This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Leveraging Offline Data from Similar Systems for Online Linear Quadratic Control

Shivam Bajaj, Prateek Jaiswal, and Vijay Gupta Shivam Bajaj and Vijay Gupta are with the Department of Electrical and Computer Engineering, Purdue University. Prateek Jaiswal is with the Department of Management, Purdue University (e-mails: {bajaj41,jaiswalp,gupta869}@purdue.edu)
Abstract

“Sim2real gap”, in which the system learned in simulations is not the exact representation of the real system, can lead to loss of stability and performance when controllers learned using data from the simulated system are used on the real system. In this work, we address this challenge in the linear quadratic regulator (LQR) setting. Specifically, we consider an LQR problem for a system with unknown system matrices. Along with the state-action pairs from the system to be controlled, a trajectory of length SS of state-action pairs from a different unknown system is available. Our proposed algorithm is constructed upon Thompson sampling and utilizes the mean as well as the uncertainty of the dynamics of the system from which the trajectory of length SS is obtained. We establish that the algorithm achieves 𝒪~(f(S,Mδ)T/S)\tilde{\mathcal{O}}({f(S,M_{\delta})\sqrt{T/S}}) Bayes regret after TT time steps, where MδM_{\delta} characterizes the dissimilarity between the two systems and f(S,Mδ)f(S,M_{\delta}) is a function of SS and MδM_{\delta}. When MδM_{\delta} is sufficiently small, the proposed algorithm achieves 𝒪~(T/S)\tilde{\mathcal{O}}({\sqrt{T/S}}) Bayes regret and outperforms a naive strategy which does not utilize the available trajectory.

Index Terms:
Identification for control, Sampled Data Control, Autonomous Systems, Adaptive Control

I Introduction

Online learning of linear quadratic regulators (LQRs) with unknown system matrices is a well-studied problem. Many recent works have proposed novel algorithms with performance guarantees on their (cumulative) regret, defined as the difference between the cumulative cost with a controller from the learning algorithm and the cost with an optimal controller that knows the system matrices [1, 2, 3, 4]. However, these algorithms require long exploration times [5], which impedes their usage in many practical applications. To aid these algorithms, we propose to use offline datasets of state-action pairs either from an approximate simulator or a simpler model of the unknown system. We propose an online algorithm that leverages offline data to provably reduce the exploration time, leading to lower regret.

Leveraging offline data is not a new idea. Offline reinforcement learning [6], for instance, uses offline data to learn a policy which is used online. However, this leads to the problem of sim-to-real gap or distribution shift since the system parameters learned offline are different from the ones encountered online [7]. Although many methods have been proposed in the literature to be robust to such issues, in general, such policies are not optimal for the new system. Another approach is to utilize the offline data to warm-start an online learning algorithm. Such strategies have been shown to achieve an improved bound on the regret in multi-armed bandits [8, 9, 10, 11]. However, extending these algorithms to LQR design and establishing their theoretical properties remains unexplored, particularly for characterizing when they provide benefits over learning the policy in a purely online fashion.

Our algorithm provides a framework to incorporate offline data from a similar linear system111Two linear systems, characterized by system matrices AiA_{i} and BiB_{i}, i{1,2}i\in\{1,2\}, are said to be similar if they have the same order and their system matrices satisfy [A1B1][A2B2]Mδ\mathinner{\!\left\lVert\begin{bmatrix}A_{1}&B_{1}\end{bmatrix}-\begin{bmatrix}A_{2}&B_{2}\end{bmatrix}\right\rVert}\leq M_{\delta}. for online learning, which provably achieves 𝒪~(f(S,Mδ)T/S)\tilde{\mathcal{O}}({f(S,M_{\delta})\sqrt{T/S}}) upper bound on the regret, where SS denotes the offline trajectory length and MδM_{\delta} quantifies the heterogeneity between the two systems. Our algorithm utilizes both the system matrices estimated from offline data and the residual uncertainty. We show via numerical simulations that as SS increases, an improved regret can be achieved with a fairly small number of measurements from the online system.

Our algorithm uses Thompson Sampling (TS) which samples a model (system matrices) from a belief distribution over unknown system matrices, takes the optimal action based on the sample model, and subsequently updates the belief distribution using the observed feedback (cost). In the purely online setting, control of unknown linear dynamical systems using TS approach has been extensively studied [1, 12, 13, 14]. Under the assumption that the distribution of the true parameters is known, [3] established a 𝒪~(T)\tilde{\mathcal{O}}(\sqrt{T}) (Bayes) regret bound. Recently, the same 𝒪~(T)\tilde{\mathcal{O}}(\sqrt{T}) regret bound was established without that assumption [15, 2]. Finally, our work is also related to the growing literature on transfer learning for linear systems [16, 17, 18, 19]. However, unlike that stream, this work focuses on determining regret guarantees on online LQR control while leveraging offline data.

This work is organized as follows. Section II presents the problem definition and a summary of background material. Section III describes the offline data scheme. Section IV presents the proposed algorithm which is analyzed in Section V. In Section VI, we present additional numerical insights and discuss how this work extends to when data from multiple sources is available. Finally, Section VIII summarizes this work and outlines directions for future work.

Notation: \|\cdot\|, F\|\cdot\|_{F}, 2\|\cdot\|_{2}, and 𝐓𝐫()\mathbf{Tr}(\cdot) denotes the operator norm, Frobenius norm, spectral norm, and the trace, respectively. For a positive definite matrix AA (denoted as A0A\succ 0), λmax(A){\lambda}_{\textup{max}}(A) and λmin(A){\lambda}_{\textup{min}}(A) denote its maximum and minimum eigenvalue, respectively. 𝐈\mathbf{I} denotes the identity matrix and η\eta denotes a matrix with independent standard normal entries. Given a set 𝒫\mathcal{P} and a sample θ\theta, 𝒮𝒫\mathcal{S}_{\mathcal{P}} represents a sampling operator that ensures θ𝒫\theta\in\mathcal{P}.

II Problem Formulation

We first review the classical LQR control problem and then describe our model followed by the formal problem statement.

II-A Classical LQR Design

Let xtnx_{t}\in\mathbb{R}^{n} denote the state and utmu_{t}\in\mathbb{R}^{m} denote the control at time tt. Let An×nA_{*}\in\mathbb{R}^{n\times n} and Bn×mB_{*}\in\mathbb{R}^{n\times m} be the system matrices. Further, let θ[AB]\theta_{*}^{\top}\coloneqq\begin{bmatrix}A_{*}&B_{*}\end{bmatrix} and zt[xtut]z_{t}\coloneqq\begin{bmatrix}x_{t}^{\top}&u_{t}^{\top}\end{bmatrix}^{\top}. Then, for t1t\geq 1 and given matrices Q0,R0Q\succ 0,R\succ 0, consider a discrete-time linear time-invariant system with the dynamics and the cost function

xt+1\displaystyle x_{t+1} =θzt+wt,\displaystyle=\theta_{*}^{\top}z_{t}+w_{t}, (1)
ct\displaystyle c_{t} =xtQxt+utRut,\displaystyle=x_{t}^{\top}Qx_{t}+u_{t}^{\top}Ru_{t},

where wt𝒩(0,𝐈)w_{t}\sim\mathcal{N}(0,\mathbf{I}) is the system noise assumed to be white and x1=0x_{1}=0. The classical LQR control problem is to design a closed-loop control π:nm\pi\mathrel{\mathop{\mathchar 58\relax}}\mathbb{R}^{n}\to\mathbb{R}^{m} with ut=π(xt)u_{t}=\pi(x_{t}) that minimizes the following cost:

Jπ(θ)=limTsup1Tt=1T𝔼[ct(xt,ut)].\displaystyle J_{\pi}(\theta_{*})=\lim_{T\to\infty}\sup\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}[c_{t}(x_{t},u_{t})]. (2)

When θ\theta_{*} is known and under the assumption that (A,B)(A_{*},B_{*}) is stabilizable, the optimal policy is ut=K(θ)xtu_{t}=K(\theta_{*})x_{t} and the corresponding cost is J(θ):=𝐓𝐫(P(θ))J(\theta_{*})\mathrel{\mathop{\mathchar 58\relax}}=\mathbf{Tr}(P(\theta_{*})) where

K(θ)=(R+BP(θ)B)1BP(θ)A\displaystyle K(\theta_{*})=-(R+B_{*}^{\top}P(\theta_{*})B_{*})^{-1}B_{*}^{\top}P(\theta_{*})A_{*}

is the gain matrix and P(θ)P(\theta_{*}) is the unique positive definite solution to the Riccati equation

P(θ)=Q+AP(θ)A+AP(θ)BK(θ).\displaystyle P(\theta_{*})=Q+A_{*}^{\top}P(\theta_{*})A_{*}+A_{*}^{\top}P(\theta_{*})B_{*}K(\theta_{*}).

II-B Model and Problem Statement

Consider a system characterized by equation (1) with unknown θ\theta_{*} and access to an offline dataset obtained through an approximated simulator. The simulator is assumed to be characterized by the following auxiliary system which is different than θ\theta_{*} and is also unknown.

ξs+1=θsimys+wssim,\xi_{s+1}={\theta_{*}^{\text{sim}}}^{\top}y_{s}+w^{\text{sim}}_{s}, (3)

where, ξsn\xi_{s}\in\mathbb{R}^{n} and vsmv_{s}\in\mathbb{R}^{m} denotes the state and the control, respectively, at time instant ss, θsim[AsimBsim]{\theta_{*}^{\text{sim}}}^{\top}\coloneqq\begin{bmatrix}{A}^{\text{sim}}_{*}&{B}^{\text{sim}}_{*}\end{bmatrix} denotes the system matrices, and ys[ξsvs]y_{s}\coloneqq\begin{bmatrix}\xi_{s}^{\top}&v_{s}^{\top}\end{bmatrix}^{\top}. The offline data 𝒟={y1,,yS}\mathcal{D}=\{y_{1},\dots,y_{S}\} represents a trajectory of length SS of state-action pairs (ξs,vs),1sS(\xi_{s},v_{s}),1\leq s\leq S. We can characterize AA_{*} and BB_{*} as A=Asim+AδA_{*}=A^{\text{sim}}_{*}+A^{\delta}_{*} and B=Bsim+BδB_{*}=B^{\text{sim}}_{*}+B^{\delta}_{*}, respectively, where AδA^{\delta}_{*} (resp. BδB^{\delta}_{*}) represents the change in the system matrices AA_{*} (resp. BB_{*}) from AsimA^{\text{sim}}_{*} (resp. BsimB^{\text{sim}}_{*}). Thus, the system characterized by equation (1) can be expressed as

xt+1\displaystyle x_{t+1} =(Asim+Aδ)xt+(Bsim+Bδ)ut+wt.\displaystyle=\left(A^{\text{sim}}_{*}+A^{\delta}_{*}\right)x_{t}+\left(B^{\text{sim}}_{*}+B^{\delta}_{*}\right)u_{t}+w_{t}. (4)

In this work, we assume that there exists a known constant MδM_{\delta} such that θδFMδ\mathinner{\!\left\lVert\theta^{\delta}_{*}\right\rVert}_{F}\leq M_{\delta}, where θδ[AδBδ]{\theta_{*}^{\delta}}^{\top}\coloneqq\begin{bmatrix}A^{\delta}_{*}&B^{\delta}_{*}\end{bmatrix}. Let tσ({x1,u1,,xt,ut})\mathcal{F}_{t}\coloneqq\sigma(\{x_{1},u_{1},\dots,x_{t},u_{t}\}) denote the filtration that represents the knowledge up to time tt during the online process. Similarly, let sσ({ξ1,v1,,ξS,vS})\mathcal{F}_{s}\coloneqq\sigma(\{\xi_{1},v_{1},\dots,\xi_{S},v_{S}\}) denote the filtration that represents the knowledge corresponding to the offline data. Then, we make the following standard assumption on the noise process [20].

Assumption 1.

There exists a filtration t\mathcal{F}_{t} and s\mathcal{F}_{s} such that for any t1t\geq 1 and 1sS1\leq s\leq S, zt,xtz_{t},x_{t} are ts\mathcal{F}_{t}\cup\mathcal{F}_{s}-measurable and ys,ξsy_{s},\xi_{s} are s\mathcal{F}_{s}-measurable. Further, wt+1w_{t+1} and ws+1simw^{\text{sim}}_{s+1} are individually martingale difference sequences. Finally, for ease of exposition, we assume that 𝔼[wt+1wt+1|ts]=𝐈\mathbb{E}[w_{t+1}w_{t+1}^{\top}|\mathcal{F}_{t}\cup\mathcal{F}_{s}]=\mathbf{I} and 𝔼[ws+1simws+1sim|s]=𝐈\mathbb{E}[w^{\text{sim}}_{s+1}{w^{\text{sim}}_{s+1}}^{\top}|\mathcal{F}_{s}]=\mathbf{I}.

Assuming that the parameter θ\theta_{*} is a random variable with a known distribution μ\mu, we quantify the performance of our learning algorithm by comparing the cumulative cost to the infinite-horizon cost attained by the LQR controller if the system matrices defined by θ\theta_{*} were known a priori. Formally, we quantify the performance of our algorithm through the cumulative Bayesian regret defined as follows.

(T,π)=𝔼[t=1T(ctJ(θ))],\mathcal{R}(T,\pi)=\mathbb{E}\left[\sum_{t=1}^{T}\left(c_{t}-J(\theta_{*})\right)\right], (5)

where the expectation is with respect to wt,μw_{t},\mu, and any randomization in the algorithms used to process the offline and online data. This metric has been previously considered for online control of LQR systems [3].

Problem 1.

The aim of this work is to find a control algorithm that minimizes the expected regret defined in (5) while utilizing the offline data 𝒟\mathcal{D}.

III Offline Data-Generation

In this work, we do not consider a particular algorithm from which the offline data is generated. As we will see later, any algorithm that satisfies the following two properties can be used to generate the offline dataset 𝒟\mathcal{D}. Let 𝒜sim\mathcal{A}^{\text{sim}} denote an algorithm that is used to generate the offline data. Further, let at time ss, Usk=0sysysU_{s}\coloneqq\sum_{k=0}^{s}y_{s}y_{s}^{\top} denote the precision matrix of Algorithm 𝒜sim\mathcal{A}^{\text{sim}}. We assume the following on algorithm 𝒜sim\mathcal{A}^{\text{sim}}.

Assumption 2 (Offline Algorithm).

For a given δ1(0,1)\delta_{1}\in(0,1), with probability of at least 1δ11-\delta_{1}, Algorithm 𝒜sim\mathcal{A}^{\text{sim}} satisfies

  1. 1.

    Us0.5(θ^ssimθsim)Fαs(δ1)\|U_{s}^{0.5}(\hat{\theta}_{s}^{\text{sim}}-\theta_{*}^{\text{sim}})\|_{F}\leq\alpha_{s}(\delta_{1}).

  2. 2.

    For s200(n+m)log12δ1s\geq 200(n+m)\log{\tfrac{12}{\delta_{1}}}, λmin(Us)s40{\lambda}_{\textup{min}}(U_{s})\geq\frac{s}{40}.

Assumption 2 is not restrictive as there are many algorithms for LQR control that satisfies these properties such as algorithms based on Thompson sampling [2] or on Upper Confidence Bounds (UCB) [4, 21] principle.

IV Thompson Sampling with Offline Data for LQR (TSOD-LQR) Algorithm

Although (A,B)(A_{*},B_{*}) is considered to be stabilizable, an algorithm based on Thompson sampling may sample parameters that are non-stabilizable. Thus, for some fixed constants MPM_{P}, we assume that θ𝒬\theta_{*}\in\mathcal{Q} where:

𝒬={θ|𝐓𝐫(P(θ))MP,A+BK(θ)2ρ<1}.\displaystyle\mathcal{Q}=\{\theta~{}|\mathbf{Tr}\left(P(\theta)\right)\leq M_{P},\mathinner{\!\left\lVert A_{*}+B_{*}K(\theta)\right\rVert}_{2}\leq\rho<1\}. (6)

The assumption that θ𝒬\theta_{*}\in\mathcal{Q} leads to the following result.

Lemma 1 (Proposition 5 in [15]).

The set 𝒬\mathcal{Q} is compact. For any θ𝒬\theta\in\mathcal{Q}, θ\theta is stabilizable and there exists a constant MK<M_{K}<\infty, where MKsupθ𝒬K(θ)2M_{K}\coloneqq\sup_{\theta\in\mathcal{Q}}\|K(\theta)\|_{2}.

The idea behind Algorithm TSOD-LQR is to augment the data collected online corresponding to system θ\theta_{*} with data collected from the simulated system. To achieve this, we utilize the posterior of θsim\theta^{\text{sim}} to characterize the prior for learning θ\theta_{*}.

1 Input: T,US,αS(δ1),δ2T,U_{S},\alpha_{S}(\delta_{1}),\delta_{2}, MδM_{\delta}
2 for each t{1,,T}t\in\{1,\dots,T\} do
3       Sample θ~t\tilde{\theta}_{t} using (7).
4       Compute K(θ~t)K(\tilde{\theta}_{t}).
5       Apply ut=K(θ~t)xtu_{t}=K(\tilde{\theta}_{t})x_{t}
6       Transition to xt+1x_{t+1} and receive the cost ct(xt,ut)c_{t}(x_{t},u_{t}).
7       Compute Vt+1V_{t+1} and θ^t+1\hat{\theta}_{t+1} using (9) and (10).
8 end for
Algorithm 1 Thompson Sampling with Learned Predictions (TSOD-LQR)

Our algorithm works as follows and is summarized in Algorithm 1. At each time t1t\geq 1, Algorithm 1 samples a parameter θ~t\tilde{\theta}_{t} according to the following equation:

θ~t=𝒮𝒬(θ^t+βt(δ2)Vt1/2ηt),\tilde{\theta}_{t}=\mathcal{S}_{\mathcal{Q}}\left(\hat{\theta}_{t}+\beta_{t}(\delta_{2})V_{t}^{-1/2}\eta_{t}\right), (7)

where, for any δ2(0,1)\delta_{2}\in(0,1),

βt(δ2)=n2log(det(Vt)0.5det(US)0.5δ2)+αS(δ1)+λmax(US)Mδ.\begin{split}\beta_{t}(\delta_{2})=&n\sqrt{2\log{\left(\frac{\det(V_{t})^{0.5}}{\det\left(U_{S}\right)^{0.5}\delta_{2}}\right)}}+\alpha_{S}(\delta_{1})+\\ &\sqrt{\lambda_{\text{max}}(U_{S})}M_{\delta}.\end{split} (8)

Once the parameter θ~t\tilde{\theta}_{t} is sampled, the gain matrix K(θ~t)K(\tilde{\theta}_{t}) is determined, the corresponding control utu_{t} is applied, and the system transitions to the next state xt+1x_{t+1}. Algorithm 1 then updates VtV_{t} and θ^t\hat{\theta}_{t} using the following equations:

Vt\displaystyle V_{t} =US+k=0t1zkzk,\displaystyle=U_{S}+\sum_{k=0}^{t-1}z_{k}z_{k}^{\top}, (9)
θ^t\displaystyle\hat{\theta}_{t} =Vt1(k=0t1zkxk+1+USθ^Ssim).\displaystyle=V_{t}^{-1}\left(\sum_{k=0}^{t-1}z_{k}x_{k+1}^{\top}+U_{S}\hat{\theta}_{S}^{\text{sim}}\right). (10)

Observe that Algorithm 1 does not require the information of the distribution μ\mu (the distribution of θ\theta_{*}). This highlights that Algorithm 1 works even when the distribution is not known, i.e., the assumption that the distribution μ\mu is known is required only for the analysis. Our first result, proof of which is deferred to the Appendix, characterizes the confidence bound on the estimation error of θ\theta_{*}.

Theorem IV.1.

Suppose that, for a given δ1(0,1)\delta_{1}\in(0,1), Algorithm 𝒜sim\mathcal{A}^{\text{sim}} is used to collect the offline data 𝒟\mathcal{D} for SS time steps and Assumption 1 holds. Then, for any δ2(0,1)\delta_{2}\in(0,1), Vt0.5(θ^tθ)Fβt(δ2)\mathinner{\!\left\lVert V_{t}^{0.5}(\hat{\theta}_{t}-\theta_{*})\right\rVert}_{F}\leq\beta_{t}(\delta_{2}) holds with probability 1δ2δ11-\delta_{2}-\delta_{1}.

In the next section we will establish an upper bound on the regret for Algorithm 1.

V Regret Analysis

Following the standard technique [15, 2] we begin by defining two concentration ellipsoids tRLS{\mathcal{E}}^{\text{RLS}}_{t} and tTS{\mathcal{E}}^{\text{TS}}_{t}.

tRLS={θ(n+m)×n|Vt0.5(θθ^t)Fβt(δ2)}\displaystyle{\mathcal{E}}^{\text{RLS}}_{t}=\{\theta\in\mathbb{R}^{(n+m)\times n}~{}|~{}\|V_{t}^{0.5}(\theta-\hat{\theta}_{t})\|_{F}\leq\beta_{t}(\delta_{2})\}
tTS={θ~(n+m)×n|Vt0.5(θ~θ^t)Fβt},\displaystyle{\mathcal{E}}^{\text{TS}}_{t}=\{\tilde{\theta}\in\mathbb{R}^{(n+m)\times n}~{}|~{}\|V_{t}^{0.5}(\tilde{\theta}-\hat{\theta}_{t})\|_{F}\leq\beta^{\prime}_{t}\},

where βt(δ2)=n2(n+m)log(2n(n+m)/δ2)βt(δ2)\beta^{\prime}_{t}(\delta_{2})=n\sqrt{2(n+m)\log{(2n(n+m)/\delta_{2})}}\beta_{t}(\delta_{2}). Further, introduce the event E^t={kt,θkRLS}\hat{E}_{t}=\{\forall k\leq t,\theta_{*}\in{\mathcal{E}}^{\text{RLS}}_{k}\} and the event E~t={kt,θ~ttTS}\tilde{E}_{t}=\{\forall k\leq t,\tilde{\theta}_{t}\in{\mathcal{E}}^{\text{TS}}_{t}\}.

The following result will be useful to establish that the event EtE^tE~tE_{t}\coloneqq\hat{E}_{t}\cap\tilde{E}_{t} holds with high probability.

Lemma 2.

Suppose that S>TS>T. Then, (ET)1δ4\mathbb{P}(E_{T})\geq 1-\frac{\delta}{4}.

Proof.

Using Theorem IV.1,

(E^t)\displaystyle\mathbb{P}(\hat{E}_{t}) =(t=1T(Vt0.5(θ^tθ)Fβt(δ2)))\displaystyle=\mathbb{P}\left(\cap_{t=1}^{T}\left(\|V_{t}^{0.5}\left(\hat{\theta}_{t}-\theta_{*}\right)\|_{F}\leq\beta_{t}(\delta_{2})\right)\right)
=1(t=1T(Vt0.5(θ^tθ)Fβt(δ2)))\displaystyle=1-\mathbb{P}\left(\cup_{t=1}^{T}\left(\|V_{t}^{0.5}\left(\hat{\theta}_{t}-\theta_{*}\right)\|_{F}\geq\beta_{t}(\delta_{2})\right)\right)
1T(δ1+δ2).\displaystyle\geq 1-T(\delta_{1}+\delta_{2}).

Selecting δ1=δ16S\delta_{1}=\frac{\delta}{16S} and δ2=δ16T\delta_{2}=\frac{\delta}{16T} and using the fact that S>TS>T yields (E^t)1δ8\mathbb{P}(\hat{E}_{t})\geq 1-\frac{\delta}{8}. The proof for (E~t)1δ8\mathbb{P}(\tilde{E}_{t})\geq 1-\frac{\delta}{8} is analogous to that of [15, Proposition 6]. Finally, applying the union bound yields the result. ∎

Remark 1.

The requirement that S>TS>T means that the length of the offline trajectory must be greater than the learning horizon TT. This is not an onerous assumption especially when a simulator is used to generate the offline data. Further, since the auxiliary system need not be the same as the true system, data available from any other source (such as a simpler model) can also be used in this work. Finally, in cases where generating large amounts of data is not possible through a simulator (for example, when a high-fidelity simulator is used), one can select δ1=δT\delta_{1}=\frac{\delta}{T} for the simulations. However, this requires that the horizon length TT to be known a priori.

Conditioned on the filtration st\mathcal{F}_{s}\cup\mathcal{F}_{t} and event EtE_{t}, following analogous steps as in [15], the expected regret of Algorithm 1 can be decomposed as

(T,TSOD-LQR)𝟙{ET}0+1+2+3,\displaystyle\mathcal{R}(T,\text{TSOD-LQR})\mathds{1}\{E_{T}\}\leq\mathcal{R}_{0}+\mathcal{R}_{1}+\mathcal{R}_{2}+\mathcal{R}_{3}, (11)

where

0𝔼[t=1T{J(θ~t)J(θ)}𝟙{Et}],1𝔼[t=1TxtP(θ~t)xt𝟙{Et}xt+1P(θ~t+1)xt+1𝟙{Et+1}],2𝔼[t=1T[(θzt)P(θ~t)(θzt)(θ~tzt)P(θ~t)(θ~tzt)]𝟙{Et}],3=𝔼[t=1T{xt+1(P(θ~t+1)P(θ~t))xt+1}𝟙{Et+1}].\displaystyle\begin{split}&\mathcal{R}_{0}\coloneqq\mathbb{E}\left[\sum_{t=1}^{T}\{J(\tilde{\theta}_{t})-J(\theta_{*})\}\mathds{1}\{E_{t}\}\right],\\ &\mathcal{R}_{1}\coloneqq\mathbb{E}\left[\sum_{t=1}^{T}x_{t}^{\top}P(\tilde{\theta}_{t})x_{t}\mathds{1}\{E_{t}\}-x_{t+1}^{\top}P(\tilde{\theta}_{t+1})x_{t+1}\mathds{1}\{E_{t+1}\}\right],\\ &\mathcal{R}_{2}\coloneqq\mathbb{E}\left[\sum_{t=1}^{T}\left[\left(\theta_{*}^{\top}z_{t}\right)^{\top}P(\tilde{\theta}_{t})\left(\theta_{*}^{\top}z_{t}\right)-\right.\right.\\ &\left.\left.\left(\tilde{\theta}_{t}^{\top}z_{t}\right)^{\top}P(\tilde{\theta}_{t})\left(\tilde{\theta}_{t}^{\top}z_{t}\right)\right]\mathds{1}\{E_{t}\}\right],\\ &\mathcal{R}_{3}=\mathbb{E}\left[\sum_{t=1}^{T}\{x_{t+1}^{\top}\left(P(\tilde{\theta}_{t+1})-P(\tilde{\theta}_{t})\right)x_{t+1}\}\mathds{1}\{E_{t+1}\}\right].\end{split}

We will now characterize an upper bound on each of these terms separately to bound the regret of Algorithm 1.

Lemma 3.

The term 0=0\mathcal{R}_{0}=0.

Proof.

Since the distribution of θ\theta_{*} is assumed to be known, from the posterior sampling lemma [22, Lemma 1], it follows that 𝔼[J(θ~t)]=𝔼[J(θ)]\mathbb{E}[J(\tilde{\theta}_{t})]=\mathbb{E}[J(\theta_{*})] and the claim follows. ∎

Lemma 4.

The term 1\mathcal{R}_{1} is upper bounded as 1MPx122\mathcal{R}_{1}\leq M_{P}\mathinner{\!\left\lVert x_{1}\right\rVert}_{2}^{2}.

Proof.

The proof directly follows by expanding the terms in the summation and the fact that P(θ~T)P(\tilde{\theta}_{T}) is positive definite. ∎

Let XT:=maxtTxt2X_{T}\mathrel{\mathop{\mathchar 58\relax}}=\max_{t\leq T}\|x_{t}\|_{2} and XS:=maxsSxs2X_{S}\mathrel{\mathop{\mathchar 58\relax}}=\max_{s\leq S}\|x_{s}\|_{2}. Then, the following two results, proofs of which are in the appendix, bound 2\mathcal{R}_{2} and 3\mathcal{R}_{3}.

Lemma 5.

For a given δ1(0,1)\delta_{1}\in(0,1), suppose that S200(n+m)log12δ1S\geq 200(n+m)\log{\tfrac{12}{\delta_{1}}}. Then, with probability 1δ11-\delta_{1} and under event ETE_{T},

2𝒪~(TS𝔼[βT(δ2)XT2log(1+TMK2XT2S(n+m))]).\displaystyle\mathcal{R}_{2}\leq\tilde{\mathcal{O}}\left(\sqrt{\frac{T}{S}}\mathbb{E}\left[\beta_{T}(\delta_{2})X_{T}^{2}\sqrt{\log{\left(1+\frac{TM_{K}^{2}X_{T}^{2}}{S(n+m)}\right)}}\right]\right).

where 𝒪~\tilde{\mathcal{O}} contains problem dependent constants and polylog terms in TT.

Lemma 6.

For a given δ1(0,1)\delta_{1}\in(0,1), suppose that S200(n+m)log12δ1S\geq 200(n+m)\log{\tfrac{12}{\delta_{1}}}. Then, under event ETE_{T} and with probability 1δ11-\delta_{1},

3\displaystyle\mathcal{R}_{3}\leq 𝒪~(TS𝔼[XT4βT(δ2)log(1+TMK2XT2S(n+m))]).\displaystyle\tilde{\mathcal{O}}\left(\sqrt{\frac{T}{S}}\mathbb{E}\left[X_{T}^{4}\beta_{T}(\delta_{2})\sqrt{\log{\left(1+\frac{TM_{K}^{2}X_{T}^{2}}{S(n+m)}\right)}}\right]\right).
Theorem V.1.

Suppose that Smax{T,200(n+m)log12δ1}S\geq\max\{T,200(n+m)\log{\tfrac{12}{\delta_{1}}}\}. Then, with probability at least 1δ1-\delta the regret, defined in equation (5), of Algorithm 1 is at most

𝒪~(TS(log(T)+𝔼[αS(δ1)+λmax(US)Mδ])),\tilde{\mathcal{O}}\left(\sqrt{\frac{T}{S}}\left(\log(T)+\mathbb{E}[\alpha_{S}(\delta_{1})+\sqrt{\lambda_{\text{max}}(U_{S})}M_{\delta}]\right)\right), (12)

where 𝒪~\tilde{\mathcal{O}} contains the logarithmic terms in TT and SS as well as the problem dependent constants.

Proof.

Since we assume that x1=0x_{1}=0, 1=0\mathcal{R}_{1}=0. For 3\mathcal{R}_{3} substituting the expression of βT(δ2)\beta_{T}(\delta_{2}) in Lemma 6 and taking the product yields

𝔼[XT4βT(δ2)log(1+TMK2XT2S(n+m)))]\displaystyle\mathbb{E}\left[X_{T}^{4}\beta_{T}(\delta_{2})\sqrt{\log{\left(1+\frac{TM_{K}^{2}X_{T}^{2}}{S(n+m))}\right)}}\right]\leq
𝔼[XT4log(2TMK2XT2S(n+m)))αS(δ1)]I+\displaystyle\underbrace{\mathbb{E}\left[X_{T}^{4}\sqrt{\log{\left(\frac{2TM_{K}^{2}X_{T}^{2}}{S(n+m))}\right)}}\alpha_{S}(\delta_{1})\right]}_{I}+
𝔼[XT4log(2TMK2XT2S(n+m)))λmax(US)Mδ]II+\displaystyle\underbrace{\mathbb{E}\left[X_{T}^{4}\sqrt{\log{\left(\frac{2TM_{K}^{2}X_{T}^{2}}{S(n+m))}\right)}}\sqrt{\lambda_{\text{max}}(U_{S})}M_{\delta}\right]}_{II}+
n𝔼[XT4log(2TMK2XT2S(n+m)))2log(det(Vt)0.5det(US)0.5δ2)]III\displaystyle\underbrace{n\mathbb{E}\left[X_{T}^{4}\sqrt{\log{\left(\frac{2TM_{K}^{2}X_{T}^{2}}{S(n+m))}\right)}}\sqrt{2\log{\left(\frac{\det(V_{t})^{0.5}}{\det\left(U_{S}\right)^{0.5}\delta_{2}}\right)}}\right]}_{III}

We begin with an upper bound for term II.

𝔼[XT4log(2TMK2XT2S(n+m)))αS(δ1)]=\displaystyle\mathbb{E}\left[X_{T}^{4}\sqrt{\log{\left(\frac{2TM_{K}^{2}X_{T}^{2}}{S(n+m))}\right)}}\alpha_{S}(\delta_{1})\right]=
𝔼[αS(δ1)𝔼[XT4log(2TMK2XT2S(n+m)))|S]]\displaystyle\mathbb{E}\left[\alpha_{S}(\delta_{1})\mathbb{E}\left[X_{T}^{4}\sqrt{\log{\left(\frac{2TM_{K}^{2}X_{T}^{2}}{S(n+m))}\right)}}|\mathcal{F}_{S}\right]\right]

Observe that by Jensen’s inequality

𝔼[XT4log(1+TMK2XT2S(n+m)))|S]\displaystyle\mathbb{E}\Bigg{[}\sqrt{X_{T}^{4}\log{\left(1+\frac{TM_{K}^{2}X_{T}^{2}}{S(n+m))}\right)}}|\mathcal{F}_{S}\Bigg{]}\leq
𝔼[XT4log(2TMK2XT2S(n+m)))|S]=\displaystyle\sqrt{\mathbb{E}\Big{[}X_{T}^{4}\log{\left(\frac{2TM_{K}^{2}X_{T}^{2}}{S(n+m))}\right)|\mathcal{F}_{S}\Big{]}}}=
𝔼[XT4log(2TMK2S(n+m)))|S]+𝔼[XT4logXT2|S]\displaystyle\sqrt{\mathbb{E}\Big{[}X_{T}^{4}\log{\left(\frac{2TM_{K}^{2}}{S(n+m))}\right)|\mathcal{F}_{S}\Big{]}}+\mathbb{E}\Big{[}X_{T}^{4}\log{X_{T}^{2}}|\mathcal{F}_{S}\Big{]}}
𝒪~(1),\displaystyle\leq\tilde{\mathcal{O}}(1),

where we used that S>TS>T, law of total expectation, and Lemma 9. Thus, term II is upper bounded by 𝒪~(𝔼[αS(δ1])\tilde{\mathcal{O}}(\mathbb{E}[\alpha_{S}(\delta_{1}]). By using analogous algebraic manipulations, term IIII is upper bounded by 𝒪~(𝔼[λmax(US)Mδ])\tilde{\mathcal{O}}(\mathbb{E}[\sqrt{\lambda_{\text{max}}(U_{S})}M_{\delta}]). Using Lemma 12 followed by Lemma 9, term IIIIII is upper bounded by

n𝔼[XT4log(2TMK2XT2S(n+m)))]𝒪~(1).\displaystyle n\mathbb{E}\left[X_{T}^{4}\log{\left(\frac{2TM_{K}^{2}X_{T}^{2}}{S(n+m))}\right)}\right]\leq\mathcal{\tilde{O}}(1).

Combining the upper bounds for terms II, IIII, and IIIIII yields an upper bound for 3\mathcal{R}_{3}. The bound for 2\mathcal{R}_{2} is obtained analogously and has been omitted for brevity. Combining the bounds for 2\mathcal{R}_{2} and 3\mathcal{R}_{3} establishes the claim. ∎

Remark 2.

From Theorem V.1, using offline data from system θsim\theta_{*}^{\text{sim}} is beneficial if MδM_{\delta} is sufficiently small.

Corollary 1.

Suppose that θ=θsim\theta_{*}=\theta_{*}^{\text{sim}}. Further, suppose that Smax{T,200(n+m)log12δ1}S\geq\max\{T,200(n+m)\log{\tfrac{12}{\delta_{1}}}\}. Then, with probability at least 1δ1-\delta the regret, defined in equation (5), of Algorithm 1 is at most 𝒪~(T/S)\tilde{\mathcal{O}}\left(\sqrt{T/S}\right).

Proof.

By substituting Mδ=0M_{\delta}=0, the proof follows directly from Theorem V.1. ∎

Since S>TS>T, when offline data from the same system is available, Corollary 1 suggests that the regret of Algorithm 1 is bounded by 𝒪(logT)\mathcal{O}(\log{T}), where 𝒪\mathcal{O} contains logarithmic terms in SS. Such bounds are known to be possible, for instance when AA_{*} or BB_{*} is known [21].

Theorem V.1 provides a general regret bound for Algorithm 1 when an arbitrary algorithm is used for generating data 𝒟\mathcal{D}. The next result provides a regret bound for a particular algorithm, i.e., Algorithm TSAC [2] is used. To characterize a state-bound for Algorithm TSAC, we assume (cf. [2, Assumption 1]) that θsim𝒫\theta_{*}^{\text{sim}}\in\mathcal{P}, where

𝒫=\displaystyle\mathcal{P}= {θsim|𝐓𝐫(P(θsim))Msim,θsimFϕ,\displaystyle\Bigg{\{}\theta^{\text{sim}}~{}|~{}\mathbf{Tr}\left(P(\theta^{\text{sim}})\right)\leq{M}_{\textup{sim}},\mathinner{\!\left\lVert\theta^{\text{sim}}\right\rVert}_{F}\leq\phi,
Asim+BsimK(θsim)2ρsim<1}.\displaystyle\mathinner{\!\left\lVert A^{\text{sim}}_{*}+B^{\text{sim}}_{*}K\left(\theta^{\text{sim}}\right)\right\rVert}_{2}\leq\rho^{\text{sim}}<1\Bigg{\}}. (13)
Theorem V.2.

Suppose that data 𝒟\mathcal{D} is generated from Algorithm TSAC for Smax{T,(n+m)200log(12T/δ)}S\geq\max\{T,(n+m)200\log(12T/\delta)\}. Further, suppose that θsim𝒫\theta_{*}^{\text{sim}}\in\mathcal{P}. Then, with probability at least 1δ1-\delta the regret, defined in equation (5), of Algorithm 1 is at most

𝒪~(TS(logS+SMδ)),\tilde{\mathcal{O}}\left(\sqrt{\frac{T}{S}}\left(\log{S}+\sqrt{S}M_{\delta}\right)\right), (14)

where 𝒪~\tilde{\mathcal{O}} contains the logarithmic terms in TT and SS as well as the problem dependent constants.

Proof.

The proof follows directly by using [2, Lemma 15, Lemma 16] followed by using Lemma 10. ∎

Refer to caption
Figure 1: Cumulative regret plot comparing Algorithm 1 with an algorithm that (1) does not utilize the offline data and (2) only utilizes the estimate θ^sim\hat{\theta}^{\text{sim}} computed from the offline data.

VI Numerical Results

We now illustrate the performance of Algorithm 1 through numerical simulations. The system matrices were selected as

A=[0.60.50.400.50.4000.4],\displaystyle A_{*}=\begin{bmatrix}0.6&0.5&0.4\\ 0&0.5&0.4\\ 0&0&0.4\end{bmatrix}, Asim=[0.70.50.400.50.4000.4],\displaystyle\quad A_{*}^{\text{sim}}=\begin{bmatrix}0.7&0.5&0.4\\ 0&0.5&0.4\\ 0&0&0.4\end{bmatrix},
B=[10.50.510.50.5],\displaystyle B_{*}=\begin{bmatrix}1&0.5\\ 0.5&1\\ 0.5&0.5\end{bmatrix}, Bsim=[1.10.50.510.50.5].\displaystyle\quad B_{*}^{\text{sim}}=\begin{bmatrix}1.1&0.5\\ 0.5&1\\ 0.5&0.5\end{bmatrix}.

For all of our numerical results, we run 1010 simulations and present the mean and the standard deviation for each scenario. Figure 1 presents the numerical results that compare the cumulative regret of Algorithm 1 for S=3000S=3000. From Figure 1, the proposed approach outperforms al algorithm that either does not utilize the available data or that only uses θ^sim\hat{\theta}^{\text{sim}} computed from the offline data, implying that utilizing estimate and the uncertainty from dissimilar systems can be beneficial.

Refer to caption
Figure 2: Cumulative regret plot comparing Algorithm 1 for various values of SS and Mδ=0.15M_{\delta}=0.15.

Figure 2 presents the cumulative regret of the proposed Algorithm TSOD-LQR for values of S=1500,2500,5000S=1500,2500,5000. From Figure 2, the cumulative regret of Algorithm TSOD-LQR decreases as SS increases.

VII Extension to Multiple Offline Sources

We now briefly describe how this framework generalizes to when multiple trajectories S1,,SNS_{1},\dots,S_{N} are available from systems θ,1sim,,θ,Nsim\theta_{*,1}^{\text{sim}},\dots,\theta_{*,N}^{\text{sim}}, respectively.

By defining the l2l_{2}-least squares error as

e(θ)\displaystyle e(\theta) =i=1N𝐓𝐫((θ^Sisimθ)USi(θ^Sisimθ))+\displaystyle=\sum_{i=1}^{N}\mathbf{Tr}\left((\hat{\theta}^{\text{sim}}_{S_{i}}-\theta)^{\top}U_{S_{i}}(\hat{\theta}^{\text{sim}}_{S_{i}}-\theta)\right)+
k=1t1𝐓𝐫((xk+1θzs)(xk+1θzs))\displaystyle\sum_{k=1}^{t-1}\mathbf{Tr}\left((x_{k+1}-\theta^{\top}z_{s})(x_{k+1}-\theta^{\top}z_{s})^{\top}\right)

and minimizing with respect to θ\theta yields

Vt\displaystyle V_{t} =i=1NUSi+k=0t1zkzk,\displaystyle=\sum_{i=1}^{N}U_{S_{i}}+\sum_{k=0}^{t-1}z_{k}z_{k}^{\top},
θ^t\displaystyle\hat{\theta}_{t} =Vt1(k=0t1zkxk+1+i=1NUSiθ^Sisim).\displaystyle=V_{t}^{-1}\left(\sum_{k=0}^{t-1}z_{k}x_{k+1}^{\top}+\sum_{i=1}^{N}U_{S_{i}}\hat{\theta}_{S_{i}}^{\text{sim}}\right).

Then, following analogous steps as in the proof of Theorem IV.1, we obtain

βt(δ2)=\displaystyle\beta_{t}(\delta_{2})= n2log(det(Vt)0.5det(US)0.5δ2)+i=1NαSi(δ1)\displaystyle n\sqrt{2\log{\left(\frac{\det(V_{t})^{0.5}}{\det\left(U_{S}\right)^{0.5}\delta_{2}}\right)}}+\sum_{i=1}^{N}\alpha_{S_{i}}(\delta_{1})
+i=1Nλmax(USi)Mδ,i.\displaystyle+\sum_{i=1}^{N}\sqrt{\lambda_{\text{max}}(U_{S_{i}})}M_{\delta,i}.

With these modifications, we can now utilize Algorithm 1 for online control of LQR when offline data from multiple dissimilar sources are available. By defining S=i=1NSiS=\sum_{i=1}^{N}S_{i} and Mδ=maxMδ,iM_{\delta}=\max{M_{\delta,i}} and by following analogous steps as in the proof of Theorem V.1, a similar upper bound on the cumulative regret of Algorithm 1 can be obtained.

VIII Conclusion

In this work, we considered an online control problem of an LQR when an offline trajectory of length SS of state-action pairs from a similar linear system, also of unknown system matrices, is available. We design and analyze an algorithm that utilizes the available data from the trajectory and establish that the algorithm achieves 𝒪~(f(S,Mδ)T)\tilde{\mathcal{O}}(f(S,M_{\delta}){\sqrt{T}}) regret, where f(S,Mδ)f(S,M_{\delta}) is a decreasing function of SS. Finally, we provide additional numerical insights by comparing our algorithm with two other approaches.

IX Acknowledgement

We greatly acknowledge the valuable comments by Dr. Gugan Thoppe from the Indian Institute of Science (IISc).

-A Proof of Theorem IV.1

From equation (10) and using equation (1)

θ^t\displaystyle\hat{\theta}_{t} =Vt1k=0t1zkzkθ+Vt1k=0t1zkwk+USθ^Ssim,\displaystyle=V_{t}^{-1}\sum_{k=0}^{t-1}z_{k}z_{k}^{\top}\theta_{*}+V_{t}^{-1}\sum_{k=0}^{t-1}z_{k}w_{k}^{\top}+U_{S}\hat{\theta}^{\text{sim}}_{S},
=Vt1k=0t1zkwk+Vt1US(θ^Ssimθ)+θ.\displaystyle=V_{t}^{-1}\sum_{k=0}^{t-1}z_{k}w_{k}^{\top}+V_{t}^{-1}U_{S}(\hat{\theta}^{\text{sim}}_{S}-\theta_{*})+\theta_{*}.

For any vector zz, it follows that

zθ^tzθ=z,k=0t1zkwkVt1+z,US(θ^simθ)Vt1,\displaystyle z^{\top}\hat{\theta}_{t}-z^{\top}\theta_{*}=\langle z,\sum_{k=0}^{t-1}z_{k}w_{k}^{\top}\rangle_{V_{t}^{-1}}+\langle z,U_{S}(\hat{\theta}^{\text{sim}}-\theta_{*})\rangle_{V_{t}^{-1}},
|zθ^tzθ|\displaystyle\implies|z^{\top}\hat{\theta}_{t}-z^{\top}\theta_{*}|\leq
zVt1(k=0t1zkwkVt1+US(θ^simθ)Vt1).\displaystyle\mathinner{\!\left\lVert z\right\rVert}_{V_{t}^{-1}}\left(\mathinner{\!\left\lVert\sum_{k=0}^{t-1}z_{k}w_{k}^{\top}\right\rVert}_{V_{t}^{-1}}+\mathinner{\!\left\lVert U_{S}(\hat{\theta}^{\text{sim}}-\theta_{*})\right\rVert}_{V_{t}^{-1}}\right).

Selecting z=Vt(θ^tθ)z=V_{t}(\hat{\theta}_{t}-\theta_{*}) yields,

θ^tθVtk=0t1zkwkVt1+US(θ^simθ)Vt1.\displaystyle\mathinner{\!\left\lVert\hat{\theta}_{t}-\theta_{*}\right\rVert}_{V_{t}}\leq\lVert\sum_{k=0}^{t-1}z_{k}w_{k}^{\top}\rVert_{V_{t}^{-1}}+\mathinner{\!\left\lVert U_{S}(\hat{\theta}^{\text{sim}}-\theta_{*})\right\rVert}_{V_{t}^{-1}}.

Since VtUSV_{t}\succ U_{S}, it follows that US(θ^simθ)Vt1US0.5(θ^simθ)F\mathinner{\!\left\lVert U_{S}(\hat{\theta}^{\text{sim}}-\theta_{*})\right\rVert}_{V_{t}^{-1}}\leq\mathinner{\!\left\lVert U_{S}^{0.5}(\hat{\theta}^{\text{sim}}-\theta_{*})\right\rVert}_{F}. Further, since θ=θsim+θδ\theta_{*}=\theta_{*}^{\text{sim}}+\theta_{*}^{\delta}, it follows by using the triangle inequality that US0.5(θ^simθ)FUS0.5(θ^simθsim)F+US0.5θδF\mathinner{\!\left\lVert U_{S}^{0.5}(\hat{\theta}^{\text{sim}}-\theta_{*})\right\rVert}_{F}\leq\mathinner{\!\left\lVert U_{S}^{0.5}(\hat{\theta}^{\text{sim}}-\theta_{*}^{\text{sim}})\right\rVert}_{F}+\mathinner{\!\left\lVert U_{S}^{0.5}\theta_{*}^{\delta}\right\rVert}_{F}. Thus, we obtain

θ^tΔθVt\displaystyle\mathinner{\!\left\lVert\hat{\theta}_{t}-\Delta\theta_{*}\right\rVert}_{V_{t}}\leq k=0t1zkwkVt1+US0.5(θ^simθsim)F\displaystyle\lVert\sum_{k=0}^{t-1}z_{k}w_{k}^{\top}\rVert_{V_{t}^{-1}}+\mathinner{\!\left\lVert U_{S}^{0.5}(\hat{\theta}^{\text{sim}}-\theta_{*}^{\text{sim}})\right\rVert}_{F}
+US0.5θδF.\displaystyle+\mathinner{\!\left\lVert U_{S}^{0.5}\theta_{*}^{\delta}\right\rVert}_{F}.

The first term is bounded by [23, Corollary 1] with probability 1δ21-\delta_{2} as

Vt12k=0t1zkwkFn2log(det(Vt)0.5det(US)0.5δ2).\displaystyle\lVert V_{t}^{-\tfrac{1}{2}}\sum_{k=0}^{t-1}z_{k}w_{k}^{\top}\rVert_{F}\leq n\sqrt{2\log{\left(\frac{\det(V_{t})^{0.5}\det\left(U_{S}\right)^{-0.5}}{\delta_{2}}\right)}}.

Further, the second term is bounded with probability 1δ11-\delta_{1} by αS(δ1)\alpha_{S}(\delta_{1}) from Assumption 2. Finally, using the assumption that an upper bound MδM_{\delta} on θδ\mathinner{\!\left\lVert\theta_{*}^{\delta}\right\rVert} is known, the third term is bound as

US0.5θδFλmax(US)Mδ.\displaystyle\mathinner{\!\left\lVert U_{S}^{0.5}\theta_{*}^{\delta}\right\rVert}_{F}\leq\sqrt{\lambda_{\text{max}}(U_{S})}M_{\delta}.

Combining the three bounds establishes the claim.

-B Proof of Lemma 5

From the fact that every sample and the true parameter belongs to the set 𝒬\mathcal{Q} and from basic algebraic manipulations, we obtain 22MPMθMK𝔼[XTt=1T(θθ~t)zt]\mathcal{R}_{2}\leq 2M_{P}M_{\theta}M_{K}\mathbb{E}[X_{T}\sum_{t=1}^{T}\|(\theta_{*}-\tilde{\theta}_{t})^{\top}z_{t}\|]. We now bound the term with the expectation using Cauchy-Schwarz as

𝔼[XTt=1T(θθ~t)zt]\displaystyle\mathbb{E}[X_{T}\sum_{t=1}^{T}\|(\theta_{*}-\tilde{\theta}_{t})^{\top}z_{t}\|]
𝔼[t=1TVt0.5(θθ~t)XTVt0.5zt].\displaystyle\leq\mathbb{E}[\sum_{t=1}^{T}\|V_{t}^{0.5}(\theta_{*}-\tilde{\theta}_{t})\|X_{T}\|V_{t}^{-0.5}z_{t}\|]. (15)

Adding and subtracting θ^\hat{\theta} in the term Vt0.5(θθ~t)\|V_{t}^{0.5}(\theta_{*}-\tilde{\theta}_{t})\| and applying triangle inequality yields

Vt0.5(θθ~t)Vt0.5(θθ^)F+Vt0.5(θ^θ~t)F\displaystyle\|V_{t}^{0.5}(\theta_{*}-\tilde{\theta}_{t})\|\leq\|V_{t}^{0.5}(\theta_{*}-\hat{\theta})\|_{F}+\|V_{t}^{0.5}(\hat{\theta}-\tilde{\theta}_{t})\|_{F}
βt(δ2)+βt(δ2)βT(δ2)+βT(δ2),\displaystyle\leq\beta_{t}(\delta_{2})+\beta_{t}^{\prime}(\delta_{2})\leq\beta_{T}(\delta_{2})+\beta_{T}^{\prime}(\delta_{2}),

where we used the fact that on EtE_{t}, Vt0.5(θθ^)Fβt(δ2)\|V_{t}^{0.5}(\theta_{*}-\hat{\theta})\|_{F}\leq\beta_{t}(\delta_{2}) and Vt0.5(θ^θ~)Fβt(δ2)\|V_{t}^{0.5}(\hat{\theta}-\tilde{\theta})\|_{F}\leq\beta^{\prime}_{t}(\delta_{2}) holds and the fact that βt(δ2)\beta_{t}(\delta_{2}) is increasing in tt. Thus, by substituting the value of βT(δ2)\beta_{T}^{\prime}(\delta_{2}) it follows that

𝔼[XTt=1T(θθ~t)zt]𝒪~(𝔼[t=1TβTXTVt0.5zt]).\displaystyle\mathbb{E}[X_{T}\sum_{t=1}^{T}\|(\theta_{*}-\tilde{\theta}_{t})^{\top}z_{t}\|]\leq\tilde{\mathcal{O}}\left(\mathbb{E}[\sum_{t=1}^{T}\beta_{T}X_{T}\|V_{t}^{-0.5}z_{t}\|]\right).

Using the fact that t=1TVt0.5ztT(t=1TVt0.5zt2)0.5\sum_{t=1}^{T}\|V_{t}^{-0.5}z_{t}\|\leq\sqrt{T}(\sum_{t=1}^{T}\|V_{t}^{-0.5}z_{t}\|^{2})^{0.5} followed by using Lemma 13 it follows that, with probability 1δ11-\delta_{1},

𝔼[XTt=1T(θθ~t)zt]\displaystyle\mathbb{E}[X_{T}\sum_{t=1}^{T}\|(\theta_{*}-\tilde{\theta}_{t})^{\top}z_{t}\|]\leq
𝒪~(TS𝔼[βTXT2log(det(Vt)det(US))]).\displaystyle\tilde{\mathcal{O}}\left(\sqrt{\frac{T}{S}}\mathbb{E}\left[\beta_{T}X_{T}^{2}\sqrt{\log{\left(\frac{\det{(V_{t})}}{\det{(U_{S})}}\right)}}\right]\right).

where we used the fact that zt2MK2XT2\mathinner{\!\left\lVert z_{t}\right\rVert}^{2}\leq M_{K}^{2}X_{T}^{2}. Using Lemma 12 establishes the claim.

-C Proof of Lemma 6

The proof of Lemma 6 resembles that of [15, Lemma 1] and so we only provide an outline of the proof, highlighting the differences.

Let tx(t1,xt)\mathcal{F}_{t}^{x}\coloneqq(\mathcal{F}_{t-1},x_{t}) and let P¯t=𝔼(P(θ¯t)𝟙𝒮𝒬|txs,Et)\bar{P}_{t}=\mathbb{E}(P(\bar{\theta}_{t})\mathds{1}_{\mathcal{S}_{\mathcal{Q}}}|\mathcal{F}_{t}^{x}\cup\mathcal{F}_{s},E_{t}) and θ¯tθ^t+βt(δ2)Vt0.5ηt\bar{\theta}_{t}\coloneqq\hat{\theta}_{t}+\beta_{t}(\delta_{2})V_{t}^{-0.5}\eta_{t}. Further, let Λt𝔼[P(θ~t)P¯tF|txs,Et]\Lambda_{t}\coloneqq\mathbb{E}\left[\|P(\tilde{\theta}_{t})-\bar{P}_{t}\|_{F}|\mathcal{F}_{t}^{x}\cup\mathcal{F}_{s},E_{t}\right]. Then,

xt+1(P(θ~t+1)P(θ~t))xt+1𝟙Et+1\displaystyle x_{t+1}^{\top}\left(P(\tilde{\theta}_{t+1})-P(\tilde{\theta}_{t})\right)x_{t+1}\mathds{1}_{E_{t+1}}
XT2P(θ~t+1)P(θ~t)F𝟙Et+1\displaystyle\leq X_{T}^{2}\mathinner{\!\left\lVert P(\tilde{\theta}_{t+1})-P(\tilde{\theta}_{t})\right\rVert}_{F}\mathds{1}_{E_{t+1}}
XT2(P(θ~t+1)P¯t+1F+P(θ~t)P¯tF\displaystyle\leq X_{T}^{2}\left(\mathinner{\!\left\lVert P(\tilde{\theta}_{t+1})-\bar{P}_{t+1}\right\rVert}_{F}+\mathinner{\!\left\lVert P(\tilde{\theta}_{t})-\bar{P}_{t}\right\rVert}_{F}\right.
+P¯t+1P¯tF).\displaystyle\left.+\mathinner{\!\left\lVert\bar{P}_{t+1}-\bar{P}_{t}\right\rVert}_{F}\right).

Thus, the term 3\mathcal{R}_{3} can be re-written as

3𝔼[t=1TXT2(Λt+1+Λt+P¯t+1P¯tF)]\displaystyle\mathcal{R}_{3}\leq\mathbb{E}\left[\sum_{t=1}^{T}X_{T}^{2}\left(\Lambda_{t+1}+\Lambda_{t}+\|\bar{P}_{t+1}-\bar{P}_{t}\|_{F}\right)\right] (16)

The result of Lemma 6 can then be obtained by adding the bound characterized in the following two lemmas.

Lemma 7.
t=1T𝔼[XT2Λt]𝒪~(T𝔼[XT3βT(δ2)t=1TVt0.5zt2]).\displaystyle\sum_{t=1}^{T}\mathbb{E}\left[X_{T}^{2}\Lambda_{t}\right]\leq\tilde{\mathcal{O}}\left(\sqrt{T}\mathbb{E}\left[X_{T}^{3}\beta_{T}(\delta_{2})\sqrt{\sum_{t=1}^{T}\mathinner{\!\left\lVert V_{t}^{-0.5}z_{t}\right\rVert}^{2}}\right]\right).
Proof.

Since for any matrix Xn×nX\in\mathbb{R}^{n\times n}, XFi,j=1n|Xi,j|\|X\|_{F}\leq\sum_{i,j=1}^{n}|X^{i,j}| and that θ~t\tilde{\theta}_{t} is distributed as θ¯|𝒮𝒬\bar{\theta}|\mathcal{S}_{\mathcal{Q}}, we obtain

Λt\displaystyle\Lambda_{t} =𝔼[P(θ¯t)P¯tF𝟙𝒮Q(θ¯t)|txs,Et](θ¯t𝒮𝒬|txs,Et)\displaystyle=\frac{\mathbb{E}\left[\|P(\bar{\theta}_{t})-\bar{P}_{t}\|_{F}\mathds{1}_{\mathcal{S}_{Q}}(\bar{\theta}_{t})|\mathcal{F}_{t}^{x}\cup\mathcal{F}_{s},E_{t}\right]}{\mathbb{P}\left(\bar{\theta}_{t}\in\mathcal{S}_{\mathcal{Q}}|\mathcal{F}_{t}^{x}\cup\mathcal{F}_{s},E_{t}\right)}
i,j=1nΛti,j(θ¯t𝒮𝒬|txs,Et),\displaystyle\leq\frac{\sum_{i,j=1}^{n}\Lambda^{i,j}_{t}}{\mathbb{P}\left(\bar{\theta}_{t}\in\mathcal{S}_{\mathcal{Q}}|\mathcal{F}_{t}^{x}\cup\mathcal{F}_{s},E_{t}\right)},

where Λti,j=𝔼[|P(θ¯t)i,jP¯ti,j|𝟙𝒮𝒬|txs,Et]\Lambda^{i,j}_{t}=\mathbb{E}\left[|P(\bar{\theta}_{t})^{i,j}-\bar{P}_{t}^{i,j}|\mathds{1}_{\mathcal{S}_{\mathcal{Q}}}|\mathcal{F}_{t}^{x}\cup\mathcal{F}_{s},E_{t}\right]. Using [15, Proposition 7] followed by [15, Proposition 8] yields

Λt4ρMpn2βt(δ2)1ρ2𝔼[Vt0.5H(θ~t)F|txs,Et],\displaystyle\Lambda_{t}\leq\frac{4\rho M_{p}n^{2}\beta_{t}^{\prime}(\delta_{2})}{1-\rho^{2}}\mathbb{E}[\|V_{t}^{-0.5}H(\tilde{\theta}_{t})\|_{F}~{}|\mathcal{F}_{t}^{x}\cup\mathcal{F}_{s},E_{t}], (17)

where H(θ):=[𝐈K(θ)]H(\theta)^{\top}\mathrel{\mathop{\mathchar 58\relax}}=\begin{bmatrix}\mathbf{I}&K(\theta)^{\top}\end{bmatrix}. Since on EtE_{t}, θ¯ttTS\bar{\theta}_{t}\in\mathcal{E}_{t}^{TS} and θ~t=θ¯t|𝒮𝒬\tilde{\theta}_{t}=\bar{\theta}_{t}|\mathcal{S}_{\mathcal{Q}}, applying [15, Proposition 11]222Proposition 11 can be found in the proof of [15, Proposition 9]. yields

Vt0.5H(θ¯t)F\displaystyle\|V_{t}^{-0.5}H(\bar{\theta}_{t})\|_{F}\leq
(1+1β02)2XT𝔼[Vt0.5zt2|t1s,θ¯t,Et1]\displaystyle\left(1+\frac{1}{\beta_{0}^{2}}\right)^{2}X_{T}\mathbb{E}\left[\|V_{t}^{-0.5}z_{t}\|_{2}~{}|\mathcal{F}_{t-1}\cup\mathcal{F}_{s},\bar{\theta}_{t},E_{t-1}\right]

Substituting in equation (17) yields

Λt𝒪(𝔼[XTβT(δ2)Vt0.5zt2|txs,Et]),\displaystyle\Lambda_{t}\leq\mathcal{O}\left(\mathbb{E}\left[X_{T}\beta_{T}^{\prime}(\delta_{2})\|V_{t}^{-0.5}z_{t}\|_{2}~{}|\mathcal{F}_{t}^{x}\cup\mathcal{F}_{s},E_{t}\right]\right),

where we used the law of iterated expectations. Substituting βT(δ2)\beta_{T}^{\prime}(\delta_{2}) yields

t=1T𝔼[XT2Λt]𝒪~(𝔼[XT3βT(δ2)t=1TVt0.5zt2]).\displaystyle\sum_{t=1}^{T}\mathbb{E}\left[X_{T}^{2}\Lambda_{t}\right]\leq\tilde{\mathcal{O}}\left(\mathbb{E}\left[X_{T}^{3}\beta_{T}(\delta_{2})\sum_{t=1}^{T}\|V_{t}^{-0.5}z_{t}\|_{2}\right]\right).

Applying Cauchy Schwarz inequality establishes the claim. ∎

Lemma 8.

𝔼[t=1TXT2P¯t+1P¯tF]𝒪~(T𝔼[XT3βT(δ2)(t=1TVt0.5zt2)12]).\mathbb{E}\left[\sum_{t=1}^{T}X_{T}^{2}\|\bar{P}_{t+1}-\bar{P}_{t}\|_{F}\right]\leq\tilde{\mathcal{O}}\left(\sqrt{T}\mathbb{E}\left[X_{T}^{3}\beta_{T}(\delta_{2})\left(\sum_{t=1}^{T}\|V_{t}^{-0.5}z_{t}\|_{2}\right)^{\tfrac{1}{2}}\right]\right).

Proof.

Let ϕt\phi_{t} and Φt\Phi_{t} be the probability distribution function of θ¯t|txs\bar{\theta}_{t}|\mathcal{F}_{t}^{x}\cup\mathcal{F}_{s} and θ¯t|txs,Et\bar{\theta}_{t}|\mathcal{F}_{t}^{x}\cup\mathcal{F}_{s},E_{t}, respectively. Following similar steps as in [15] yields 𝒮𝒬|ϕt+1(θ)ϕt(θ)|𝑑θ2KL(ϕt||ϕt+1),\int_{\mathcal{S}_{\mathcal{Q}}}|\phi_{t+1}(\theta)-\phi_{t}(\theta)|d\theta\leq\sqrt{2\text{KL}(\phi_{t}||\phi_{t+1})}, where KL(||)\text{KL}(\cdot||\cdot) denotes the KL divergence between two distributions. Using Lemma 11 and considering the expectation and the summation operators from 3\mathcal{R}_{3} yields

𝔼[t=1TXT2P¯t+1P¯tF]𝒪~(𝔼[βT(δ2)XT3t=1TVt0.5zt2]).\mathbb{E}\Big{[}\sum_{t=1}^{T}X_{T}^{2}\|\bar{P}_{t+1}-\bar{P}_{t}\|_{F}\Big{]}\leq\\ \tilde{\mathcal{O}}\left(\mathbb{E}\left[\beta_{T}(\delta_{2})X_{T}^{3}\sum_{t=1}^{T}\|V_{t}^{-0.5}z_{t}\|_{2}\right]\right).

The claim then follows by using Cauchy Schwarz inequality. ∎

-D Additional Lemmas

Lemma 9.

For any j1j\geq 1 and any TT, 𝔼[XTj|S]𝒪(log(T)(1ρ)j\mathbb{E}[X_{T}^{j}~{}|~{}\mathcal{F}_{S}]\leq\mathcal{O}(\log(T)(1-\rho)^{-j}.

Proof.

Since ut=K(θ~t)xtu_{t}=K(\tilde{\theta}_{t})x_{t}, using triangle inequality

xt+12\displaystyle\|x_{t+1}\|_{2} =(A+BK(θ~t))xt+wt2,\displaystyle=\|(A^{*}+B^{*}K(\tilde{\theta}_{t}))x_{t}+w_{t}\|_{2},
(A+BK(θ~t))xt2+wt2.\displaystyle\leq\|(A^{*}+B^{*}K(\tilde{\theta}_{t}))x_{t}\|_{2}+\|w_{t}\|_{2}.

Using the property of the matrix norm and since the sampled parameter is an element of 𝒬\mathcal{Q} due to the rejection operator,

xt+12\displaystyle\|x_{t+1}\|_{2} (A+BK(θ¯t))2xt2+wt2\displaystyle\leq\|(A^{*}+B^{*}K(\bar{\theta}_{t}))\|_{2}\|x_{t}\|_{2}+\|w_{t}\|_{2}
ρxt2+wt2.\displaystyle\leq\rho\|x_{t}\|_{2}+\|w_{t}\|_{2}.

From this point on, the proof is analogous to the proof of [3, Lemma 2] and has been omitted for brevity. ∎

Lemma 10.

For the set 𝒫\mathcal{P} defined in equation (V), 𝔼[XS2]𝒪(logS)\mathbb{E}[X_{S}^{2}]\leq\mathcal{O}(\log S) holds for Algorithm TSAC.

Proof.

Suppose that for any s{iτ0,,(i+1)τ01}s\in\{i\tau_{0},\dots,(i+1)\tau_{0}-1\} in an iith iteration, sS0s\leq S_{0} holds. Then,

xs+12=Aprevxs+BprevK(θ~previ)xs+Bprevνs+ws2\displaystyle\|x_{s+1}\|_{2}=\|{A}_{\textup{prev}}^{*}x_{s}+{B}_{\textup{prev}}^{*}K({\tilde{\theta}}_{\textup{prev}}^{i})x_{s}+{B}_{\textup{prev}}^{*}\nu_{s}+w_{s}\|_{2}
ρprevxs2+ws2+Bprevνs2,\displaystyle\leq{\rho}_{\textup{prev}}\|x_{s}\|_{2}+\|w_{s}\|_{2}+\|{B}_{\textup{prev}}^{*}\nu_{s}\|_{2},

where in the last inequality we used that the sampled parameter is an element of 𝒫\mathcal{P}^{\prime}. Applying this iteratively yields xs2j<sρprevsj1(ws2+Bprevνs2)\|x_{s}\|_{2}\leq\sum_{j<s}{\rho}_{\textup{prev}}^{s-j-1}\left(\|w_{s}\|_{2}+\|{B}_{\textup{prev}}^{*}\nu_{s}\|_{2}\right) which further yields that XS21(1ρprev)2(maxj<Sws2+maxj<SBprevνs2)2.X_{S}^{2}\leq\frac{1}{(1-{\rho}_{\textup{prev}})^{2}}\left(\max_{j<S}\|w_{s}\|_{2}+\max_{j<S}\|{B}_{\textup{prev}}^{*}\nu_{s}\|_{2}\right)^{2}. Let νsBBprevνs\nu_{s}^{B}\coloneqq{B}_{\textup{prev}}^{*}\nu_{s}. We now bound 𝔼[maxj<SνsB22]\mathbb{E}[\max_{j<S}\|\nu^{B}_{s}\|_{2}^{2}]. Observe that exp(𝔼[maxjSνsB22])\exp{\left(\mathbb{E}\left[\max_{j\leq S}\|\nu_{s}^{B}\|_{2}^{2}\right]\right)}𝔼[exp(maxjSνsB22)]\leq\mathbb{E}\left[\exp{\left(\max_{j\leq S}\|\nu_{s}^{B}\|_{2}^{2}\right)}\right]𝔼[jSexp(νsB22)]=S𝔼[exp(ν1B22)].\leq\mathbb{E}\left[\sum_{j\leq S}\exp{\left(\|\nu_{s}^{B}\|_{2}^{2}\right)}\right]=S\mathbb{E}\left[\exp{\left(\|\nu_{1}^{B}\|_{2}^{2}\right)}\right]. Similarly, exp(𝔼[maxjSws22])S𝔼[exp(w122)]\exp{\left(\mathbb{E}\left[\max_{j\leq S}\|w_{s}\|_{2}^{2}\right]\right)}\leq S\mathbb{E}\left[\exp{\left(\|w_{1}\|_{2}^{2}\right)}\right]. Further, following analogous steps, we can bound exp(𝔼[maxjSws2])S𝔼[exp(w12)]\exp{\left(\mathbb{E}\left[\max_{j\leq S}\|w_{s}\|_{2}\right]\right)}\leq S\mathbb{E}\left[\exp{\left(\|w_{1}\|_{2}\right)}\right] and exp(𝔼[maxjSνsB2])S𝔼[exp(ν1B2)]\exp{\left(\mathbb{E}\left[\max_{j\leq S}\|\nu_{s}^{B}\|_{2}\right]\right)}\leq S\mathbb{E}\left[\exp{\left(\|\nu_{1}^{B}\|_{2}\right)}\right]. Using the fact that wsw_{s} and νs\nu_{s} are independent, yields 𝔼[XS2]𝒪(logS)\mathbb{E}[X_{S}^{2}]\leq\mathcal{O}\left(\log{S}\right). The proof for the case when s>S0s>S_{0} holds, for any s{iτ0,,(i+1)τ01}s\in\{i\tau_{0},\dots,(i+1)\tau_{0}-1\} in an iith iteration, is analogous to that of Lemma 9. ∎

Lemma 11.

Let ϕt(θ)\phi_{t}(\theta) denote the probability distribution function of θ¯t|ts\bar{\theta}_{t}|\mathcal{F}_{t}\cup\mathcal{F}_{s}. Then, KL(ϕtϕt+1)δKLVt0.5ztF2\text{KL}(\phi_{t}\|\phi_{t+1})\leq{\delta}_{\textup{KL}}\|V_{t}^{-0.5}z_{t}\|^{2}_{F}, where δKL=n2(n+m)2β0+12+βT2(δ2)MK2XT2/γ+W2β0{\delta}_{\textup{KL}}=\frac{n^{2}(n+m)}{2\beta_{0}}+\frac{1}{2}+\frac{\beta_{T}^{2}(\delta_{2})M_{K}^{2}X_{T}^{2}/\gamma+W}{2\beta_{0}}.

Proof.

The proof is analogous to that of [15, Proposition 10] and thus has been omitted for brevity. ∎

Lemma 12.

For a given δ1(0,1)\delta_{1}\in(0,1), suppose that S200(n+m)log12δ1S\geq 200(n+m)\log{\tfrac{12}{\delta_{1}}} and ztZ,t0\|z_{t}\|\leq Z,\forall t\geq 0. Then, with probability 1δ11-\delta_{1},

logdet(VT)det(US)(n+m)log(1+40TZ2(n+m)S).\displaystyle\log{\frac{\det(V_{T})}{\det(U_{S})}}\leq(n+m)\log{\left(1+\frac{40TZ^{2}}{(n+m)S}\right)}.
Proof.

The proof directly follows from the AM-GM inequality and Assumption 2. ∎

Lemma 13.

For a given δ1(0,1)\delta_{1}\in(0,1), suppose that S200(n+m)log12δ1S\geq 200(n+m)\log{\tfrac{12}{\delta_{1}}} and ztZ,t0\|z_{t}\|\leq Z,\forall t\geq 0. Then, with probability 1δ11-\delta_{1},

k=1tVk0.5zk222max{1,40Z2S}log(det(Vt)det(US)).\displaystyle\sum_{k=1}^{t}\|V_{k}^{-0.5}z_{k}\|_{2}^{2}\leq 2\max\left\{1,\frac{40Z^{2}}{S}\right\}\log{\left(\frac{\det(V_{t})}{\det(U_{S})}\right)}.
Proof.

From [23, Lemma 4],

k=0tVk0.5zk222max{1,Z2/λmin(Vt)}log(det(Vt)det(US))\displaystyle\sum_{k=0}^{t}\|V_{k}^{-0.5}z_{k}\|_{2}^{2}\leq 2\max\left\{1,Z^{2}/\lambda_{\text{min}(V_{t})}\right\}\log{\left(\frac{\det(V_{t})}{\det(U_{S})}\right)}
2max{1,Z2λmin(US)}log(det(Vt)det(US)),\displaystyle\leq 2\max\left\{1,\frac{Z^{2}}{\lambda_{\text{min}(U_{S})}}\right\}\log{\left(\frac{\det(V_{t})}{\det(U_{S})}\right)},
2max{1,40Z2S}log(det(Vt)det(US)),\displaystyle\leq 2\max\left\{1,\frac{40Z^{2}}{S}\right\}\log{\left(\frac{\det(V_{t})}{\det(U_{S})}\right)},

where for the second inequality we used the fact that λmin(Vt)λmin(US)\lambda_{\text{min}}(V_{t})\geq\lambda_{\text{min}}(U_{S}) and for the third inequality, we used Assumption 2. ∎

References

  • [1] Y. Abbasi-Yadkori and C. Szepesvári, “Regret bounds for the adaptive control of linear quadratic systems,” in Proceedings of the 24th Annual Conference on Learning Theory, pp. 1–26, JMLR Workshop and Conference Proceedings, 2011.
  • [2] T. Kargin, S. Lale, K. Azizzadenesheli, A. Anandkumar, and B. Hassibi, “Thompson sampling achieves O~T\tilde{O}\sqrt{T} regret in linear quadratic control,” in Conference on Learning Theory, pp. 3235–3284, PMLR, 2022.
  • [3] Y. Ouyang, M. Gagrani, and R. Jain, “Control of unknown linear systems with thompson sampling,” in 2017 55th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pp. 1198–1205, IEEE, 2017.
  • [4] A. Cohen, T. Koren, and Y. Mansour, “Learning linear-quadratic regulators efficiently with only T\sqrt{T} regret,” in International Conference on Machine Learning, pp. 1300–1309, PMLR, 2019.
  • [5] Y. Li, “Reinforcement learning in practice: Opportunities and challenges,” arXiv preprint arXiv:2202.11296, 2022.
  • [6] R. F. Prudencio, M. R. Maximo, and E. L. Colombini, “A survey on offline reinforcement learning: Taxonomy, review, and open problems,” IEEE Transactions on Neural Networks and Learning Systems, 2023.
  • [7] W. Zhao, J. P. Queralta, and T. Westerlund, “Sim-to-real transfer in deep reinforcement learning for robotics: a survey,” in 2020 IEEE symposium series on computational intelligence (SSCI), pp. 737–744, IEEE, 2020.
  • [8] P. Shivaswamy and T. Joachims, “Multi-armed bandit problems with history,” in Artificial Intelligence and Statistics, pp. 1046–1054, PMLR, 2012.
  • [9] C. Zhang, A. Agarwal, H. D. Iii, J. Langford, and S. Negahban, “Warm-starting contextual bandits: Robustly combining supervised and bandit feedback,” in Proceedings of the 36th International Conference on Machine Learning, vol. 97, pp. 7335–7344, PMLR, 2019.
  • [10] C. Kausik, K. Tan, and A. Tewari, “Leveraging offline data in linear latent bandits,” arXiv preprint arXiv:2405.17324, 2024.
  • [11] B. Hao, R. Jain, T. Lattimore, B. Van Roy, and Z. Wen, “Leveraging demonstrations to improve online learning: Quality matters,” in International Conference on Machine Learning, pp. 12527–12545, PMLR, 2023.
  • [12] H. Mania, S. Tu, and B. Recht, “Certainty equivalence is efficient for linear quadratic control,” Advances in Neural Information Processing Systems, vol. 32, 2019.
  • [13] D. Baby and Y.-X. Wang, “Optimal dynamic regret in lqr control,” Advances in Neural Information Processing Systems, vol. 35, pp. 24879–24892, 2022.
  • [14] T.-J. Chang and S. Shahrampour, “Regret analysis of distributed online lqr control for unknown lti systems,” IEEE Transactions on Automatic Control, 2023.
  • [15] M. Abeille and A. Lazaric, “Improved regret bounds for thompson sampling in linear quadratic control problems,” in International Conference on Machine Learning, pp. 1–9, PMLR, 2018.
  • [16] T. Guo and F. Pasqualetti, “Transfer learning for lqr control,” arXiv preprint arXiv:2503.06755, 2025.
  • [17] T. Guo, A. A. Al Makdah, V. Krishnan, and F. Pasqualetti, “Imitation and transfer learning for lqg control,” IEEE Control Systems Letters, vol. 7, pp. 2149–2154, 2023.
  • [18] L. Li, C. De Persis, P. Tesi, and N. Monshizadeh, “Data-based transfer stabilization in linear systems,” IEEE Transactions on Automatic Control, vol. 69, no. 3, pp. 1866–1873, 2023.
  • [19] L. Xin, L. Ye, G. Chiu, and S. Sundaram, “Learning dynamical systems by leveraging data from similar systems,” IEEE Transactions on Automatic Control, 2025.
  • [20] Y. Abbasi-Yadkori, D. Pál, and C. Szepesvári, “Improved algorithms for linear stochastic bandits,” Advances in neural information processing systems, vol. 24, 2011.
  • [21] A. Cassel, A. Cohen, and T. Koren, “Logarithmic regret for learning linear quadratic regulators efficiently,” in International Conference on Machine Learning, pp. 1328–1337, PMLR, 2020.
  • [22] I. Osband and B. Van Roy, “Posterior sampling for reinforcement learning without episodes,” arXiv preprint arXiv:1608.02731, 2016.
  • [23] Y. Abbasi-Yadkori, D. Pál, and C. Szepesvári, “Online least squares estimation with self-normalized processes: An application to bandit problems,” arXiv preprint arXiv:1102.2670, 2011.