This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Regret-Guaranteed Safe Switching with Minimum Cost: LQR Setting with Unknown Dynamics

Jafar Abbaszadeh Chekan and Cédric Langbort This paragraph of the first footnote will contain the date on which you submitted your paper for review. It will also contain support information, including sponsor and financial support acknowledgment. For example, “This work was supported in part by the U.S. Department of Commerce under Grant BS123456.” J. A. Chekan and C. Langbort (emails: jafar2 & langbort@illinois.edu) are with the Coordinated Science Laboratory and the Department of Aerospace Engineering at the University of Illinois at Urbana-Champaign (UIUC).
Abstract

Externally Forced Switched (EFS) systems represent a subset of switched systems where switches occur deliberately to meet an external requirement. However, fast switching can lead to instability, even when all closed-loop modes are stable. In this study, our focus is on an EFS scenario with unknown system dynamics, where the next mode to switch to is revealed by an external entity in real-time as the switch occurs. The challenge is to track the revealed sequence while (1) minimizing accumulated cost in a regretful sense and (2) ensuring that the norm of the system’s state does not grow excessively-a property we refer to as ’the safety of switching.’ Achieving the latter involves requiring the closed-loop system to remain in each revealed mode for some minimum dwell time, which must be learned online. We propose an algorithm based on the principles of Optimism in the Face of Uncertainty. This algorithm jointly establishes confidence sets for unknown parameters, devises a feedback policy, and estimates a minimum dwell time for each revealed mode from data. By precisely estimating dwell-time error, our strategy yields an expected regret of 𝒪(|M|ns)\mathcal{O}(|M|\sqrt{ns}), where nsns and |M||M| denote the total switches and mode count, respectively. We benchmark this approach against scenarios with known parameters.

Index Terms:
Online Learning, Switched Systems, Dwell-Time, Regret

I Introduction

Great strides have been made over the past few years in the application of learning-based and data-driven control techniques to both linear and nonlinear systems whose parameters are unknown [1, 2, 3, 4, 5, 6, 7]. One class of plants for which progress in this direction is currently not as mature, however, is switched systems (i.e., systems consisting of a number of continuous-time modes the transition between which is governed by some discrete protocol[8, 9, 10, 11]) due to the complexities brought about in the learning process by the combination of both continuous and discrete aspects.

In this work, we are primarily interested in developing such learning-based tools for unknown switched systems in the so-called externally forced switching (EFS) scenarios, in which either the time of the switch, the sequence of modes, or both are dictated to the system by an external entity [8]. In particular, we focus on a setting where only the next mode is dictated, and the decision regarding when to switch to that mode is made by the system itself. More precisely, the next mode to switch to is revealed in an online fashion, as the switch actually occurs, and the problem is to follow the revealed sequence while (1) minimizing accumulated cost in a regret sense and, (2) ensuring that the norm of system’s state does not grow inordinately – a property we refer to as ‘the safety of switching’. This safety concern removes the option of quick switching (which would appear rational if the goal were solely cost minimization) since rapid switching can result in state explosion, even in situations where the individual closed-loop modes are stable [12, 13, 14]. A typical approach to avoid such state explosion is to ensure that the system remains in each mode for a minimal specified duration, referred to as the minimum dwell time, whose computation usually requires system model [10, 11]. The central difficulty, then, in the absence of an a priori knowledge of the modes’ parameters or even of the full sequence of switches (which would allow for computation of an average dwell-time), is to compute minimum dwell times for all modes dynamically and directly from data. Our goal in this paper is to design a strategy that involves specifying the switch sequence and feedback gain design in a scenario where the system’s dynamics is unknown, and there is only noisy access to the state of the sub-systems (modes). We refer to this setting as ”unknown modes-unknown sequence” (M¯S¯\bar{M}\bar{S}). To address this problem we first analyze the problem in two distinct scenarios: one where both the mode and sequence are known (MSMS), and another where the mode is known but the sequence is concealed (MS¯M\bar{S}). The MS¯M\bar{S} scenario, which employs the same information disclosure mechanism as our proposed problem, serves as a reasonable benchmark for comparing the designed algorithm for the M¯S¯\bar{M}\bar{S} setting. We demonstrate that achieving optimal solutions for the MS¯M\bar{S} scenario requires more than just knowledge of the next mode. This leads us to propose a practical approach for managing the MS¯M\bar{S} setting that involves implementing the policy obtained through solving Discrete Algebraic Riccati Equation (DARE) and crafting a corresponding mode-dependant dwell time strategy. Afterward, we apply an Optimism in the Face of Uncertainty (OFU) based algorithm to tackle the challenge in the M¯S¯\bar{M}\bar{S} setting. Our proposed strategy aims to design minimum dwell time and feedback policy fully online, relying solely on state measurements, to enable switching according to the given sequence with a guaranteed regret comparing to a reference strategy for MS¯M\bar{S} setting. Our main contributions are as follows

  • We propose a computationally efficient algorithm to estimate dwell time in an unknown dynamic setting.

  • We prove that the estimated dwell time is close to the true but unknown dwell time by providing an upper bound for the estimation error.

  • We prove that our proposed algorithm has regret of 𝒪(||ns)\mathcal{O}(|\mathcal{M}|\sqrt{ns}) compared to the case when the parameters of all modes are known.

At a high level, our work differs from the existing literature in terms of setup (i.e., (1) what is (not) known a priori about the modes and switching sequence, and (2) whether the switching signal is itself actionable), closed-loop goals (mere stability vs. performance guarantees) and methods used (direct vs. no identification of unknown parameters). For example, [15] use a set of random observations from different trajectories of the switched system to perform probabilistic stability analysis even in the absence of knowledge of the modes and switching rule. The authors of [16] propose a data-driven control approach for an unknown discrete-time linear system that switches between a finite number of modes, with the switching signal being fully unknown. They design control rules, which can automatically adapt to changes in the actuation mode, based solely on data and without explicit identification. Likewise, in [17] and [18], stabilizing controllers are designed for linear switched systems without explicit plant identification. These controllers rely solely on experimental data obtained from an experiment that thoroughly excites all the modes. The designed switched state feedback controller is robust in the sense that it guarantees stability for any switching sequence. In the work presented by [19], the authors put forth an online switching controller that accomplishes both mode detection and stabilization. This is achieved solely by utilizing a finite set of trajectories for each mode, which are provided offline.

In contrast with these works, which consider solely stabilization as their goal, Du et al., in [20], consider an unknown Markov jump system within a Linear Quadratic (LQ) framework and propose an algorithm that jointly learns the unknown parameters of the modes and the Markov transition matrix that governs the evolution of the mode switches, while achieving guaranteed regret. The authors of [21] also consider regret minimization as part of their performance objective but, in a manner most closely related to our setup, also allow for the switching time between modes to be part of the control design and to be determined online. Their algorithm employs a multi-armed bandit like analysis and, by relying on cost measurements, switches online between policies drawn from a finite candidate pool in an effort to identify the best controller. This switching adheres to a minimum dwell-time policy of the kind mentioned earlier, but uses a conservative bound for this dwell-time, which impacts the performance of the closed-loop system. This is also the case in our own previous work [22], where we introduced a projection-equipped algorithm for unknown switching over-actuated systems.

At the technical level, our approach builds upon existing model-based RL algorithms, which come with guaranteed regret and use system identification as a core phase. We use the principle of Optimism in the Face of Uncertainty (OFU), as originally introduced in [23] and later strengthened in various ways in [24, 25, 26, 27, 7], to obtain sublinear regret and/or accelerate the stabilization of the closed-loop system. More precisely, we exploit the relaxed-semidefinite programming (SDP) formulation of [28] in our new, switched-system, context.

Paper Structure. The remainder of the paper is organized as follows: Section II presents the problem statement. In Section III, we discuss the solution when all system parameters are known. Section IV provides a brief review of constructing confidence sets (system identification) and the relaxation of the standard LQR-SDP formulation for control and dwell-time design. The proposed algorithm and its main steps are presented in Section V. Section VI summarizes the regret bound guarantees and stability analysis, including dwell-time design. Detailed analysis and proofs are provided in the Appendix.

Notations. M=tr(MM)\|M\|_{*}=tr(\sqrt{M^{\top}M}) is Frobenius norm of matrix MM and λ¯(M)\underline{\lambda}(M) and λ¯(M)\overline{\lambda}(M) are its minimum and maximum eigenvalues. For a set 𝒮\mathcal{S}, |𝒮||\mathcal{S}| denotes its cardinality.

II Problem Statement and Formulation

Consider a time-invariant switched LQR system

xt+1\displaystyle x_{t+1} =Aσ(t)xt+Bσ(t)ut+ωt+1\displaystyle=A^{\sigma(t)}_{*}x_{t}+B^{\sigma(t)}_{*}u_{t}+\omega_{t+1} (1)
ctσ(t)\displaystyle c^{\sigma(t)}_{t} =xtQσ(t)xt+utRσ(t)ut\displaystyle=x_{t}^{\top}Q^{\sigma(t)}x_{t}+u_{t}^{\top}R^{\sigma(t)}u_{t} (2)

where \mathcal{M} is an index set and σ:{0}\sigma:\{0\}\cup\mathbb{N}\rightarrow\mathcal{M} is right-continuous piecewise constant switching signal that specifies at any time tt which mode is active. AiA^{i}_{*}, BiB^{i}_{*}, i\forall i\in\mathcal{M} are matrices of each individual modes which are initially unknown to the learner and QiQ^{i}, RiR^{i} are known cost matrices.

We have the following assumption for the process noise which is standard in controls community (see [24, 7, 26]).

Assumption 1

There exists a filtration t\mathcal{F}_{t} such that

(1.1)(1.1) ωt+1\omega_{t+1} is a martingale difference, i.e., 𝔼[ωt+1|t]=0\mathbb{E}[\omega_{t+1}|\;\mathcal{F}_{t}]=0

(1.2)(1.2) 𝔼[ωt+1ωt+1|t]=σ¯ω2In=:W\mathbb{E}[\omega_{t+1}\omega_{t+1}^{\top}|\;\mathcal{F}_{t}]=\bar{\sigma}_{\omega}^{2}I_{n}=:W for some σ¯ω2>0\bar{\sigma}_{\omega}^{2}>0;

(1.3)(1.3) ωt\omega_{t} are component-wise sub-Gaussian, i.e., there exists σω>0\sigma_{\omega}>0 such that for any γ\gamma\in\mathbb{R} and j=1,2,,nj=1,2,...,n

𝔼[eγωj(t+1)|t]eγ2σω2/2.\displaystyle\mathbb{E}[e^{\gamma\omega_{j}(t+1)}|\;\mathcal{F}_{t}]\leq e^{\gamma^{2}\sigma_{\omega}^{2}/2}.

Let ={i0,i1,,,in}\mathcal{I}=\{i_{0},i_{1},\ldots,...,i_{n}\} represent the sequence of modes in which the system switches. This sequence is not known in advance and is revealed gradually, meaning that only the next mode to switch to, denoted as ik+1i_{k+1}\in\mathcal{M}, is disclosed after the current switch, iki_{k}\in\mathcal{M}, has occurred. We refer to the time interval between two subsequent switches as an ’epoch.’ The termination of the sequence is uncertain and is randomly announced after the ”last” mode ini_{n} is revealed. The objective is to develop an algorithm that guarantees actuation according to the set \mathcal{I} with the minimum cost, conditioned on maintaining control over the expected growth of the state norm, as defined below.

Definition 1

Let 𝔗ik+1+\mathfrak{T}_{i_{k+1}}^{+} and 𝔗ik+\mathfrak{T}_{i_{k}}^{+} represent time sequences immediately following two subsequent switches between any two arbitrary modes iki_{k} and ik+1i_{k+1} both in \mathcal{I}. We define the expected state norm growth as being (α¯,β¯)(\bar{\alpha},\bar{\beta})-under control, where 0<α¯<10<\bar{\alpha}<1 and β¯>0\bar{\beta}>0, if the following condition holds:

𝔼[x𝔗ik+1+x𝔗ik+1+|𝔗ik+11]α¯𝔼[x𝔗ik+x𝔗ik+|𝔗ik1]+β¯σω2\displaystyle\mathbb{E}[x_{\mathfrak{T}_{i_{k+1}}^{+}}^{\top}x_{\mathfrak{T}_{i_{k+1}}^{+}}|\mathcal{F}_{\mathfrak{T}_{i_{k+1}}-1}]\leq\bar{\alpha}\mathbb{E}[x_{\mathfrak{T}_{i_{k}}^{+}}^{\top}x_{\mathfrak{T}_{i_{k}}^{+}}|\mathcal{F}_{\mathfrak{T}_{i_{k}}-1}]+\bar{\beta}\sigma_{\omega}^{2} (3)

In Definition 1, it is important to note that α¯\bar{\alpha} is a user-defined parameter, whereas β¯\bar{\beta} is contingent upon the specific policy class for control design and cost function associated with each mode, which will be introduced later.

To minimize accumulated cost, switching fast seems rational. However, even when the mode characteristics are known it is well known that fast switching between modes can result in increases in a system’s state’s norm, which jeopardize global stability even if each mode is itself stable [8, 9, 10, 11]. Accordingly, switching fast can violate state norm growth control in the sense of Definition 1. The issue can be alleviated by remaining in each mode for a computable minimum duration. Hence, the key challenge in addressing this problem boils down to the simultaneous design of policies to be applied during different epochs and determining the appropriate timing for switches.

To lay the foundation for developing an algorithm to tackle the proposed problem, known as Unknown-Modes and Unknown-Sequence (M¯S¯\bar{M}\bar{S}), it is crucial to thoroughly examine the problem within the settings of (1) Known-Modes and Known-Sequence (MS{M}{S}) and (2) Known-Modes and Unknown-Sequence (MS¯M\bar{S}). This analysis not only enhances our understanding of the problem but also establishes a baseline for defining regret, with its upper-bound serving as a metric to evaluate our proposed algorithm.

Expanding upon the solution for the MS¯M\bar{S} setup presented in Section III, we propose an algorithm in Section V to address the M¯S¯\bar{M}\bar{S} setting. Our algorithm relies on high probability estimates of parameters, which are obtained through the system identification procedure outlined in Section IV-B.

By introducing the notation Θi=(Ai,Bi){\Theta_{*}^{i}}=(A_{*}^{i},B_{*}^{i})^{\top} we can express (1) equivalently as follows:

xt+1=Θizt+ωt+1,zt=(xtut).\displaystyle x_{t+1}={\Theta^{i}_{*}}^{\top}z_{t}+\omega_{t+1},\quad z_{t}=\begin{pmatrix}x_{t}\\ u_{t}\end{pmatrix}. (4)

for some mode σ(t)=i\sigma(t)=i. This formulation, denoted by (4), will be widely used henceforth to represent the dynamics model.

For control design purpose we restrict the policy to belong to the compact sets 𝒮(Θi)\mathcal{S}(\Theta_{*}^{i})’s specified by certain κci>0\kappa^{i}_{c}>0 and 0<γci<10<\gamma_{c}^{i}<1, that denote some stabilizing policies KiK_{i} of system parameterized with Θi\Theta_{i}^{*} as follows

𝒮(Θi)={\displaystyle\mathcal{S}(\Theta_{*}^{i})=\{ Ki(ni+mi)×ni|\displaystyle K_{i}\in\mathbb{R}^{(n_{i}+m_{i})\times n_{i}}|
Kiκci,ρ(Ai+BiKi)<1γci}.\displaystyle\|K_{i}\|\leq\kappa^{i}_{c},\;\rho(A_{*}^{i}+B_{*}^{i}K_{i})<1-\gamma_{c}^{i}\}. (5)

Given the policy class (5), the following lemma explicitly defines the parameter β¯\bar{\beta} in Definition 1.

Lemma 1

With the policy class (5) employed for control design across all modes ii\in\mathcal{M}, within the framework of the proposed switching scenario explained in Definition 1, if 𝔗ik+1𝔗ikτikik+1\mathfrak{T}_{i_{k+1}}-\mathfrak{T}_{i_{k}}\geq\tau_{i_{k}i_{k+1}} for τikik+11\tau_{i_{k}i_{k+1}}\geq 1 then the expected state norm is (α¯,β¯)(\bar{\alpha},\bar{\beta})-under control with a user-defined parameter α¯\bar{\alpha} and β¯\bar{\beta} defined by

β¯:=4α12α02κc4(1+κc2)2γc2\displaystyle\bar{\beta}:=\frac{{4\alpha^{*}_{1}}^{2}}{{\alpha^{*}_{0}}^{2}}\frac{{\kappa^{*}_{c}}^{4}(1+{\kappa^{*}_{c}}^{2})^{2}}{{\gamma^{*}_{c}}^{2}} (6)

where τikik+1\tau_{i_{k}i_{k+1}} is α¯\bar{\alpha}-dependant and κc=maxi||κci\kappa^{*}_{c}=\max_{i\in|\mathcal{M}|}\kappa^{i}_{c} and γc=maxi||γci\gamma^{*}_{c}=\max_{i\in|\mathcal{M}|}\gamma^{i}_{c} and α0IQi,Riα1I\alpha^{*}_{0}I\preceq Q^{i},R^{i}\preceq\alpha^{*}_{1}I for all ii\in\mathcal{M}.

III Solution for MSMS and MS¯M\bar{S} Setups

In this section, we focus on addressing the problem of efficient and safe switching in two distinct scenarios: MSMS and MS¯M\bar{S}. For each scenario, we first introduce formulations and then provide an upper-bound for performance gap between their solutions.

III-A MSMS setup

Consider a scenario where the switch sequence \mathcal{I} and the system dynamics for all subsystems are known. In this context, the central challenge is to determine the switch times 𝒯¯={𝔗i0,,𝔗ins1}\bar{\mathcal{T}}=\{\mathfrak{T}_{i_{0}},\ldots,\mathfrak{T}_{i_{ns-1}}\} with 𝔗i0=0\mathfrak{T}_{i_{0}}=0, with the starting time point established as 𝔗i0=0\mathfrak{T}_{i_{0}}=0, along with the set of feedback gains 𝒦¯={{𝒦i0},,{𝒦ins1}}\bar{\mathcal{K}}=\{\{\mathcal{K}_{i_{0}}\},...,\{\mathcal{K}_{i_{ns-1}}\}\} to be applied at specific time instances with the aim of minimizing the cumulative cost. The subset {𝒦ij}:={K𝔗ij,K𝔗ij+1,,K𝔗ij+11}\{\mathcal{K}_{i_{j}}\}:=\{K_{\mathfrak{T}_{i_{j}}},K_{\mathfrak{T}_{i_{j}}+1},...,K_{\mathfrak{T}_{i_{j+1}}-1}\}, whose members belong to 𝒮(Θij)\mathcal{S}(\Theta_{*}^{i_{j}}), represents the sequence of feedback gains to be implemented during the time steps within the epoch following the jj-th switch. It is note worthy that the cardinally of this sequence is 𝔗ij+1𝔗ij\mathfrak{T}_{i_{j+1}}-\mathfrak{T}_{i_{j}} which is also the decision variable. The formulation of this problem is given by Problem 0, as follows:

III-A1 MSMS, (Program 0)

𝒯¯0,𝒦¯0=argmin𝒦¯,𝒯k=0ns1t=𝔗ik𝔗ik+11ctik\displaystyle\bar{\mathcal{T}}^{*}_{0},\bar{\mathcal{K}}^{*}_{0}=\operatorname*{argmin}_{\bar{\mathcal{K}},\mathcal{T}}\sum_{k=0}^{ns-1}\sum_{t={\mathfrak{T}}_{i_{k}}}^{\mathfrak{T}_{i_{k+1}}-1}c_{t}^{i_{k}} (7)
s.t.
xt+1=Aσ(t)xt+Bσ(t)ut+ωt+1\displaystyle x_{t+1}=A^{\sigma(t)}_{*}x_{t}+B^{\sigma(t)}_{*}u_{t}+\omega_{t+1}
σ(t)=ikfor𝔗ikt<𝔗ik+1\displaystyle\sigma(t)=i_{k}\;\text{for}\;\mathfrak{T}_{i_{k}}\leq t<\mathfrak{T}_{i_{k+1}}
𝔼[x𝔗ik+1+x𝔗ik+1+|𝔗ik+11]α¯𝔼[x𝔗ik+x𝔗ik+|𝔗ik1]+β¯σω2\displaystyle\mathbb{E}[x_{\mathfrak{T}_{i_{k+1}}^{+}}^{\top}x_{\mathfrak{T}_{i_{k+1}}^{+}}|\mathcal{F}_{\mathfrak{T}_{i_{k+1}}-1}]\leq\bar{\alpha}\mathbb{E}[x_{\mathfrak{T}_{i_{k}}^{+}}^{\top}x_{\mathfrak{T}_{i_{k}}^{+}}|\mathcal{F}_{\mathfrak{T}_{i_{k}}-1}]+\bar{\beta}\sigma_{\omega}^{2} (8)
𝔗ik+1𝔗ik1\displaystyle\mathfrak{T}_{i_{k+1}}-\mathfrak{T}_{i_{k}}\geq 1 (9)

The constraint (9) imposes the requirement of visiting each mode iki_{k}\in\mathcal{I}. We denote the optimum solution of Program 0 as 𝒯¯0={𝔗i00,,𝔗ins10}\bar{\mathcal{T}}_{0}^{*}=\{\mathfrak{T}_{i_{0}}^{0_{*}},\ldots,\mathfrak{T}^{0_{*}}_{i_{ns-1}}\} and 𝒦¯0={{𝒦i00},,{𝒦ins10}}\bar{\mathcal{K}}_{0}^{*}=\{\{\mathcal{K}^{0_{*}}_{i_{0}}\},...,\{\mathcal{K}^{0_{*}}_{i_{ns-1}}\}\}.

Although Problem 0 benefits from having full access to switch sequence information \mathcal{I}, solving it remains challenging due to its intricate nature as a complex combinatorial optimization problem. The other challenge arises from the nonconvexity of the set 𝒮(Θik)\mathcal{S}(\Theta_{*}^{i_{k}}), for any iki_{k}\in\mathcal{I} as demonstrated in [29] through a counterexample. This nonconvexity significantly complicates the computational resolution of (7). Moreover, the constraint (8), i.e., the boundedness of the expected state norm as per Definition 1, does not explicitly depend on the design variables 𝒦¯\bar{\mathcal{K}} and 𝒯\mathcal{T} which adds a layer of complexity to Problem 0. Problem 1, as formulated below, aims to explicitly incorporate this constraint into the decision variables by establishing a fixed policy for each epoch. Also, thanks to the fixed policy for epochs, Problem 1 slightly reduces the combinatorial complexity associated with solving the MSMS Problem 0. Consequently, solving the problem simplifies to determining the epoch’s duration and the policy to be employed during that epoch. For the context of Problem 1, without loss of generality and for the sake of brevity, we will redefine the decision variables. By letting τij\tau_{i_{j}}\in\mathbb{N}, we will use 𝒯={τi0,,τins1}\mathcal{T}=\{\tau_{i_{0}},\ldots,\tau_{i_{ns-1}}\}, to represent the set of epochs’ duration, and 𝒦={Ki0,,Kins1}\mathcal{K}=\{K_{i_{0}},\ldots,K_{i_{ns-1}}\} to denote the set of feedback policies applied for each epoch. In this revised formulation, we have τijij+1=𝔗ij+1𝔗ij\tau_{i_{j}i_{j+1}}=\mathfrak{T}_{i_{j+1}}-\mathfrak{T}_{i_{j}}, and the fixed control gain for an epoch occurring within the time interval [𝔗ij,𝔗ij+1)[\mathfrak{T}_{i_{j}},\;\mathfrak{T}_{i_{j+1}}) is KijK_{i_{j}}. Having the average expected cost of policy KijK_{i_{j}}, denoted by J(Θij,Kij)J(\Theta^{i_{j}}_{*},K_{i_{j}}) and computed by

J(Θi,Ki)=limT1T𝔼[t=0T1xt(Qi+RiKiRi)xt]\displaystyle J(\Theta^{i}_{*},K_{i})=\lim_{T\rightarrow\infty}\frac{1}{T}\mathbb{E}[\sum_{t=0}^{T-1}x^{\top}_{t}(Q^{i}+R^{i}{{}^{\top}}K_{i}R^{i})x^{\top}_{t}]

the expected accumulated cost during actuating in mode ii for time with length of τi\tau_{i} will be τiJ(Θi,Ki)\tau_{i}J(\Theta_{*}^{i},K_{i}).

Now let PKiP_{K_{i}} be a certain positive semidefinite (PSD) matrix associated with the fixed KiK_{i} to be applied in an epoch, which is defined as follows:

PKi=Qi+KiRiKi+(Ai+BiKi)PKi(Ai+BiKi).\displaystyle P_{K_{i}}=Q^{i}+K_{i}^{\top}R^{i}K_{i}+(A^{i}_{*}+B^{i}_{*}K_{i})^{\top}P_{K_{i}}(A^{i}_{*}+B^{i}_{*}K_{i}). (10)

Then following proposition outlines a sufficient condition to achieve the boundedness of the state in accordance with Definition 1.

Proposition 1

Let 𝔗ik+1+\mathfrak{T}_{i_{k+1}}^{+} and 𝔗ik+\mathfrak{T}_{i_{k}}^{+} represent time sequences immediately following two subsequent switches from the mode iki_{k} to ik+1i_{k+1}, both within \mathcal{I}. The sufficient condition for (8), with β¯\bar{\beta} as defined in (6), is as follows:

τikik+1:=𝔗ik+1𝔗ik\displaystyle{\tau}_{i_{k}i_{k+1}}:=\mathfrak{T}_{i_{k+1}}-\mathfrak{T}_{i_{k}}\geq (11)
lnρ¯(Kik,Kik+1)+ln𝒳¯(Kik,Kik+1)lnα¯ln(1η¯(Kik))\displaystyle-\frac{\ln\bar{\rho}(K_{i_{k}},K_{i_{k+1}})+\ln\bar{\mathcal{X}}(K_{i_{k}},K_{i_{k+1}})-\ln\mathcal{\bar{\alpha}}}{\ln\big{(}1-\bar{\eta}\big{(}K_{i_{k}})\big{)}}

where

η¯(Kik):=λ¯(HKik)λ¯(PKik),ρ¯(Kik,Kik+1):=λ¯(PKik+1)λ¯(PKik)\displaystyle\bar{\eta}\big{(}K_{i_{k}}):=\frac{\underline{\lambda}\big{(}H_{K_{i_{k}}}\big{)}}{\overline{\lambda}\big{(}P_{K_{i_{k}}}\big{)}},\;\bar{\rho}(K_{i_{k}},K_{i_{k+1}}):=\frac{\overline{\lambda}\big{(}P_{K_{i_{k+1}}}\big{)}}{\underline{\lambda}\big{(}P_{K_{i_{k}}})}
𝒳¯(Kik,Kik+1):=λ¯(PKik)λ¯(PKik+1),HKik=Qik+KikRikKik.\displaystyle\bar{\mathcal{X}}(K_{i_{k}},K_{i_{k+1}}):=\frac{\overline{\lambda}\big{(}P_{K_{i_{k}}}\big{)}}{\underline{\lambda}\big{(}P_{K_{i_{k+1}}})},H_{K_{i_{k}}}=Q^{i_{k}}+K_{i_{k}}^{\top}R^{i_{k}}K_{i_{k}}. (12)

III-A2 MSMS (Program 1)

𝒦1,𝒯1=\displaystyle\mathcal{K}^{*}_{1},\mathcal{T}^{*}_{1}= argmin𝒦,𝒯k=0n1τikik+1J(Θik,Kik)\displaystyle\operatorname*{argmin}_{\mathcal{K},\mathcal{T}}\sum_{k=0}^{n-1}\tau_{i_{k}i_{k+1}}J(\Theta_{*}^{i_{k}},K_{i_{k}})
s.t.
xt+1=Aσ(t)xt+Bσ(t)ut+ωt+1\displaystyle x_{t+1}=A^{\sigma(t)}_{*}x_{t}+B^{\sigma(t)}_{*}u_{t}+\omega_{t+1}
σ(t)=ikfor𝔗ikt<𝔗ik+1\displaystyle\sigma(t)=i_{k}\;\text{for}\;\mathfrak{T}_{i_{k}}\leq t<\mathfrak{T}_{i_{k+1}}
lnρ¯(Kik,Kik+1)+τikln(1η¯(Kik))+\displaystyle\ln\bar{\rho}(K_{i_{k}},K_{i_{k+1}})+{\tau_{i_{k}}}\ln\bigg{(}1-\bar{\eta}\big{(}K_{i_{k}}\big{)}\bigg{)}+
ln𝒳¯(Kik,Kik+1)lnα¯k=0,1,..,ns1\displaystyle\ln\bar{\mathcal{X}}(K_{i_{k}},K_{i_{k+1}})\leq\ln\mathcal{\bar{\alpha}}\quad\quad\quad\quad k=0,1,..,ns-1 (13)
𝔗ik+1=τikik+1+𝔗ik,𝔗i0=0\displaystyle\mathfrak{T}_{i_{k+1}}=\tau_{i_{k}i_{k+1}}+\mathfrak{T}_{i_{k}},\quad\mathfrak{T}_{i_{0}}=0
τikik+11.k\displaystyle\tau_{i_{k}i_{k+1}}\geq 1.\quad\forall k (14)

The optimum duration and feedback gains outputted by Program 1 is denoted by 𝒯1={τi0i11,,τin1in1}\mathcal{T}^{*}_{1}=\{\tau^{1_{*}}_{i_{0}i_{1}},\ldots,\tau^{1_{*}}_{i_{n-1}i_{n}}\} and 𝒦1={Ki01,,Kins11}\mathcal{K}^{*}_{1}=\{K^{1_{*}}_{i_{0}},\ldots,K^{1_{*}}_{i_{ns-1}}\}.

Proposition 2

Problem 1 achieves its minimum in

τikik+1:=\displaystyle{\tau}_{i_{k}i_{k+1}}:=
max{1,lnρ¯(Kik,Kik+1)+ln𝒳(Kik,Kik+1)lnα¯ln(1η¯(Kik))}.\displaystyle\max\bigg{\{}1,-\frac{\ln\bar{\rho}(K_{i_{k}},K_{i_{k+1}})+\ln\mathcal{X}(K_{i_{k}},K_{i_{k+1}})-\ln\bar{\alpha}}{\ln\big{(}1-\bar{\eta}\big{(}K_{i_{k}})\big{)}}\bigg{\}}. (15)

and for any arbitrary switch sequence \mathcal{I}, every feasible solution of Program 1 has a cost C1()>nminjJ(Θj,Qj,Ri):=1C_{1}(\mathcal{I})>n\min_{j\in\mathcal{M}}J_{*}(\Theta_{*}^{j},Q^{j},R^{i}):=\mathcal{L}_{1}.

Problems 0 and 1, apart from their computational complexity, require full knowledge of switch sequence that is unavailable in our proposed setting, where only the next mode is revealed to the system and the termination of the sequence is unknown. In the following section, our goal is to investigate the problem in the absence of this information.

III-B MS¯M\bar{S} Setup

When confronted with lack of switch sequence information and uncertainty regarding sequence termination, a robust approach with respect to the termination involves minimizing the costs on an epoch-by-epoch basis. The subsequent program outlines this approach.

III-B1 MS¯M\bar{S} (Problem 2)

Consider the switched system (1), that at time 𝔗ik+1\mathfrak{T}_{i_{k+1}} switches from mode iki_{k} to the mode ik+1i_{k+1} according to the switch sequence \mathcal{I} i.e., σ(𝔗ik+1)=ik\sigma(\mathfrak{T}_{i_{k+1}}^{-})=i_{k} and σ(𝔗ik+1+)=ik+1\sigma(\mathfrak{T}_{i_{k+1}}^{+})=i_{k+1}. Additionally, let Kik𝒮(Θik)K_{i_{k}}\in\mathcal{S}(\Theta_{*}^{i_{k}}) and Kik+1𝒮(Θik+1)K_{i_{k+1}}\in\mathcal{S}(\Theta_{*}^{i_{k+1}}) stand for the respective stabilizing linear controllers for these modes. Then the minimum mode-dependent dwell-time, τikik+12\tau^{2_{*}}_{i_{k}i_{k+1}} and policy for the epoch of actuating in mode iki_{k}\in\mathcal{I}, Kik2K^{2_{*}}_{i_{k}} is obtained by solving following program.

Kik2,τikik+12=\displaystyle K_{i_{k}}^{2_{*}},\tau_{i_{k}i_{k+1}}^{2_{*}}= argminKik𝒮(Θik),τik+τikik+1J(Θik,K)\displaystyle\operatorname*{argmin}_{K_{i_{k}}\in\mathcal{S}(\Theta_{*}^{i_{k}}),\tau_{i_{k}}\in\mathbb{Z}^{+}}\tau_{i_{k}i_{k+1}}J(\Theta_{*}^{i_{k}},K)
s.t. (16)
xt+1=Aikxt+Bikut+ωt+1,\displaystyle x_{t+1}=A^{i_{k}}_{*}x_{t}+B^{i_{k}}_{*}u_{t}+\omega_{t+1},
τikτ¯(Kik,Kik+1)\displaystyle\tau_{i_{k}}\geq\bar{\tau}(K_{i_{k}},K_{i_{k+1}})

where

τ¯(Kik,Kik+1):=\displaystyle\bar{\tau}(K_{i_{k}},K_{i_{k+1}}):=
max{1,lnρ¯(Kik,Kik+1)+ln𝒳¯(Kik,Kik+1)lnα¯ln(1η¯(Kik))}\displaystyle\max\bigg{\{}1,-\frac{\ln\bar{\rho}(K_{i_{k}},K_{i_{k+1}})+\ln\bar{\mathcal{X}}(K_{i_{k}},K_{i_{k+1}})-\ln\bar{\alpha}}{\ln\big{(}1-\bar{\eta}\big{(}K_{i_{k}})\big{)}}\bigg{\}} (17)

with operators ρ¯(.,.)\bar{\rho}(.,.), η¯(.)\bar{\eta}(.) and 𝒳¯(.,.)\bar{\mathcal{X}}(.,.) defined by (12).

The objective of Problem 2 is the product of the average expected cost J(Θik,Kik)J(\Theta_{*}^{i_{k}},K_{i_{k}}) incurred by a policy KikK_{i_{k}} and its associated dwell time τikik+1\tau_{i_{k}i_{k+1}}. Noting Proposition 2, Problem 2 achieves its minimum when τikik+1=τ¯(Kik,Kik+1)\tau_{i_{k}i_{k+1}}=\bar{\tau}(K_{i_{k}},K_{i_{k+1}}) where Kik+1K_{i_{k+1}} represents the policy of the next epoch. Then problem can be reformulated as follows:

Kik2=\displaystyle K_{i_{k}}^{2_{*}}= argminK𝒮(Θik)τ¯(Kik,Kik+1)J(Θik,K)\displaystyle\operatorname*{argmin}_{K\in\mathcal{S}(\Theta_{*}^{i_{k}})}\bar{\tau}(K_{i_{k}},K_{i_{k+1}})J(\Theta_{*}^{i_{k}},K) (18)
s.t.
xt+1=Aikxt+Bikut+ωt+1.\displaystyle x_{t+1}=A^{i_{k}}_{*}x_{t}+B^{i_{k}}_{*}u_{t}+\omega_{t+1}.

While formulation (18) seems less complex than Problem 2, it still contains two main hurdles. First, Kik+1K_{i_{k+1}} is not know as it depends on the policy of the subsequent epoch whose index ik+1i_{k+1} is yet to be revealed. In other words, Kik2K_{i_{k}}^{2_{*}} is a function of unknown policy Kik+1K_{i_{k+1}}. Secondly, solving this problem is computationally intractable because of the complexity of its objective function and the presence of the nonconvex set 𝒮(Θik)\mathcal{S}(\Theta_{*}^{i_{k}}).

To obtain feasible solution given the disclosed information, one possible approach is to enable the system to utilize optimal control feedback obtained by solving DARE at each epoch. This approach indeed specifies the policies to be applied in each epoch and not only eliminates the requirement for future switch sequences but also sidesteps the computational complexity associated with solving a complex optimization problem. The only remaining task is to design a minimum dwell time associated with the solution of DARE.

The rationale for this approach is supported by employing the following insightful lemma, which provides an expression for the value of J(Θ,K)J(\Theta_{*},K) in terms of the optimal average expected cost J(Θ,Q,R)J_{*}(\Theta_{,}Q,R) and the optimal feedback gain K(Θ,Q,R)K_{*}(\Theta_{*},Q,R).

Lemma 2

(Lemma 3 of [6]) In the classic Linear Quadratic Regulator (LQR) framework, characterized by the environment parameters Θ=(A,B)\Theta_{*}=(A_{*},B_{*})^{\top} and the cost matrices (Q,R)(Q,R), the application of any arbitrary stabilizing static linear controller KK incurs an average expected cost of:

J(Θ,K)=J(Θ,Q,R)+\displaystyle J(\Theta_{*},K)=J_{*}(\Theta_{*},Q,R)+ (19)
Tr(Σxx(K)(KK(Θ,Q,R))\displaystyle\operatorname{Tr}(\Sigma_{xx}(K)(K-K_{*}(\Theta_{*},Q,R))^{\top}
(R+BPB)(KK(Θ,Q,R))\displaystyle(R+B_{*}^{\top}P_{*}B_{*})(K-K_{*}(\Theta_{*},Q,R)) (20)

where Σxx(K)\Sigma_{xx}(K) represents the stationary distribution of the covariance matrix of the closed-loop system’s state and computed by

(A+BK)Σxx(K)(A+BK)Σxx(K)+σω2I=0\displaystyle(A_{*}+B_{*}K)^{\top}\Sigma_{xx}(K)(A_{*}+B_{*}K)-\Sigma_{xx}(K)+\sigma_{\omega}^{2}I=0

The expression (20) can be interpreted as the sum of the optimal average expected cost and an additional term that arises due to deviations from the optimal stationary policy K(Θ,Q,R)K_{*}(\Theta_{*},Q,R).

Let us consider that the policy for the next mode, Kik+1K_{i_{k+1}}, is known in advance. This results in the minimum dwell time definition being solely a function of policy KikK_{i_{k}}. Even with this information, solving (18) to minimize KikK_{i_{k}} is computationally burdensome. Furthermore, it is worth noting that the minimizer of τikik+1\tau_{i_{k}i_{k+1}} may not necessarily be the feedback gain obtained by solving DARE. However, such a policy would incur an average expected cost greater than that of the solution from DARE, as explained in Lemma 2. Therefore, deviating from the policy obtained by solving DARE does not necessarily reduce the accumulated expected cost. This rationale justifies adhering to the policy obtained by solving DARE. It makes us independent of the next epoch’s policy and allows us to avoid the computational complexity of solving (18) while remaining reasonably close to the optimal policy.

The following program provides the minimum dwell time for the proposed alternative algorithm.

III-B2 Approximate MS¯M\bar{S} (Program 3)

Consider the switched system (1), that at time 𝔗ik+1\mathfrak{T}_{i_{k+1}} switches from mode iki_{k} to the mode ik+1i_{k+1}, (i.e., σ(𝔗ik+1)=ik\sigma(\mathfrak{T}_{i_{k+1}}^{-})=i_{k} and σ(𝔗ik+1+)=ik+1\sigma(\mathfrak{T}_{i_{k+1}}^{+})=i_{k+1}). Let the corresponding dual solutions SDP solutions for these modes be P(Θik,Qik,Rik)P_{*}({\Theta}^{i_{k}}_{*},Q^{i_{k}},R^{i_{k}}), P(Θik+1,Qik+1,Rik+1)P_{*}({\Theta}^{i_{k+1}}_{*},Q^{i_{k+1}},R^{i_{k+1}}) and the designed control feedback for the mode iki_{k} be K(Θik,Qik,Rik)K_{*}({\Theta}^{i_{k}}_{*},Q^{i_{k}},R^{i_{k}}). Then minimum mode-dependent dwell-time for the epoch starting at 𝔗ik\mathfrak{T}_{i_{k}} is

τikik+1:=max{1,lnρikik+1+ln𝒳ikik+1lnα¯ln(1ηik)}\displaystyle\tau_{*}^{{i_{k}}i_{k+1}}:=\max\bigg{\{}1,-\frac{\ln{\rho}_{*}^{{i_{k}}i_{k+1}}+\ln\mathcal{X}_{*}^{{i_{k}}i_{k+1}}-\ln\bar{\alpha}}{\ln\big{(}1-{\eta}_{*}^{i_{k}}\big{)}}\bigg{\}} (21)

where

ηik:=λ¯(H(Θik,Qik,Rik))λ¯(P(Θik,Qik,Rik)),\displaystyle\eta_{*}^{i_{k}}:=\frac{\underline{\lambda}\big{(}H(\Theta_{*}^{i_{k}},Q^{i_{k}},R^{i_{k}})\big{)}}{\overline{\lambda}\big{(}P_{*}({\Theta}^{i_{k}}_{*},Q^{i_{k}},R^{i_{k}})\big{)}},
ρikik+1:=λ¯(P(Θik+1,Qik+1,Rik+1))λ¯(P(Θik,Qik,Rik)\displaystyle\rho_{*}^{i_{k}i_{k+1}}:=\frac{\overline{\lambda}\big{(}P_{*}({\Theta}^{i_{k+1}}_{*},Q^{i_{k+1}},R^{i_{k+1}})\big{)}}{\underline{\lambda}\big{(}P_{*}({\Theta}^{i_{k}}_{*},Q^{i_{k}},R^{i_{k}})}
𝒳ikik+1:=λ¯(P(Θik,Qik,Rik))λ¯(P(Θik+1,Qik+1,Rik+1)\displaystyle\mathcal{X}_{*}^{i_{k}i_{k+1}}:=\frac{\overline{\lambda}\big{(}P_{*}({\Theta}^{i_{k}}_{*},Q^{i_{k}},R^{i_{k}})\big{)}}{\underline{\lambda}\big{(}P_{*}({\Theta}^{i_{k+1}}_{*},Q^{i_{k+1}},R^{i_{k+1}})}
H(Θik,Qik,Rik)=Qik+\displaystyle H(\Theta_{*}^{i_{k}},Q^{i_{k}},R^{i_{k}})=Q^{i_{k}}+
K(Θik,Qik,Rik)RikK(Θik,Qik,Rik).\displaystyle K_{*}^{\top}({\Theta}^{i_{k}}_{*},Q^{i_{k}},R^{i_{k}})R^{i_{k}}K_{*}({\Theta}^{i_{k}}_{*},Q^{i_{k}},R^{i_{k}}). (22)

We denote the switch time sequence associated with the solution of Program 3 as 𝒯¯={𝔗i0,,𝔗in1}\bar{\mathcal{T}}_{*}=\{\mathfrak{T}^{*}_{i_{0}},\ldots,\mathfrak{T}^{*}_{i_{n-1}}\} where 𝔗i0=0\mathfrak{T}^{*}_{i_{0}}=0 by definition and 𝔗ik+1=𝔗ik+1+τikik+1\mathfrak{T}^{*}_{i_{k+1}}=\mathfrak{T}^{*}_{i_{k+1}}+\tau^{i_{k}i_{k+1}}_{*}.

Considering equation (21), the undesirable switching scenario is when the subsequent mode switch to ik+1i_{k+1} necessitates prolonging the duration in current mode iki_{k} to effectively suppress the state explosion in the sense of Definition 1. We refer to this type of scenario as a malignant switch, as it hinders a quick transition. To provide a precise definition of such a switch, we introduce the following rigorous definition.

Definition 2

(Malignant and Benign Switch) When system at time 𝔗ik+1\mathfrak{T}_{i_{k+1}} switches from mode iki_{k} to ik+1i_{k+1}, i.e., σ(𝔗ik+1)=ik\sigma(\mathfrak{T}_{i_{k+1}}^{-})=i_{k} and σ(𝔗ik+1+)=ik+1\sigma(\mathfrak{T}_{i_{k+1}}^{+})=i_{k+1}, we call the switch malignant if

lnρikik+1+ln𝒳ikik+1>lnα¯\displaystyle\ln\rho_{*}^{i_{k}i_{k+1}}+\ln\mathcal{X}_{*}^{i_{k}i_{k+1}}>\ln\bar{\alpha}

benign otherwise.

In equation (21), assigning a value of one to 𝒯ij\mathcal{T}_{*}^{ij} indicates the occurrence of benign switching.

Now, our focus is to assess the performance loss resulting from the approximation of Program 1 by Program 3. The following corollary provides an upper bound on the performance loss.

Proposition 3

For any arbitrary switch sequence \mathcal{I} we have the following properties:

  1. 1.

    The solution of Program 3, C3()C_{3}(\mathcal{I}) has the following lower bound:

    C3()maxi,j[nsτijJ(Θi,Qi,Ri)]:=𝒰3\displaystyle C_{3}(\mathcal{I})\leq\max_{i,j\in\mathcal{M}}\big{[}-ns\tau^{*}_{ij}J_{*}(\Theta_{*}^{i},Q^{i},R^{i})\big{]}:=\mathcal{U}_{3} (23)
  2. 2.

    The optimality gap between Problem 1 and 3 C3()C1()C_{3}(\mathcal{I})-C_{1}(\mathcal{I}) has the following upper-bound

    C3()C1()𝒰31.\displaystyle C_{3}(\mathcal{I})-C_{1}(\mathcal{I})\leq\mathcal{U}_{3}-\mathcal{L}_{1}. (24)

Conclusively, under the illustrated information disclosure mechanism and when all mode parameters are known, the justifiable strategy when feasibility and computational complexity are crucial factors, should be applying the policy obtained by solving DARE for each mode, and constraining the duration between two subsequent switches to be minimum dwell-time computed by Program 3, called minimum mode-dependent dwell time.

Now, before delving into the problem in the M¯S¯\bar{M}\bar{S} setup, it is crucial to review some foundational concepts.

IV Preliminaries for the M¯S¯\bar{M}\bar{S} Setup

We review the notation of strong stability introduced in [7], which will be used in the context of switched system as well.

IV-A (κ,γ)(\kappa,\gamma)- strong stability [7]

Definition 3

Consider a linear plant parameterized by AA and BB. The closed-loop system matrix A+BKA+BK is (κ,γ)(\kappa,\gamma)- strongly stable for κ>0\kappa>0 and 0<γ<10<\gamma<1 if there exists H0H\succ 0 and LL such that A+BK=HLH1A_{*}+B_{*}K=HL{H}^{-1} and

  1. 1.

    L1γ\|L\|\leq 1-\gamma

  2. 2.

    HH1κ\|H\|\|{H}^{-1}\|\leq\kappa .

Furthermore, we say a sequence KK of control gains is (κ,γ)(\kappa,\gamma)-strongly stabilizing for a plant (A,B)(A,B) if A+BKA+BK is (κ,γ)(\kappa,\gamma)-strongly stable.

As proved in [5] any stabilizing policy KK is indeed (κ,γ)(\kappa,\gamma)-stabilizing for some κ\kappa and γ\gamma and the definition thus does not introduce any additional assumption. However, it simplifies the analysis. For the sake of completeness we provided the proof in Appendix.

IV-B System Identification

The preliminaries provided by this section are a summary of results provided by [7] and [24]. However, we slightly modify the representation, especially for construction of confidence ellipsoid.

Assumption 2

  1. 1.

    There are known constants α0i,α1i,ϑi,νi>0\alpha^{i}_{0},\alpha^{i}_{1},\vartheta^{i},\nu^{i}>0 for all ii\in\mathcal{M} such that, α0iIQiα1iI\alpha^{i}_{0}I\leq Q^{i}\leq\alpha^{i}_{1}I, α0iIRiα1iI\alpha^{i}_{0}I\leq R^{i}\leq\alpha^{i}_{1}I, Θiϑi\|\Theta_{*}^{i}\|\leq\vartheta^{i} and νi\nu^{i} is an upper-bound for average expected cost J(Θi,Qi,Ri)J_{*}(\Theta_{*}^{i},Q^{i},R^{i}), defined in Section IV-D.

  2. 2.

    There is an initial stabilizing policy K0iK^{i}_{0} for all subsystems.

It is important to mention that Assumption 2 is a standard assumption in the literature and can be found in [7]. The second part of the assumption can be omitted by utilizing the results from [26] and the authors’ previous work [30]. However, for the sake of consistency, we will adhere to the assumption presented in [7].

Now, consider the linear switched system (1), which operates according to the sequence \mathcal{I} for an arbitrary time tt. Furthermore, let us denote ni(t)n_{i}(t) as the total number of time steps the system has spent in mode ii\in\mathcal{M} up to time tt. Then, we can express (1) as follows:

Xni(t)\displaystyle X_{n_{i}(t)} =Zni(t)Θi+Wni(t)\displaystyle=Z_{n_{i}(t)}\Theta^{i}_{*}+W_{n_{i}(t)} (25)

where Wni(t)W_{n_{i}(t)} is the vertical concatenation of ω1,,ωni(t)\omega_{1}^{\top},...,\omega_{n_{i}(t)}^{\top} and Xni(t)X_{n_{i}(t)} and Zni(t)Z_{n_{i}(t)} are matrices constructed by rows x1,,xni(t)x^{\top}_{1},...,x^{\top}_{n_{i}(t)} and z0,,zni(t)1{z}^{\top}_{0},...,{z}^{\top}_{n_{i}(t)-1} respectively.

Using the measured data, the l2l^{2}-regularized least square estimate can be obtained as

Θ^ni(t)i\displaystyle\hat{\Theta}^{i}_{n_{i}(t)} =argminΘie(Θi)=Vini(t)1(Zni(t)Xni(t)+λiΘ0i).\displaystyle=\operatorname*{argmin}_{\Theta^{i}}e(\Theta^{i})={V^{i}}_{n_{i}(t)}^{-1}\big{(}Z_{n_{i}(t)}^{\top}X_{n_{i}(t)}+\lambda_{i}\Theta^{i}_{0}\big{)}. (26)

where e(Θi)e(\Theta^{i}) is defined by

e(Θi)\displaystyle e(\Theta^{i}) =λiTr((ΘiΘ0i)(ΘiΘ0i))+\displaystyle=\lambda_{i}\operatorname{Tr}\big{(}(\Theta^{i}-\Theta^{i}_{0})^{\top}(\Theta^{i}-\Theta^{i}_{0})\big{)}+
s=0ni(t)1Tr((xs+1Θizs)(xs+1Θizs)))\displaystyle\sum_{s=0}^{n_{i}(t)-1}\operatorname{Tr}\big{(}(x_{s+1}-{\Theta^{i}}^{\top}z_{s})(x_{s+1}-{\Theta^{i}}^{\top}z_{s})^{\top})\big{)} (27)

where Θ0i\Theta^{i}_{0} is an initial estimate and λi\lambda_{i} is a regularization parameter. Furthermore,

Vni(t)i=λiI+s=0ni(t)1zszs=λiI+Zni(t)Zni(t),\displaystyle V^{i}_{n_{i}(t)}=\lambda_{i}I+\sum_{s=0}^{n_{i}(t)-1}z_{s}z_{s}^{\top}=\lambda_{i}I+Z_{n_{i}(t)}^{\top}Z_{n_{i}(t)},

is covariance matrix. By assuming martingale difference properties for dynamics and sub-Gaussianity for the process noise, and assuming having access to an initial estimate Θ0i\Theta^{i}_{0} such that ΘiΘ0iϵi\|\Theta^{i}_{*}-\Theta^{i}_{0}\|_{*}\leq\epsilon_{i} for some ϵi>0\epsilon_{i}>0 a high probability confidence set around true but unknown parameters of system is constructed as follows

𝒞ti(δ):={\displaystyle\mathcal{C}^{i}_{t}(\delta):=\big{\{} Θi(ni+mi)×ni|\displaystyle\Theta^{i}\in\mathbb{R}^{(n_{i}+m_{i})\times n_{i}}|
Tr((ΘiΘ^ni(t)i)Vni(t)i(ΘiΘ^ni(t)i))rni(t)i}\displaystyle\operatorname{Tr}\big{(}(\Theta^{i}-\hat{\Theta}^{i}_{n_{i}(t)})^{\top}V^{i}_{n_{i}(t)}(\Theta^{i}-\hat{\Theta}^{i}_{n_{i}(t)})\big{)}\leq r^{i}_{n_{i}(t)}\big{\}} (28)

where

rni(t)i=(σω2nlogndet(Vni(s)i)δdet(λiI)+λiϵi)2.\displaystyle r^{i}_{n_{i}(t)}=\bigg{(}\sigma_{\omega}\sqrt{2n\log\frac{n\det(V^{i}_{n_{i}(s)})}{\delta\det(\lambda_{i}I)}}+\sqrt{\lambda_{i}}\epsilon_{i}\bigg{)}^{2}. (29)

It is guaranteed that the true parameter of the system Θi\Theta^{i}_{*} belongs to the confidence ellipsoid 𝒞ti(δ)\mathcal{C}^{i}_{t}(\delta) with probability at least 1δ1-\delta where 0<δ<10<\delta<1. The regularization parameters λi\lambda_{i}’s are a user-defined parameter which needs to be specified in a way that the stability of system is guaranteed and the regret scales appropriately. We will specify it in the stability analysis section. It is noteworthy that 𝒞ti(δ)\mathcal{C}^{i}_{t}(\delta), constructed based on ni(t)n_{i}(t) data points, preserves information about the parameters of mode ii up to time tt.

IV-C Primal and Dual Relaxed-SDP Formulation

While referring the readers to Appendix A-C for more detail, LQR control problem with known system parameters can be reformulated in Semi-definite Programming (SDP) in the form of primal and dual problems. Given the constructed confidence set (28)-(29), by applying the perturbation lemma (see Appendix A-E), the primal and dual formulations are relaxed to account for the estimation error.

The relaxed primal SDP is given as follows:

min(Qi00Ri)Σs.t.ΣxxΘ^ni(t)iΣΘ^ni(t)i+Wμni(t)i(ΣVini(t)1)I,Σ0.\displaystyle\begin{array}[]{rrclcl}\displaystyle\min&\lx@intercol\begin{pmatrix}Q^{i}&0\\ 0&R^{i}\end{pmatrix}\bullet\Sigma\hfil\lx@intercol\\ \textrm{s.t.}&\Sigma_{xx}\geq{\hat{\Theta}_{n_{i}(t)}^{i^{\top}}}\Sigma\hat{\Theta}^{i}_{n_{i}(t)}+W-\mu^{i}_{n_{i}(t)}\big{(}\Sigma\bullet{V^{i}}^{-1}_{n_{i}(t)}\big{)}I,\\ &\Sigma\succ 0.\end{array} (33)

where minimization is with respect to

Σ=(ΣxxΣxuΣuxΣuu)\displaystyle\Sigma=\begin{pmatrix}\Sigma_{xx}&\Sigma_{xu}\\ \Sigma_{ux}&\Sigma_{uu}\end{pmatrix}

and μni(t)irni(t)i+rni(t)iϑiVni(t)i1/2\mu_{n_{i}(t)}^{i}\geq r^{i}_{n_{i}(t)}+\sqrt{r^{i}_{n_{i}(t)}}\vartheta^{i}\|V^{i}_{n_{i}(t)}\|^{1/2}. We denote the optimal solution of this program as the operator Σ(𝒞ti,Qi,Ri)\Sigma(\mathcal{C}_{t}^{i},Q^{i},R^{i}), which operates on QiQ^{i}, RiR^{i}, as well as the confidence ellipsoid 𝒞ti(δ)\mathcal{C}_{t}^{i}(\delta) that fully determines Θ^ni(t)i\hat{\Theta}^{i}_{n_{i}(t)}, Vni(t)iV^{i}_{n_{i}(t)}, and μini(t)\mu^{n_{i}(t)}_{i}. For brevity, we choose to use 𝒞ti\mathcal{C}_{t}^{i} rather than 𝒞ti(δ)\mathcal{C}_{t}^{i}(\delta) within the operator expression.

The controller extracted from solving the relaxed SDP (33) is deterministic and linear in state (u=K(𝒞ti,Qi,Ri)xu=K(\mathcal{C}_{t}^{i},Q^{i},R^{i})x) where

K(𝒞ti,Qi,Ri)=Σux(𝒞ti,Qi,Ri)Σxx1(𝒞ti,Qi,Ri).\displaystyle K(\mathcal{C}_{t}^{i},Q^{i},R^{i})=\Sigma_{ux}(\mathcal{C}_{t}^{i},Q^{i},R^{i})\Sigma_{xx}^{-1}(\mathcal{C}_{t}^{i},Q^{i},R^{i}). (34)

The designed control possesses strong stabilizing characteristics in accordance with the definition provided in Definition 3.

The relaxed primal problem (33) is mainly used for control design purpose. However for the stability analysis, minimum mode-dependent dwell time design and regret bound analyses, we need its dual program which is given as follows:

maxPWs.t.(QiP00Ri)+Θ^ni(t)iPΘ^ni(t)iμni(t)iPVni(t)1P0\displaystyle\begin{array}[]{rrclcl}\max&\lx@intercol P\bullet W\hfil\lx@intercol\\ \textrm{s.t.}&\begin{pmatrix}Q^{i}-P&0\\ 0&R^{i}\end{pmatrix}+\\ &\hat{\Theta}^{i}_{n_{i}(t)}P\;{\hat{\Theta}^{i^{\top}}_{n_{i}(t)}}\succeq\mu^{i}_{n_{i}(t)}\|P\|_{*}{{V}^{-1}_{n_{i}(t)}}\\ &P\succeq 0\end{array} (39)

In the context of optimization with respect to PP, we denote the optimal solution of the relaxed dual problem (39) as P(𝒞ti,Qi,Ri)P(\mathcal{C}^{i}_{t},Q^{i},R^{i}).

The derivation of the relaxed primal and dual formulations follows the analysis provided by [7]. However, the fundamental distinction between our formulation and that presented in [7] lies in how we define μni(t)i\mu_{n_{i}(t)}^{i}. This discrepancy arises from our choice not to normalize the confidence set.

IV-D Regret Definition

Let 𝒜\mathcal{A}_{*} represent the algorithm that possesses knowledge of the mode parameters. This algorithm devises a feedback policy by solving the DARE, calculates the mode-dependent dwell time associated with the DARE solution through the resolution of Problem 3, and follows the sequence ={i0,i1,,in}\mathcal{I}=\{i_{0},i_{1},...,i_{n}\}. Under these conditions, the expression for the optimal accumulated cost is given by:

J𝒜()=k=0n1τikik+1J(Θik,Qik,Rik)\displaystyle J_{\mathcal{A}_{*}}(\mathcal{I})=\sum_{k=0}^{n-1}\tau^{*}_{i_{k}i_{k+1}}J_{*}(\Theta_{*}^{i_{k}},Q^{i_{k}},R^{i_{k}}) (40)

where τikik+1\tau^{*}_{i_{k}i_{k+1}} is given by (21) and J(Θi,Qi,Ri)=P(Θi,Qi,Ri)WJ_{*}(\Theta_{*}^{i},Q^{i},R^{i})=P_{*}(\Theta_{*}^{i},Q^{i},R^{i})\bullet W is optimal average expected cost for the mode ii\in\mathcal{M}. To emphasize, the operator P(Θi,Qi,Ri)P_{*}(\Theta_{*}^{i},Q^{i},R^{i}) represents the solution of the dual SDP and should not be confused with P(𝒞ti,Qi,Ri)P(\mathcal{C}^{i}_{t},Q^{i},R^{i}), which is the solution of the relaxed dual SDP.

Now, let an arbitrary algorithm 𝒜\mathcal{A} address the M¯S¯\bar{M}\bar{S} problem by following switching with epoch duration 𝒯\mathcal{T} and applying episodic policies 𝒦\mathcal{K}. Having the corresponding accumulated cost denoted by J𝒜()J_{\mathcal{A}}(\mathcal{I}), we define the regret for the algorithm 𝒜\mathcal{A} as follows:

R𝒜()=J𝒜()J𝒜().\displaystyle R_{\mathcal{A}}(\mathcal{I})=J_{\mathcal{A}}(\mathcal{I})-J_{\mathcal{A}_{*}}(\mathcal{I}). (41)

V Problem Solution for M¯S¯\bar{M}\bar{S} setup

V-A Overview

In this section, we introduce our algorithm, which aims to design the feedback gain and the minimum mode-dependent dwell time for the switched system (1-2) when there is only access to noisy measurements of the state. The objective is to ensure that the system switches according to an externally revealed step-by-step sequence ={i0,;i1,;in}\mathcal{I}=\{i_{0},;i_{1},;...i_{n}\} with minimal cost while maintaining control over the expected state norm, as defined in Definition 1. we first introduce specific symbols to streamline our results and analysis. We denote the estimated minimum mode-dependent dwell time of the kthk^{th} switch between iki_{k} and ik+1i_{k+1} by τesikik+1\tau_{es}^{i_{k}i_{k+1}}, where the 0th0^{th} switch starts at time t=0t=0 with mode i0i_{0}. This symbol provides information about the index of the switch and both consecutive modes. The estimated time of occurrence of the kthk^{th} switch, denoted as 𝔗ikes\mathfrak{T}^{es}_{i_{k}}, is defined as 𝔗ikes=τesik1ik+𝔗ik1es\mathfrak{T}^{es}_{i_{k}}=\tau_{es}^{i_{k-1}i_{k}}+\mathfrak{T}^{es}_{i_{k-1}}, with 𝔗i0es=0\mathfrak{T}^{es}_{i_{0}}=0. It is directly deduced that 𝔗ikes=q=0kτesiq1jq\mathfrak{T}^{es}_{i_{k}}=\sum_{q=0}^{k}\tau_{es}^{i_{q-1}j_{q}}. We apply an OFU-based principle through which we build a high probability confidence ellipsoid for systems parameters, using which we design our feedback policy through solving relaxed primal problem (33) and specify the associated minimum mode-dependent dwell time using solution of relaxed dual SDP (39). The way we apply OFU principle is slightly different than of OSLO algorithm proposed by [5] in that we apply an off-policy version of it where we exploit the information learned by 𝔗ikes\mathfrak{T}^{es}_{i_{k}} for an epoch occurring in interval [𝔗ikes,𝔗ik+1es)[\mathfrak{T}^{es}_{i_{k}},\;\mathfrak{T}^{es}_{i_{k+1}}). The system commits to apply the designed feedback policy throughout the epoch whose length is indeed the designed minimum dwell time that is sufficient condition to guarantee state norm boundedness in the sense of Definition 1.

V-B Main Steps of Algorithm 1

Given initial parameter estimates for all the subsystems denoted by Θ0i\Theta_{0}^{i} such that |Θ0iΘi|ϵi|\Theta_{0}^{i}-\Theta_{*}^{i}|\leq\epsilon_{i} (as defined in Appendix A-D), Algorithm 1 designs minimum-dwell time and feedback gain to be implemented. For such initialization, we propose a provably efficient strategy, Algorithm 2 adopted from [5] with slight modifications on the duration of the warm-up phase. We leave the supporting arguments and theorem to Appendix A-D for brevity.

The process for building confidence set is performed by following the process specified in Section IV-B. For kthk^{th} epoch starting at time 𝔗ikes\mathfrak{T}^{es}_{i_{k}} we choose λik\lambda_{i_{k}} such that

λik4μ¯ikνikα0ikσω2.\displaystyle\lambda_{i_{k}}\geq\frac{4\bar{\mu}_{i_{k}}\nu_{i_{k}}}{\alpha_{0}^{i_{k}}\sigma_{\omega}^{2}}. (42)

where

μnik(𝔗ikes)ik=rnik(𝔗ikes)ik+\displaystyle{\mu}^{i_{k}}_{n_{i_{k}}(\mathfrak{T}^{es}_{i_{k}})}=r^{i_{k}}_{n_{i_{k}}(\mathfrak{T}^{es}_{i_{k}})}+ (43)
rnik(𝔗ikes)ikϑik(λik+q=1nik(𝔗ikes)zqikzqik)0.5.\displaystyle\sqrt{r^{i_{k}}_{n_{i_{k}}(\mathfrak{T}^{es}_{i_{k}})}}\vartheta_{i_{k}}(\lambda_{i_{k}}+\|\sum_{q=1}^{n_{i_{k}}(\mathfrak{T}^{es}_{i_{k}})}z^{i_{k}}_{q}{z^{i_{k}}_{q}}^{\top}\|)^{0.5}.

The algorithm employed is of the off-policy type, utilizing information learned by the beginning of each epoch to compute feedback gain and the minimum mode-dependent dwell time to be applied during the epoch.

To clarify, let Πt\Pi_{t} represent the set of all confidence ellipsoids updated by time tt, defined as follows:

Πt={𝒞tj(δ)|j=1,,||}\displaystyle\Pi_{t}=\{\mathcal{C}_{t}^{j}(\delta)|\;j=1,...,|\mathcal{M}|\} (44)

Now, consider the moment 𝔗esik\mathfrak{T}^{es}{i{k}} when the kthk^{th} switch from mode iki_{k} to ik+1i_{k+1} occurs. During this epoch, the fixed applied control is designed based on the most recently updated corresponding confidence ellipsoid 𝒞𝔗ikesik(δ)Π𝔗ikes\mathcal{C}^{i_{k}}_{\mathfrak{T}^{es}_{i_{k}}}(\delta)\in\Pi_{{\mathfrak{T}^{es}_{i_{k}}}} and through solving the relaxed primal problem (33). Furthermore, given the revealed next mode ik+1i_{k+1}, the minimum mode-dependent dwell time τesikik+1\tau_{es}^{i_{k}i_{k+1}} is computed using 𝒞𝔗ikesik(δ)\mathcal{C}^{i_{k}}_{\mathfrak{T}^{es}_{i_{k}}}(\delta) and 𝒞𝔗ik+1esik+1(δ)\mathcal{C}^{i_{k+1}}_{\mathfrak{T}^{es}_{i_{k+1}}}(\delta). Theorem 2 illustrates how to compute this quantity. By actuating with the designed fixed policy, the algorithm maintains exploration throughout. When the epoch concludes at 𝔗ik+1es\mathfrak{T}^{es}_{i_{k+1}}, the set of confidence ellipsoids is updated with 𝒞𝔗ik+1esik(δ)\mathcal{C}^{i_{k}}_{\mathfrak{T}^{es}_{i_{k+1}}}(\delta). Consequently, the set of recently updated confidence ellipsoids is refreshed as Π𝔗ik+1es\Pi_{\mathfrak{T}^{es}_{i_{k+1}}}, where 𝒞𝔗ik+1esj(δ)=𝒞𝔗ikesj(δ)\mathcal{C}^{j}_{\mathfrak{T}^{es}_{i_{k+1}}}(\delta)=\mathcal{C}^{j}_{\mathfrak{T}^{es}_{i_{k}}}(\delta) for all jij\neq i.

Algorithm 1 Safe and Fast Switching Algorithm (SFSA)
1:  Inputs: α¯(0, 1)\bar{\alpha}\in(0,\;1), α0i,\alpha^{i}_{0},σω2,\,\sigma_{\omega}^{2}, ϑi,\,\vartheta^{i},νi>0,\,\nu^{i}>0, δ(0, 1),\,\delta\in(0,\;1), Qi,Ri,Θ0i\,Q^{i}\,,R^{i},\,\Theta^{i}_{0} i=1,,||\forall i=1,...,|\mathcal{M}|
2:  Set 𝔗i0es=0\mathfrak{T}^{es}_{i_{0}}=0
3:  Initialize the set Π𝔗i0es\Pi_{\mathfrak{T}^{es}_{i_{0}}} using the initial estimates Θ0i\Theta^{i}_{0}
4:  Initialize first mode i0=ii_{0}=i
5:  for k=0,1,k=0,1,... do
6:     Receive the next mode ik+1i_{k+1}
7:     Use Π𝔗ikes\Pi_{\mathfrak{T}^{es}_{i_{k}}} and solve relaxed primal SDP (33) for Σ(𝒞𝔗ikesik,Qik,Rik)\Sigma(\mathcal{C}_{\mathfrak{T}^{es}_{i_{k}}}^{i_{k}},Q^{i_{k}},R^{i_{k}}) and compute feedback K(𝒞𝔗ikesik,Qik,Rik)K(\mathcal{C}_{\mathfrak{T}^{es}_{i_{k}}}^{i_{k}},Q^{i_{k}},R^{i_{k}}) by (34)
8:     Use Π𝔗ikes\Pi_{\mathfrak{T}^{es}_{i_{k}}} and solve dual Relaxed-SDP (39) for both P(𝒞𝔗ikesik,Qik,Rik)P(\mathcal{C}_{\mathfrak{T}^{es}_{i_{k}}}^{i_{k}},Q^{i_{k}},R^{i_{k}}) and P(𝒞𝔗ikesik+1,Qik+1,Rik+1)P(\mathcal{C}_{\mathfrak{T}^{es}_{i_{k}}}^{i_{k+1}},Q^{i_{k+1}},R^{i_{k+1}})
9:     Compute τesikik+1\tau_{es}^{i_{k}i_{k+1}} by Theorem 2
10:     Set 𝔗ik+1es=𝔗ikes+τesikik+1\mathfrak{T}^{es}_{i_{k+1}}=\mathfrak{T}^{es}_{i_{k}}+\tau_{es}^{i_{k}i_{k+1}}
11:     for 𝔗ikest<𝔗ik+1es\mathfrak{T}^{es}_{i_{k}}\leq t<\mathfrak{T}^{es}_{i_{k+1}} do
12:        Actuate in the mode iki_{k} by executing control utik=K(𝒞𝔗ikesik,Qik,Rik)xtu_{t}^{i_{k}}=K(\mathcal{C}_{\mathfrak{T}^{es}_{i_{k}}}^{i_{k}},Q^{i_{k}},R^{i_{k}})x_{t}
13:        Observe new state xt+1x_{t+1}, Save (ztik,xt+1)(z^{i_{k}}_{t},x_{t+1}) into database of mode iki_{k}
14:     end for
15:     update confidence ellipsoid of the mode iki_{k}, 𝒞𝔗ik+1esik\mathcal{C}_{\mathfrak{T}^{es}_{i_{k+1}}}^{i_{k}} by following (25- 29) using regularization parameter obtained by (42-43)
16:     update Π𝔗ikes\Pi_{\mathfrak{T}^{es}_{i_{k}}} and name it Π𝔗ik+1es\Pi_{\mathfrak{T}^{es}_{i_{k+1}}}
17:  end for

VI Analysis and Guarantees

VI-A Stability Analysis

In this section, we first demonstrate that the learning-based policy designed is (κ,γ)(\kappa,\gamma)-strongly stabilizing within an epoch. The result we present are similar to those of [7], with the rationale detailed in its proof (see Appendix A-D). The second part of this section introduces Theorem 2, which provides a formula for online computation of the minimum mode-dependent dwell-time. This computation is used in Algorithm 1 and ensures control of state norm growth in the sense of Definition 1.

VI-A1 Stability in an Epoch

The following Theorem gives the stability properties of the designed OFU-based control applied on a mode ii within an epoch (i.e., interval between two subsequent switches in \mathcal{I}).

Theorem 1

Suppose that at time tt, a transition to an arbitrary mode ii takes place, and let 𝒞ti(δ)Πt\mathcal{C}^{i}_{t}(\delta)\in\Pi_{t} indicate the confidence ellipsoid formed using ni(t)n_{i}(t) data points. Then defining κi=2νi/α0iσω2\kappa_{i}=\sqrt{2\nu_{i}/\alpha^{i}_{0}\sigma_{\omega}^{2}} and γi=1/2κi2\gamma_{i}=1/2\kappa_{i}^{2}, it is evident that the policy K(𝒞ti,Qi,Ri)K(\mathcal{C}_{t}^{i},Q^{i},R^{i}) obtained through solving relaxed primal SDP exhibits (κi,γi)(\kappa_{i},\gamma_{i})-strong stabilizing characteristics as per Definition 3.

VI-A2 Minimum Mode-Dependent Dwell-Time

Theorem 2

Consider the switched system described by (1), which undergoes a mode transition to mode ii at time tt and subsequently identifies the next mode index jj for the upcoming switch. Let K(𝒞ti,Qi,Ri)K(\mathcal{C}_{t}^{i},Q^{i},R^{i}) denote the control feedback designed for mode ii, and let P(𝒞ti,Qi,Ri)P(\mathcal{C}_{t}^{i},Q^{i},R^{i}) and P(𝒞tj,Qj,Rj)P(\mathcal{C}_{t}^{j},Q^{j},R^{j}) represent the corresponding solutions of relaxed dual SDPs. Then, the growth in state norms during the transition from mode ii to jj is under control in the sense of Definition 1, provided that the duration of actuation in mode ii lasts for at least the minimum dwell-time of τesij\tau^{ij}_{es} given by:

τesij:=max\displaystyle\tau^{ij}_{es}:=\max
{1,lnρ(𝒞ti(δ),𝒞tj(δ))+ln𝒳(𝒞ti(δ),𝒞tj(δ))lnα¯ln(1η(𝒞ti(δ)))}\displaystyle\bigg{\{}1,-\frac{\ln\rho(\mathcal{C}^{i}_{t}(\delta),\mathcal{C}^{j}_{t}(\delta))+\ln\mathcal{X}(\mathcal{C}^{i}_{t}(\delta),\mathcal{C}^{j}_{t}(\delta))-\ln\bar{\alpha}}{\ln\big{(}1-\eta\big{(}\mathcal{C}^{i}_{t}(\delta)\big{)}\big{)}}\bigg{\}} (45)

where

η(𝒞ti(δ)):=λ¯((𝒞ti(δ))λ¯(P(𝒞ti,Qi,Ri)).\displaystyle\eta\big{(}\mathcal{C}^{i}_{t}(\delta)\big{)}:=\frac{\underline{\lambda}\big{(}\mathcal{H}(\mathcal{C}^{i}_{t}(\delta)\big{)}}{\overline{\lambda}\big{(}P(\mathcal{C}^{i}_{t},Q^{i},R^{i})\big{)}}. (46)
ρ(𝒞ti(δ),𝒞tj(δ)):=λ¯(P(𝒞tj,Qj,Rj))λ¯(P(𝒞ti,Qi,Ri)),\displaystyle\rho(\mathcal{C}^{i}_{t}(\delta),\mathcal{C}^{j}_{t}(\delta)):=\frac{\overline{\lambda}\big{(}P(\mathcal{C}^{j}_{t},Q^{j},R^{j})\big{)}}{\underline{\lambda}\big{(}P(\mathcal{C}^{i}_{t},Q^{i},R^{i})\big{)}},
𝒳(𝒞ti(δ),𝒞tj(δ)):=λ¯(P(𝒞ti,Qi,Ri))λ¯(P(𝒞tj,Qj,Rj))\displaystyle\mathcal{X}(\mathcal{C}^{i}_{t}(\delta),\mathcal{C}^{j}_{t}(\delta)):=\frac{\overline{\lambda}\big{(}P(\mathcal{C}^{i}_{t},Q^{i},R^{i})\big{)}}{\underline{\lambda}\big{(}P(\mathcal{C}^{j}_{t},Q^{j},R^{j})\big{)}} (47)

and

(\displaystyle\mathcal{H}( 𝒞ti(δ))=Qi+K(𝒞ti,Qi,Ri)RiK(𝒞ti,Qi,Ri)\displaystyle\mathcal{C}^{i}_{t}(\delta))=Q^{i}+K^{\top}(\mathcal{C}^{i}_{t},Q^{i},R^{i})R^{i}K(\mathcal{C}^{i}_{t},Q^{i},R^{i})
2μni(t)iP(𝒞ti,Qi,Ri)×\displaystyle-2\mu^{i}_{n_{i}(t)}\|P(\mathcal{C}^{i}_{t},Q^{i},R^{i})\|_{*}\times
(IK(𝒞ti,Qi,Ri))Vini(t)1(IK(𝒞ti,Qi,Ri)).\displaystyle\begin{pmatrix}I\\ K(\mathcal{C}^{i}_{t},Q^{i},R^{i})\end{pmatrix}{V^{i}}^{-1}_{n_{i}(t)}\begin{pmatrix}I\\ K(\mathcal{C}^{i}_{t},Q^{i},R^{i})\end{pmatrix}^{\top}. (48)

Note that in the proof of Theorem 2 it is shown that 0<η(𝒞ti(δ))<10<\eta\big{(}\mathcal{C}^{i}_{t}(\delta)\big{)}<1.

The following lemma gives an upper-bound for the state norm for the switched system.

Lemma 3

Algorithm 1 guarantees

𝔼[x𝔗ik+1es+x𝔗ik+1es+|τq1]𝔼[x𝔗ikes+x𝔗ikes+|τq1]+κ4α1α0σω2\displaystyle\mathbb{E}[x^{\top}_{\mathfrak{T}^{{es}^{+}}_{i_{k+1}}}x_{\mathfrak{T}^{{es}^{+}}_{i_{k+1}}}|\mathcal{F}_{\tau_{q}-1}]\leq\mathbb{E}[x^{\top}_{\mathfrak{T}^{{es}^{+}}_{i_{k}}}x_{\mathfrak{T}^{{es}^{+}}_{i_{k}}}|\mathcal{F}_{\tau_{q}-1}]+\kappa_{*}^{4}\frac{\alpha_{1}^{*}}{\alpha_{0}^{*}}\sigma_{\omega}^{2} (49)

for the state norm right after two subsequent switches occurring at 𝔗ikes\mathfrak{T}^{{es}}_{i_{k}} and 𝔗ik+1es\mathfrak{T}^{{es}}_{i_{k+1}}.

Furthermore, for any 𝔗ikest<𝔗ik+1es\mathfrak{T}^{{es}}_{i_{k}}\leq t<\mathfrak{T}^{{es}}_{i_{k+1}}

𝔼[xtxt|t1]α¯kκ2x0x0+2α¯1α¯𝒳2ησω2:=X¯k\displaystyle\mathbb{E}[x_{t}^{\top}x_{t}|\mathcal{F}_{t-1}]\leq\bar{\alpha}^{k}\kappa_{*}^{2}x_{0}^{\top}x_{0}+\frac{2-\bar{\alpha}}{1-\bar{\alpha}}\frac{\mathcal{X}_{*}^{2}}{\eta_{*}}\sigma_{\omega}^{2}:=\bar{X}_{k} (50)

or overlay for any t>0t>0

𝔼[xtxt|t1]κ2x0x0+2α¯1α¯𝒳2ησω2:=X~\displaystyle\mathbb{E}[x_{t}^{\top}x_{t}|\mathcal{F}_{t-1}]\leq\kappa_{*}^{2}x_{0}^{\top}x_{0}+\frac{2-\bar{\alpha}}{1-\bar{\alpha}}\frac{\mathcal{X}_{*}^{2}}{\eta_{*}}\sigma_{\omega}^{2}:=\tilde{X} (51)

where η=1/κ2\eta_{*}=1/\kappa_{*}^{2} and 𝒳=κ2α1/α0\mathcal{X}_{*}=\kappa_{*}^{2}{\alpha_{1}^{*}}/{\alpha_{0}^{*}}.

Corollary 1

The policy employed in Algorithm 1 falls within the category of candidate policies as defined by (5), i.e., κiκci\kappa_{i}\leq\kappa^{i}_{c} for all i||i\in|\mathcal{M}|. Now, noting that γci<1\gamma^{i}_{c}<1, consequently, upon comparing (49) from Lemma 3 with (3), it becomes evident that Algorithm 1 ensures the containment of state norm within bounds, aligning with the conditions specified in Definition 1.

The following theorem establishes an upper bound on the estimation error of the dwell time. This bound is highly useful for regret bound analysis of the algorithm.

In the MS¯M\bar{S} setting, the problem of minimizing costs while switching according to \mathcal{I} can be framed as minimizing the expected time, as the objective is to spend as little time as possible in each mode while ensuring the state remains bounded, as defined in Definition 1. In this context, the expected cumulative time required to achieve the goal is expressed as:

T=k=0n1τikik+1.\displaystyle T_{*}=\sum_{k=0}^{n-1}\tau_{*}^{i_{k}i_{k+1}}. (52)

where τikik+1\tau_{*}^{i_{k}i_{k+1}}’s are determined by (21).

In the M¯S¯\bar{M}\bar{S} setting, the duration, denoted as

Tes=k=0n1τesikik+1,\displaystyle T_{es}=\sum_{k=0}^{n-1}\tau_{es}^{i_{k}i_{k+1}}, (53)

is not explicitly known. Theorem 3 also offers insights into the order of the total expected time for switching according to \mathcal{I} under the mode index announcement strategy, particularly when ensuring the boundedness of the state is a concern.

Theorem 3

(Dwell-Time Estimation Error Upper-Bound) Suppose that at time tt, the system initiates actuation in mode ii and immediately receives to mode jj. In this scenario, the following statements hold.

(a)(a) The error of the estimated minimum mode-dependant dwell time τesij\tau_{es}^{ij} has the following upper-bound

τesijτijln(1+β¯0λ¯(χti))ln(1κi2)\displaystyle\tau_{es}^{ij}-\tau_{*}^{ij}\leq\frac{\ln\bigg{(}1+\bar{\beta}_{0}\overline{\lambda}\big{(}\chi^{i}_{t}\big{)}\bigg{)}}{-\ln\big{(}1-\kappa_{i}^{-2}\big{)}} (54)

where

χti:=\displaystyle\chi_{t}^{i}:= 2κi2μtiγP(𝒞ti,Qi,Ri)(IK(𝒞ti,Qi,Ri))×\displaystyle\frac{2\kappa_{i}^{2}\mu^{i}_{t}}{\gamma}\|P(\mathcal{C}_{t}^{i},Q^{i},R^{i})\|_{*}\bigg{\|}\begin{pmatrix}I\\ K(\mathcal{C}_{t}^{i},Q^{i},R^{i})\end{pmatrix}\times
Vni(t)i1(IK(𝒞ti,Qi,Ri))I,\displaystyle{V_{n_{i}(t)}^{i^{-1}}}\begin{pmatrix}I\\ K(\mathcal{C}_{t}^{i},Q^{i},R^{i})\end{pmatrix}^{\top}\bigg{\|}I, (55)
α0iα¯0i:=maxj2νiνjσω2,β¯0ij=2νjσω2α¯0i2.\displaystyle\alpha_{0}^{i}\geq\bar{\alpha}^{i}_{0}:=\max_{j}\frac{\sqrt{2\nu_{i}\nu_{j}}}{\sigma_{\omega}^{2}},\;\;\bar{\beta}^{ij}_{0}=\frac{2\nu_{j}}{\sigma_{\omega}^{2}{\bar{\alpha}^{i^{2}}_{0}}}. (56)

(b)(b) The expected cumulative time TesT_{es} required to accomplish switching according to \mathcal{I} with state norm growth under control, as defined in Definition 1, is of 𝒪(||n)\mathcal{O}(|\mathcal{M}|\sqrt{n}), where nn is the total number of switches.

VI-B Regret Bound Analysis

The regret, defined in (41), can be partitioned into two components, denoted as R𝒜1R_{\mathcal{A}}^{1} and R𝒜2R_{\mathcal{A}}^{2}, which are outlined as follows:

R𝒜()=𝔼[k=0n1t=𝔗ikes𝔗ik+11(ctσ(t)J(Θik,Qik,Rik))]R𝒜1\displaystyle R_{\mathcal{A}}(\mathcal{I})=\overbrace{\mathbb{E}[\sum_{k=0}^{n-1}\sum_{t=\mathfrak{T}^{{es}}_{i_{k}}}^{\mathfrak{T}^{{*}}_{i_{k+1}}-1}(c_{t}^{\sigma(t)}-J_{*}(\Theta_{*}^{i_{k}},Q^{i_{k}},R^{i_{k}}))]}^{R_{\mathcal{A}}^{1}}
+𝔼[k=0ns1t=𝔗ik+1𝔗ik+1es1ctσ(t)]R𝒜2\displaystyle+\underbrace{\mathbb{E}[\sum_{k=0}^{ns-1}\sum_{t=\mathfrak{T}^{{*}}_{i_{k+1}}}^{\mathfrak{T}^{{es}}_{i_{k+1}}-1}c_{t}^{\sigma(t)}]}_{R_{\mathcal{A}}^{2}} (57)

Here, σ(t)=ik\sigma(t)=i_{k} for 𝔗ikest<𝔗ik+1es\mathfrak{T}^{{es}}_{i_{k}}\leq t<\mathfrak{T}^{{es}}_{i_{k+1}}.

The component R𝒜1R_{\mathcal{A}}^{1} characterizes the regret arising from the sub-optimality of the feedback policy generated by the algorithm. In contrast, the term R𝒜2R_{\mathcal{A}}^{2} gauges the regret attributed to estimation error of the minimum mode-dependent dwell time. To establish an upper bound for R𝒜1R^{1}_{\mathcal{A}}, we adopt a decomposition approach similar to that found in [7], while also incorporating considerations for controlling state norm growth. For the latter term, we rely on the insights derived from Theorem 3.

To establish an upper bound for the regret, we introduce the event 𝔗ikes()\mathcal{E}_{\mathfrak{T}^{{es}}_{i_{k}}}(\mathcal{I}) which is defined as follows:

𝔗ikes()={\displaystyle\mathcal{E}_{\mathfrak{T}^{{es}}_{i_{k}}}(\mathcal{I})=\big{\{} Π𝔗ikesand\displaystyle\Pi_{\mathfrak{T}^{{es}}_{i_{k}}}\;\textit{and}
E[xtxt|t1]X¯kfor𝔗ikest<𝔗ik+1es}.\displaystyle E[x_{t}^{\top}x_{t}|\mathcal{F}_{t-1}]\leq\bar{X}_{k}\;\;\textit{for}\;\;\mathfrak{T}^{{es}}_{i_{k}}\leq t<\mathfrak{T}^{{es}}_{i_{k+1}}\big{\}}. (58)

This event contains the collection of confidence ellipsoids updated until time 𝔗ikes\mathfrak{T}^{{es}}_{i_{k}} using which we designed the feedback gain and the minimum mode-dependent dwell time, subsequently upper-bounding the expected state norm, which is also included in the event. The value of X¯k\bar{X}_{k} is determined according to (50) during the execution of the mission following the prescribed sequence \mathcal{I}.

Theorem 4

Suppose Assumptions 1 and 2 holds, then Algorithm 1 achieves the expected regret of 𝒪(||ns)\mathcal{O}(|\mathcal{M}|\sqrt{ns}) with probability at least 1δ1-\delta.

VII Conclusion

In this paper, we propose an algorithm that guarantees safe switching with minimum cost in a scenario where switch sequences are revealed in an online fashion. Our algorithm is based on the Optimism in the Face of Uncertainty (OFU) principle, which integrates system identification, control, and dwell-time design procedures. By utilizing constructed confidence sets for parameter estimates, we develop a strategy to accurately estimate the minimum dwell time. Furthermore, we prove that our proposed algorithm achieves an expected regret of 𝒪(||ns)\mathcal{O}(|\mathcal{M}|\sqrt{ns}), compared to the case when all parameters are known, where nsns represents the total number of switches and |||\mathcal{M}| is the number of mode candidates. A possible extension for this work involves a setup in which the decision of the next mode to switch becomes an integral part of the strategy design. For instance, the LQ formulation of the problem discussed in [21] has the potential to jointly design both the timing and selection of the next mode, aiming to identify the most effective actuation mode with minimal regret.

References

  • [1] N. M. Boffi, S. Tu, and J.-J. E. Slotine, “Regret bounds for adaptive nonlinear control,” in Learning for Dynamics and Control.   PMLR, 2021, pp. 471–483.
  • [2] S. Kakade, A. Krishnamurthy, K. Lowrey, M. Ohnishi, and W. Sun, “Information theoretic regret bounds for online nonlinear control,” Advances in Neural Information Processing Systems, vol. 33, pp. 15 312–15 325, 2020.
  • [3] X. Chen and E. Hazan, “Black-box control for linear dynamical systems,” in Conference on Learning Theory.   PMLR, 2021, pp. 1114–1143.
  • [4] S. Dean, H. Mania, N. Matni, B. Recht, and S. Tu, “On the sample complexity of the linear quadratic regulator,” Foundations of Computational Mathematics, vol. 20, no. 4, pp. 633–679, 2020.
  • [5] A. Cohen, A. Hasidim, T. Koren, N. Lazic, Y. Mansour, and K. Talwar, “Online linear quadratic control,” in International Conference on Machine Learning.   PMLR, 2018, pp. 1029–1038.
  • [6] H. Mania, S. Tu, and B. Recht, “Certainty equivalence is efficient for linear quadratic control,” Advances in Neural Information Processing Systems, vol. 32, 2019.
  • [7] A. Cohen, T. Koren, and Y. Mansour, “Learning linear-quadratic regulators efficiently with only sqrt(t) regret,” arXiv preprint arXiv:1902.06223, 2019.
  • [8] F. Zhu and P. J. Antsaklis, “Optimal control of hybrid switched systems: A brief survey,” Discrete Event Dynamic Systems, vol. 25, no. 3, pp. 345–364, 2015.
  • [9] J. Zhao and D. J. Hill, “On stability, l2-gain and h{\infty} control for switched systems,” Automatica, vol. 44, no. 5, pp. 1220–1232, 2008.
  • [10] H. Lin and P. J. Antsaklis, “Stability and stabilizability of switched linear systems: a survey of recent results,” IEEE Transactions on Automatic control, vol. 54, no. 2, pp. 308–322, 2009.
  • [11] D. Liberzon, J. P. Hespanha, and A. S. Morse, “Stability of switched systems: a lie-algebraic condition,” Systems & Control Letters, vol. 37, no. 3, pp. 117–122, 1999.
  • [12] D. Liberzon, Switching in systems and control.   Springer, 2003, vol. 190.
  • [13] J. P. Hespanha and A. S. Morse, “Stability of switched systems with average dwell-time,” in Proceedings of the 38th IEEE conference on decision and control (Cat. No. 99CH36304), vol. 3.   IEEE, 1999, pp. 2655–2660.
  • [14] X. Zhao, L. Zhang, P. Shi, and M. Liu, “Stability and stabilization of switched linear systems with mode-dependent average dwell time,” IEEE Transactions on Automatic Control, vol. 57, no. 7, pp. 1809–1815, 2011.
  • [15] J. Kenanian, A. Balkan, R. M. Jungers, and P. Tabuada, “Data driven stability analysis of black-box switched linear systems,” Automatica, vol. 109, p. 108533, 2019.
  • [16] M. Rotulo, C. De Persis, and P. Tesi, “Online learning of data-driven controllers for unknown switched linear systems,” Automatica, vol. 145, p. 110519, 2022.
  • [17] T. Dai and M. Sznaier, “A moments based approach to designing mimo data driven controllers for switched systems,” in 2018 IEEE Conference on Decision and Control (CDC).   IEEE, 2018, pp. 5652–5657.
  • [18] ——, “A convex optimization approach to synthesizing state feedback data-driven controllers for switched linear systems,” Automatica, vol. 139, p. 110190, 2022.
  • [19] J. Eising, S. Liu, S. Martinez, and J. Cortes, “Data-driven mode detection and stabilization of unknown switched linear systems,” arXiv preprint arXiv:2303.11489, 2023.
  • [20] Z. Du, Y. Sattar, D. A. Tarzanagh, L. Balzano, N. Ozay, and S. Oymak, “Data-driven control of markov jump systems: Sample complexity and regret bounds,” in 2022 American Control Conference (ACC).   IEEE, 2022, pp. 4901–4908.
  • [21] Y. Li, J. A. Preiss, N. Li, Y. Lin, A. Wierman, and J. Shamma, “Online switching control with stability and regret guarantees,” Proceedings of Machine Learning Research vol XX, vol. 1, p. 29, 2023.
  • [22] J. A. Chekan and C. Langbort, “Learn and control while switching: with guaranteed stability and sublinear regret,” arXiv preprint arXiv:2207.10827, 2022.
  • [23] M. C. Campi and P. Kumar, “Adaptive linear quadratic gaussian control: the cost-biased approach revisited,” SIAM Journal on Control and Optimization, vol. 36, no. 6, pp. 1890–1907, 1998.
  • [24] Y. Abbasi-Yadkori and C. Szepesvári, “Regret bounds for the adaptive control of linear quadratic systems,” in Proceedings of the 24th Annual Conference on Learning Theory, 2011, pp. 1–26.
  • [25] M. Ibrahimi, A. Javanmard, and B. V. Roy, “Efficient reinforcement learning for high dimensional linear quadratic systems,” in Advances in Neural Information Processing Systems, 2012, pp. 2636–2644.
  • [26] S. Lale, K. Azizzadenesheli, B. Hassibi, and A. Anandkumar, “Reinforcement learning with fast stabilization in linear dynamical systems,” in International Conference on Artificial Intelligence and Statistics.   PMLR, 2022, pp. 5354–5390.
  • [27] J. A. Chekan, K. Azizzadenesheli, and C. Langbort, “Joint stabilization and regret minimization through switching in systems with actuator redundancy,” arXiv preprint arXiv:2105.14709, 2021.
  • [28] A. Cohen, T. Koren, and Y. Mansour, “Learning linear-quadratic regulators efficiently with only T\sqrt{T} regret,” in Proceedings of the 36th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 97.   PMLR, 09–15 Jun 2019, pp. 1300–1309. [Online]. Available: https://proceedings.mlr.press/v97/cohen19b.html
  • [29] M. Fazel, R. Ge, S. M. Kakade, and M. Mesbahi, “Global convergence of policy gradient methods for linearized control problems,” 2018.
  • [30] J. A. Chekan, K. Azizzadenesheli, and C. Langbort, “Joint stabilization and regret minimization through switching in over-actuated systems,” IFAC-PapersOnLine, vol. 55, no. 25, pp. 79–84, 2022.

Appendix A Further Analysis

In this section, we dig further and provide proofs, rigorous analysis of the algorithms, properties of the closed-loop system, and regret bounds.

A-A Notes on (κ,γ)(\kappa,\gamma)- stability

To get a better sense for (κ,γ)(\kappa,\gamma)- stability, we provide illustrations of the proof of Lemma B.1 by [5]. For any linear system traditionally defined by pair (A,B)(A,B), the stabilizing policy KK is (κ,γ)(\kappa,\gamma)- stabilizing. The closed loop system is stable if ρ(A+BK)=1γ\rho(A+BK)=1-\gamma where 0<γ<10<\gamma<1. If we define matrix Q=(1γ)1(A+BK)Q=(1-\gamma)^{-1}(A+BK), then QQ is stable if there exists a positive definite matrix PP such that

QPQP.\displaystyle Q^{\top}PQ\preceq P. (59)

It is also trivial that when QQ is stable, (A+BK)(A+BK) is also stable.

From (59) one can write

(A+BK)P(A+BK)(1γ)2P.\displaystyle(A+BK)^{\top}P(A+BK)\preceq(1-\gamma)^{2}P. (60)

Pre and post multiplying both sides of (60) by P12P^{-\frac{1}{2}} and P12P^{\frac{1}{2}} yields

LL(1γ)2.\displaystyle L^{\top}L\preceq(1-\gamma)^{2}. (61)

where L=P12(A+BK)P12L=P^{\frac{1}{2}}(A+BK)P^{-\frac{1}{2}}. Then defining H=P12H=P^{-\frac{1}{2}}, one can write

HLH1=A+BK.\displaystyle HLH^{-1}=A+BK. (62)

Setting the condition number of P12P^{-\frac{1}{2}} to κ\kappa, i.e., HH1κ\|H\|\|H^{-1}\|\leq\kappa and having L1γ\|L\|\leq 1-\gamma from (61) complete proof.

It is worthy to note that this definition directly results ρ(A+BK)L1γ\rho(A+BK)\leq\|L\|\leq 1-\gamma.

Lemma 4

(Lemma 25 of [7]) Let XX and ZZ denote matrices of equal size and let YY denote a (κ,γ)(\kappa,\gamma) stable matrix such that XYXY+ZX\preceq Y^{\top}XY+Z, then Xκ2γZIX\preceq\frac{\kappa^{2}}{\gamma}\|Z\|I.

A-B Proofs of Section III

Proof of Corollary 2

Proof:

Considering that Program 1 is a minimization problem linear in τik\tau_{i_{k}}, it attains its minimum at the equality condition of (13). Utilizing a similar, albeit simpler, approach as shown in the proof of Theorem 2, we can establish that when a system at time tt starts actuation in mode iki_{k} using a stabilizing policy KikK_{i^{k}} and is subsequently informed of the next mode-to-switch, denoted as ik+1i_{k+1} to actuate with another stabilizing policy Kik+1K_{i_{k+1}}, the duration of actuation in mode iki_{k}, denoted as τik\tau_{i_{k}}, must satisfy the following condition:

lnρ¯(Kik,Kik+1)+τikln(1η¯(Kik))\displaystyle\ln\bar{\rho}(K_{i_{k}},K_{i_{k+1}})+{\tau_{i_{k}}}\ln\bigg{(}1-\bar{\eta}\big{(}K_{i_{k}}\big{)}\bigg{)}
+ln𝒳¯(Kik,Kik+1)ln𝒴\displaystyle+\ln\bar{\mathcal{X}}(K_{i_{k}},K_{i_{k+1}})\leq\ln\mathcal{Y}

or equivalently

τiklnρ¯(Kik,Kik+1)+ln𝒳(Kik,Kik+1)ln𝒴ln(1η¯(Kik)).\displaystyle{\tau}_{i_{k}}\geq-\frac{\ln\bar{\rho}(K_{i_{k}},K_{i_{k+1}})+\ln\mathcal{X}(K_{i_{k}},K_{i_{k+1}})-\ln\mathcal{Y}}{\ln\big{(}1-\bar{\eta}\big{(}K_{i_{k}})\big{)}}. (63)

where

η¯(Kik):=λ¯(HKik)λ¯(PKik),ρ¯(Kik,Kik+1):=λ¯(PKik+1)λ¯(PKik)\displaystyle\bar{\eta}\big{(}K_{i_{k}}):=\frac{\underline{\lambda}\big{(}H_{K_{i_{k}}}\big{)}}{\overline{\lambda}\big{(}P_{K_{i_{k}}}\big{)}},\;\bar{\rho}(K_{i_{k}},K_{i_{k+1}}):=\frac{\overline{\lambda}\big{(}P_{K_{i_{k+1}}}\big{)}}{\underline{\lambda}\big{(}P_{K_{i_{k}}})}
𝒳¯(Kik,Kik+1):=λ¯(PKik)λ¯(PKik+1),HKik=Qik+KikRikKik\displaystyle\bar{\mathcal{X}}(K_{i_{k}},K_{i_{k+1}}):=\frac{\overline{\lambda}\big{(}P_{K_{i_{k}}}\big{)}}{\underline{\lambda}\big{(}P_{K_{i_{k+1}}})},H_{K_{i_{k}}}=Q^{i_{k}}+K_{i_{k}}^{\top}R^{i_{k}}K_{i_{k}} (64)

then again similar to the proof of Theorem we have

𝔼[𝒱ik+1((t+τik)+)]\displaystyle\mathbb{E}[\mathcal{V}_{i_{k+1}}((t+\tau_{i_{k}})^{+})]\leq
ρ¯(Kik,Kik+1)(1η¯(Kik))τik𝔼[𝒱ik(t+)]+\displaystyle\bar{\rho}(K_{i_{k}},K_{i_{k+1}})\bigg{(}1-\bar{\eta}\big{(}K_{i_{k}})\bigg{)}^{\tau_{i_{k}}}\mathbb{E}[\mathcal{V}_{i_{k}}(t^{+})]+
(k=0τik(1η¯(Kik)))λ¯(PKik)σω2\displaystyle\big{(}\sum_{k=0}^{\tau_{i_{k}}}(1-\bar{\eta}\big{(}K_{i_{k}}))\big{)}\overline{\lambda}(P_{K_{i_{k}}})\sigma_{\omega}^{2} (65)

where by definition of 𝒱k+1((t+τik)+)=x(t+τik)+PKi+1x(t+τik)+\mathcal{V}_{k+1}((t+\tau_{i_{k}})^{+})=x^{\top}_{(t+{\tau}_{i_{k}})^{+}}P_{K_{i+1}}x_{(t+{\tau}_{i_{k}})^{+}} and 𝒱k+1((t+τik)+)=xt+PKixt+\mathcal{V}_{k+1}((t+\tau_{i_{k}})^{+})=x^{\top}_{t^{+}}P_{K_{i}}x_{t^{+}}, and applying Rayleigh-Ritz inequality it yields

𝔼[x(t+τik)+x(t+τik)+]\displaystyle\mathbb{E}[x^{\top}_{(t+{\tau}_{i_{k}})^{+}}x_{(t+{\tau}_{i_{k}})^{+}}]\leq
𝒳¯(Kik,Kik+1)ρ¯(Kik,Kik+1)(1η¯(Kik))τik𝔼[xt+xτt+]+\displaystyle\bar{\mathcal{X}}(K_{i_{k}},K_{i_{k+1}})\bar{\rho}(K_{i_{k}},K_{i_{k+1}})\bigg{(}1-\bar{\eta}\big{(}K_{i_{k}})\bigg{)}^{\tau_{i_{k}}}\mathbb{E}[x^{\top}_{t^{+}}x_{\tau_{t^{+}}}]+
𝒳¯(Kik,Kik+1)η¯(Kik)σω2\displaystyle\frac{\bar{\mathcal{X}}(K_{i_{k}},K_{i_{k+1}})}{\bar{\eta}\big{(}K_{i_{k}})}\sigma_{\omega}^{2}

where we applied an upper-bound on the geometric series in (65).

By definition of τik\tau_{i_{k}} we have

𝒳¯(Kik,Kik+1)ρ¯(Kik,Kik+1)(1η¯(Kik))τik𝒴<1\displaystyle\bar{\mathcal{X}}(K_{i_{k}},K_{i_{k+1}})\bar{\rho}(K_{i_{k}},K_{i_{k+1}})\bigg{(}1-\bar{\eta}\big{(}K_{i_{k}})\bigg{)}^{\tau_{i_{k}}}\leq\mathcal{Y}<1 (66)

Furthermore by applying Lemma 4, we have

PKikα1(1+κ2)κ2γ,\displaystyle\|P_{K_{i_{k}}}\|_{*}\leq\alpha_{1}(1+\kappa^{2})\frac{\kappa^{2}}{\gamma},

using which one can show

𝒳¯(Kik,Kik+1)η¯(Kik)σω2α12α02κ4(1+κ2)2γ2σω2\displaystyle\frac{\bar{\mathcal{X}}(K_{i_{k}},K_{i_{k+1}})}{\bar{\eta}\big{(}K_{i_{k}})}\sigma_{\omega}^{2}\leq\frac{\alpha_{1}^{2}}{\alpha_{0}^{2}}\frac{\kappa^{4}(1+\kappa^{2})^{2}}{\gamma^{2}}\sigma_{\omega}^{2} (67)

in which we applied the fact that PKiα0IP_{K_{i}}\succeq\alpha_{0}I which holds for i=ik,ik+1i=i_{k},i_{k+1}.

Observing that the policy Kik𝒮(Θik)K_{i_{k}}\in\mathcal{S}(\Theta_{*}^{i_{k}}), we have κκc\kappa\leq\kappa_{c}^{*} and γγc\gamma\geq\gamma_{c}^{*}. This indicates that we indeed fulfill the boundedness of state immediately after a switch, as defined in accordance with Definition 1.

Furthermore, in cases where lnρ¯(Kik,Kik+1)+ln𝒳(Kik,Kik+1)<0\ln\bar{\rho}(K_{i_{k}},K_{i_{k+1}})+\ln\mathcal{X}(K_{i_{k}},K_{i_{k+1}})<0, it signifies that prolonging mode iki_{k} is needless. However, our proposed scenario requires a mandatory presence of at least one step in mode iki_{k}. Thus, we include this condition within the definition of minimum dwell-time.

Proof of Corollary 3

Proof:

The evidence is a straightforward outcome of a situation where the system operates in a mode with the highest cost and longest duration. This scenario offers a robust upper-bound for cost that can be used for any given sequence of switches, denoted as \mathcal{I}. ∎

A-C Primal-Dual SDP formulation of LQR

For any system Θ\Theta_{*} with quadratic cost specified with QQ and RR, the standard LQR control problem can be reformulated into the Semi-Definite Programming (SDP) form as follows:

minimize(Q00R)Σ(Θ,Q,R)\displaystyle\textrm{minimize}\;\;\;\begin{pmatrix}Q&0\\ 0&R\end{pmatrix}\bullet\Sigma(\Theta_{*},Q,R)
S.t.Σxx(Θ,Q,R)=ΘΣ(Θ,Q,R)Θ+W\displaystyle\textrm{S.t.}\;\;\;\Sigma_{xx}(\Theta_{*},Q,R)={\Theta_{*}}^{\top}\Sigma(\Theta_{*},Q,R)\Theta_{*}+W
Σ(Θ,Q,R)>0\displaystyle\Sigma(\Theta_{*},Q,R)>0 (68)

where for a given Θ\Theta_{*}, Σ(Θ,Q,R)\Sigma(\Theta_{*},Q,R) is the covariance matrix of joint distribution (x,u)(x,u) in steady-state with

Σ(Θ,Q,R)=(ΣxxΣxuΣuxΣuu)\displaystyle\Sigma(\Theta_{*},Q,R)=\begin{pmatrix}\Sigma_{xx}&\Sigma_{xu}\\ \Sigma_{ux}&\Sigma_{uu}\end{pmatrix}

in which Σxx(Θ,Q,R)n×n\Sigma_{xx}(\Theta_{*},Q,R)\in\mathbb{R}^{n\times n}, Σuu(Θ,Q,R)m×m\Sigma_{uu}(\Theta_{*},Q,R)\in\mathbb{R}^{m\times m}, and Σux(Θ,Q,R)=Σxu(Θ,Q,R)m×n\Sigma_{ux}(\Theta_{*},Q,R)=\Sigma_{xu}(\Theta_{*},Q,R)\in\mathbb{R}^{m\times n} for Qn×nQ\in\mathbb{R}^{n\times n} and Rm×mR\in\mathbb{R}^{m\times m}. The optimal value is exactly average expected cost J(Θ,Q,R)J_{*}(\Theta_{*},Q,R) and for W>0W>0 (which guarantees Σxx>0\Sigma_{xx}>0) the optimal policy of system K(Θ,Q,R)K_{*}(\Theta_{*},Q,R) can be obtained from K(Θ,Q,R)=𝒦(Σ(Θ,Q,R))=Σux(Θ,Q,R)Σxx1(Θ,Q,R)K_{*}(\Theta_{*},Q,R)=\mathcal{K}(\Sigma(\Theta_{*},Q,R))=\Sigma_{ux}(\Theta_{*},Q,R)\Sigma^{-1}_{xx}(\Theta_{*},Q,R). In other words, the matrix

(K)=(XXKTKXKXKT)\displaystyle\mathcal{E}(K)=\begin{pmatrix}X&XK^{T}\\ KX&KXK^{T}\end{pmatrix}

is a feasible solution for the SDP. With stabilizing K(Θ,Q,R)K_{*}(\Theta_{*},Q,R) the system state converges to the steady state distribution with covariance of X=𝔼[xxT]X=\mathbb{E}[xx^{T}].

The dual of this program is written as follows:

maxP(Θ,Q,R)Ws.t.(QP(Θ,Q,R)00R)+ΘP(Θ,Q,R)Θ=0P(Θ,Q,R)0.\displaystyle\begin{array}[]{rrclcl}\displaystyle\max&\lx@intercol P(\Theta_{*},Q,R)\bullet W\hfil\lx@intercol\\ \textrm{s.t.}&\begin{pmatrix}Q-P(\Theta_{*},Q,R)&0\\ 0&R\end{pmatrix}+\Theta_{*}P(\Theta_{*},Q,R){\Theta_{*}}^{\top}=0\\ &P(\Theta_{*},Q,R)\geq 0\end{array}. (72)

using its optimal solution P(Θ,Q,R)P_{*}(\Theta_{*},Q,R) we define the optimal average expected cost as J(Θ,Q,R)=P(Θ,Q,R)WJ_{*}(\Theta_{*},Q,R)=P_{*}(\Theta_{*},Q,R)\bullet W.

A-D Initialization Algorithm

Given a (κ0,γ0)(\kappa_{0},\gamma_{0})-stabilizing policy K0K_{0}, Algorithm 2 guarantees reaching to an appropriate ϵi\epsilon_{i} for all ii\in\mathcal{M}. The only remaining question is the specification of the algorithm input T0iT_{0}^{i} for each mode, to address which we first explicate the main steps of the warm-up phase.

Algorithm 2 [22] Warm-Up Phase
1:  Inputs: K0i,K^{i}_{0}, T0i,T^{i}_{0}, ϑi\vartheta^{i}, δ\delta, σω,ηt𝒩(0,2σω2κ0i2I)\sigma_{\omega},\,\eta_{t}\sim\mathcal{N}(0,2\sigma_{\omega}^{2}{\kappa^{i}_{0}}^{2}I), i\forall i
2:  set λi=σω2ϑi2\lambda_{i}=\sigma_{\omega}^{2}\vartheta_{i}^{-2}, V0i=λiIV^{i}_{0}=\lambda_{i}I i\forall i\in\mathcal{M}
3:  for i=1:||i=1:|\mathcal{M}| do
4:     for t=0:T0it=0:T^{i}_{0} do
5:        Observe xtx_{t}
6:        Play ut=K0ixt+ηtu_{t}=K^{i}_{0}x_{t}+\eta_{t}
7:        observe xt+1x_{t+1}
8:        using utu_{t} and xtx_{t} form ztz_{t} and save (zt,xt+1)(z_{t},x_{t+1}) and update the ellipsoid according to the procedure given (73-75)
9:     end for
10:  end for
11:  Output: Θ0\Theta_{0} the center of constructed ellipsoid

A-D1 Main Steps of Algorithm 1

In Algorithm 2, using data obtained by executing control input ut=K0ixt+ηtu_{t}=K^{i}_{0}x_{t}+\eta_{t} (which consists of linear strongly stabilizing term and an extra exploratory sub-gaussian noise ηt\eta_{t}), a confidence set is constructed around the true parameters of the system by following the steps of Section IV-B. For warm-up phase We set λ¯i=σω2ϑi1\bar{\lambda}_{i}=\sigma_{\omega}^{2}{\vartheta^{i}}^{-1} and Θ0i=0\Theta^{i}_{0}=0 in (27) that results in by-time-tt estimates of

Θ^ti\displaystyle\hat{\Theta}^{i}_{t} =V¯ti1ZtXt.\displaystyle={\bar{V}_{t}^{i^{-1}}}{Z_{t}^{\top}}X_{t}. (73)

where the covariance matrix V¯ti\bar{V}^{i}_{t} is constructed similar to Section IV-B. Recalling Θiϑi\|\Theta^{i}_{*}\|_{*}\leq\vartheta^{i} by assumption, a high probability confidence set around true but unknown parameters of system is constructed as

𝒞¯ti(δ):={\displaystyle\bar{\mathcal{C}}^{i}_{t}(\delta):=\big{\{} Θi(n+mi)×n|\displaystyle\Theta^{i}\in\mathbb{R}^{(n+m_{i})\times n}|
Tr((ΘiΘ^ti)V¯ti(ΘiΘ^ti))r¯ti}\displaystyle\operatorname{Tr}\big{(}(\Theta^{i}-\hat{\Theta}^{i}_{t})^{\top}\bar{V}^{i}_{t}(\Theta^{i}-\hat{\Theta}^{i}_{t})\big{)}\leq\bar{r}^{i}_{t}\big{\}} (74)

where

r¯ti=(σω2nlogndet(V¯ti)δdet(λ¯iI)+λ¯iϑi)2.\displaystyle\bar{r}^{i}_{t}=\bigg{(}\sigma_{\omega}\sqrt{2n\log\frac{n\det(\bar{V}^{i}_{t})}{\delta\det(\bar{\lambda}_{i}I)}}+\sqrt{\bar{\lambda}_{i}\vartheta^{i}}\bigg{)}^{2}. (75)

We need following definition to specify warm-up phase duration.

Definition 4

[22] The set 𝒩s(Θ)={Θ(n+m)×n:ΘΘϵ}\mathcal{N}_{s}(\Theta_{*})=\{\Theta\in\mathbb{R}^{(n+m)\times n}:\;\|\Theta-\Theta_{*}\|\leq\epsilon\} for some ϵ>0\epsilon>0 is (κ,γ)(\kappa,\gamma)-stabilizing neighborhood of the system (1-2), if for any Θ𝒩s(Θ)\Theta^{\prime}\in\mathcal{N}_{s}(\Theta_{*}), K(Θ,Q,R)K(\Theta^{\prime},Q,R) is (κ,γ)(\kappa^{\prime},\gamma^{\prime})-strongly stabilizing on the system (1) with κκ\kappa^{\prime}\leq\kappa.

Initialization phase, with an given initial policy K0iK_{0}^{i} is carried out for all modes to achieve an estimate Θ0iΘiϵi\|\Theta_{0}^{i}-\Theta_{*}^{i}\|\leq\epsilon_{i} such that it fulfills two goals.

First, it resides in (κ0i,γ0i)(\kappa^{i}_{0},\gamma^{i}_{0})-stabilizing neighborhood. In other words, the confidence ellipsoid 𝒞¯ti(δ)\bar{\mathcal{C}}^{i}_{t}(\delta) is guaranteed to be a subset of (κ0i,γ0i)(\kappa^{i}_{0},\gamma^{i}_{0})-strong stabilizing neighborhood. This, indeed, guarantees that Algorithm 1 starts with a policy as good as the initial (κ0i,γ0i)(\kappa_{0}^{i},\gamma_{0}^{i})-stabilizing one, K0iK_{0}^{i}.

The following theorem is adopted from [22] that gives the required duration T0iT_{0}^{i} to fulfill the first goal.

Theorem 5

[22] For any mode ii there are explicit constants C0iC^{i}_{0} and ϵ0i=poly(α0i1,α1i,ϑi,νi,n,m)\epsilon^{i}_{0}=poly({\alpha^{i}_{0}}^{-1},\alpha^{i}_{1},\vartheta_{i},\nu_{i},n,m) such that for a given (κ0i,γ0i)(\kappa^{i}_{0},\gamma^{i}_{0})- strongly stabilizing input K0iK^{i}_{0}, the smallest time T0iT^{i}_{0} that satisfies

80T0iσω2(\displaystyle\frac{80}{T^{i}_{0}\sigma^{2}_{\omega}}\bigg{(} σω[(2n(lognδ+log(1+\displaystyle\sigma_{\omega}\bigg{[}(2n(\log\frac{n}{\delta}+\log(1+
300σω2κ0i4γ0i2(n+ϑ2κ02)logT0iδ]1/2+λϑi)2\displaystyle\frac{300\sigma_{\omega}^{2}{\kappa^{i}_{0}}^{4}}{{\gamma_{0}^{i}}^{2}}(n+\vartheta^{2}\kappa^{2}_{0})\log\frac{T^{i}_{0}}{\delta}\bigg{]}^{1/2}+\sqrt{\lambda}\vartheta^{i}\bigg{)}^{2}
min{κ0i2α0iσω2C0i,ϵ0i2}:=ϵ~i2\displaystyle\leq\min\bigg{\{}{\kappa^{i}_{0}}^{2}\frac{\alpha^{i}_{0}\sigma^{2}_{\omega}}{C^{i}_{0}},{\epsilon_{0}^{i}}^{2}\bigg{\}}:=\tilde{\epsilon}_{i}^{2} (76)

guarantees that K(Θ^T0ii,Qi,Ri)K(\hat{\Theta}^{i}_{T^{i}_{0}},Q^{i},R^{i}) applied on the system Θi\Theta^{i}_{*} produces (κ0i,γ0i)(\kappa^{i}_{0},\gamma^{i}_{0}) closed-loop system where Θ^T0ii\hat{\Theta}^{i}_{T^{i}_{0}} is the center of confidence ellipsoid 𝒞T0ii(δ)\mathcal{C}^{i}_{T^{i}_{0}}(\delta) constructed by Algorithm 2.

Before, describing the second goal, we need to find an upper-bound on the duration of the posted problem, given number of switches nsn_{s}. The following lemma gives an upper bound on duration of an epoch.

Lemma 5

Maximum duration for an epoch is

τmaxdw:=lnκln(1γ)\displaystyle\tau^{dw}_{max}:=-\frac{\ln\kappa_{*}}{\ln(1-\gamma_{*})} (77)

where with the definition κi:=2νi/σω2α0i\kappa_{i}:=\sqrt{2\nu_{i}/\sigma_{\omega}^{2}\alpha_{0}^{i}},

κ=maxi{1,,||}κi,andγ=12κ2\displaystyle\kappa^{*}=\max_{i\in\{1,...,|\mathcal{B}^{*}|\}}\kappa_{i},\;\;\;and\;\;\gamma^{*}=\frac{1}{2\kappa_{*}^{2}} (78)
Proof:

The proof is straightforward. Applying strongly stabilizing KK on the system with initial state xt0x_{t_{0}} guarantees the boundedness of state trajectory as follows:

xtκi(1γi)tt0xt0+κiγimaxt0stws.\displaystyle\|x_{t}\|\leq\kappa_{i}(1-\gamma_{i})^{t-t_{0}}\|x_{t_{0}}\|+\frac{\kappa_{i}}{\gamma_{i}}\max\limits_{t_{0}\leq s\leq t}\|w_{s}\|. (79)

The proof is completed by noting that the worst mode after a switch is the less suppressive one, i.e., the one with κi=κ\kappa_{i}=\kappa^{*}. ∎

The second goal is that the estimate Θ0i\Theta_{0}^{i} satisfies Θ0iΘiϵ¯i\|\Theta^{i}_{0}-\Theta^{i}_{*}\|\leq\bar{\epsilon}_{i} where ϵ¯iλ¯maxi<1\bar{\epsilon}_{i}\bar{\lambda}_{\max}^{i}<1 and λ¯maxi=maxτqττqi\bar{\lambda}_{\max}^{i}=\max_{\tau_{q}}\tau_{\tau_{q}}^{i} where τq\tau_{q}’s are start of epochs. In Corollary 5 of [5] it is shown that if nsτmaxdw>poly(n,ν,ϑi,α01,σω1,κ0,γ01,log(δ1))n_{s}\tau^{dw}_{max}>poly(n,\nu,\vartheta^{i},\alpha_{0}^{-1},\sigma_{\omega}^{-1},\kappa_{0},\gamma_{0}^{-1},\log(\delta^{-1})) with

T~0=𝒪(n2νi2ϑiα05σω10(nsτmaxdw)log2nsτmaxdwδ)\displaystyle\tilde{T}_{0}=\mathcal{O}\big{(}\frac{n^{2}\nu_{i}^{2}\vartheta_{i}}{\alpha_{0}^{5}\sigma_{\omega}^{10}}\sqrt{(n_{s}\tau_{max}^{dw})\log^{2}\frac{n_{s}\tau_{max}^{dw}}{\delta}}\big{)} (80)

time steps we fulfill the second goal.

To put all in nutshell, to achieve an initial estimate Θ0i\Theta_{0}^{i} such that Θ0iΘimin{ϵ¯i,ϵ~i}:=ϵi\|\Theta_{0}^{i}-\Theta_{*}^{i}\|\leq min\{\bar{\epsilon}_{i},\tilde{\epsilon}_{i}\}:=\epsilon_{i} we need to run warm-up algorithm for T0T_{0} time steps that satisfies (76) and (80).

We can relax the assumption of having an initial stabilizing K0iK_{0}^{i} by adapting the results of [26] and [30]; however we prefer to stick to this assumption and be more focused on presenting the core contribution of the paper, learning mode-dependent minimum dwell time.

A-E Stability Analysis

We first need the following lemmas as required ingredients to prove Theorem 1 and afterward Theorem 2.

Lemma 6

(Perturbation Lemma [7]) Let XX and Δ\Delta be matrices of matching sizes and let ΔΔTrV1\Delta\Delta^{T}\leq rV^{-1} for positive definite matrix VV. Then for any P0P\geq 0 and μr(1+2XV1/2)\mu\geq r(1+2\|X\|\|V\|^{1/2}), we have

μPV1(X+Δ)P(X+Δ)μPV1.\displaystyle-\mu\|P\|_{*}V^{-1}\leq(X+\Delta)^{\top}P(X+\Delta)\leq\mu\|P\|_{*}V^{-1}.
Proof:

The proof follows the same steps of Lemma 24 in [7] with only the difference in μ\mu definition, which is slightly different than the one of [7]. This is due to the fact that unlike [7], we have not normalized our confidence ellipsoid. ∎

Lemma 7

(Lemma 16 in [7]) Assume at time τq\tau_{q} a switch to mode ii occurs. Let μirτqi(1+2ϑiVni(τq)i1/2)\mu_{i}\geq r^{i}_{\tau_{q}}(1+2\vartheta_{i}\|V^{i}_{n_{i}(\tau_{q})}\|^{1/2}) and Vni(τq)i(4νiμi/α0iσω2)IV^{i}_{n_{i}(\tau_{q})}\geq(4\nu_{i}\mu_{i}/\alpha^{i}_{0}\sigma_{\omega}^{2})I (This holds true by rationale illustrated in Theorem 1). Let Σ(Θ^τqi,Qi,Ri)\Sigma(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i}) be the primal relaxed SDP solution of (39) and let P(Θ^τqi,Qi,Ri)P(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i}) denote the solution of dual relaxed SDP (39). Then Σxx(Θ^τqi,Qi,Ri)\Sigma_{xx}(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i}) is invertible and

P(Θ^τqi,Qi,Ri)=Qi+K(Θ^τqi,Qi,Ri)RiK(Θ^τqi,Qi,Ri)\displaystyle P(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i})=Q^{i}+K^{\top}(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i})R^{i}K(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i})
+(A^ti+B^tiK(Θ^τqi,Qi,Ri))P(Θ^τqi,Qi,Ri)×\displaystyle+(\hat{A}^{i}_{t}+\hat{B}_{t}^{i}K(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i}))^{\top}P(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i})\times
(A^τqi+B^τqiK(Θ^τqi,Qi,Ri))μiP(Θ^τqi,Qi,Ri)×\displaystyle(\hat{A}^{i}_{\tau_{q}}+\hat{B}_{\tau_{q}}^{i}K(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i}))-\mu_{i}\|P(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i})\|_{*}\times
(IK(Θ^τqi,Qi,Ri))Vini(τq)1(IK(Θ^τqi,Qi,Ri))\displaystyle\begin{pmatrix}I\\ K(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i})\end{pmatrix}{V^{i}}^{-1}_{n_{i}(\tau_{q})}\begin{pmatrix}I\\ K(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i})\end{pmatrix}^{\top} (81)

where (A^τqi,B^τqi)=Θ^τqi(\hat{A}^{i}_{\tau_{q}},\hat{B}^{i}_{\tau_{q}})^{\top}=\hat{\Theta}^{i}_{\tau_{q}} and K(Θ^τqi,Qi,Ri)=Σux(Θ^τqi,Qi,Ri)Σxx1(Θ^τqi,Qi,Ri)K(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i})=\Sigma_{ux}(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i}){\Sigma}^{-1}_{xx}(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i}).

Proof of Theorem 1

Proof:

First we need to find a μi\mu_{i} such that μirτqi(1+2ϑiVni(τq)i1/2)\mu_{i}\geq r^{i}_{\tau_{q}}(1+2\vartheta_{i}\|V^{i}_{n_{i}(\tau_{q})}\|^{1/2}), we denote it by μ¯iτq\bar{\mu}^{\tau_{q}}_{i} which is defined as follows:

rτqi(1+2ϑiVni(τq)i1/2)\displaystyle r^{i}_{\tau_{q}}(1+2\vartheta_{i}\|V^{i}_{n_{i}(\tau_{q})}\|^{1/2})\leq
rτqi(1+2ϑi(ni(τq)+k=1ni(τq)zkizki)0.5):=μ¯iτq\displaystyle r^{i}_{\tau_{q}}\bigg{(}1+2\vartheta_{i}\big{(}n_{i}(\tau_{q})+\|\sum_{k=1}^{n_{i}(\tau_{q})}z^{i}_{k}{z^{i}_{k}}^{\top}\|\big{)}^{0.5}\bigg{)}:=\bar{\mu}^{\tau_{q}}_{i} (82)

where rτqir^{i}_{\tau_{q}} is 𝒪(log(ni(τq)))\mathcal{O}\big{(}\log(n_{i}(\tau_{q}))\big{)} considering the definition of ϵi\epsilon_{i}. In the second inequality we apply the fact that λiτq<ni(τq)\lambda_{i}^{\tau_{q}}<n_{i}(\tau_{q}) considering the definition given for λτqi\lambda_{\tau_{q}}^{i} below.

Applying the perturbation Lemma 6 in (81) of Lemma 7 yields

P(Θ^τqi,Qi,Ri)Qi+K(Θ^τqi,Qi,Ri)RiK(Θ^τqi,Qi,Ri)\displaystyle P(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i})\succeq Q^{i}+K^{\top}(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i})R^{i}K(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i})
+(A+BiK(Θ^τqi,Qi,Ri))P(Θ^τqi,Qi,Ri)×\displaystyle+(A_{*}+B_{*}^{i}K(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i}))^{\top}P(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i})\times
(A+BiK(Θ^τqi,Qi,Ri))2μ¯iτqP(Θ^τqi,Qi,Ri)×\displaystyle(A_{*}+B_{*}^{i}K(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i}))-2\bar{\mu}_{i}^{\tau_{q}}\|P(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i})\|_{*}\times
(IK(Θ^τqi,Qi,Ri))Vini(τq)1(IK(Θ^τqi,Qi,Ri)).\displaystyle\begin{pmatrix}I\\ K(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i})\end{pmatrix}{V^{i}}^{-1}_{n_{i}(\tau_{q})}\begin{pmatrix}I\\ K(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i})\end{pmatrix}^{\top}. (83)

Recalling the dual formulation (39) and the fact νiJi\nu_{i}\geq J_{*}^{i} one can write

νiJiP(Θ^τqi,Qi,Ri)W(Qi00Ri)Wα0iσω2\displaystyle\nu_{i}\geq J_{*}^{i}\geq P(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i})\bullet W\geq\begin{pmatrix}Q^{i}&0\\ 0&R^{i}\end{pmatrix}\bullet W\geq\alpha^{i}_{0}\sigma_{\omega}^{2}

which clearly shows that νi/α0iσω21\nu_{i}/\alpha^{i}_{0}\sigma_{\omega}^{2}\geq 1.

Now we let λiτq:=4νiμ¯iτq/α0iσω2\lambda^{\tau_{q}}_{i}:=4\nu_{i}\bar{\mu}^{\tau_{q}}_{i}/\alpha^{i}_{0}\sigma_{\omega}^{2} and together with P(Θ^τqi,Qi,Ri)νi/σω2\|P(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i})\|_{*}\leq\nu_{i}/\sigma_{\omega}^{2}, and the fact that Vni(t)iλiτqIV^{i}_{n_{i}(t)}\geq\lambda^{\tau_{q}}_{i}I t\forall t it yields

μ¯iP(Θ^τqi,Qi,Ri)Vini(τq)1\displaystyle\bar{\mu}_{i}\|P(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i})\|_{*}{V^{i}}_{n_{i}({\tau_{q}})}^{-1} μ¯iνiσω2α0iσω24νiμ¯iI\displaystyle\leq\bar{\mu}_{i}\frac{\nu_{i}}{\sigma_{\omega}^{2}}\frac{\alpha^{i}_{0}\sigma_{\omega}^{2}}{4\nu_{i}\bar{\mu}_{i}}I
μ¯iνiσω2α0iσω24νiμ¯iI=α0i4I\displaystyle\leq\bar{\mu}_{i}\frac{\nu_{i}}{\sigma_{\omega}^{2}}\frac{\alpha^{i}_{0}\sigma_{\omega}^{2}}{4\nu_{i}\bar{\mu}_{i}}I=\frac{\alpha^{i}_{0}}{4}I (84)

plugging (84) together with the assumptions Qi,Riα0iIQ^{i},R^{i}\geq\alpha^{i}_{0}I into (83) yields,

P(Θ^τqi,Qi,Ri)\displaystyle P(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i}) α0i2I+α0i2K(Θ^τqi,Qi,Ri)K(Θ^τqi,Qi,Ri)\displaystyle\geq\frac{\alpha^{i}_{0}}{2}I+\frac{\alpha^{i}_{0}}{2}K^{\top}(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i})K(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i})
+(A+BiK(Θ^τqi,Qi,Ri))P(Θ^τqi,Qi,Ri)\displaystyle+(A_{*}+B^{i}_{*}K(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i}))^{\top}P(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i})
×(A+BiK(Θ^τqi,Qi,Ri)).\displaystyle\times(A_{*}+B^{i}_{*}K(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i})). (85)

Defining κi=2νiα0iσω2\kappa_{i}=\sqrt{\frac{2\nu_{i}}{\alpha_{0}^{i}\sigma_{\omega}^{2}}} and using the fact that P(Θ^τqi,Qi,Ri)νiσω2\|P(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i})\|\leq\frac{\nu_{i}}{\sigma_{\omega}^{2}} (which implies κi2P(Θ^τqi,Qi,Ri)νiκi2σω2I\kappa_{i}^{-2}P(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i})\leq\frac{\nu_{i}\kappa_{i}^{-2}}{\sigma_{\omega}^{2}}I),

it holds true that

P(Θ^τqi,Qi,Ri)12α0iI(1κi2)P(Θ^τqi,Qi,Ri)\displaystyle P(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i})-\frac{1}{2}\alpha^{i}_{0}I\leq(1-\kappa_{i}^{-2})P(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i}) (86)

From (85) and (86) one has:

P1/2(Θ^τqi,Qi,Ri)(A+BiK(Θ^τqi,Qi,Ri))×\displaystyle{P}^{-1/2}(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i})(A_{*}+B^{i}_{*}K(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i}))^{\top}\times
P(Θ^τqi,Qi,Ri)(A+BiK(Θ^τqi,Qi,Ri))P1/2(Θ^τqi,Qi,Ri)\displaystyle P(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i})(A_{*}+B^{i}_{*}K(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i})){P}^{-1/2}(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i})
(1κi2)I.\displaystyle\leq(1-\kappa_{i}^{-2})I. (87)

Denoting

Hτqi=\displaystyle H_{\tau_{q}}^{i}= P1/2(Θ^τqi,Qi,Ri)\displaystyle{P}^{1/2}(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i})
Lτqti=\displaystyle L_{\tau_{q}}t^{i}= P1/2(Θ^τqi,Qi,Ri)(A+BiK(Θ^τqi,Qi,Ri))\displaystyle{P}^{-1/2}(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i})(A_{*}+B^{i}_{*}K(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i}))
P1/2(Θ^τqi,Qi,Ri)\displaystyle{P}^{1/2}(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i})

(87) leads Lτqi1κi211/2κi2\|L^{i}_{\tau_{q}}\|\leq\sqrt{1-\kappa_{i}^{-2}}\leq 1-1/2\kappa_{i}^{-2}.

Furthermore, (85) results in P(Θ^τqi,Qi,Ri)α0i2K(Θ^τqi,Qi,Ri)K(Θ^τqi,Qi,Ri)P(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i})\geq\frac{\alpha^{i}_{0}}{2}{K}^{\top}(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i})K(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i}) which together with P(Θ^τqi,Qi,Ri)νiσω2\|P(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i})\|\leq\frac{\nu_{i}}{\sigma_{\omega}^{2}} imply K(Θ^τqi,Qi,Ri)κi\|K(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i})\|\leq\kappa_{i}. No need to mention that in an epoch, for any t[τq,τq+1]t\in[\tau_{q},\;\tau_{q+1}], K(Θ^ti,Qi,Ri)=K(Θ^τqi,Qi,Ri)K(\hat{\Theta}^{i}_{t},Q^{i},R^{i})=K(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i}).

This automatically yields Hτqiνi/σω2\|H^{i}_{\tau_{q}}\|\leq\sqrt{\nu_{i}/\sigma_{\omega}^{2}} and Hτqi12α0i\|{H^{i}_{\tau_{q}}}^{-1}\|\leq\sqrt{\frac{2}{\alpha^{i}_{0}}}. So, our proof is complete and K(Θ^τqi,Qi,Ri)K(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i})’s are (κi,γi)(\kappa_{i},\gamma_{i})- strongly stabilizing. ∎

A-F Minimum Dwell Time

Before providing proof of Theorem 2 we need the following Lemma.

Lemma 8

[7] For mode ii starting qq-th epoch at τq\tau_{q}

P(Θ^τqi,Qi,Ri)P(Θi,Qi,Ri)\displaystyle P(\hat{\Theta}_{\tau_{q}}^{i},Q^{i},R^{i})\preceq P({\Theta}_{*}^{i},Q^{i},R^{i})\preceq P(Θ^τqi,Qi,Ri)+χτqi\displaystyle P(\hat{\Theta}_{\tau_{q}}^{i},Q^{i},R^{i})+\chi_{\tau_{q}}^{i} (88)

where

χτqi:=2κi2μiτqγP(Θ^τqi,Qi,Ri)\displaystyle\chi_{\tau_{q}}^{i}:=\frac{2\kappa_{i}^{2}\mu_{i}^{\tau_{q}}}{\gamma}\|P(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i})\|_{*}
(IK(Θ^τqi,Qi,Ri))Vini(τq)1(IK(Θ^τqi,Qi,Ri))I\displaystyle\bigg{\|}\begin{pmatrix}I\\ K(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i})\end{pmatrix}{V^{i}}^{-1}_{n_{i}(\tau_{q})}\begin{pmatrix}I\\ K(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i})\end{pmatrix}^{\top}\bigg{\|}I (89)

holds true.

Proof of Theorem 2

Proof:

Let the system actuate in the mode ii within an epoch starting at time τq\tau_{q} and ending at τq+1\tau_{q+1}. Then for any t[τq,τq+1]t\in[\tau_{q},\;\tau_{q+1}], we designate the potential Lyapunov function as follows:

𝒱i(t)=xtP(Θ^τqi,Qi,Ri)xt\displaystyle\mathcal{V}_{i}(t)=x_{t}^{\top}P(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i})x_{t} (90)

where P(Θ^τqi,Qi,Ri)P(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i}) is the solution of relaxed dual program (39) for mode ii, computed by ni(τq)n_{i}(\tau_{q}) number of data. Notably, this solution is both positive definite and radially unbounded. Also, by applying Rayleigh-Ritz inequality one can write

λ¯(P(Θ^τqi,Qi,Ri))𝔼[xtxt|t1]\displaystyle\underline{\lambda}\big{(}P(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i})\big{)}\mathbb{E}[x_{t}^{\top}x_{t}|\mathcal{F}_{t-1}]\leq
𝔼[𝒱i(t)|t1]λ¯(P(Θ^τqi,Qi,Ri))𝔼[xtxt|t1].\displaystyle\mathbb{E}[\mathcal{V}_{i}(t)|\mathcal{F}_{t-1}]\leq\overline{\lambda}\big{(}P(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i})\big{)}\mathbb{E}[x_{t}^{\top}x_{t}|\mathcal{F}_{t-1}]. (91)

by martingale difference properties of process noise we have

𝔼[𝒱i(t+1)|t]𝒱i(t)=\displaystyle\mathbb{E}[\mathcal{V}_{i}(t+1)|\mathcal{F}_{t}]-\mathcal{V}_{i}(t)=
xt(Ai+BiK(Θ^τqi,Qi,Ri))P(Θ^τqi,Qi,Ri)×\displaystyle x^{\top}_{t}(A^{i}_{*}+B^{i}_{*}K(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i}))^{\top}P(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i})\times
(Ai+BiK(Θ^τqi,Qi,Ri))xt+\displaystyle(A^{i}_{*}+B^{i}_{*}K(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i}))x_{t}+
𝔼[ωt+1P(Θ^τqi,Qi,Ri)ωt+1|t]xtTP(Θ^τqi,Qi,Ri)xt\displaystyle\mathbb{E}[\omega^{\top}_{t+1}P(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i})\omega_{t+1}|\mathcal{F}_{t}]-x^{T}_{t}P(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i})x_{t} (92)

Now, using (83) one can write,

P(Θ^τqi,Qi,Ri)QiK(Θ^τqi,Qi,Ri)RiK(Θ^τqi,Qi,Ri)+\displaystyle P(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i})-Q^{i}-K^{\top}(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i})R^{i}K(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i})+
2μiP(Θ^τqi,Qi,Ri)×\displaystyle 2\mu_{i}\|P(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i})\|_{*}\times
(IK(Θ^τqi,Qi,Ri))Vini(τq)1(IK(Θ^τqi,Qi,Ri))\displaystyle\begin{pmatrix}I\\ K(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i})\end{pmatrix}{V^{i}}^{-1}_{n_{i}(\tau_{q})}\begin{pmatrix}I\\ K(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i})\end{pmatrix}^{\top}\succeq
(Ai+BiK(Θτqi,Qi,Ri))P(Θ^τqi,Qi,Ri)×\displaystyle(A^{i}_{*}+B_{*}^{i}K({\Theta}^{i}_{\tau_{q}},Q^{i},R^{i}))^{\top}P(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i})\times
(Ai+BiK(Θ^τqi,Qi,Ri)).\displaystyle(A^{i}_{*}+B_{*}^{i}K(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i})). (93)

Substituting which in (92) yields

𝔼[𝒱i(t+1)|t]𝒱i(t)\displaystyle\mathbb{E}[\mathcal{V}_{i}(t+1)|\mathcal{F}_{t}]-\mathcal{V}_{i}(t)\leq
xt(𝒞τqi(δ))xt+λ¯(P(Θ^τqi,Qi,Ri))W\displaystyle-x^{\top}_{t}\mathcal{H}(\mathcal{C}^{i}_{\tau_{q}}(\delta))x_{t}+\overline{\lambda}\big{(}P(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i})\big{)}W (94)

in which,

(𝒞τqi(δ))=\displaystyle\mathcal{H}(\mathcal{C}^{i}_{\tau_{q}}(\delta))=
Qi+K(Θ^τqi,Qi,Ri)RiK(Θ^τqi,Qi,Ri)\displaystyle Q^{i}+K^{\top}(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i})R^{i}K(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i})-
2μiτqP(Θ^τqi,Qi,Ri)\displaystyle 2\mu_{i}^{\tau_{q}}\|P(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i})\|_{*}
(IK(Θ^τqi,Qi,Ri))Vini(τq)1(IK(Θ^τqi,Qi,Ri)).\displaystyle\begin{pmatrix}I\\ K(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i})\end{pmatrix}{V^{i}}^{-1}_{n_{i}(\tau_{q})}\begin{pmatrix}I\\ K(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i})\end{pmatrix}^{\top}. (95)

It is noteworthy that (95) can be easily computed using confidence ellipsoid 𝒞τqi(δ)\mathcal{C}^{i}_{\tau_{q}}(\delta).

Observing that

𝔼[𝔼[𝒱i(t+1)|t]|t1]=𝔼[𝒱i(t+1)|t]\displaystyle\mathbb{E}[\mathbb{E}[\mathcal{V}_{i}(t+1)|\mathcal{F}_{t}]|\mathcal{F}_{t-1}]=\mathbb{E}[\mathcal{V}_{i}(t+1)|\mathcal{F}_{t}]

holds true and the fact that

𝔼[(𝒞τqi(δ))|t1]=(𝒞τqi(δ))\displaystyle\mathbb{E}[\mathcal{H}(\mathcal{C}^{i}_{\tau_{q}}(\delta))|\mathcal{F}_{t-1}]=\mathcal{H}(\mathcal{C}^{i}_{\tau_{q}}(\delta))

and

𝔼[P(Θ^τqi,Qi,Ri)|t1]=P(Θ^τqi,Qi,Ri).\displaystyle\mathbb{E}[P(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i})|\mathcal{F}_{t-1}]=P(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i}).

are true for all tτqt\geq\tau_{q} we can rewrite (94) as follows

𝔼[𝒱i(t+1)|t]𝔼[𝒱i(t)|t1]\displaystyle\mathbb{E}[\mathcal{V}_{i}(t+1)|\mathcal{F}_{t}]-\mathbb{E}[\mathcal{V}_{i}(t)|\mathcal{F}_{t-1}]\leq
λ¯((𝒞τqi(δ)))𝔼[xtxt|t1]+λ¯(P(Θ^τqi,Qi,Ri))σω2\displaystyle-\underline{\lambda}\big{(}\mathcal{H}(\mathcal{C}^{i}_{\tau_{q}}(\delta))\big{)}\mathbb{E}[x_{t}^{\top}x_{t}|\mathcal{F}_{t-1}]+\overline{\lambda}\big{(}P(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i})\big{)}\sigma_{\omega}^{2} (96)

combining which with (91) gives

𝔼[𝒱i(t+1)|t]𝔼[𝒱i(t)|t1]\displaystyle\mathbb{E}[\mathcal{V}_{i}(t+1)|\mathcal{F}_{t}]-\mathbb{E}[\mathcal{V}_{i}(t)|\mathcal{F}_{t-1}]\leq
λ¯((𝒞τqi(δ))λ¯(P(Θ^τqi,Qi,Ri))𝔼[𝒱i(t)|t1]+λ¯(P(Θ^τqi,Qi,Ri))σω2\displaystyle-\frac{\underline{\lambda}\big{(}\mathcal{H}(\mathcal{C}^{i}_{\tau_{q}}(\delta)\big{)}}{\overline{\lambda}\big{(}P(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i})\big{)}}\mathbb{E}[\mathcal{V}_{i}(t)|\mathcal{F}_{t-1}]+\overline{\lambda}\big{(}P(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i})\big{)}\sigma_{\omega}^{2} (97)

which can be briefly rewritten as follows:

𝔼[𝒱i(t+1)|t]\displaystyle\mathbb{E}[\mathcal{V}_{i}(t+1)|\mathcal{F}_{t}]\leq
(1η(𝒞τqi(δ)))𝔼[𝒱i(t)|t1]+λ¯(P(Θ^τqi,Qi,Ri))σω2\displaystyle\bigg{(}1-\eta\big{(}\mathcal{C}^{i}_{\tau_{q}}(\delta)\big{)}\bigg{)}\mathbb{E}[\mathcal{V}_{i}(t)|\mathcal{F}_{t-1}]+\overline{\lambda}\big{(}P(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i})\big{)}\sigma_{\omega}^{2} (98)

where

η(𝒞τqi(δ)):=λ¯((𝒞τqi(δ))λ¯(P(Θ^τqi,Qi,Ri)).\displaystyle\eta\big{(}\mathcal{C}^{i}_{\tau_{q}}(\delta)\big{)}:=\frac{\underline{\lambda}\big{(}\mathcal{H}(\mathcal{C}^{i}_{\tau_{q}}(\delta)\big{)}}{\overline{\lambda}\big{(}P(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i})\big{)}}. (99)

By carrying the inequality (98) through an epoch cycle, starting from τq+1\tau_{q+1}^{-} (just prior to the next switch time) and propagating back to τq+\tau_{q}^{+} (immediately after the current switch time), we arrive at:

𝔼[𝒱i(τq+1)|τq+11]\displaystyle\mathbb{E}[\mathcal{V}_{i}(\tau_{q+1}^{-})|\mathcal{F}_{\tau_{q+1}-1}]\leq
(1η(𝒞τqi(δ)))τq+1τq𝔼[𝒱i(τq+)|τq1]+\displaystyle\bigg{(}1-\eta\big{(}\mathcal{C}^{i}_{\tau_{q}}(\delta)\big{)}\bigg{)}^{\tau_{q+1}-\tau_{q}}\mathbb{E}[\mathcal{V}_{i}(\tau_{q}^{+})|\mathcal{F}_{\tau_{q}-1}]+
k=τq+1τq+1(1η(𝒞τqi(δ)))kτq1λ¯(P(Θ^τqi,Qi,Ri))σω2\displaystyle\sum_{k=\tau_{q}+1}^{\tau_{q+1}}\bigg{(}1-\eta\big{(}\mathcal{C}^{i}_{\tau_{q}}(\delta)\big{)}\bigg{)}^{k-\tau_{q}-1}\overline{\lambda}\big{(}P(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i})\big{)}\sigma_{\omega}^{2} (100)

Suppose the system, after being in mode ii at time τq+1\tau_{q+1}, switches to mode jj. In this context, we introduce teh following candidate Lyapunov function for mode jj:

𝒱j(t)=xtP(Θ^τqj,Qj,Rj)xt,\displaystyle\mathcal{V}_{j}(t)=x_{t}^{\top}P(\hat{\Theta}^{j}_{\tau_{q}},Q^{j},R^{j})x_{t}, (101)

Then one can establish

𝔼[𝒱i(τq+1)|τq+11]\displaystyle\mathbb{E}[\mathcal{V}_{i}(\tau_{q+1}^{-})|\mathcal{F}_{\tau_{q+1}-1}]\geq
λ¯(P(Θ^τqi,Qi,Ri))𝔼[xτq+1xτq+1|τq+11]\displaystyle\underline{\lambda}\big{(}P(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i})\big{)}\mathbb{E}[x_{\tau_{q+1}}^{\top}x_{\tau_{q+1}}|\mathcal{F}_{\tau_{q+1}-1}]\geq
λ¯(P(Θ^τqi,Qi,Ri)λ¯(P(Θj,Qj,Rj))𝔼[𝒱j(τq+1+)|τq+11]\displaystyle\frac{\underline{\lambda}\big{(}P(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i})}{\overline{\lambda}\big{(}P({\Theta}^{j}_{*},Q^{j},R^{j})\big{)}}\mathbb{E}[\mathcal{V}_{j}(\tau_{q+1}^{+})|\mathcal{F}_{\tau_{q+1}-1}] (102)

where in the first and second inequalities we applied (91).

By defining

ρ(𝒞τqi(δ),𝒞τqj(δ)):=λ¯(P(Θ^τqj,Qj,Rj)))λ¯(P(Θ^τqi,Qi,Ri)\displaystyle\rho(\mathcal{C}^{i}_{\tau_{q}}(\delta),\mathcal{C}^{j}_{\tau_{q}}(\delta)):=\frac{\overline{\lambda}\big{(}P(\hat{\Theta}^{j}_{\tau_{q}},Q^{j},R^{j}))\big{)}}{\underline{\lambda}\big{(}P(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i})} (103)

and combining (102) with the left hand side of (100) yields

𝔼[𝒱j(τq+1+)|τq+11]\displaystyle\mathbb{E}[\mathcal{V}_{j}(\tau_{q+1}^{+})|\mathcal{F}_{\tau_{q+1}-1}]\leq
ρ(𝒞τqi(δ),𝒞τqj(δ))(1η(𝒞τqi(δ)))τq+1τq𝔼[𝒱i(τq+)|τq1]+\displaystyle\rho(\mathcal{C}^{i}_{\tau_{q}}(\delta),\mathcal{C}^{j}_{\tau_{q}}(\delta))\bigg{(}1-\eta\big{(}\mathcal{C}^{i}_{\tau_{q}}(\delta)\big{)}\bigg{)}^{\tau_{q+1}-\tau_{q}}\mathbb{E}[\mathcal{V}_{i}(\tau_{q}^{+})|\mathcal{F}_{\tau_{q}-1}]+
k=τq+1τq+1(1η(𝒞τqi(δ)))kτq1λ¯(P(Θ^τqi,Qi,Ri))σω2\displaystyle\sum_{k=\tau_{q}+1}^{\tau_{q+1}}\bigg{(}1-\eta\big{(}\mathcal{C}^{i}_{\tau_{q}}(\delta)\big{)}\bigg{)}^{k-\tau_{q}-1}\overline{\lambda}\big{(}P(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i})\big{)}\sigma_{\omega}^{2} (104)

Now applying (91) yields

𝔼[xτq+1+xτq+1+|τq+11]ρ(𝒞τqi(δ),𝒞τqj(δ))𝒳(𝒞τqi(δ),𝒞τqj(δ))\displaystyle\mathbb{E}[x^{\top}_{\tau_{q+1}^{+}}x_{\tau_{q+1}^{+}}|\mathcal{F}_{\tau_{q+1}-1}]\leq\rho(\mathcal{C}^{i}_{\tau_{q}}(\delta),\mathcal{C}^{j}_{\tau_{q}}(\delta))\mathcal{X}(\mathcal{C}^{i}_{\tau_{q}}(\delta),\mathcal{C}^{j}_{\tau_{q}}(\delta))
(1η(𝒞τqi(δ)))τq+1τq𝔼[xτq+xτq+|τq1]+\displaystyle\bigg{(}1-\eta\big{(}\mathcal{C}^{i}_{\tau_{q}}(\delta)\big{)}\bigg{)}^{\tau_{q+1}-\tau_{q}}\mathbb{E}[x^{\top}_{\tau_{q}^{+}}x_{\tau_{q}^{+}}|\mathcal{F}_{\tau_{q}-1}]+
k=τq+1τq+1(1η(𝒞τqi(δ)))kτq1𝒳(𝒞τqi(δ),𝒞τqj(δ))σω2\displaystyle\sum_{k=\tau_{q}+1}^{\tau_{q+1}}\bigg{(}1-\eta\big{(}\mathcal{C}^{i}_{\tau_{q}}(\delta)\big{)}\bigg{)}^{k-\tau_{q}-1}\mathcal{X}(\mathcal{C}^{i}_{\tau_{q}}(\delta),\mathcal{C}^{j}_{\tau_{q}}(\delta))\sigma_{\omega}^{2} (105)

where

𝒳(𝒞τqi(δ),𝒞τqj(δ)):=λ¯(P(Θ^τqi,Qi,Ri))λ¯(P(Θ^τqj,Qj,Rj)\displaystyle\mathcal{X}(\mathcal{C}^{i}_{\tau_{q}}(\delta),\mathcal{C}^{j}_{\tau_{q}}(\delta)):=\frac{\overline{\lambda}\big{(}P(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i})\big{)}}{\underline{\lambda}\big{(}P(\hat{\Theta}^{j}_{\tau_{q}},Q^{j},R^{j})}

The boundedness of the second term on the right hand side of (105) follows by following argument. Recalling the definition of (𝒞τqi(δ))\mathcal{H}(\mathcal{C}^{i}_{\tau_{q}}(\delta)) as provided in (95), and taking into consideration (83), along with the fact that P(Θ^τqi,Qi,Ri)0P(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i})\succeq 0, this leads to the conclusion that P(Θ^τqi,Qi,Ri)(𝒞τqi(δ))P(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i})\succeq\mathcal{H}(\mathcal{C}^{i}_{\tau_{q}}(\delta)), resulting in η(𝒞τqi(δ))<1\eta\big{(}\mathcal{C}^{i}_{\tau_{q}}(\delta)\big{)}<1. Also by stability condition (84) it is straight forward to notice η(𝒞τqi(δ))>0\eta\big{(}\mathcal{C}^{i}_{\tau_{q}}(\delta)\big{)}>0. This ensures that the sum of the geometric series 1η(𝒞τqi(δ))1-\eta\big{(}\mathcal{C}^{i}_{\tau_{q}}(\delta)\big{)} remains bounded. Moreover, in conjunction with λ¯(P(Θ^τqi,Qi,Ri)νi/σω2\overline{\lambda}\big{(}P(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i})\leq\nu_{i}/\sigma_{\omega}^{2}, the boundedness of the second term on the right-hand side of (105) is guaranteed.

In order to manage the growth of the expected state norm, the following condition needs to hold for a user-defined 𝒴<1\mathcal{Y}<1:

ρ(𝒞τqi(δ),𝒞τqj(δ))𝒳(𝒞τqi(δ),𝒞τqj(δ))(1η(𝒞τqi(δ))τdwq\displaystyle\rho(\mathcal{C}^{i}_{\tau_{q}}(\delta),\mathcal{C}^{j}_{\tau_{q}}(\delta))\mathcal{X}(\mathcal{C}^{i}_{\tau_{q}}(\delta),\mathcal{C}^{j}_{\tau_{q}}(\delta))\big{(}1-\eta\big{(}\mathcal{C}^{i}_{\tau_{q}}(\delta)\big{)}^{\tau_{dw}^{q}} 𝒴\displaystyle\leq\mathcal{Y} (106)

which is done by requiring τdwq:=τq+1τq\tau_{dw}^{q}:=\tau_{q+1}-\tau_{q}, denoted as the dwell time of the epoch starting at time τq\tau_{q}, meets the condition:

τdwqlnρ(𝒞τqi(δ),𝒞τqj(δ))+ln𝒳(𝒞τqi(δ),𝒞τqj(δ))ln𝒴ln(1η(𝒞τqi(δ)).\displaystyle\tau_{dw}^{q}\geq-\frac{\ln\rho(\mathcal{C}^{i}_{\tau_{q}}(\delta),\mathcal{C}^{j}_{\tau_{q}}(\delta))+\ln\mathcal{X}(\mathcal{C}^{i}_{\tau_{q}}(\delta),\mathcal{C}^{j}_{\tau_{q}}(\delta))-\ln\mathcal{Y}}{\ln\big{(}1-\eta\big{(}\mathcal{C}^{i}_{\tau_{q}}(\delta)\big{)}}. (107)

Note that when

lnρ(𝒞τqi(δ),𝒞τqj(δ))+ln𝒳(𝒞τqi(δ),𝒞τqj(δ))<ln𝒴\displaystyle\ln\rho(\mathcal{C}^{i}_{\tau_{q}}(\delta),\mathcal{C}^{j}_{\tau_{q}}(\delta))+\ln\mathcal{X}(\mathcal{C}^{i}_{\tau_{q}}(\delta),\mathcal{C}^{j}_{\tau_{q}}(\delta))<\ln\mathcal{Y}

it indicates a benign situation where there is no necessity to prolong mode ii and a subsequent switch can occur quickly. However, since it is mandatory to spend a minimum of one second in each mode, we set forth the minimum mode-specific dwell time as follows:

[τmdij]q:=\displaystyle[\tau_{md}^{ij}]_{q}:=
max{1,lnρ(𝒞τqi(δ),𝒞τqj(δ))+ln𝒳(𝒞τqi(δ),𝒞τqj(δ))ln𝒴ln(1η(𝒞τqi(δ))}\displaystyle\max\bigg{\{}1,-\frac{\ln\rho(\mathcal{C}^{i}_{\tau_{q}}(\delta),\mathcal{C}^{j}_{\tau_{q}}(\delta))+\ln\mathcal{X}(\mathcal{C}^{i}_{\tau_{q}}(\delta),\mathcal{C}^{j}_{\tau_{q}}(\delta))-\ln\mathcal{Y}}{\ln\big{(}1-\eta\big{(}\mathcal{C}^{i}_{\tau_{q}}(\delta)\big{)}}\bigg{\}} (108)

which completes the proof. ∎

Proof of Lemma 3

Proof:

Our first step involves determining a lower bound for η(𝒞τqi(δ))\eta\big{(}\mathcal{C}^{i}_{\tau_{q}}(\delta)\big{)}. This can be derived as follows: Given the definition of (𝒞τqi(δ))\mathcal{H}(\mathcal{C}^{i}_{\tau_{q}}(\delta)) according to equation (95), we can conclude that

(𝒞τqi(δ))α0(1+κ2)2α04(1+κ2)=α02(1+κ2)\displaystyle\mathcal{H}(\mathcal{C}^{i}_{\tau_{q}}(\delta))\geq\alpha_{0}(1+\kappa^{2})-2\frac{\alpha_{0}}{4}(1+\kappa^{2})=\frac{\alpha_{0}}{2}(1+\kappa^{2})

in which we applied (84). Consequently, following the definition of η(𝒞τqi(δ))\eta\big{(}\mathcal{C}^{i}_{\tau_{q}}(\delta)\big{)}, one can write

η(𝒞τqi(δ))α0i2(1+κi2)νiσω2=(1+κi2)κi21κi21κ2:=η\displaystyle\eta\big{(}\mathcal{C}^{i}_{\tau_{q}}(\delta)\big{)}\geq\frac{\frac{\alpha^{i}_{0}}{2}(1+\kappa_{i}^{2})}{\frac{\nu^{i}}{\sigma_{\omega}^{2}}}=\frac{(1+\kappa_{i}^{2})}{\kappa_{i}^{2}}\geq\frac{1}{\kappa_{i}^{2}}\geq\frac{1}{\kappa_{*}^{2}}:=\eta_{*}
𝒳(𝒞τqi(δ),𝒞τqj(δ))νiσω2α0j2=κi2α0iα0jκ2α1α0:=𝒳\displaystyle\mathcal{X}(\mathcal{C}^{i}_{\tau_{q}}(\delta),\mathcal{C}^{j}_{\tau_{q}}(\delta))\leq\frac{\frac{\nu^{i}}{\sigma_{\omega}^{2}}}{\frac{\alpha_{0}^{j}}{2}}=\kappa_{i}^{2}\frac{\alpha_{0}^{i}}{\alpha_{0}^{j}}\leq\kappa_{*}^{2}\frac{\alpha_{1}^{*}}{\alpha_{0}^{*}}:=\mathcal{X}_{*} (109)

where the last inequality of both expression hold by definitions maxi||κi=:κ\max_{i\in|\mathcal{M}|}\kappa_{i}=:\kappa_{*} and α0IQi,Riα1I\alpha^{*}_{0}I\preceq Q^{i},R^{i}\preceq\alpha^{*}_{1}I for all i||i\in|\mathcal{M}|.

Let’s begin by providing the proof for the first statement of the lemma. We start with equations (105) and (106), which yield the following result:

𝔼[xτq+1+xτq+1+|τq+11]𝒴𝔼[xτq+xτq+|τq1]+\displaystyle\mathbb{E}[x^{\top}_{\tau_{q+1}^{+}}x_{\tau_{q+1}^{+}}|\mathcal{F}_{\tau_{q+1}-1}]\leq\mathcal{Y}\mathbb{E}[x^{\top}_{\tau_{q}^{+}}x_{\tau_{q}^{+}}|\mathcal{F}_{\tau_{q}-1}]+
k=τq+1τq+1(1η(𝒞τqi(δ)))kτq1𝒳(𝒞τqi(δ),𝒞τqj(δ))σω2\displaystyle\sum_{k=\tau_{q}+1}^{\tau_{q+1}}\bigg{(}1-\eta\big{(}\mathcal{C}^{i}_{\tau_{q}}(\delta)\big{)}\bigg{)}^{k-\tau_{q}-1}\mathcal{X}(\mathcal{C}^{i}_{\tau_{q}}(\delta),\mathcal{C}^{j}_{\tau_{q}}(\delta))\sigma_{\omega}^{2} (110)
𝒴𝔼[xτq+xτq+|τq1]+1η(𝒞τqi(δ))νiσω2α0j2σω2\displaystyle\leq\mathcal{Y}\mathbb{E}[x^{\top}_{\tau_{q}^{+}}x_{\tau_{q}^{+}}|\mathcal{F}_{\tau_{q}-1}]+\frac{1}{\eta\big{(}\mathcal{C}^{i}_{\tau_{q}}(\delta)\big{)}}\frac{\frac{\nu^{i}}{\sigma_{\omega}^{2}}}{\frac{\alpha_{0}^{j}}{2}}\sigma_{\omega}^{2}\leq (111)
𝒴𝔼[xτq+xτq+|τq1]+κi4α0iα0jσω2\displaystyle\mathcal{Y}\mathbb{E}[x^{\top}_{\tau_{q}^{+}}x_{\tau_{q}^{+}}|\mathcal{F}_{\tau_{q}-1}]+\kappa_{i}^{4}\frac{\alpha_{0}^{i}}{\alpha_{0}^{j}}\sigma_{\omega}^{2}\leq (112)
𝒴𝔼[xτq+xτq+|τq1]+κ4α1α0σω2\displaystyle\mathcal{Y}\mathbb{E}[x^{\top}_{\tau_{q}^{+}}x_{\tau_{q}^{+}}|\mathcal{F}_{\tau_{q}-1}]+\kappa_{*}^{4}\frac{\alpha_{1}^{*}}{\alpha_{0}^{*}}\sigma_{\omega}^{2} (113)

inwhich, we have used an upper boundary of 1/η(𝒞τqi(δ))1/\eta\big{(}\mathcal{C}^{i}_{\tau_{q}}(\delta)\big{)} on the summation of a geometric series, considering the fact that 0<η(𝒞τqi(δ))<10<\eta\big{(}\mathcal{C}^{i}_{\tau_{q}}(\delta)\big{)}<1 then applied (109). Moreover, we employed the inequalities λ¯(P(Θ^τqi,Qi,Ri))νi/σω2\overline{\lambda}\big{(}P(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i})\big{)}\leq\nu^{i}/\sigma_{\omega}^{2} and λ¯(P(Θ^τqj,Qj,Rj)α0j/2\underline{\lambda}\big{(}P(\hat{\Theta}^{j}_{\tau_{q}},Q^{j},R^{j})\geq\alpha_{0}^{j}/2. Subsequently, we applied (109). These steps together constitute the completion of the proof for the first part of the lemma.

To prove the second and third components of the lemma, we proceed as follow

Recalling definition of kthk^{th} switch time τk=q=0k1[τmdiqiq+1]q\tau_{k}=\sum_{q=0}^{k-1}[\tau_{md}^{i_{q}i_{q+1}}]_{q} and employing (105), we propagate from τk\tau_{k} back to the initial time t=0t=0. We use the intermittent variables η\eta_{*} and 𝒳\mathcal{X}_{*} defined in (109), and apply gemoetric series bound on the summation of (1η)(1-\eta_{*}) which results in

𝔼[xτk+xτk+|τk1]𝒴kx0x0+𝒳ησω2m=1k1𝒴m\displaystyle\mathbb{E}[x^{\top}_{\tau_{k}^{+}}x_{\tau_{k}^{+}}|\mathcal{F}_{\tau_{k}-1}]\leq\mathcal{Y}^{k}x_{0}^{\top}x_{0}+\frac{\mathcal{X}_{*}}{\eta_{*}}\sigma_{\omega}^{2}\sum_{m=1}^{k-1}\mathcal{Y}^{m} (114)

Moreover, for any instant tt falling within the epoch initiated at τk\tau_{k} and involving action in mode jj, the application of (98) along with a similar analogy to (100) enables us to formulate:

𝔼[𝒱j(t)|t1]\displaystyle\mathbb{E}[\mathcal{V}_{j}(t)|\mathcal{F}_{t-1}]\leq
(1η(𝒞τkj(δ)))tτk𝔼[𝒱j(τk+)|τk1]+\displaystyle\bigg{(}1-\eta\big{(}\mathcal{C}^{j}_{\tau_{k}}(\delta)\big{)}\bigg{)}^{t-\tau_{k}}\mathbb{E}[\mathcal{V}_{j}(\tau_{k}^{+})|\mathcal{F}_{\tau_{k}-1}]+
m=τk+1t(1η(𝒞τkj(δ)))kτk1λ¯(P(Θ^τqj,Qj,Rj))σω2\displaystyle\sum_{m=\tau_{k}+1}^{t}\bigg{(}1-\eta\big{(}\mathcal{C}^{j}_{\tau_{k}}(\delta)\big{)}\bigg{)}^{k-\tau_{k}-1}\overline{\lambda}\big{(}P(\hat{\Theta}^{j}_{\tau_{q}},Q^{j},R^{j})\big{)}\sigma_{\omega}^{2} (115)

which yields to

𝔼[xtxt|t1]κ2(1η(𝒞τkj(δ)))tτk𝔼[xτk+xτk+|τk1]\displaystyle\mathbb{E}[x_{t}^{\top}x_{t}|\mathcal{F}_{t-1}]\leq\kappa_{*}^{2}\bigg{(}1-\eta\big{(}\mathcal{C}^{j}_{\tau_{k}}(\delta)\big{)}\bigg{)}^{t-\tau_{k}}\mathbb{E}[x^{\top}_{\tau_{k}^{+}}x_{\tau_{k}^{+}}|\mathcal{F}_{\tau_{k}-1}]
+𝒳ησω2.\displaystyle+\frac{\mathcal{X_{*}}}{\eta_{*}}\sigma_{\omega}^{2}. (116)

Now by combining (114) and (116) for τkt<τk+1\tau_{k}\leq t<\tau_{k+1} we summarize

𝔼[xtxt|t1]𝒴kκ2(1η(𝒞τkj(δ)))tτkx0x0+\displaystyle\mathbb{E}[x_{t}^{\top}x_{t}|\mathcal{F}_{t-1}]\leq\mathcal{Y}^{k}\kappa_{*}^{2}\bigg{(}1-\eta\big{(}\mathcal{C}^{j}_{\tau_{k}}(\delta)\big{)}\bigg{)}^{t-\tau_{k}}x_{0}^{\top}x_{0}+
(1+11𝒴)𝒳2ησω2\displaystyle(1+\frac{1}{1-\mathcal{Y}})\frac{\mathcal{X}_{*}^{2}}{\eta_{*}}\sigma_{\omega}^{2}\leq
𝒴kκ2x0x0+2𝒴1𝒴𝒳2ησω2\displaystyle\mathcal{Y}^{k}\kappa_{*}^{2}x_{0}^{\top}x_{0}+\frac{2-\mathcal{Y}}{1-\mathcal{Y}}\frac{\mathcal{X}_{*}^{2}}{\eta_{*}}\sigma_{\omega}^{2} (117)

which completes proof of the second part. The third part directly follows by the second part.

A-G Minimum Mode-dependent Dwell-Time Estimation Error

In this subsection, we give an upper-bound for dwell time estimation error [τmdij]τqτij[\tau_{md}^{ij}]_{\tau_{q}}-\tau_{*}^{ij}, which is used for the regret bound analysis.

Proof of Theorem 3

Proof:

proof of part (a)

The most pessimistic upper limit on the error in estimating the minimum mode-dependent dwell time [τmdij]τqτij[\tau_{md}^{ij}]_{\tau_{q}}-\tau^{ij}_{*} occurs when the type of switch is benign according to Definition 2. In such a case, τij\tau_{*}^{ij} reaches its lowest value of one. Consequently, we directly bound the minimum mode-dependent dwell time itself. This result can also be applied to prove part (b) of the proof.

By Lemma 8 for mode ii we can write

λ¯(P(Θ^τqi,Qi,Ri))\displaystyle\overline{\lambda}\big{(}P(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i})\big{)} λ¯((P(Θ^τqi,Qi,Ri)+χτqi)\displaystyle\leq\overline{\lambda}(\big{(}P(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i})+\chi^{i}_{\tau_{q}}\big{)}
λ¯(P(Θ^τqi,Qi,Ri))+λ¯(χτqi)\displaystyle\leq\overline{\lambda}\big{(}P(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i})\big{)}+\overline{\lambda}\big{(}\chi^{i}_{\tau_{q}}\big{)}
κi2λ¯(P(Θ^τqi,Qi,Ri))+λ¯(χτqi)\displaystyle\leq\kappa_{i}^{2}\underline{\lambda}\big{(}P(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i})\big{)}+\overline{\lambda}\big{(}\chi^{i}_{\tau_{q}}\big{)} (118)

where in the second inequality we applied Weyl’s inequality and in the last inequality we applied

λ¯(P(Θ^τqi,Qi,Ri))λ¯(P(Θ^τqi,Qi,Ri))νiσω2α0i/2=κi2.\displaystyle\frac{\overline{\lambda}\big{(}P(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i})\big{)}}{\underline{\lambda}\big{(}P(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i})\big{)}}\leq\frac{\frac{\nu_{i}}{\sigma_{\omega}^{2}}}{\alpha_{0}^{i}/2}=\kappa_{i}^{2}. (119)

By dividing both sides of (118) by λ¯(P(Θ^τqi,Qi,Ri))\underline{\lambda}\big{(}P(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i})\big{)} and noting that λ¯(P(Θ^τqi,Qi,Ri))αoi/2\underline{\lambda}\big{(}P(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i})\big{)}\geq\alpha_{o}^{i}/2 one can write:

λ¯(P(Θ^τqj,Qj,Rj)λ¯(P(Θ^τqi,Qi,Ri))κi2+2α0iλ¯(χτqi).\displaystyle\frac{\overline{\lambda}\big{(}P(\hat{\Theta}^{j}_{\tau_{q}},Q^{j},R^{j})}{\underline{\lambda}\big{(}P(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i})\big{)}}\leq\kappa_{i}^{2}+\frac{2}{\alpha_{0}^{i}}\overline{\lambda}\big{(}\chi^{i}_{\tau_{q}}\big{)}. (120)

Now, we have

lnρ(𝒞τqi(δ),𝒞τqj(δ))+ln𝒳(𝒞τqi(δ),𝒞τqj(δ))=\displaystyle\ln\rho(\mathcal{C}^{i}_{\tau_{q}}(\delta),\mathcal{C}^{j}_{\tau_{q}}(\delta))+\ln\mathcal{X}(\mathcal{C}^{i}_{\tau_{q}}(\delta),\mathcal{C}^{j}_{\tau_{q}}(\delta))=
lnλ¯(P(Θ^τqi,Qi,Ri))λ¯(P(Θ^τqi,Qi,Ri))+lnλ¯(P(Θ^τqj,Qj,Rj))λ¯(P(Θ^τqj,Qj,Rj))\displaystyle\ln\frac{\overline{\lambda}\big{(}P(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i})\big{)}}{\underline{\lambda}\big{(}P(\hat{\Theta}^{i}_{\tau_{q}},Q^{i},R^{i})\big{)}}+\ln\frac{\overline{\lambda}\big{(}P(\hat{\Theta}^{j}_{\tau_{q}},Q^{j},R^{j})\big{)}}{\underline{\lambda}\big{(}P(\hat{\Theta}^{j}_{\tau_{q}},Q^{j},R^{j})\big{)}}\leq
ln(κi2+2α0iλ¯(χτqi))+ln(κj2+2α0jλ¯(χτqj))=\displaystyle\ln\big{(}\kappa_{i}^{2}+\frac{2}{\alpha_{0}^{i}}\overline{\lambda}\big{(}\chi^{i}_{\tau_{q}}\big{)}\big{)}+\ln\big{(}\kappa_{j}^{2}+\frac{2}{\alpha_{0}^{j}}\overline{\lambda}\big{(}\chi^{j}_{\tau_{q}}\big{)}\big{)}=
lnκi2ln(1+2κi2α0iλ¯(χτqi))+lnκj2ln(1+2κj2α0jλ¯(χτqj))\displaystyle\ln\kappa_{i}^{2}\ln\big{(}1+\frac{2}{\kappa_{i}^{2}\alpha_{0}^{i}}\overline{\lambda}\big{(}\chi^{i}_{\tau_{q}}\big{)}\big{)}+\ln\kappa_{j}^{2}\ln\big{(}1+\frac{2}{\kappa_{j}^{2}\alpha_{0}^{j}}\overline{\lambda}\big{(}\chi^{j}_{\tau_{q}}\big{)}\big{)} (121)

Now considering the definition of minimum dwell time and the fact that η(𝒞τqi(δ))1/κi2\eta\big{(}\mathcal{C}^{i}_{\tau_{q}}(\delta)\big{)}\geq 1/\kappa_{i}^{2} one can write:

ln(1η(𝒞τqi(δ))ln(1κi2).\displaystyle-ln\big{(}1-\eta\big{(}\mathcal{C}^{i}_{\tau_{q}}(\delta)\big{)}\geq-ln\big{(}1-\kappa_{i}^{-2}\big{)}. (122)

which yields to

[τmdij]τqτij\displaystyle[\tau_{md}^{ij}]_{\tau_{q}}-\tau_{*}^{ij}\leq
[τmdij]\displaystyle[\tau_{md}^{ij}]\leq (123)
s=i,jlnκs2+ln(1+2κs2α0sλ¯(χτqs))ln𝒴ln(1κi2)\displaystyle\frac{\sum_{s=i,j}\ln\kappa_{s}^{2}+\ln\big{(}1+\frac{2}{\kappa_{s}^{2}\alpha_{0}^{s}}\overline{\lambda}\big{(}\chi^{s}_{\tau_{q}}\big{)}\big{)}-\ln\mathcal{Y}}{-ln\big{(}1-\kappa_{i}^{-2}\big{)}}

that completes the proof. ∎

Following lemma gives an statement to be used in the regret bound analysis section.

Lemma 9

The following statement holds true.

[τmdij]τqτijlnκi2κj2𝒴ln(1κs2)+s=i,j2κs2α0sλ¯(χτqs)ln(1κs2)\displaystyle[\tau_{md}^{ij}]_{\tau_{q}}-\tau_{*}^{ij}\leq\frac{\ln\frac{\kappa_{i}^{2}\kappa_{j}^{2}}{\mathcal{Y}}}{-ln\big{(}1-\kappa_{s}^{-2}\big{)}}+\sum_{s=i,j}\frac{\sqrt{\frac{2}{\kappa_{s}^{2}\alpha_{0}^{s}}\overline{\lambda}\big{(}\chi^{s}_{\tau_{q}}\big{)}}}{-ln\big{(}1-\kappa_{s}^{-2}\big{)}} (124)
Proof:

The proof directly comes from the following useful inequality

ln(1+x)x1+x,for x>0.\displaystyle ln\big{(}1+x\big{)}\leq\frac{x}{\sqrt{1+x}},\;\;\textit{for $x>0$}.

A-H Regret Bound Analysis (Proof of Theorem 4)

Recalling (57) to bound the first term RT1R_{T}^{1}, we start with upper-bounding the regret in the epoch starting with kthk^{th} switch, i.e., τkt<𝔗k\tau_{k}\leq t<\mathfrak{T}_{k}. For the sake of brevity of notations, we let s1:=τks_{1}:=\tau_{k}, s2:=𝔗ks_{2}:=\mathfrak{T}_{k}, i:=iki:=i_{k}, and j:=ik+1j:=i_{k+1}. Then the regret of actuating in the corresponding epoch is written as follows:

τk:𝔗k=t=τk𝔗k1(ctiJ(Θi,Qi,Ri))\displaystyle\mathcal{R}_{\tau_{k}:\mathfrak{T}_{k}}=\sum_{t=\tau_{k}}^{\mathfrak{T}_{k}-1}(c_{t}^{i}-J_{*}(\Theta_{*}^{i},Q^{i},R^{i}))

where

cti=xt(Qi+K(Θ^τki,Qi,Ri)RiK(Θ^τki,Qi,Ri))xt\displaystyle c_{t}^{i}=x_{t}^{\top}\big{(}Q^{i}+K^{\top}(\hat{\Theta}^{i}_{\tau_{k}},Q^{i},R^{i})R^{i}K(\hat{\Theta}^{i}_{\tau_{k}},Q^{i},R^{i})\big{)}x_{t} (125)

as the algorithm commits to a fixed policy during the epoch.

Given that J(Θi,Qi,Ri)=P(Θi,Qi,Ri)σω2IJ_{*}(\Theta_{*}^{i},Q^{i},R^{i})=P(\Theta_{*}^{i},Q^{i},R^{i})\bullet\sigma_{\omega}^{2}I, and P(Θ^s1i,Qi,Ri)P(Θi,Qi,Ri)P(\hat{\Theta}^{i}_{s_{1}},Q^{i},R^{i})\preceq P({\Theta}^{i}_{*},Q^{i},R^{i}) (see Lemma 19 in [7]) it yields

J(Θi,Qi,Ri)σω2P(Θ^τki,Qi,Ri)\displaystyle J_{*}(\Theta_{*}^{i},Q^{i},R^{i})\geq\sigma_{\omega}^{2}\|P(\hat{\Theta}^{i}_{\tau_{k}},Q^{i},R^{i})\|_{*} (126)

On the other hand, by (83) we have

Qi+K(Θ^τki,Qi,Ri)RiK(Θ^τki,Qi,Ri)\displaystyle Q^{i}+K^{\top}(\hat{\Theta}^{i}_{\tau_{k}},Q^{i},R^{i})R^{i}K(\hat{\Theta}^{i}_{\tau_{k}},Q^{i},R^{i})\preceq
P(Θ^τki,Qi,Ri)\displaystyle P(\hat{\Theta}^{i}_{\tau_{k}},Q^{i},R^{i})-
(A+BiK(Θ^τki,Qi,Ri))P(Θ^τki,Qi,Ri)×\displaystyle(A_{*}+B_{*}^{i}K(\hat{\Theta}^{i}_{\tau_{k}},Q^{i},R^{i}))^{\top}P(\hat{\Theta}^{i}_{\tau_{k}},Q^{i},R^{i})\times
(A+BiK(Θ^τki,Qi,Ri))+\displaystyle(A_{*}+B_{*}^{i}K(\hat{\Theta}^{i}_{\tau_{k}},Q^{i},R^{i}))+
2μiP(Θ^τki,Qi,Ri)×\displaystyle 2\mu_{i}\|P(\hat{\Theta}^{i}_{\tau_{k}},Q^{i},R^{i})\|_{*}\times
(IK(Θ^τki,Qi,Ri))Vini(τk)1(IK(Θ^τki,Qi,Ri)).\displaystyle\begin{pmatrix}I\\ K(\hat{\Theta}^{i}_{\tau_{k}},Q^{i},R^{i})\end{pmatrix}{V^{i}}^{-1}_{n_{i}(\tau_{k})}\begin{pmatrix}I\\ K(\hat{\Theta}^{i}_{\tau_{k}},Q^{i},R^{i})\end{pmatrix}^{\top}. (127)

Combining the above inequalities results in

τk:𝔗kτk:𝔗k11+τk:𝔗k12+τk:𝔗k13+τk:𝔗k14\displaystyle\mathcal{R}_{\tau_{k}:\mathfrak{T}_{k}}\leq\mathcal{R}_{\tau_{k}:\mathfrak{T}_{k}}^{11}+\mathcal{R}_{\tau_{k}:\mathfrak{T}_{k}}^{12}+\mathcal{R}_{\tau_{k}:\mathfrak{T}_{k}}^{13}+\mathcal{R}_{\tau_{k}:\mathfrak{T}_{k}}^{14}
τk:𝔗k11=t=τk𝔗k1(xtP(Θ^τkik,Qik,Rik)xt\displaystyle\mathcal{R}_{\tau_{k}:\mathfrak{T}_{k}}^{11}=\sum_{t=\tau_{k}}^{\mathfrak{T}_{k}-1}\bigg{(}x_{t}^{\top}P(\hat{\Theta}^{i_{k}}_{\tau_{k}},Q^{i_{k}},R^{i_{k}})x_{t}-
xt+1P(Θ^τkik,Qik,Rik)xt+1)1t\displaystyle\quad\quad\quad\quad\quad\quad\quad x_{t+1}^{\top}P(\hat{\Theta}^{i_{k}}_{\tau_{k}},Q^{i_{k}},R^{i_{k}})x_{t+1}\bigg{)}1_{\mathcal{E}_{t}}
τk:𝔗k12=t=τk𝔗k1(ωt+1P(Θ^τkik,Qik,Rik)xt)1t\displaystyle\mathcal{R}_{\tau_{k}:\mathfrak{T}_{k}}^{12}=\sum_{t=\tau_{k}}^{\mathfrak{T}_{k}-1}\big{(}\omega_{t+1}^{\top}P(\hat{\Theta}^{i_{k}}_{\tau_{k}},Q^{i_{k}},R^{i_{k}})x_{t}\big{)}1_{\mathcal{E}_{t}}
τk:𝔗k13=t=τk𝔗k1(ωt+1P(Θ^τkik,Qik,Rik)ωt+1\displaystyle\mathcal{R}_{\tau_{k}:\mathfrak{T}_{k}}^{13}=\sum_{t=\tau_{k}}^{\mathfrak{T}_{k}-1}\bigg{(}\omega_{t+1}^{\top}P(\hat{\Theta}^{i_{k}}_{\tau_{k}},Q^{i_{k}},R^{i_{k}})\omega_{t+1}-
σω2P(Θ^τki,Qik,Rik))1t\displaystyle\quad\quad\quad\quad\quad\quad\quad\sigma_{\omega}^{2}\|P(\hat{\Theta}^{i}_{\tau_{k}},Q^{i_{k}},R^{i_{k}})\|_{*}\bigg{)}1_{\mathcal{E}_{t}}
τk:𝔗k14=t=τk𝔗k14νikμikσω2(ztVni(τk)ik1zt)1t.\displaystyle\mathcal{R}_{\tau_{k}:\mathfrak{T}_{k}}^{14}=\sum_{t=\tau_{k}}^{\mathfrak{T}_{k}-1}\frac{4\nu_{i_{k}}\mu_{i_{k}}}{\sigma_{\omega}^{2}}\big{(}z_{t}^{\top}{V_{n_{i}(\tau_{k})}^{i_{k}}}^{-1}z_{t}\big{)}1_{\mathcal{E}_{t}}. (128)

It is worthy to note that in the terms mentioned above, we do not specify the mode index as it is known from the context,i.e., for an epoch starting at τk\tau_{k} the corresponding mode is ik:={i0,i1,,in1}i_{k}\in\mathcal{I}:=\{i_{0},i_{1},...,i_{n-1}\}.

Furthermore, recalling R𝒜2R_{\mathcal{A}}^{2} definition we can write

t=𝔗kτk+11ctik=\displaystyle\sum_{t=\mathfrak{T}_{k}}^{\tau_{k+1}-1}c_{t}^{i_{k}}=
t=𝔗kτk+11(Qik+K(Θ^τkik,Qik,Rik)RikK(Θ^τkik,Qik,Rik))\displaystyle\sum_{t=\mathfrak{T}_{k}}^{\tau_{k+1}-1}\big{(}Q^{i_{k}}+K^{\top}(\hat{\Theta}^{i_{k}}_{\tau_{k}},Q^{i_{k}},R^{i_{k}})R^{i_{k}}K(\hat{\Theta}^{i_{k}}_{\tau_{k}},Q^{i_{k}},R^{i_{k}})\big{)}\leq
𝔗k:τk+121+𝔗k:τk+122\displaystyle\mathcal{R}^{21}_{\mathfrak{T}_{k}:\tau_{k+1}}+\mathcal{R}^{22}_{\mathfrak{T}_{k}:\tau_{k+1}}
𝔗k:τk+121=t=𝔗kτk+11(xtP(Θ^τkik,Qik,Rik)xt\displaystyle\mathcal{R}^{21}_{\mathfrak{T}_{k}:\tau_{k+1}}=\sum_{t=\mathfrak{T}_{k}}^{\tau_{k+1}-1}\bigg{(}x_{t}^{\top}P(\hat{\Theta}^{i_{k}}_{\tau_{k}},Q^{i_{k}},R^{i_{k}})x_{t}-
xt(Aik+BikK(Θ^τqik,Qik,Rik))P(Θ^τqi,Qik,Rik)×\displaystyle x_{t}^{\top}(A_{*}^{i_{k}}+B_{*}^{i_{k}}K(\hat{\Theta}^{i_{k}}_{\tau_{q}},Q^{i_{k}},R^{i_{k}}))^{\top}P(\hat{\Theta}^{i}_{\tau_{q}},Q^{i_{k}},R^{i_{k}})\times
(Aik+BiK(Θ^τqik,Qik,Rik))xt)1t\displaystyle(A_{*}^{i_{k}}+B_{*}^{i}K(\hat{\Theta}^{i_{k}}_{\tau_{q}},Q^{i_{k}},R^{i_{k}}))x_{t}\bigg{)}1_{\mathcal{E}_{t}} (129)
𝔗k:τk+122=t=𝔗kτk+112μikτkP(Θ^τqik,Qik,Rik)\displaystyle\mathcal{R}^{22}_{\mathfrak{T}_{k}:\tau_{k+1}}=\sum_{t=\mathfrak{T}_{k}}^{\tau_{k+1}-1}2{\mu}_{i_{k}}^{\tau_{k}}\|P(\hat{\Theta}^{i_{k}}_{\tau_{q}},Q^{i_{k}},R^{i_{k}})\|_{*}
(IK(Θ^τkik,Qik,Rik))Vikni(τk)1(IK(Θ^τkik,Qik,Rik))1t.\displaystyle\begin{pmatrix}I\\ K(\hat{\Theta}^{i_{k}}_{\tau_{k}},Q^{i_{k}},R^{i_{k}})\end{pmatrix}{V^{i_{k}}}^{-1}_{n_{i}(\tau_{k})}\begin{pmatrix}I\\ K(\hat{\Theta}^{i_{k}}_{\tau_{k}},Q^{i_{k}},R^{i_{k}})\end{pmatrix}^{\top}1_{\mathcal{E}_{t}}. (130)

in which we applied (83).

By this decomposition, now the regret (57) is upper-bounded as follows:

RT\displaystyle R_{T}\leq 𝔼[k=0ns1τk:𝔗k11+τk:𝔗k12+τk:𝔗k13+τk:𝔗k14]+\displaystyle\mathbb{E}[\sum_{k=0}^{ns-1}\mathcal{R}_{\tau_{k}:\mathfrak{T}_{k}}^{11}+\mathcal{R}_{\tau_{k}:\mathfrak{T}_{k}}^{12}+\mathcal{R}_{\tau_{k}:\mathfrak{T}_{k}}^{13}+\mathcal{R}_{\tau_{k}:\mathfrak{T}_{k}}^{14}]+
𝔼[k=0ns1𝔗k:τk+121+𝔗k:τk+122]\displaystyle\mathbb{E}[\sum_{k=0}^{ns-1}\mathcal{R}^{21}_{\mathfrak{T}_{k}:\tau_{k+1}}+\mathcal{R}^{22}_{\mathfrak{T}_{k}:\tau_{k+1}}]

Now we need to bound each term individually. First we show that 𝔼[k=0ns1τk:𝔗k12]=0\mathbb{E}[\sum_{k=0}^{ns-1}\mathcal{R}^{12}_{\tau_{k}:\mathfrak{T}_{k}}]=0 and 𝔼[k=0ns1τk:𝔗k13]=0\mathbb{E}[\sum_{k=0}^{ns-1}\mathcal{R}^{13}_{\tau_{k}:\mathfrak{T}_{k}}]=0. Lemmas provides for these claims.

Lemma 10

On event t\mathcal{E}_{t},

𝔼[k=0ns1s1:s22]=0\displaystyle\mathbb{E}[\sum_{k=0}^{ns-1}\mathcal{R}_{s_{1}:s_{2}}^{2}]=0 (131)

holds true.

Proof:
𝔼[k=0ns1t=s1s21(ωt+1P(Θ^s1ik,Qik,Rik)xt)1t]=\displaystyle\mathbb{E}[\sum_{k=0}^{ns-1}\sum_{t=s_{1}}^{s_{2}-1}\big{(}\omega_{t+1}^{\top}P(\hat{\Theta}^{i_{k}}_{s_{1}},Q^{i_{k}},R^{i_{k}})x_{t}\big{)}1_{\mathcal{E}_{t}}]=
k=0ns1𝔼[t=s1s21(ωt+1P(Θ^s1ik,Qik,Rik)xt)1t|k1]=\displaystyle\sum_{k=0}^{ns-1}\mathbb{E}[\sum_{t=s_{1}}^{s_{2}-1}\big{(}\omega_{t+1}^{\top}P(\hat{\Theta}^{i_{k}}_{s_{1}},Q^{i_{k}},R^{i_{k}})x_{t}\big{)}1_{\mathcal{E}_{t}}|\mathcal{F}_{k-1}]=
k=0ns1P(Θ^s1ik,Qik,Rik)𝔼[t=s1s21(ωt+1xt)1t|k1]=\displaystyle\sum_{k=0}^{ns-1}P(\hat{\Theta}^{i_{k}}_{s_{1}},Q^{i_{k}},R^{i_{k}})\bullet\mathbb{E}[\sum_{t=s_{1}}^{s_{2}-1}\big{(}\omega_{t+1}^{\top}x_{t}\big{)}1_{\mathcal{E}_{t}}|\mathcal{F}_{k-1}]=
k=0ns1P(Θ^s1ik,Qik,Rik)t=s1s21𝔼[(ωt+1xt)1t|t1]=0\displaystyle\sum_{k=0}^{ns-1}P(\hat{\Theta}^{i_{k}}_{s_{1}},Q^{i_{k}},R^{i_{k}})\bullet\sum_{t=s_{1}}^{s_{2}-1}\mathbb{E}[\big{(}\omega_{t+1}^{\top}x_{t}\big{)}1_{\mathcal{E}_{t}}|\mathcal{F}_{t-1}]=0 (132)

Note that the second equality holds because P(Θ^s1ik,Qik,Rik)P(\hat{\Theta}^{i_{k}}_{s_{1}},Q^{i_{k}},R^{i_{k}}) is k1\mathcal{F}_{k-1} measurable. The last equality holds because xtx_{t} and ωt+1\omega_{t+1} are independent and ωt+1\omega_{t+1} is martingale difference sequence, i.e., [ωs+1|s]=0\mathbb{[}\omega_{s+1}|\mathcal{F}_{s}]=0 for all s=0,1,,ts=0,1,...,t. ∎

Lemma 11

It holds that

𝔼[k=0ns1s1:s23]=0\displaystyle\mathbb{E}[\sum_{k=0}^{ns-1}\mathcal{R}_{s_{1}:s_{2}}^{3}]=0 (133)
Proof:
𝔼[k=0ns1t=s1s21(ωt+1P(Θ^s1ik,Qik,Rik)ωt+1\displaystyle\mathbb{E}[\sum_{k=0}^{ns-1}\sum_{t=s_{1}}^{s_{2}-1}\big{(}\omega_{t+1}^{\top}P(\hat{\Theta}^{i_{k}}_{s_{1}},Q^{i_{k}},R^{i_{k}})\omega_{t+1}-
σω2P(Θ^s1ik,Qik,Rik))1t]=\displaystyle\quad\sigma_{\omega}^{2}\|P(\hat{\Theta}^{i_{k}}_{s_{1}},Q^{i_{k}},R^{i_{k}})\|_{*}\big{)}1_{\mathcal{E}_{t}}]=
k=0ns1𝔼[t=s1s21(ωt+1P(Θ^s1ik,Qik,Rik)ωt+1\displaystyle\sum_{k=0}^{ns-1}\mathbb{E}[\sum_{t=s_{1}}^{s_{2}-1}\big{(}\omega_{t+1}^{\top}P(\hat{\Theta}^{i_{k}}_{s_{1}},Q^{i_{k}},R^{i_{k}})\omega_{t+1}-
σω2P(Θ^s1ik,Qik,Rik))1t|k1]=\displaystyle\quad\quad\quad\sigma_{\omega}^{2}\|P(\hat{\Theta}^{i_{k}}_{s_{1}},Q^{i_{k}},R^{i_{k}})\|_{*}\big{)}1_{\mathcal{E}_{t}}|\mathcal{F}_{k-1}]=
k=0ns1(𝔼[t=s1s21(ωt+1P(Θ^s1ik,Qik,Rik)ωt+1)1t|k1]\displaystyle\sum_{k=0}^{ns-1}\bigg{(}\mathbb{E}[\sum_{t=s_{1}}^{s_{2}-1}\big{(}\omega_{t+1}^{\top}P(\hat{\Theta}^{i_{k}}_{s_{1}},Q^{i_{k}},R^{i_{k}})\omega_{t+1}\big{)}1_{\mathcal{E}_{t}}|\mathcal{F}_{k-1}]-
𝔼[t=s1s21(σω2P(Θ^s1ik,Qik,Rik))1t|k1])=\displaystyle\quad\quad\quad\mathbb{E}[\sum_{t=s_{1}}^{s_{2}-1}\big{(}\sigma_{\omega}^{2}\|P(\hat{\Theta}^{i_{k}}_{s_{1}},Q^{i_{k}},R^{i_{k}})\|_{*}\big{)}1_{\mathcal{E}_{t}}|\mathcal{F}_{k-1}]\bigg{)}=
k=0ns1(P(Θ^s1ik,Qik,Rik)𝔼[t=s1s21(ωt+1ωt+1)1t|k1]\displaystyle\sum_{k=0}^{ns-1}\bigg{(}P(\hat{\Theta}^{i_{k}}_{s_{1}},Q^{i_{k}},R^{i_{k}})\bullet\mathbb{E}[\sum_{t=s_{1}}^{s_{2}-1}\big{(}\omega_{t+1}^{\top}\omega_{t+1}\big{)}1_{\mathcal{E}_{t}}|\mathcal{F}_{k-1}]-
𝔼[t=s1s21(σω2P(Θ^s1ik,Qik,Rik))1t|k1])\displaystyle\quad\quad\quad\mathbb{E}[\sum_{t=s_{1}}^{s_{2}-1}\big{(}\sigma_{\omega}^{2}\|P(\hat{\Theta}^{i_{k}}_{s_{1}},Q^{i_{k}},R^{i_{k}})\|_{*}\big{)}1_{\mathcal{E}_{t}}|\mathcal{F}_{k-1}]\bigg{)}\leq
k=0ns1P(Θ^s1ik,Qik,Rik)\displaystyle\sum_{k=0}^{ns-1}\|P(\hat{\Theta}^{i_{k}}_{s_{1}},Q^{i_{k}},R^{i_{k}})\|_{*}
t=s1s21𝔼[(ωt+1ωt+1σω2I)1t|t1]=0\displaystyle\quad\quad\sum_{t=s_{1}}^{s_{2}-1}\mathbb{E}[\big{(}\omega_{t+1}^{\top}\omega_{t+1}-\sigma_{\omega}^{2}I\big{)}1_{\mathcal{E}_{t}}|\mathcal{F}_{t-1}]=0 (134)

where, we applied k1\mathcal{F}_{k-1} measurablity of P(Θ^s1ik,Qik,Rik)P(\hat{\Theta}^{i_{k}}_{s_{1}},Q^{i_{k}},R^{i_{k}}) in the last equality and in the last inequality we applied the property ABABA\bullet B\leq\|A\|_{*}\|B\|_{*}. ∎

Lemma 12

It holds that

𝔼[k=0ns1s1:s21]νσω2𝔼[x0x0]+νσω2κ4\displaystyle\mathbb{E}[\sum_{k=0}^{ns-1}\mathcal{R}_{s_{1}:s_{2}}^{1}]\leq\frac{\nu_{*}}{\sigma_{\omega}^{2}}\mathbb{E}[x_{0}^{\top}x_{0}]+\frac{\nu_{*}}{\sigma_{\omega}^{2}}\kappa_{*}^{4} (135)
Proof:

We have

s1:s21=t=s1s21(xtP(Θ^s1i,Qi,Ri)xt\displaystyle\mathcal{R}_{s_{1}:s_{2}}^{1}=\sum_{t=s_{1}}^{s_{2}-1}\bigg{(}x_{t}^{\top}P(\hat{\Theta}^{i}_{s_{1}},Q^{i},R^{i})x_{t}-
xt+1P(Θ^s1i,Qi,Ri)xt+1)1t\displaystyle\quad\quad\quad\quad\quad\quad\quad x_{t+1}^{\top}P(\hat{\Theta}^{i}_{s_{1}},Q^{i},R^{i})x_{t+1}\bigg{)}1_{\mathcal{E}_{t}}
=[xs1P(Θ^s1i,Qi,Ri)xs1xs2P(Θ^s1i,Qi,Ri)xs2+\displaystyle\quad\quad\quad=\bigg{[}x_{s_{1}}^{\top}P(\hat{\Theta}^{i}_{s_{1}},Q^{i},R^{i})x_{s_{1}}-x_{s_{2}}^{\top}P(\hat{\Theta}^{i}_{s_{1}},Q^{i},R^{i})x_{s_{2}}+
t=s1+1s22(xtP(Θ^s1i,Qi,Ri)xtxtP(Θ^s1i,Qi,Ri)xt)]1t\displaystyle\quad\sum_{t=s_{1}+1}^{s_{2}-2}\big{(}x_{t}^{\top}P(\hat{\Theta}^{i}_{s_{1}},Q^{i},R^{i})x_{t}-x_{t}^{\top}P(\hat{\Theta}^{i}_{s_{1}},Q^{i},R^{i})x_{t}\big{)}\bigg{]}1_{\mathcal{E}_{t}}
=xs1P(Θ^s1i,Qi,Ri)xs1xs2P(Θ^s1i,Qi,Ri)xs2\displaystyle\quad\quad\quad=x_{s_{1}}^{\top}P(\hat{\Theta}^{i}_{s_{1}},Q^{i},R^{i})x_{s_{1}}-x_{s_{2}}^{\top}P(\hat{\Theta}^{i}_{s_{1}},Q^{i},R^{i})x_{s_{2}} (136)

by telescoping. By applying (136) and taking into account the definitions 𝔗k\mathfrak{T}_{k} and τk\tau_{k} it implies

𝔼[k=0ns1τk:𝔗k1]=\displaystyle\mathbb{E}[\sum_{k=0}^{ns-1}\mathcal{R}_{\tau_{k}:\mathfrak{T}_{k}}^{1}]=
k=0ns1(𝔼[x𝔗kP(Θ^τkik1,Qik1,Rik1)x𝔗k|𝔗k1]\displaystyle\sum_{k=0}^{ns-1}\bigg{(}-\mathbb{E}[x_{\mathfrak{T}_{k}}^{\top}P(\hat{\Theta}^{i_{k-1}}_{\tau_{k}},Q^{i_{k-1}},R^{i_{k-1}})x_{\mathfrak{T}_{k}}|\mathcal{F}_{\mathfrak{T}_{k}-1}]
+𝔼[xτkP(Θ^τkik,Qik,Rik)xτk|τk1])\displaystyle\quad\quad\quad+\mathbb{E}[x_{\tau_{k}}^{\top}P(\hat{\Theta}^{i_{k}}_{\tau_{k}},Q^{i_{k}},R^{i_{k}})x_{\tau_{k}}|\mathcal{F}_{\tau_{k}-1}]\bigg{)} (137)

Case A: Contraction Case 𝒴<1\mathcal{Y}<1

For this case we, by applying (49) it yields

𝔼[k=0ns1τk:𝔗k1]k=0ns1𝔼[xτkP(Θ^τkik,Qik,Rik)xτk|τk1]\displaystyle\mathbb{E}[\sum_{k=0}^{ns-1}\mathcal{R}_{\tau_{k}:\mathfrak{T}_{k}}^{1}]\leq\sum_{k=0}^{ns-1}\mathbb{E}[x_{\tau_{k}}^{\top}P(\hat{\Theta}^{i_{k}}_{\tau_{k}},Q^{i_{k}},R^{i_{k}})x_{\tau_{k}}|\mathcal{F}_{\tau_{k}-1}]
νσω2k=0ns1𝔼[xτkxτk|τk1]\displaystyle\leq\frac{\nu_{*}}{\sigma_{\omega}^{2}}\sum_{k=0}^{ns-1}\mathbb{E}[x_{\tau_{k}}^{\top}x_{\tau_{k}}|\mathcal{F}_{\tau_{k}-1}] (138)
𝔼[k=0ns1s1:s21]\displaystyle\mathbb{E}[\sum_{k=0}^{ns-1}\mathcal{R}_{s_{1}:s_{2}}^{1}] (139)

By (105) we have

ζ(𝒞τqi(δ),𝒞τqj(δ))W𝔼[𝒱j(τq+1+)|τq+11]\displaystyle\zeta(\mathcal{C}^{i}_{\tau_{q}}(\delta),\mathcal{C}^{j}_{\tau_{q}}(\delta))W\geq\mathbb{E}[\mathcal{V}_{j}(\tau_{q+1}^{+})|\mathcal{F}_{\tau_{q+1}-1}]-
ρ(𝒞τqi(δ),𝒞τqj(δ))(1η(𝒞τqi(δ)))τq+1τq𝔼[𝒱i(τq+)|τq1]\displaystyle\rho(\mathcal{C}^{i}_{\tau_{q}}(\delta),\mathcal{C}^{j}_{\tau_{q}}(\delta))\bigg{(}1-\eta\big{(}\mathcal{C}^{i}_{\tau_{q}}(\delta)\big{)}\bigg{)}^{\tau_{q+1}-\tau_{q}}\mathbb{E}[\mathcal{V}_{i}(\tau_{q}^{+})|\mathcal{F}_{\tau_{q}-1}]

Based on the dwell-time design the following inequality

ρ(𝒞τqi(δ),𝒞τqj(δ))(1η(𝒞τqi(δ)))τq+1τq1\displaystyle\rho(\mathcal{C}^{i}_{\tau_{q}}(\delta),\mathcal{C}^{j}_{\tau_{q}}(\delta))\bigg{(}1-\eta\big{(}\mathcal{C}^{i}_{\tau_{q}}(\delta)\big{)}\bigg{)}^{\tau_{q+1}-\tau_{q}}\leq 1

holds if τq+1τq[τdwij]τq\tau_{q+1}-\tau_{q}\geq[\tau_{dw}^{ij}]_{\tau_{q}}. Therefore for any τ<[τdwij]τq\tau<[\tau_{dw}^{ij}]_{\tau_{q}}

ζ(𝒞τqi(δ),𝒞τqj(δ))W𝔼[𝒱j(τq+1+)|τq+11]\displaystyle\zeta(\mathcal{C}^{i}_{\tau_{q}}(\delta),\mathcal{C}^{j}_{\tau_{q}}(\delta))W\geq\mathbb{E}[\mathcal{V}_{j}(\tau_{q+1}^{+})|\mathcal{F}_{\tau_{q+1}-1}]-
ρ(𝒞τqi(δ),𝒞τqj(δ))(1η(𝒞τqi(δ)))[τdwij]τq𝔼[𝒱i(τq+)|τq1]\displaystyle\rho(\mathcal{C}^{i}_{\tau_{q}}(\delta),\mathcal{C}^{j}_{\tau_{q}}(\delta))\bigg{(}1-\eta\big{(}\mathcal{C}^{i}_{\tau_{q}}(\delta)\big{)}\bigg{)}^{[\tau_{dw}^{ij}]_{\tau_{q}}}\mathbb{E}[\mathcal{V}_{i}(\tau_{q}^{+})|\mathcal{F}_{\tau_{q}-1}]\geq
𝔼[𝒱j(τq+1+)|τq+11]\displaystyle\mathbb{E}[\mathcal{V}_{j}(\tau_{q+1}^{+})|\mathcal{F}_{\tau_{q+1}-1}]-
ρ(𝒞τqi(δ),𝒞τqj(δ))(1η(𝒞τqi(δ)))τ𝔼[𝒱i(τq+)|τq1]\displaystyle\rho(\mathcal{C}^{i}_{\tau_{q}}(\delta),\mathcal{C}^{j}_{\tau_{q}}(\delta))\bigg{(}1-\eta\big{(}\mathcal{C}^{i}_{\tau_{q}}(\delta)\big{)}\bigg{)}^{\tau}\mathbb{E}[\mathcal{V}_{i}(\tau_{q}^{+})|\mathcal{F}_{\tau_{q}-1}] (140)

Also, by definition we have

𝔼[𝒱ik1(s2k1)|s2k11]:=\displaystyle\mathbb{E}[\mathcal{V}_{i_{k-1}}(s_{2}^{k-1})|\mathcal{F}_{s_{2}^{k-1}-1}]:=
𝔼[P(Θ^s1k1ik1,Qik1,Rik1)[xs2k1xs2k1]|s2k11]\displaystyle\mathbb{E}[P(\hat{\Theta}^{i_{k-1}}_{s^{k-1}_{1}},Q^{i_{k-1}},R^{i_{k-1}})\bullet[x_{s^{k-1}_{2}}^{\top}x_{s^{k-1}_{2}}]|\mathcal{F}_{s^{k-1}_{2}-1}]

and similarly

𝔼[𝒱ik(s1k)|s1k1]:=𝔼[P(Θ^s1ik,Qik,Rik)[xs1kxs1k]|s1k1].\displaystyle\mathbb{E}[\mathcal{V}_{i_{k}}(s_{1}^{k})|\mathcal{F}_{s_{1}^{k}-1}]:=\mathbb{E}[P(\hat{\Theta}^{i_{k}}_{s_{1}},Q^{i_{k}},R^{i_{k}})\bullet[x_{s^{k}_{1}}^{\top}x_{s^{k}_{1}}]|\mathcal{F}_{s_{1}^{k}-1}].

Now noting that s2ks1k<[τdwij]τqs_{2}^{k}-s_{1}^{k}<[\tau_{dw}^{ij}]_{\tau_{q}} (140) implies

𝔼[P(Θ^s1k1ik1,Qik1,Rik1)[xs2k1xs2k1]|s2k11]+\displaystyle-\mathbb{E}[P(\hat{\Theta}^{i_{k-1}}_{s^{k-1}_{1}},Q^{i_{k-1}},R^{i_{k-1}})\bullet[x_{s^{k-1}_{2}}^{\top}x_{s^{k-1}_{2}}]|\mathcal{F}_{s^{k-1}_{2}-1}]+
𝔼[P(Θ^s1ik,Qik,Rik)[xs1kxs1k]|sk1]ζ(𝒞τqi(δ),𝒞τqj(δ))W\displaystyle\mathbb{E}[P(\hat{\Theta}^{i_{k}}_{s_{1}},Q^{i_{k}},R^{i_{k}})\bullet[x_{s^{k}_{1}}^{\top}x_{s^{k}_{1}}]|\mathcal{F}_{s^{k}-1}]\leq\zeta(\mathcal{C}^{i}_{\tau_{q}}(\delta),\mathcal{C}^{j}_{\tau_{q}}(\delta))W (141)

By summing up the right hand side of (141) over the switch sequence we have

ζ(𝒞τqi(δ),𝒞τqj(δ))Wνσω2κ4\displaystyle\zeta(\mathcal{C}^{i}_{\tau_{q}}(\delta),\mathcal{C}^{j}_{\tau_{q}}(\delta))W\leq\frac{\nu_{*}}{\sigma_{\omega}^{2}}\kappa_{*}^{4} (142)

which follows by similar analysis, carried our for Lemma 3. Now it is straight forward to see

𝔼[P(Θ^s1i0,Qi0,Ri0)[x0x0]|0]νσω2𝔼[x0x0|0]\displaystyle\mathbb{E}[P(\hat{\Theta}^{i_{0}}_{s_{1}},Q^{i_{0}},R^{i_{0}})\bullet[x_{0}^{\top}x_{0}]|\mathcal{F}_{0^{-}}]\leq\frac{\nu_{*}}{\sigma_{\omega}^{2}}\mathbb{E}[x_{0}^{\top}x_{0}|\mathcal{F}_{0^{-}}] (143)

that completes the proof.

To bound the term 𝔼[k=0ns1s1:s24]\mathbb{E}[\sum_{k=0}^{ns-1}\mathcal{R}_{s_{1}:s_{2}}^{4}] we need the following lemma.

Lemma 13

Let the matrix MM be positive definite and let ztz_{t} be a t\mathcal{F}_{t}-measurable vector, if t=s1s2𝔼[ztM1zt|t1]1\sum_{t=s_{1}}^{s_{2}}\mathbb{E}[z_{t}^{\top}M^{-1}z_{t}|\mathcal{F}_{t-1}]\leq 1 then

t=s1s2𝔼[ztM1zt|t1]\displaystyle\sum_{t=s_{1}}^{s_{2}}\mathbb{E}[z_{t}^{\top}M^{-1}z_{t}|\mathcal{F}_{t-1}]\leq
2log(det(M+t=s1s2𝔼[zszs|t1])det(M))\displaystyle 2\log\bigg{(}\frac{\det\big{(}M+\sum_{t=s_{1}}^{s_{2}}\mathbb{E}[z_{s}z_{s}^{\top}|\mathcal{F}_{t-1}]\big{)}}{\det(M)}\bigg{)} (144)
Proof:

Following statement holds true by determinant properties

det(M+t=s1s2𝔼[zszs|t1])=\displaystyle\det\big{(}M+\sum_{t=s_{1}}^{s_{2}}\mathbb{E}[z_{s}z_{s}^{\top}|\mathcal{F}_{t-1}]\big{)}=
det(M)det(I+M1/2t=s1s2𝔼[ztzt|t1]M1/2)=\displaystyle\det(M)det\big{(}I+M^{-1/2}\sum_{t=s_{1}}^{s_{2}}\mathbb{E}[z_{t}z_{t}^{\top}|\mathcal{F}_{t-1}]M^{-1/2}\big{)}=
det(M)(1+t=s1s2𝔼[ztM1zt|t1])\displaystyle\det(M)(1+\sum_{t=s_{1}}^{s_{2}}\mathbb{E}[z_{t}^{\top}M^{-1}z_{t}|\mathcal{F}_{t-1}])

which results in

log(1+t=s1s2𝔼[ztM1zt|t1])=\displaystyle\log(1+\sum_{t=s_{1}}^{s_{2}}\mathbb{E}[z_{t}^{\top}M^{-1}z_{t}|\mathcal{F}_{t-1}])=
log(det(M+t=s1s2𝔼[ztzt|t1])det(M))\displaystyle\log\bigg{(}\frac{\det\big{(}M+\sum_{t=s_{1}}^{s_{2}}\mathbb{E}[z_{t}z_{t}^{\top}|\mathcal{F}_{t-1}]\big{)}}{\det(M)}\bigg{)}

If t=s1s2𝔼[ztM1zt|t1]1\sum_{t=s_{1}}^{s_{2}}\mathbb{E}[z_{t}^{\top}M^{-1}z_{t}|\mathcal{F}_{t-1}]\leq 1 then applying the inequality x2log(1+x)x\leq 2\log(1+x) which holds true for 0x10\leq x\leq 1, one can write:

t=s1s2𝔼[ztM1zt|t1]\displaystyle\sum_{t=s_{1}}^{s_{2}}\mathbb{E}[z_{t}^{\top}M^{-1}z_{t}|\mathcal{F}_{t-1}]\leq
2log(det(M+t=s1s2𝔼[ztzt|t1])det(M))\displaystyle 2\log\bigg{(}\frac{\det\big{(}M+\sum_{t=s_{1}}^{s_{2}}\mathbb{E}[z_{t}z_{t}^{\top}|\mathcal{F}_{t-1}]\big{)}}{\det(M)}\bigg{)}

Lemma 14

On event t\mathcal{E}_{t} it holds

𝔼[k=0ns1s1:s24]\displaystyle\mathbb{E}[\sum_{k=0}^{ns-1}\mathcal{R}_{s_{1}:s_{2}}^{4}] (145)
Proof:

Recall the definition of μ¯iτq\bar{\mu}^{\tau_{q}}_{i} given by (82) where τq\tau_{q} denotes switch time. For the sake of simplicity in notation, we slightly abuse the notation and let the parameter to be defined μ¯ik\bar{\mu}^{k}_{i} where kk is index of the kk-th switch. Since the algorithm is off-policy and the control is designed by confidence ellipsoid, constructed by data gathered before starting current epoch, then 𝔼[μ¯ik|k]=𝔼[μ¯ik|k1]=μ¯ik\mathbb{E}[\bar{\mu}_{i}^{k^{-}}|\mathcal{F}_{k^{-}}]=\mathbb{E}[\bar{\mu}_{i}^{k^{-}}|\mathcal{F}_{k-1}]=\bar{\mu}_{i}^{k^{-}}. The second equality holds because the last update-time of μ¯ik\bar{\mu}^{k}_{i} is at least two epochs back. Now, one can write

𝔼[k=0ns1s1:s24]=\displaystyle\mathbb{E}[\sum_{k=0}^{ns-1}\mathcal{R}_{s_{1}:s_{2}}^{4}]=
𝔼[k=0ns1t=s1s214νikμ¯ikkσω2(ztViknik(s1)1zt)1t]=\displaystyle\mathbb{E}[\sum_{k=0}^{ns-1}\sum_{t=s_{1}}^{s_{2}-1}\frac{4\nu_{i_{k}}\bar{\mu}^{k^{-}}_{i_{k}}}{\sigma_{\omega}^{2}}\big{(}z_{t}^{\top}{V^{i_{k}}}_{n_{i_{k}}(s_{1})}^{-1}z_{t}\big{)}1_{\mathcal{E}_{t}}]=
𝔼[𝔼[k=0ns1t=s1s214νikμ¯ikkσω2(ztViknik(s1)1zt)1t]|μ¯ik]=\displaystyle\mathbb{E}[\mathbb{E}[\sum_{k=0}^{ns-1}\sum_{t=s_{1}}^{s_{2}-1}\frac{4\nu_{i_{k}}\bar{\mu}^{k^{-}}_{i_{k}}}{\sigma_{\omega}^{2}}\big{(}z_{t}^{\top}{V^{i_{k}}}_{n_{i_{k}}(s_{1})}^{-1}z_{t}\big{)}1_{\mathcal{E}_{t}}]|\bar{\mu}_{i}^{k^{-}}]=
𝔼[k=0ns14νikμikkσω2Viknik(s1)1]\displaystyle\mathbb{E}[\sum_{k=0}^{ns-1}\frac{4\nu_{i_{k}}\mu^{k^{-}}_{i_{k}}}{\sigma_{\omega}^{2}}{V^{i_{k}}}_{n_{i_{k}}(s_{1})}^{-1}]\bullet
𝔼[t=s1s21(ztzt)1t|k1]|μ¯ikk]=\displaystyle\quad\quad\quad\mathbb{E}[\sum_{t=s_{1}}^{s_{2}-1}\big{(}z_{t}^{\top}z_{t}\big{)}1_{\mathcal{E}_{t}}|\mathcal{F}_{k-1}]|\bar{\mu}_{i_{k}}^{k^{-}}]=
k=0ns14νik𝔼[μ¯ikk]σω2𝔼[Viknik(s1)1]𝔼[t=s1s21(ztzt)1t|t1]=\displaystyle\sum_{k=0}^{ns-1}\frac{4\nu_{i_{k}}\mathbb{E}[\bar{\mu}^{k^{-}}_{i_{k}}]}{\sigma_{\omega}^{2}}\mathbb{E}[{V^{i_{k}}}_{n_{i_{k}}(s_{1})}^{-1}]\bullet\mathbb{E}[\sum_{t=s_{1}}^{s_{2}-1}\big{(}z_{t}^{\top}z_{t}\big{)}1_{\mathcal{E}_{t}}|\mathcal{F}_{t-1}]=
k=0ns14νik𝔼[μ¯ikk]σω2t=s1s21𝔼[zt𝔼[Viknik(s1)1]zt1t|t1]\displaystyle\sum_{k=0}^{ns-1}\frac{4\nu_{i_{k}}\mathbb{E}[\bar{\mu}^{k^{-}}_{i_{k}}]}{\sigma_{\omega}^{2}}\sum_{t=s_{1}}^{s_{2}-1}\mathbb{E}\big{[}z_{t}^{\top}\mathbb{E}[{V^{i_{k}}}_{n_{i_{k}}(s_{1})}^{-1}]z_{t}1_{\mathcal{E}_{t}}|\mathcal{F}_{t-1}\big{]} (146)

By Viknik(s1)λikkI{V^{i_{k}}}_{n_{i_{k}}(s_{1})}\succsim\lambda^{k}_{i_{k}}I, we have

t=s1s21𝔼[zt𝔼[Viknik(s1)1]zt1t|t1]\displaystyle\sum_{t=s_{1}}^{s_{2}-1}\mathbb{E}\big{[}z_{t}^{\top}\mathbb{E}[{V^{i_{k}}}_{n_{i_{k}}(s_{1})}^{-1}]z_{t}1_{\mathcal{E}_{t}}|\mathcal{F}_{t-1}\big{]}\leq
1λikkt=s1s21𝔼[ztzt1t|t1].\displaystyle\frac{1}{\lambda^{k}_{i_{k}}}\sum_{t=s_{1}}^{s_{2}-1}\mathbb{E}\big{[}z_{t}^{\top}z_{t}1_{\mathcal{E}_{t}}|\mathcal{F}_{t-1}\big{]}.

Considering the fact that 𝔼[ztztt1]\mathbb{E}[z_{t}^{\top}z_{t}\ \mathcal{F}_{t-1}] is bounded for s1ts21s_{1}\leq t\leq s_{2}-1 then 1λikkt=s1s21𝔼[ztzt1t|t1]\frac{1}{\lambda^{k}_{i_{k}}}\sum_{t=s_{1}}^{s_{2}-1}\mathbb{E}\big{[}z_{t}^{\top}z_{t}1_{\mathcal{E}_{t}}|\mathcal{F}_{t-1}\big{]} is 𝒪(s2s1)\mathcal{O}(s_{2}-s_{1}). However, since λikk\lambda^{k}_{i_{k}} is of 𝒪(nik(s1))\mathcal{O}(\sqrt{n_{i_{k}}(s_{1})}) by definition, with long enough switch sequence there exists kk^{\prime}-switch with corresponding time time tt^{\prime} such that 𝒪(ni(t))=𝒪((τmaxdω)2)\mathcal{O}\big{(}n_{i}(t^{\prime})\big{)}=\mathcal{O}\big{(}(\tau_{\max}^{d\omega})^{2}\big{)} and for kkk\geq k^{\prime} the following inequality holds

1λikkt=s1s21𝔼[ztzt1t|t1]1\displaystyle\frac{1}{\lambda^{k}_{i_{k}}}\sum_{t=s_{1}}^{s_{2}-1}\mathbb{E}\big{[}z_{t}^{\top}z_{t}1_{\mathcal{E}_{t}}|\mathcal{F}_{t-1}\big{]}\leq 1 (147)

where s1ts_{1}\geq t^{\prime}. The inequality (147) in deed fulfills the condition for Lemma 13. Therefore for all modes ii\in\mathcal{I} and s1ts_{1}\geq t^{\prime}

t=s1s21𝔼[zt𝔼[Vini(s1)1]zt1t|t1]\displaystyle\sum_{t=s_{1}}^{s_{2}-1}\mathbb{E}\big{[}z_{t}^{\top}\mathbb{E}[{V^{i}}_{n_{i}(s_{1})}^{-1}]z_{t}1_{\mathcal{E}_{t}}|\mathcal{F}_{t-1}\big{]}\leq
2log(det(𝔼[Vini(s2)])det(𝔼[Vini(s1)]))\displaystyle 2\log\bigg{(}\frac{\det\big{(}\mathbb{E}[{V^{i}}_{n_{i}(s_{2})}]\big{)}}{\det\big{(}\mathbb{E}[{V^{i}}_{n_{i}(s_{1})}]\big{)}}\bigg{)} (148)

holds where

𝔼[Vini(s2)]=𝔼[Vini(s1)]+t=s1s21𝔼[ztzt1t|t1].\displaystyle\mathbb{E}[{V^{i}}_{n_{i}(s_{2})}]=\mathbb{E}[{V^{i}}_{n_{i}(s_{1})}]+\sum_{t=s_{1}}^{s_{2}-1}\mathbb{E}\big{[}z_{t}^{\top}z_{t}1_{\mathcal{E}_{t}}|\mathcal{F}_{t-1}\big{]}.

The inequality (148) implies that for t>tt>t^{\prime} the left hand is not 𝒪(s2s1)\mathcal{O}(s_{2}-s_{1}) anymore which assures achieving better regret.

Now, we proceed with decomposing (146) as follows:

k=0ns14νik𝔼[μ¯ikk]σω2t=s1s21𝔼[zt𝔼[Viknik(s1)1]zt1t|t1]=\displaystyle\sum_{k=0}^{ns-1}\frac{4\nu_{i_{k}}\mathbb{E}[\bar{\mu}^{k^{-}}_{i_{k}}]}{\sigma_{\omega}^{2}}\sum_{t=s_{1}}^{s_{2}-1}\mathbb{E}\big{[}z_{t}^{\top}\mathbb{E}[{V^{i_{k}}}_{n_{i_{k}}(s_{1})}^{-1}]z_{t}1_{\mathcal{E}_{t}}|\mathcal{F}_{t-1}\big{]}=
k=0m14νik𝔼[μ¯ikk]σω2t=s1s21𝔼[zt𝔼[Viknik(s1)1]zt1t|t1]Γ1+\displaystyle\overbrace{\sum_{k=0}^{m^{\prime}-1}\frac{4\nu_{i_{k}}\mathbb{E}[\bar{\mu}^{k^{-}}_{i_{k}}]}{\sigma_{\omega}^{2}}\sum_{t=s_{1}}^{s_{2}-1}\mathbb{E}\big{[}z_{t}^{\top}\mathbb{E}[{V^{i_{k}}}_{n_{i_{k}}(s_{1})}^{-1}]z_{t}1_{\mathcal{E}_{t}}|\mathcal{F}_{t-1}\big{]}}^{\Gamma_{1}}+
k=mns14νik𝔼[μ¯ikk]σω2t=s1s21𝔼[zt𝔼[Viknik(s1)1]zt1t|t1]Γ2\displaystyle\underbrace{\sum_{k=m^{\prime}}^{ns-1}\frac{4\nu_{i_{k}}\mathbb{E}[\bar{\mu}^{k^{-}}_{i_{k}}]}{\sigma_{\omega}^{2}}\sum_{t=s_{1}}^{s_{2}-1}\mathbb{E}\big{[}z_{t}^{\top}\mathbb{E}[{V^{i_{k}}}_{n_{i_{k}}(s_{1})}^{-1}]z_{t}1_{\mathcal{E}_{t}}|\mathcal{F}_{t-1}\big{]}}_{\Gamma_{2}} (149)

We upper bound the terms individually. For the term Γ1\Gamma_{1} we have

Γ1k=0k1α0ikt=s1s21𝔼[ztzt1t|t1]\displaystyle\Gamma_{1}\leq\sum_{k=0}^{k^{\prime}-1}\alpha_{0}^{i_{k}}\sum_{t=s_{1}}^{s_{2}-1}\mathbb{E}\big{[}z_{t}^{\top}z_{t}1_{\mathcal{E}_{t}}|\mathcal{F}_{t-1}\big{]}\leq
k=0k12α0ikκik2t=s1s21𝔼[xtxt1t|t1]\displaystyle\sum_{k=0}^{k^{\prime}-1}2\alpha_{0}^{i_{k}}\kappa^{2}_{i_{k}}\sum_{t=s_{1}}^{s_{2}-1}\mathbb{E}\big{[}x_{t}^{\top}x_{t}1_{\mathcal{E}_{t}}|\mathcal{F}_{t-1}\big{]}\leq
k=0k12α0ikκik2α(s2s1)=\displaystyle\sum_{k=0}^{k^{\prime}-1}2\alpha_{0}^{i_{k}}\kappa^{2}_{i_{k}}\alpha(s_{2}-s_{1})=
i2α0iκi2Xni(t)2||Xmaxiα0iκi2ni(t)\displaystyle\sum_{i\in\mathcal{I}}2\alpha_{0}^{i}\kappa^{2}_{i}Xn_{i}(t^{\prime})\leq 2|\mathcal{M}|X\max_{i\in\mathcal{M}}\alpha_{0}^{i}\kappa^{2}_{i}n_{i}(t^{\prime})

where in the first inequality we applied Viknik(s1)11/λiknikI{V^{i_{k}}}_{n_{i_{k}}(s_{1})}^{-1}\leq 1/\lambda_{i^{k}}^{n_{i_{k}}}I and the definition of λnikik\lambda^{i^{k}}_{n_{i_{k}}}. In the second inequality, we used the fact that on the event t\mathcal{E}_{t}, zt=(IK(Θ^τqi,Qi,Ri))xtz_{t}=\begin{pmatrix}I\\ K(\hat{\Theta}_{\tau_{q}}^{i},Q^{i},R^{i})\end{pmatrix}x_{t} and (IK(Θ^τqi,Qi,Ri))22κi2\|\begin{pmatrix}I\\ K(\hat{\Theta}_{\tau_{q}}^{i},Q^{i},R^{i})\end{pmatrix}\|^{2}\leq 2{\kappa_{i}}^{2} since κi1\kappa_{i}\geq 1, i.e.,

𝔼[ztzt1t|t1]2κi2𝔼[xtxt|t1]\displaystyle\mathbb{E}\big{[}z_{t}^{\top}z_{t}1_{\mathcal{E}_{t}}|\mathcal{F}_{t-1}\big{]}\leq 2\kappa^{2}_{i}\mathbb{E}\big{[}x_{t}^{\top}x_{t}|\mathcal{F}_{t-1}\big{]} (150)

In the third inequality we applied Lemma 3 with α\alpha defined as follows:

α:=κ4E[x0x0|0]+κ6W.\displaystyle\alpha:=\kappa_{*}^{4}E[x_{0}^{\top}x_{0}|\mathcal{F}_{0^{-}}]+\kappa_{*}^{6}W. (151)

The equality holds because

k=0k1(s2s1)=ini(t)\displaystyle\sum_{k=0}^{k^{\prime}-1}(s_{2}-s_{1})=\sum_{i\in\mathcal{I}}n_{i}(t^{\prime}) (152)

by definition.

Now we proceed to upper-bound Γ2\Gamma_{2}. For kkk\geq k^{\prime} thanks to the fulfilment of (147), which is condition of Lemma 13, it yields

Γ2\displaystyle\Gamma_{2} k=kns18νik𝔼[μikk]σω2log(det(𝔼[Viknik(s2)])det(𝔼[Viknik(s1)]))\displaystyle\leq\sum_{k=k^{\prime}}^{ns-1}\frac{8\nu_{i_{k}}\mathbb{E}[\mu^{k^{-}}_{i_{k}}]}{\sigma_{\omega}^{2}}\log\bigg{(}\frac{\det\big{(}\mathbb{E}[{V^{i_{k}}}_{n_{i_{k}}(s_{2})}]\big{)}}{\det\big{(}\mathbb{E}[{V^{i_{k}}}_{n_{i_{k}}(s_{1})}]\big{)}}\bigg{)}
8ν𝔼[μns]σω2k=kns1log(det(𝔼[Viknik(s2)])det(𝔼[Viknik(s1)]))\displaystyle\leq\frac{8\nu_{*}\mathbb{E}[\mu^{ns^{-}}_{*}]}{\sigma_{\omega}^{2}}\sum_{k=k^{\prime}}^{ns-1}\log\bigg{(}\frac{\det\big{(}\mathbb{E}[{V^{i_{k}}}_{n_{i_{k}}(s_{2})}]\big{)}}{\det\big{(}\mathbb{E}[{V^{i_{k}}}_{n_{i_{k}}(s_{1})}]\big{)}}\bigg{)} (153)

where 𝔼[μns]=maxi𝔼[μins]\mathbb{E}[\mu^{ns^{-}}_{*}]=\max_{i\in\mathcal{I}}\mathbb{E}[\mu^{ns^{-}}_{i}], which is given by (156).

Moreover,

k=kns1log(det(𝔼[Viknik(s2)])det(𝔼[Viknik(s1)]))\displaystyle\sum_{k=k^{\prime}}^{ns-1}\log\bigg{(}\frac{\det\big{(}\mathbb{E}[{V^{i_{k}}}_{n_{i_{k}}(s_{2})}]\big{)}}{\det\big{(}\mathbb{E}[{V^{i_{k}}}_{n_{i_{k}}(s_{1})}]\big{)}}\bigg{)}\leq
ilog(det(𝔼[Vini(T)])det(𝔼[Vini(tm)]))\displaystyle\sum_{i\in\mathcal{M}}\log\bigg{(}\frac{\det\big{(}\mathbb{E}[{V^{i}}_{n_{i}(T)}]\big{)}}{\det\big{(}\mathbb{E}[{V^{i}}_{n_{i}(t_{m^{\prime}})}]\big{)}}\bigg{)} (154)

in which we opened up the summation and re-arranged the terms based on the modes and afterwards, we applied the property logab+logbc=logac\log\frac{a}{b}+\log\frac{b}{c}=\log\frac{a}{c} for each mode’s terms. Conclusively, we have

Γ28ν𝔼[μns]σω2ilog(det(𝔼[Vini(T)])det(𝔼[Vini(tm)]))\displaystyle\Gamma_{2}\leq\frac{8\nu_{*}\mathbb{E}[\mu^{ns^{-}}_{*}]}{\sigma_{\omega}^{2}}\sum_{i\in\mathcal{M}}\log\bigg{(}\frac{\det\big{(}\mathbb{E}[{V^{i}}_{n_{i}(T)}]\big{)}}{\det\big{(}\mathbb{E}[{V^{i}}_{n_{i}(t_{m^{\prime}})}]\big{)}}\bigg{)} (155)

where TT is minimum expected time to accomplish mission. Note that in (155), the summation of logarithmic terms are order of 𝒪(||log(||ns))\mathcal{O}(|\mathcal{M}|\log(|\mathcal{M}|\sqrt{ns})) given the fact that TT is of 𝒪(||ns)\mathcal{O}(|\mathcal{M}|\sqrt{ns}) by Theorem 3.

Furthermore, for any ii\in\mathcal{M} one can write:

𝔼[μins]=\displaystyle\mathbb{E}[\mu^{ns}_{i}]=
𝔼[rni(T)i(1+2ϑi(ni(T)+k=1ni(T)zkizki)0.5)]\displaystyle\quad\quad\quad\mathbb{E}\big{[}r^{i}_{n_{i}(T)}\bigg{(}1+2\vartheta_{i}\big{(}n_{i}(T)+\|\sum_{k=1}^{n_{i}(T)}z^{i}_{k}{z^{i}_{k}}^{\top}\|\big{)}^{0.5}\bigg{)}\big{]}\leq
R¯i(1+2νi(1+2κi2α)0.5ni(T))\displaystyle\quad\quad\quad\bar{R}_{i}\big{(}1+2\nu_{i}(1+2\kappa_{i}^{2}\alpha)^{0.5}\sqrt{n_{i}(T)}\big{)} (156)

By definition 𝔼[μns]=maxi||𝔼[μins]\mathbb{E}[\mu^{ns}_{*}]=\max_{i\in|\mathcal{M}|}\mathbb{E}[\mu^{ns}_{i}] where considering the definition of rni(T)ir^{i}_{n_{i}(T)}, R¯i\bar{R}_{i} is 𝒪(log||ns)\mathcal{O}(log|\mathcal{M}|\sqrt{ns}). ni(T)Tn_{i}(T)\leq T and by Theorem 3, TT is of 𝒪(||ns)\mathcal{O}(|\mathcal{M}|\sqrt{ns}). This concludes that Γ2\Gamma_{2} is of 𝒪(||3/2ns14)\mathcal{O}\big{(}|\mathcal{M}|^{3/2}{ns}^{\frac{1}{4}}\big{)} where 𝒪\mathcal{O} absorbs logarithmic orders of nsns and |||\mathcal{M}|.

It is noteworthy that the regret term Γ1\Gamma_{1} is imposed because of applying off-policy type of OFU-based strategy instead of on-policy one.

A-I Upper-bounding RT2R_{T}^{2}

Now, need to upper-bound the term RT2R_{T}^{2} which is enforced due to the learning error for estimating the mode-dependent minimum dwell-time. Before hand we need the following ingredients.

RT2=\displaystyle R_{T}^{2}= 𝔼[k=0ns1t=𝔗k𝔗¯k+11ctσ(t)]=\displaystyle\mathbb{E}[\sum_{k=0}^{ns-1}\sum_{t=\mathfrak{T}_{k}}^{\bar{\mathfrak{T}}_{k+1}-1}c_{t}^{\sigma(t)}]=
𝔼[k=0ns1t=𝔗k𝔗¯k+11ctik]=\displaystyle\mathbb{E}[\sum_{k=0}^{ns-1}\sum_{t=\mathfrak{T}_{k}}^{\bar{\mathfrak{T}}_{k+1}-1}c_{t}^{i_{k}}]=
𝔼[k=0ns1t=𝔗k𝔗¯k+11xt(Qik+\displaystyle\mathbb{E}\bigg{[}\sum_{k=0}^{ns-1}\sum_{t=\mathfrak{T}_{k}}^{\bar{\mathfrak{T}}_{k+1}-1}x_{t}^{\top}\bigg{(}Q^{i_{k}}+
K(Θ^𝔗¯kik,Qik,Rik)RikK(Θ^𝔗¯kik,Qik,Rik))xt]\displaystyle\quad\quad K^{\top}(\hat{\Theta}^{i_{k}}_{\bar{\mathfrak{T}}_{k}},Q^{i_{k}},R^{i_{k}})R^{i_{k}}K(\hat{\Theta}^{i_{k}}_{\bar{\mathfrak{T}}_{k}},Q^{i_{k}},R^{i_{k}})\bigg{)}x_{t}\bigg{]} (157)

and from (83) we have the following inequality

Qik+K(Θ^𝔗¯kik,Qik,Rik)RikK(Θ^𝔗¯kik,Qik,Rik)\displaystyle Q^{i_{k}}+K^{\top}(\hat{\Theta}^{i_{k}}_{\bar{\mathfrak{T}}_{k}},Q^{i_{k}},R^{i_{k}})R^{i_{k}}K(\hat{\Theta}^{i_{k}}_{\bar{\mathfrak{T}}_{k}},Q^{i_{k}},R^{i_{k}})\preceq
P(Θ^𝔗¯kik,Qik,Rik)+\displaystyle P(\hat{\Theta}^{i_{k}}_{\bar{\mathfrak{T}}_{k}},Q^{i_{k}},R^{i_{k}})+
2μik𝔗¯kP(Θ^𝔗¯kik,Qik,Rik)\displaystyle 2\mu^{\bar{\mathfrak{T}}_{k}}_{i_{k}}\|P(\hat{\Theta}^{i_{k}}_{\bar{\mathfrak{T}}_{k}},Q^{i_{k}},R^{i_{k}})\|_{*}
(IK(Θ^𝔗¯kik,Qik,Rik))Viknik(𝔗¯k)1(IK(Θ^𝔗¯kik,Qik,Rik))\displaystyle\begin{pmatrix}I\\ K(\hat{\Theta}^{i_{k}}_{\bar{\mathfrak{T}}_{k}},Q^{i_{k}},R^{i_{k}})\end{pmatrix}{V^{i_{k}}}^{-1}_{n_{i_{k}}(\bar{\mathfrak{T}}_{k})}\begin{pmatrix}I\\ K(\hat{\Theta}^{i_{k}}_{\bar{\mathfrak{T}}_{k}},Q^{i_{k}},R^{i_{k}})\end{pmatrix}^{\top}\preceq
νikσω2I+α0ik2(I+K(Θ^𝔗¯kik,Qik,Rik)K(Θ^𝔗¯kik,Qik,Rik))\displaystyle\frac{\nu_{i^{k}}}{\sigma_{\omega}^{2}}I+\frac{\alpha_{0}^{i_{k}}}{2}\big{(}I+K^{\top}(\hat{\Theta}^{i_{k}}_{\bar{\mathfrak{T}}_{k}},Q^{i_{k}},R^{i_{k}})K(\hat{\Theta}^{i_{k}}_{\bar{\mathfrak{T}}_{k}},Q^{i_{k}},R^{i_{k}})\big{)}\preceq
(νikσω2+α0ik2(1+κik2))I\displaystyle\big{(}\frac{\nu_{i^{k}}}{\sigma_{\omega}^{2}}+\frac{\alpha_{0}^{i_{k}}}{2}(1+\kappa^{2}_{i^{k}})\big{)}I (158)

where in the second inequality we used P(Θ^𝔗¯kik,Qik,Rik)νik/σω2\|P(\hat{\Theta}^{i_{k}}_{\bar{\mathfrak{T}}_{k}},Q^{i_{k}},R^{i_{k}})\|\leq\nu_{i^{k}}/\sigma^{2}_{\omega} and applied Viknik(s1)11/λiknikI{V^{i_{k}}}_{n_{i_{k}}(s_{1})}^{-1}\leq 1/\lambda_{i^{k}}^{n_{i_{k}}}I and used the definition of λiknik\lambda_{i^{k}}^{n_{i_{k}}} .

Applying the obtained bound (158) on (157) yields

RT2\displaystyle R_{T}^{2} (νσω2+α02(1+κ2))α𝔼k=0ns1(τmdkτk)\displaystyle\leq\big{(}\frac{\nu_{*}}{\sigma_{\omega}^{2}}+\frac{\alpha_{0}^{*}}{2}(1+\kappa^{2}_{*})\big{)}\alpha\mathbb{E}\sum_{k=0}^{ns-1}(\tau_{md}^{k}-\tau_{*}^{k})
(νσω2+α02(1+κ2))αT\displaystyle\leq\big{(}\frac{\nu_{*}}{\sigma_{\omega}^{2}}+\frac{\alpha_{0}^{*}}{2}(1+\kappa^{2}_{*})\big{)}\alpha T

where TT is an 𝒪(||ns)\mathcal{O}(|\mathcal{M}|\sqrt{ns}) term by Theorem 3 which implies the term T2=𝒪(||ns)\mathcal{R}_{T}^{2}=\mathcal{O}(|\mathcal{M}|\sqrt{ns}).