This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Learning to Seek: Multi-Agent Online Source Seeking Against Non-Stochastic Disturbances

Bin Du, Kun Qian, Christian Claudel, and Dengfeng Sun Bin Du is with College of Automation Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing 210016, China (iniesdu@nuaa.edu.cn)Kun Qian and Christian Claudel are with Department of Civil, Architectural, and Environmental Engineering, the University of Texas at Austin, Austin, TX 78712, USA ({kunqian, christian.claudel}@utexas.edu)Dengfeng Sun is with School of Aeronautics and Astronautics, Purdue University, West Lafayette, IN 47906, USA (dsun@purdue.edu)
Abstract

This paper proposes to leverage the emerging learning techniques and devise a multi-agent online source seeking algorithm under unknown environment. Of particular significance in our problem setups are: i) the underlying environment is not only unknown, but dynamically changing and also perturbed by two types of non-stochastic disturbances; and ii) a group of agents is deployed and expected to cooperatively seek as many sources as possible. Correspondingly, a new technique of discounted Kalman filter is developed to tackle with the non-stochastic disturbances, and a notion of confidence bound in polytope nature is utilized to aid the computation-efficient cooperation among multiple agents. With standard assumptions on the unknown environment as well as the disturbances, our algorithm is shown to achieve sub-linear regrets under the two types of non-stochastic disturbances; both results are comparable to the state-of-the-art. Numerical examples on a real-world pollution monitoring application are provided to demonstrate the effectiveness of our algorithm.

I Introduction

The problem of online source seeking, in which one or multiple agents are deployed to adaptively localize the underlying sources under a possibly unknown and disturbed environment, has gained considerably increasing attention recently among researchers in both control and robotics communities [1, 2, 3, 4]. Two challenges are of particular significance to solve such a source seeking problem: i) how to obtain a reliable perception or estimation via observations on the unknown environment; and ii) how to integrate the environment estimation with task planning for the agent(s) to seek sources in an online manner.

In order to tackle with the above two challenges, a variety of methodologies have been investigated in the literature, among which, the mainstream approaches are typically based on the estimation of environment gradients[5, 6, 7]. Considering that the sources are often associated with the maximum/minimum values of a function which is utilized to characterize the state of environment, thus the gradient based approaches naturally steer the agents to search along with the direction of estimated gradients toward the locations whose gradients are close to zero. An appealing feature of this method is often attributed to the fact that only local measurements are collected during the searching process without the knowledge of agents’ global positions. However, a critical disadvantage is that the agents are easily trapped into the local extremum when the environment can not be modeled as an ideal convex/concave function.

To further address the above issue, recent methods, building on certain learning techniques [8, 9], interplays processes of learning of an unknown environment and source seeking based on the learned environmental information. Particularly, a novel algorithm termed as AdaSearch is proposed in [8], which leverages the notions of upper and lower confidence bounds to guide the agent’s adaptive searching for the static sources. Our previous work [9] considers a more sophisticated searching scenario, in which i) the unknown environment follows certain linear dynamics, and thus the underlying sources are moving around; and ii) multiple agents are deployed simultaneously with the aim to cooperatively locate as many moving sources as possible. Indeed, one of the significant challenges in such a multi-agent source seeking setup is the combinatorial growth of the searching space as the increase of the number of agents. To deal with this challenge, we developed a novel notion of confidence bound, termed as D-UCB, which appropriately constructs a polytope confidence set and helps decompose the searching space for each agent. As a consequence, a linear complexity is achieved for our algorithm with respect to the number of agents, which enables the computation-efficient cooperation among the multiple agents.

Despite the remarkable feature of our D-UCB algorithm in reducing the computational complexity, one critical drawback is its dependence on the precise knowledge of the environment dynamics. Nevertheless, considering that uncertainties and/or disturbances are almost ubiquitous in practice, the knowledge of an exact model on the environment is barely available when considering real-world applications. To take into account the disturbances in system dynamics, a set of classical approaches, such as linear quadratic regulator, incorporates the stochastic processing noise which is usually assumed to be independent and identically (Gaussian) distributed and in most cases with zero-mean. Recently, with the great advancement of learning theory applied into control problems, relevant works started to turn to a new paradigm where the stochastic disturbances are replaced by non-stochastic ones. It is well recognized that, in most problems, the non-stochastic setup is more challenging than the stochastic one, as the standard statistical properties of the disturbances are no longer available. In addition, it is also more general, on the other hand, since the non-stochastic disturbances can not only characterize the modeling deviation of the environment, but also be interpreted as the one which is arbitrarily injected by an underlying adversary. As such, we consider in this paper the multi-agent online source seeking problem with the non-stochastic setup where the environment is disturbed by two types of non-stochastic disturbances. Our objective is to enhance the D-UCB algorithm with the capability of dealing with the non-stochastic disturbances while still enjoying the low computational complexity with a guaranteed source seeking performance.

I-A Related Works

As mentioned earlier, the predominate approaches to solve the source seeking problem, including the well-known technique of extremum seeking control [10, 11], often build on the environment gradient estimation. These approaches can be indeed viewed as variants of the first-order optimization algorithm, which drive the agent to search for the local extremum values. In particular, by modeling the unknown environment as a time-invariant and concave real-valued function, the authors in [12] designed the distributed source seeking control law for a group of cooperative agents. Besides, the diffusion process is further considered in [13] for investigating the scenario of dynamical environments. The source seeking problems are also studied in [14, 15] by forcing the multiple agents to follow a specific circular formation. In addition, the stochastic gradient based methods are proposed in [16, 17] when considering that the gradient estimation is subject to environment and/or measurement noises. We should note that, also inherited from the first-order optimization algorithm, the above gradient based methods are very likely to be stuck at local extremum points when the considered environment is non-convex/non-concave. Furthermore, the gradient estimation is also sensitive to the measurement/environment noise, and thus additional statistical properties on the noise, such as known distribution with zero-mean, need to be imposed as assumptions in the problem setup.

Whilst it is unknown how to deal with noises without statistical properties in the context of source seeking, non-stochastic disturbance has been considered increasingly broadly in control communities. Within the classical robust control framework, the non-stochastic disturbance is often treated by considering the worst-scenario performance, see e.g., [18]. However, more recent works related to the learning based control mainly concern about the development of adaptive approaches which aim at controlling typically a linear system with adversarial disturbances while optimizing certain objective function with respect to the system states and control inputs[19, 20, 21, 22, 23]. To measure the performance of adaptive controllers, the notion of regret is adopted; that is, to measure the discrepancy between the gain/cost of the devised controller and that of the best one in hindsight. In particular, the authors in [19] devise the first 𝒪(K)\mathcal{O}(\sqrt{K})-regret algorithm by assuming a convex cost function and known system dynamics. Afterwards, such a regret bound is enhanced to be logarithmic in [20, 21] within the same problem setup. To further relax the requirement of the known dynamics, the authors in [22] develop the algorithm which attains 𝒪(K2/3)\mathcal{O}(K^{2/3})-regret, and such a bound is also improved to 𝒪(K)\mathcal{O}(\sqrt{K}) later in [21, 23]. Though the above works have investigated quite thoroughly the non-stochastic setting in the context of learning based control, we remark that our paper considers a different problem where some standard conditions in control, such as controllability and observability, can be no longer simply assumed. In fact, our problem is more related to a sequential decision process; that is, the agents make their source seeking decisions in sequence while interplaying with perception of the unknown environment.

This sequential feature also makes our setting closely related to the well-known problem of multi-armed bandits. Therefore, another rich line of relevant works is on the series of bandit algorithms. More specifically, involved with the non-stochastic disturbances, linear bandits are investigated within two settings of non-stationary environment and adversarial corruptions, respectively. While the former one interprets the non-stochastic disturbance as a variation of the environment, the latter one is corresponding to corruptions injected by potential adversaries. Both cases are well studied in literature with the development of algorithms guaranteeing sub-linear regrets. To deal with the environmental non-stationarity, the WindowUCB algorithm is first proposed in [24] along with the technique of sliding-window least squares. It is shown that the algorithm achieves the regret of 𝒪~(K2/3BK1/3)\widetilde{\mathcal{O}}(K^{2/3}B_{K}^{1/3}) where BKB_{K} is a measure to the level of non-stationarity. The same regret is proved for the weighted linear bandit algorithm proposed in [25], which leverages on the weighted least-square estimator. Further, a simple restart strategy is developed in [26], obtaining the same regret. It is indeed proved that the 𝒪~(K2/3BK1/3)\widetilde{\mathcal{O}}(K^{2/3}B_{K}^{1/3})-regret is the optimal one that can be achieved in the setting of non-stationary bandits. In terms of the adversarial bandits, a robust algorithm is proposed in [27] which guarantees the 𝒪~(BKK3/4)\widetilde{\mathcal{O}}(B_{K}{K}^{3/4})-regret, and thus it is sub-linear only if the level of adversarial corruptions satisfies BK=o(K1/4)B_{K}=o(K^{1/4}). More recently, such a regret has been improved to 𝒪~(BK+K)\widetilde{\mathcal{O}}(B_{K}+\sqrt{K}) in [28, 29] which is also shown to be nearly optimal in the adversarial setting. It can be concluded from the above discussion that, once BKB_{K} grows sub-linearly, the regrets in both cases are guaranteed to be sub-linear. These are also the state-of-the-art that we are expected to achieve for our algorithm to be developed in this work.

I-B Statement of Contributions

This paper proposes an online source seeking algorithmic framework using the emerging learning technique, which is capable of i) dealing with the unknown environment in the presence of non-stochastic disturbances; and ii) taking advantages of the cooperation among the multi-agent network. In terms of the non-stochastic disturbances, two specific types of them are considered: i) an external one which disturbs the measurable states of the environment; and ii) an internal one which is truly evolved with the environment dynamics. To deal with them, an unified technique of discounted Kalman filtering is proposed to estimate the unknown environment states while mitigating the disturbances. Meanwhile, to build the cooperation among multiple agents and avoid the combinatorial complexity, we leverage the polytope confidence set, and as a result, the proposed algorithm is exceptionally computation-efficient in the multi-agent setting. It is shown by the regret analysis that our algorithm attains sub-linear regrets against both types of non-stochastic disturbances. The obtained two regrets are both comparable to the state-of-the-art in the studies of non-stationary and adversarial bandit algorithms. At last, all theoretical findings are validated by simulation examples on a real-world pollution monitoring application.

II Problem Statement

II-A Unknown Environment with Non-Stochastic Disturbances

Consider an obstacle-free environment which is assumed to be bounded and discretized by a finite set of points 𝒮\mathcal{S} where each 𝐬𝒮\mathbf{s}\in\mathcal{S} represents the corresponding position. Suppose that the unknown state of the environment at each discrete time kk is described by a real-valued function ϕk():𝒮+\phi_{k}(\cdot):\mathcal{S}\to\mathbb{R}_{+} which maps the positional information 𝐬\mathbf{s} to a positive quantity ϕk(𝐬)\phi_{k}(\mathbf{s}) indicating the environmental value of interest. Let us denote NN the total number of all points, i.e., N=|𝒮|N=|\mathcal{S}|, and for simplicity, denote ϕk+N\bm{\phi}_{k}\in\mathbb{R}_{+}^{N} the vector which stacks all individual ϕk(𝐬)\phi_{k}(\mathbf{s}). Further, to characterize dynamics of the changing environment, we consider that the evolution of state ϕk\bm{\phi}_{k} is basically governed by the following nominal linear time-varying (LTV) model

ϕk+1=Ak+1ϕk,\displaystyle\bm{\phi}_{k+1}=A_{k+1}\bm{\phi}_{k}, (1)

where the state transition matrix AkN×NA_{k}\in\mathbb{R}^{N\times N} is assumed to be known a prior. In order for the considered source seeking problem to be well-defined, we need the state ϕk\bm{\phi}_{k} to be neither explosive nor vanishing to zero, which can be ensured by the following assumption.

Assumption 1

For the LTV dynamics (1), there exists a pair of uniform lower and upper bounds 0<α¯α¯<0<\underaccent{\bar}{\alpha}\leq\bar{\alpha}<\infty such that, for kt>0\forall k\geq t>0,

α¯𝐈NA[k:t]A[k:t]α¯𝐈N,\displaystyle\underaccent{\bar}{\alpha}\cdot\mathbf{I}_{N}\preceq A[k:t]^{\top}A[k:t]\preceq\bar{\alpha}\cdot\mathbf{I}_{N}, (2)

where 𝐈N\mathbf{I}_{N} represents the N×NN\times N identity matrix and the state propagation matrix111By convention, we let A[k:t]=𝐈NA[k:t]=\mathbf{I}_{N} when k<tk<t. is defined as A[k:t]:=AkAk1AtA[k:t]:=A_{k}A_{k-1}\cdots A_{t}.

We should note that the above Assumption 1 not only helps confine the behavior of the environment states, but also implies the invertibility of the state transition matrices AkA_{k}’s which aids the subsequent regret analysis of our algorithm. In fact, such an assumption is not unusual in the study of system control and estimation problems; see e.g., [30, 31, 32, 33].

Now, in order to further impose the underlying disturbances into the environment model, let us consider the following two types of non-stochastic ones on top of the nominal dynamics:

Type I:\displaystyle\texttt{Type I}: ϕ~k+1=Ak+1ϕk+𝜹k\widetilde{\bm{\phi}}_{k+1}=A_{k+1}{\bm{\phi}}_{k}+\bm{\delta}_{k}. (3a)
Type II:\displaystyle\texttt{Type II}: ϕ~k+1=Ak+1ϕ~k+𝜹k\widetilde{\bm{\phi}}_{k+1}=A_{k+1}\widetilde{\bm{\phi}}_{k}+\bm{\delta}_{k}, (3b)

Note that in both types, ϕ~k+N\widetilde{\bm{\phi}}_{k}\in\mathbb{R}_{+}^{N} denotes the disturbed state. However, while the first type of disturbance can be interpreted as an external one since ϕk\bm{\phi}_{k} in (3a) is still evolved according to the nominal dynamics (1) and the disturbance 𝜹k\bm{\delta}_{k} only affects the state ϕ~k+1\widetilde{\bm{\phi}}_{k+1} in one step, the second type can be viewed as an internal one since the disturbance 𝜹k\bm{\delta}_{k} is intrinsically imposed into the dynamics and accumulated during the evolution of ϕ~k\widetilde{\bm{\phi}}_{k}. In fact, we shall remark that the two types of disturbances both find a wide range of real-world applications. For instance, in the scenario of pollution monitoring which is investigated in our simulations, the external disturbance could correspond to certain unrelated emitters which will not change the locations of sources of interest but interfere the perceptible environment states, the internal one might result from some environmental conditions, such as wind, which will truly affect the diffusion of pollutants and thus change their positions. It is also enlightened by the provided example that the localization of sources should be considered differently for the above two cases. More details will be found in Section II-B. In addition, we note that the internal disturbance can be also used to capture to some extent the unmodeled dynamics of the unknown environment. However, no matter which type of disturbances is involved in the process, only the disturbed state ϕ~k\widetilde{\bm{\phi}}_{k} is measurable for the agents which are employed to operate in the environment later.

As we have remarked earlier, the disturbances of both types are supposed to be non-stochastic, i.e., no statistical property in any form is assumed regarding 𝜹k\bm{\delta}_{k}. Instead, to characterize the effect of both disturbances in long term, we consider to impose the following assumption.

Assumption 2

There exists a positive sequence {BK}K+\{B_{K}\}_{K\in\mathbb{N}_{+}} such that, for K0\forall K\geq 0,

k=0K𝜹kBK.\displaystyle\sum_{k=0}^{K}\|\bm{\delta}_{k}\|\leq B_{K}. (4)
Remark 1

The sequence {BK}K+\{B_{K}\}_{K\in\mathbb{N}_{+}} in Assumption 2 is not necessarily required to be bounded by some constant in our work. In fact, we consider the problem under the condition that BKB_{K} increases at a sub-linear rate and aim to provide a performance guarantee for our algorithm on the dependence of BKB_{K}. It is often implied by the sub-linear increasing BKB_{K} that either the total number of occurrence of the disturbance 𝛅k\bm{\delta}_{k} increases sub-linearly or the effect of disturbance 𝛅k\|\bm{\delta}_{k}\| vanishes to zero over the time-steps kk. While the former is often referred to as the abrupt-changing disturbance, the latter is regarded as the slowly-varying one. In addition, in the context of learning theory in adversarial/non-stationary settings, such a sequence {BK}K+\{B_{K}\}_{K\in\mathbb{N}_{+}} is also viewed as an attack budget of an adversary; see e.g., [34, 28].

II-B Multi-Agent Source Seeking

With the aim to locate the potential sources which usually correspond to the extreme values in the unknown environment state, we deploy a network of II agents and expect each of them i:={1,2,,I}i\in\mathcal{I}:=\{1,2,\cdots,I\} to seek its best positions 𝐩k[i]𝒮\mathbf{p}_{k}^{\star}[i]\in\mathcal{S} at each time kk by solving the following maximization problem,

maximize𝐩[i]𝒮,i\displaystyle\mathop{\text{maximize}}\limits_{\mathbf{p}[i]\in\mathcal{S},\,i\in\mathcal{I}} Fk(𝐩[1],𝐩[2],,𝐩[I])=𝐬i=1I𝐩[i]ϕk(𝐬).\displaystyle F_{k}(\mathbf{p}[1],\mathbf{p}[2],\cdots,\mathbf{p}[I])=\sum_{\mathbf{s}\in\cup_{i=1}^{I}\mathbf{p}[i]}\phi_{k}(\mathbf{s}). (5)

Notice that the summation involved in the objective function Fk():𝒮I+F_{k}(\cdot):\mathcal{S}^{I}\to\mathbb{R}_{+} takes into account the union of positions 𝐩[i]\mathbf{p}[i]’s, therefore all agents will naturally tend to locate as many distinct positions as possible for the purpose of maximizing Fk()F_{k}(\cdot). In addition, it is now clear to see the reason why Assumption 1 would be needed, i.e., the maximization in (5) is otherwise not well-defined if the environment state ϕk\bm{\phi}_{k} explodes or vanishes to zero. Further, we should also note that an inherent difference will take place in the counted state ϕk\bm{\phi}_{k} involved in the objective function when considering the disturbances of the two types. More precisely, for the first type of disturbance, i.e., the external one, the positions of sources should be indeed reflected by the undisturbed ϕk\bm{\phi}_{k}, though only the information of disturbed ϕ~k\widetilde{\bm{\phi}}_{k} is measurable for agents. On the contrary, for the second type, i.e., the internal disturbance, the disturbed ϕ~k\widetilde{\bm{\phi}}_{k} should be taken into account in (5), since 𝜹k\bm{\delta}_{k} is evolved in the environment dynamics and changes the positions of sources. On this account, we emphasize that while the maximization problem (5) is precisely the one that the agents would like to solve when considering the external disturbance, yet for the internal one, the objective function should be amended as

F~k(𝐩[1],𝐩[2],,𝐩[I])=𝐬i=1I𝐩[i]ϕ~k(𝐬).\displaystyle\widetilde{F}_{k}(\mathbf{p}[1],\mathbf{p}[2],\cdots,\mathbf{p}[I])=\sum_{\mathbf{s}\in\cup_{i=1}^{I}\mathbf{p}[i]}\widetilde{\phi}_{k}(\mathbf{s}). (6)

With the above difference presented in the objective functions, the main challenges in solving the maximization problems are also distinguishable in principle. Whilst the former is to extract the true information hidden in ϕk\bm{\phi}_{k} in the case that only ϕ~k\widetilde{\bm{\phi}}_{k} is accessible, the latter is to identify and compensate the unmodeled disturbance 𝜹k\bm{\delta}_{k}. Despite this difference, we develop in this paper an unified algorithmic framework for both cases, enabling the agents to track the dynamical sources in an online manner. We remark that this is also one of the main contributions of our work.

Another common technical issue, regardless of types of the disturbances involved, is to deal with the estimation of the environment. Therefore, we leverage on the following linear stochastic measurement model,

𝐳ki=Hi(𝐩k[i])ϕ~k+𝐧ki,\displaystyle\mathbf{z}^{i}_{k}=H^{i}\big{(}\mathbf{p}_{k}[i]\big{)}\widetilde{\bm{\phi}}_{k}+\mathbf{n}_{k}^{i}, (7)

where 𝐳kim\mathbf{z}_{k}^{i}\in\mathbb{R}^{m} is the ii-th agent’s obtained measurement at the time-step kk; Hi(𝐩k[i])m×NH^{i}\big{(}\mathbf{p}_{k}[i]\big{)}\in\mathbb{R}^{m\times N} denotes the measurement matrix depending on the agent’s position 𝐩k[i]\mathbf{p}_{k}[i]; and 𝐧kim\mathbf{n}_{k}^{i}\in\mathbb{R}^{m} is the measurement noise which is assumed to be independent and identically distributed (i.i.d.) Gaussian with zero mean and variance Vi=vi𝐈mV^{i}=v^{i}\cdot\mathbf{I}_{m}. We shall note that the measurement matrix Hi(𝐩k[i])H^{i}\big{(}\mathbf{p}_{k}[i]\big{)} is not specified in (7). In fact, it can be defined by various means based on the agent’s position. Nevertheless, we assume that each Hi(𝐩k[i])H^{i}\big{(}\mathbf{p}_{k}[i]\big{)} has the following basic form,

Hi(𝐩k[i])=[𝐞l]l𝒞ki,\displaystyle H^{i}\big{(}\mathbf{p}_{k}[i]\big{)}=\big{[}\mathbf{e}_{l}\big{]}^{\top}_{l\in\mathcal{C}_{k}^{i}}, (8)

where 𝐞l\mathbf{e}_{l} denotes the unit vector, i.e., the ll-th column of the identity matrix, and 𝒞ki\mathcal{C}_{k}^{i} is the set of positions which are covered by the agent’s sensing area at the time-step kk. It is natural to assume that the position where the agent currently locates falls into its sensing area, i.e., 𝐩k[i]𝒞ki\mathbf{p}_{k}[i]\in\mathcal{C}_{k}^{i}.

III Development of the Online ALgorithm

In this section, we develop our online source seeking algorithm which relies on two central ingredients: 1) a discounted Kalman filter, which is capable of providing an estimation on the unknown environment while dealing with the two types of non-stochastic disturbances in an unified framework; and 2) a D-UCB approach, which helps determine the agents’ seeking positions sequentially in an computation-efficient manner.

III-A Estimation of the Environment States with Disturbances

According to the measurement model (7) introduced in the previous section, let us first express it as a compact form which counts for all agents within the network. For this purpose, we stack all the measurements 𝐳ki\mathbf{z}_{k}^{i}’s and also the noise 𝐧ki\mathbf{n}_{k}^{i}’s as the concatenated vectors 𝐳kM\mathbf{z}_{k}\in\mathbb{R}^{M} and 𝐧kM\mathbf{n}_{k}\in\mathbb{R}^{M} with M=mIM=mI, e.g., 𝐳k:=[(𝐳k1),(𝐳k2),,(𝐳kI)]M\mathbf{z}_{k}:=[(\mathbf{z}^{1}_{k})^{\top},(\mathbf{z}^{2}_{k})^{\top},\cdots,(\mathbf{z}^{I}_{k})^{\top}]^{\top}\in\mathbb{R}^{M}. Likewise, we define the concatenated measurement matrix HkM×NH_{k}\in\mathbb{R}^{M\times N} by stacking all local Hi(𝐩k[i])H^{i}(\mathbf{p}_{k}[i])’s. Consequently, the measurement model of the compact form can be written as

𝐳k=Hkϕ~k+𝐧k.\displaystyle\mathbf{z}_{k}=H_{k}\widetilde{\bm{\phi}}_{k}+\mathbf{n}_{k}. (9)

Note that, in the notation HkH_{k}, we have absorbed for simplicity the dependency on agents’ positions 𝐩k[i]\mathbf{p}_{k}[i]’s into the index kk. In addition, by our assumption on the measurement noise, one can have that the concatenated noise 𝐧k\mathbf{n}_{k} is also i.i.d. Gaussian with zero-mean and variance being

V:=Diag{V1,V2,,VI}M×M.\displaystyle V:=\text{Diag}\{V^{1},V^{2},\cdots,V^{I}\}\in\mathbb{R}^{M\times M}. (10)

Equipped with the agents’ measurement model in its compact form (9), we are now ready to present the technique of discounted Kalman filtering. Similar to the standard Kalman filter, we also use mean ϕ^kN\widehat{\bm{\phi}}_{k}\in\mathbb{R}^{N} and covariance ΣkN×N\Sigma_{k}\in\mathbb{R}^{N\times N} to recursively generate estimates of the unknown environment. However, a primary difference is that two positive sequences of weights {λk}k+\{\lambda_{k}\}_{k\in\mathbb{N}_{+}} and {ωk}k+\{\omega_{k}\}_{k\in\mathbb{N}_{+}} are imposed in the filtering process with the aim to mitigate the effect of disturbances presented in the environment. Keep this in mind, the discounted Kalman filter performs the following recursions,

Σk+1/2=(Σk1+λkYk)1,\displaystyle\Sigma_{k+\scriptscriptstyle{1/2}}=\big{(}\Sigma_{k}^{-1}+\lambda_{k}Y_{k}\big{)}^{-1}, (11a)
ϕ^k+1/2=ϕ^k+λkΣk+1/2(𝐲kYkϕ^k),\displaystyle\widehat{\bm{\phi}}_{k+\scriptscriptstyle{1/2}}=\widehat{\bm{\phi}}_{k}+\lambda_{k}\Sigma_{k+\scriptscriptstyle{1/2}}(\mathbf{y}_{k}-Y_{k}\widehat{\bm{\phi}}_{k}), (11b)
Σk+1=Ak+1(Σk+1/21+(ωkωk1)Γk1)1Ak+1,\displaystyle\Sigma_{k+1}=A_{k+1}\big{(}\Sigma_{k+\scriptscriptstyle{1/2}}^{-1}+(\omega_{k}-\omega_{k-1})\Gamma_{k}^{-1}\big{)}^{-1}A_{k+1}^{\top}, (11c)
ϕ^k+1=Σk+1Ak+1Σk+1/21ϕ^k+1/2,\displaystyle\widehat{\bm{\phi}}_{k+1}=\Sigma_{k+1}A_{k+1}^{-\top}\Sigma_{k+\scriptscriptstyle{1/2}}^{-1}\widehat{\bm{\phi}}_{k+\scriptscriptstyle{1/2}}, (11d)
Γk+1=Ak+1ΓkAk+1.\displaystyle\Gamma_{k+1}=A_{k+1}\Gamma_{k}A_{k+1}^{\top}. (11e)

Notice that Σk+1/2N×N\Sigma_{k+\scriptscriptstyle{1/2}}\in\mathbb{R}^{N\times N} and ϕ^k+1/2N\widehat{\bm{\phi}}_{k+\scriptscriptstyle{1/2}}\in\mathbb{R}^{N} here denote the intermediate results during the recursions; ΓkN×N\Gamma_{k}\in\mathbb{R}^{N\times N} is an auxiliary matrix initialized by Γ0=𝐈N\Gamma_{0}=\mathbf{I}_{N}; and the variables 𝐲k:=HkV1𝐳kN\mathbf{y}_{k}:=H_{k}^{\top}V^{-1}\mathbf{z}_{k}\in\mathbb{R}^{N} and Yk:=HkV1HkN×NY_{k}:=H_{k}^{\top}V^{-1}H_{k}\in\mathbb{R}^{N\times N}, which can be readily acquired by consensus schemes, i.e., [35], incorporate latest measurements into the update of estimates. Next, to better show how the imposed weights help deal with the non-stochastic disturbances, we present in the subsequent lemma another expression of the discounted Kalman filter (11).

Lemma 1

Suppose that the state estimates ϕ^k\widehat{\bm{\phi}}_{k} and Σk\Sigma_{k} are generated by (11) with the initialization and ω1=0\omega_{-1}=0, then at each iteration kk, it is equivalent to have

Σk\displaystyle\Sigma_{k} =A[k:1]Υk1A[k:1];\displaystyle=A[k:1]\Upsilon_{k}^{-1}A[k:1]^{\top}; (12a)
ϕ^k\displaystyle\widehat{\bm{\phi}}_{k} =A[k:1]Υk1(Σ01ϕ^0+t=0k1λtA[t:1]𝐲t),\displaystyle=A[k:1]\Upsilon_{k}^{-1}\Big{(}\Sigma_{0}^{-1}\widehat{\bm{\phi}}_{0}+\sum_{t=0}^{k-1}\lambda_{t}A[t:1]^{\top}\mathbf{y}_{t}\Big{)}, (12b)

where the matrix ΥkN×N\Upsilon_{k}\in\mathbb{R}^{N\times N} is defined as

Υk:=Σ01+t=0k1λtA[t:1]YtA[t:1]+ωk1𝐈N.\displaystyle\Upsilon_{k}:=\Sigma_{0}^{-1}+\sum_{t=0}^{k-1}\lambda_{t}A[t:1]^{\top}Y_{t}A[t:1]+\omega_{k-1}\cdot\mathbf{I}_{N}. (13)
Proof:

See Appendix I. ∎

Remark 2

According to the form (12) of the discounted Kalman filter, it can be observed that the sequence {λk}k+\{\lambda_{k}\}_{k\in\mathbb{N}_{+}} serves to adjust weights on the measurements obtained during the process. Considering that the cumulative quantity of the disturbance is upper bounded by the sequence {BK}K+\{B_{K}\}_{K\in\mathbb{N}+}; see Assumption 2, this implies that, in general, the influence of disturbances vanishes over time if BKB_{K} increases sub-linearly. In this case, the significant disturbance which took place at the early stage can be expected to be gradually mitigated by performing the discounted Kalman filtering. Further, unlike the weight λk\lambda_{k} which is only added to the measurements locally, another sequence of weights {ωk}k+\{\omega_{k}\}_{k\in\mathbb{N}_{+}} is applied to globally adjust the covariance Σk\Sigma_{k} so that it can compensate the effect of internal disturbances more directly.

III-B Multi-Agent Online Source Seeking via D-UCB

Based on ϕ^k\widehat{\bm{\phi}}_{k} and Σk\Sigma_{k}, we now introduce the key notion of D-UCB 𝝁kN\bm{\mu}_{k}\in\mathbb{R}^{N}, which is defined as follows,

𝝁k:=ϕ^k+βk(δ)diag1/2(Σk).\displaystyle\bm{\mu}_{k}:=\widehat{\bm{\phi}}_{k}+\beta_{k}(\delta)\cdot\text{diag}^{1/2}(\Sigma_{k}). (14)

Note that the operator diag1/2():N×NN\text{diag}^{1/2}(\cdot):\mathbb{R}^{N\times N}\to\mathbb{R}^{N} maps the square root of the matrix diagonal elements to a vector, and the sequence {βk(δ)}k+\{\beta_{k}(\delta)\}_{k\in\mathbb{N}_{+}} depending on a predefined confidence level δ\delta will be specified subsequently in the next section. With the aid of the defined D-UCB 𝝁k\bm{\mu}_{k}, one can update the agents’ seeking positions in an online manner, by solving the following maximization problem:

𝐩kargmax𝐩[i]𝒮,i𝐬i=1I𝐩[i]μk(𝐬).\displaystyle\mathbf{p}_{k}\in\operatorname*{arg\,max}_{\mathbf{p}[i]\in\mathcal{S},\,i\in\mathcal{I}}\sum_{\mathbf{s}\in\cup_{i=1}^{I}\mathbf{p}[i]}{\mu}_{k}(\mathbf{s}). (15)

Here, 𝐩k𝒮I\mathbf{p}_{k}\in\mathcal{S}^{I} stacks the decided seeking positions 𝐩[i]k\mathbf{p}[i]_{k}’s for all agents, and likewise, μk(𝐬){\mu}_{k}(\mathbf{s})\in\mathbb{R} represents one component of the vector 𝝁k{\bm{\mu}}_{k} which corresponds to the position 𝐬𝒮\mathbf{s}\in\mathcal{S}. The complete multi-agent online source seeking scheme under the environment with disturbances is outlined in Algorithm 1.

Initialization: Each agent ii initializes its estimates ϕ^0\widehat{\bm{\phi}}^{0}
and Σ0\Sigma_{0}, and computes its initial position 𝐩0[i]\mathbf{p}_{0}[i]. Set the
confidence δ\delta and generate the sequence {βk(δ)}k+\{\beta_{k}(\delta)\}_{k\in\mathbb{N}_{+}}.
while the stopping criteria is NOT satisfied do
       Each agent ii simultaneously performs
       Step 1 (Measuring): Obtain the measurement 𝐳ik\mathbf{z}_{i}^{k} based on the measurement matrix Hi(𝐩k[i])H^{i}(\mathbf{p}_{k}[i]);
      
      Step 2 (Discounted Kalman Filtering): Collect information from neighbors, obtain the estimates ϕ^k+1\widehat{\bm{\phi}}_{k+1} and Σk+1\Sigma_{k+1} by (11);
      Step 3 (D-UCB Computing): Compute via (14) the updated D-UCB 𝝁k+1\bm{\mu}^{k+1} by ϕ^k+1\widehat{\bm{\phi}}_{k+1} and Σk+1\Sigma_{k+1};
      Step 4 (Seeking Positions Updating): Assign the new seeking position 𝐩k+1[i]\mathbf{p}_{k+1}[i] by solving (15).
      Let kk+1k\leftarrow k+1, and continue.
end while
Algorithm 1 Multi-agent online source seeking under environment with non-stochastic disturbances

IV Regret Analysis

In this section, we provide theoretical performance guarantee for our algorithm by the notion of regret. More specifically, we perform the regret analysis for both cases which are subject to the two types of non-stochastic disturbances, respectively. By showing the sub-linear cumulative regrets for both cases, it is ensured that the agents are capable of tracking the dynamical sources under an unknown and disturbed environment.

IV-A On the Disturbance of Type I

As we have informed in the previous discussion, for the first type of disturbance, the objective function Fk()F_{k}(\cdot) in (5) takes into account the undisturbed state ϕk\bm{\phi}_{k}. Therefore, we introduce the notion of regret for the first case as follows,

rk:=Fk(𝐩k)Fk(𝐩k),\displaystyle r_{k}:=F_{k}(\mathbf{p}_{k}^{\star})-F_{k}(\mathbf{p}_{k}), (16)

where 𝐩k\mathbf{p}_{k}^{\star} denotes the optimal solution to problem (5) and 𝐩k\mathbf{p}_{k} corresponds to the decision generated by our source seeking algorithm. Here, we aim to show that the cumulative regret, i.e., RK:=k=0KrkR_{K}:=\sum_{k=0}^{K}r_{k}, increases sub-linearly with respect to the number of time-steps KK, namely the regret rkr_{k} converges to zero on average. To this end, let us first show the following result which formalizes that the D-UCB 𝝁k\bm{\mu}_{k} provides indeed a valid upper bound for the unknown state ϕk\bm{\phi}_{k}.

Proposition 1

Under Assumptions 1 and 2, let ϕ^k\widehat{\bm{\phi}}_{k} and Σk\Sigma_{k} be generated by the discounted Kalman filter (11) with ωk0\omega_{k}\equiv 0 and λk=min{1,λ¯/YkΣk}\lambda_{k}=\min\{1,\bar{\lambda}/\|Y_{k}\|_{\Sigma_{k}}\}. Suppose that the initialization satisfies σ¯𝐈NΣ0σ¯𝐈N\underaccent{\bar}{\sigma}\cdot\mathbf{I}_{N}\preceq\Sigma_{0}\preceq\bar{\sigma}\cdot\mathbf{I}_{N} and likewise the noise variance has v¯𝐈MVv¯𝐈M\underaccent{\bar}{v}\cdot\mathbf{I}_{M}\preceq V\preceq\bar{v}\cdot\mathbf{I}_{M}, then it holds that,

(ϕk𝝁k)1δ,k0,\displaystyle{\mathbb{P}(\bm{\phi}_{k}\;{\preceq}\;\bm{\mu}_{k})\geq 1-\delta},\quad\forall k\geq 0, (17)

where {\preceq} is defined in element-wise, the probability ()\mathbb{P}(\cdot) is taken on random noises (𝐧1,𝐧2,,𝐧k)(\mathbf{n}_{1},\mathbf{n}_{2},\cdots,\mathbf{n}_{k}) and the sequence {βk(δ)}k+\{\beta_{k}(\delta)\}_{k\in\mathbb{N}_{+}} in D-UCB is chosen satisfying

βk(δ)\displaystyle\beta_{k}(\delta)\geq N(λ¯Bk+C1\displaystyle\sqrt{N}\cdot\Bigg{(}\bar{\lambda}B_{k}+C_{1} (18)
+C2Nlog(σ¯/σ¯+α¯σ¯k/v¯2δ2/N)),\displaystyle+C_{2}\sqrt{N}\cdot\sqrt{\log\Big{(}\frac{\bar{\sigma}/\underaccent{\bar}{\sigma}+\bar{\alpha}\bar{\sigma}\cdot k/\underaccent{\bar}{v}^{2}}{\delta^{2/N}}\Big{)}}\Bigg{)},

where BkB_{k} is defined in Assumption 2, C1=ϕ^0ϕ0/σ¯C_{1}=\|\widehat{\bm{\phi}}_{0}-\bm{\phi}_{0}\|/\sqrt{\underaccent{\bar}{\sigma}} and C2=v¯2max{2,2/v¯}C_{2}=\bar{v}^{2}\sqrt{\max\{2,2/\underaccent{\bar}{v}\}}.

Proof:

See Appendix II-A. ∎

It can be concluded from Proposition 1 that the D-UCB 𝝁k\bm{\mu}_{k} is guaranteed to be an upper bound for ϕk\bm{\phi}_{k} with probability at least 1δ1-\delta. In fact, considering that the disturbance of type I is not really evolved in the environment dynamics, the weight ωk\omega_{k} is thus set to be zero during the whole process. Further, to extract the true information, we set the weight λk\lambda_{k} adaptively according to the timely estimation of the environment. Since the estimate covariance Σk\Sigma_{k} is, in general, decreased as more measurements are absorbed during the filtering process, it can be seen that the sequence {λk}k+\{\lambda_{k}\}_{k\in\mathbb{N}_{+}} will increase with an upper bound set to be λ¯\bar{\lambda}. With the help of Proposition 1, we are now ready to present the result of regret analysis for our algorithm.

Theorem 1

Suppose that {𝐩k}k+\{\mathbf{p}_{k}\}_{k\in\mathbb{N}_{+}} is the sequence generated by Algorithm 1 under the conditions in Proposition 1, let λ¯\bar{\lambda} be specified as λ¯=N/BK\bar{\lambda}=\sqrt{N}/B_{K}, it holds with probability at least 1δ1-\delta that,

RK𝒪~(N2K+N5/2BK),K>0.\displaystyle R_{K}\leq\widetilde{\mathcal{O}}\Big{(}N^{2}\sqrt{K}+N^{5/2}B_{K}\Big{)},\quad\forall K>0. (19)
Proof:

See Appendix II-B. ∎

IV-B On the Disturbance of Type II

Similar to the previous analysis, to provide the performance guarantee for our algorithm in this part, we also rely on the notion of regret. However, considering the difference in features of the disturbance of type II; see details in Section II-B, the definition of regret should be amended accordingly,

r~k:=F~k(𝐩k)F~k(𝐩k),\displaystyle\widetilde{r}_{k}:=\widetilde{F}_{k}(\mathbf{p}_{k}^{\star})-\widetilde{F}_{k}(\mathbf{p}_{k}), (20)

where the objective function F~k()\widetilde{F}_{k}(\cdot) is defined in (6). Likewise, it is also expected to show a sub-linear cumulative regret, i.e., R~K:=k=0Kr~k\widetilde{R}_{K}:=\sum_{k=0}^{K}\widetilde{r}_{k} increases sub-linearly with respect to KK.

Due to the fact that the disturbance of type II is imposed in the environment dynamics, the current state ϕ~k\widetilde{\bm{\phi}}_{k} inherently accumulates all disturbances prior to the time kk. As a result, it is not necessarily implied by Assumptions 1 and 2 that ϕ~k\widetilde{\bm{\phi}}_{k} is upper bounded if the sequence BKB_{K} is allowed to be increased infinitely. Thus, to ensure the well-definedness of our problem, we need an additional assumption.

Assumption 3

There exists an uniform upper bound ϕ¯>0\bar{\phi}>0 such that ϕ~kϕ¯,k0\|\widetilde{\bm{\phi}}_{k}\|\leq\bar{\phi},\forall k\geq 0.

Now, we follow the similar path as in the previous analysis to show the sub-linear regret of R~K\widetilde{R}_{K}. Note that, due to the long term effect of the second type disturbance in the state ϕ~k\widetilde{\bm{\phi}}_{k}, one cannot expect that the D-UCB 𝝁k\bm{\mu}_{k} serves as an upper bound for ϕ~k\widetilde{\bm{\phi}}_{k}. To deal with this issue, we construct an auxiliary variable ϕ¯kN\mkern 1.5mu\overline{\mkern-1.5mu\bm{\phi}\mkern-1.5mu}\mkern 1.5mu_{k}\in\mathbb{R}^{N}, i.e.,

ϕ¯k\displaystyle\mkern 1.5mu\overline{\mkern-1.5mu\bm{\phi}\mkern-1.5mu}\mkern 1.5mu_{k} :=A[k:1]Υk1(Σ01ϕ~0\displaystyle:=A[k:1]\Upsilon_{k}^{-1}\Big{(}\Sigma_{0}^{-1}\widetilde{\bm{\phi}}_{0} (21)
+t=0k1λtA[t:1]HtV1Htϕ~t+λk1A[k:1]1ϕ~k),\displaystyle+\sum_{t=0}^{k-1}\lambda_{t}A[t:1]^{\top}H_{t}^{\top}V^{-1}H_{t}\widetilde{\bm{\phi}}_{t}+\lambda_{k-1}A[k:1]^{-1}\widetilde{\bm{\phi}}_{k}\Big{)},

which helps build a connection between 𝝁k\bm{\mu}_{k} and the state ϕ~k\widetilde{\bm{\phi}}_{k} as shown in the following propositions.

Proposition 2

Under Assumptions 13 and the conditions in Proposition 1, let ϕ^k\widehat{\bm{\phi}}_{k} and Σk\Sigma_{k} be generated by the discounted Kalman filter (11) with λk=ωk=(1/γ)k\lambda_{k}=\omega_{k}=(1/\gamma)^{k} where 0<γ<10<\gamma<1, then it holds that,

(ϕ¯k𝝁k)1δ,k0,\displaystyle{\mathbb{P}(\mkern 1.5mu\overline{\mkern-1.5mu\bm{\phi}\mkern-1.5mu}\mkern 1.5mu_{k}\;{\preceq}\;\bm{\mu}_{k})\geq 1-\delta},\quad\forall k\geq 0, (22)

when the sequence {βk(δ)}k+\{\beta_{k}(\delta)\}_{k\in\mathbb{N}_{+}} in D-UCB satisfies

βk(δ)N(C1+C3γ(1k)/2\displaystyle\beta_{k}(\delta)\geq\sqrt{N}\cdot\Bigg{(}C_{1}+C_{3}{\gamma}^{(1-k)/2} (23)
+C2Nγ(1k)/2log(1+α¯/v¯2t=0k1γ2(kt1)δ2/N)),\displaystyle+C_{2}\sqrt{N}{\gamma}^{(1-k)/2}\cdot\sqrt{\log\Big{(}\frac{1+\bar{\alpha}/\underaccent{\bar}{v}^{2}\cdot\sum_{t=0}^{k-1}\gamma^{2(k-t-1)}}{\delta^{2/N}}\Big{)}}\Bigg{)},

where C1C_{1} and C2C_{2} are defined as same as in Proposition 1 and C3=ϕ¯/α¯C_{3}=\bar{\phi}/\sqrt{\underaccent{\bar}{\alpha}}.

Proof:

See Appendix II-C. ∎

It is proved by Proposition 2 that the D-UCB 𝝁k\bm{\mu}_{k} provides a valid upper bound for the constructed variable ϕ¯k\mkern 1.5mu\overline{\mkern-1.5mu\bm{\phi}\mkern-1.5mu}\mkern 1.5mu_{k} if βk(δ)\beta_{k}(\delta) is chosen appropriately. To further build the connection between 𝝁k\bm{\mu}_{k} and the true state ϕ~k\widetilde{\bm{\phi}}_{k}, it can be shown that the discrepancy between ϕ¯k\mkern 1.5mu\overline{\mkern-1.5mu\bm{\phi}\mkern-1.5mu}\mkern 1.5mu_{k} and ϕ~k\widetilde{\bm{\phi}}_{k} will be bounded by a term related to the disturbances 𝜹k\bm{\delta}_{k}. However, for the sake of presentation, such a result will be deferred, and we directly provide the statement of sub-linear regret for our algorithm in the following theorem. The bound of ϕ¯kϕ~k\|\mkern 1.5mu\overline{\mkern-1.5mu\bm{\phi}\mkern-1.5mu}\mkern 1.5mu_{k}-\widetilde{\bm{\phi}}_{k}\| will be shown as an intermediate step of proofs for the theorem in Appendix.

Theorem 2

Suppose that {𝐩k}k+\{\mathbf{p}_{k}\}_{k\in\mathbb{N}_{+}} is the sequence generated by Algorithm 1 under the conditions in Proposition 2, let γ{\gamma} be specified as γ=1(BK/K)2/3\gamma=1-(B_{K}/K)^{2/3}, then it holds that with probability 1δ1-\delta,

R~K𝒪~(N2BK1/3K2/3),K>0.\displaystyle\widetilde{R}_{K}\leq\widetilde{\mathcal{O}}\Big{(}N^{2}B_{K}^{1/3}K^{2/3}\Big{)},\quad\forall K>0. (24)
Proof:

See Appendix II-D ∎

Remark 3

Note that, in Proposition 2 and Theorem 2, the two weights λk\lambda_{k} and ωk\omega_{k} are specified as λk=ωk=(1/γ)k\lambda_{k}=\omega_{k}=(1/\gamma)^{k} where γ<1\gamma<1. This means that they will increase exponentially with respect to the time-step kk. Therefore, numerical overflow may arise in the discounted Kalman filtering as shown in (11), when kk is large. To deal with this issue, we notice that the discounted Kalman filter, when λk\lambda_{k} and ωk\omega_{k} are chosen as above, can be implemented equivalently by the following recursions,

Σ~k+1/2=(γΣ~k1+Yk)1,\displaystyle\widetilde{\Sigma}_{k+\scriptscriptstyle{1/2}}=\big{(}\gamma\widetilde{\Sigma}_{k}^{-1}+Y_{k}\big{)}^{-1}, (25a)
ϕ^k+1/2=ϕ^k+Σ~k+1/2(𝐲kYkϕ^k),\displaystyle\widehat{\bm{\phi}}_{k+\scriptscriptstyle{1/2}}=\widehat{\bm{\phi}}_{k}+\widetilde{\Sigma}_{k+\scriptscriptstyle{1/2}}(\mathbf{y}_{k}-Y_{k}\widehat{\bm{\phi}}_{k}), (25b)
Σ~k+1=Ak+1(Σ~k+1/21+(1γ)Γk1)1Ak+1,\displaystyle\widetilde{\Sigma}_{k+1}=A_{k+1}\Big{(}\widetilde{\Sigma}_{k+\scriptscriptstyle{1/2}}^{-1}+(1-\gamma)\Gamma_{k}^{-1}\Big{)}^{-1}A_{k+1}^{\top}, (25c)
ϕ^k+1=(Ak+1(1γ)Σ~k+1Ak+1Γk1)ϕ^k+1/2,\displaystyle\widehat{\bm{\phi}}_{k+1}=\Big{(}A_{k+1}-(1-\gamma)\widetilde{\Sigma}_{k+1}A_{k+1}^{-\top}\Gamma_{k}^{-1}\Big{)}\widehat{\bm{\phi}}_{k+\scriptscriptstyle{1/2}}, (25d)
where Γk\Gamma_{k} is defined as same as before. It should be also noted that, in (25), the covariance is slightly different from the one in (11), in the sense that Σ~k=(1/γ)k1Σk\widetilde{\Sigma}_{k}=(1/\gamma)^{k-1}\Sigma_{k}. This needs to be taken into account in Algorithm 1 when generating the D-UCB 𝝁k\bm{\mu}_{k} by using Σ~k\widetilde{\Sigma}_{k}.

IV-C Further Discussions

Before the end of this section, a few more remarks should be added on the obtained results of the above regret analysis.

First, to tackle with the two types of non-stochastic disturbances, it can be seen from the propositions that the sequences of weights λk\lambda_{k} and ωk\omega_{k} are also determined differently. More specifically, for the external disturbance which only affects the measured state ϕ~k\widetilde{\bm{\phi}}_{k} but not evolve with the nominal dynamics, the sequence of λk\lambda_{k} is chosen as increased at the same rate of 1/YkΣk1/\|Y_{k}\|_{\Sigma_{k}}. This is due to the fact that the disturbance 𝜹k\bm{\delta}_{k} in this case only comes into play when the state is measured, and thus the weight λk\lambda_{k} is also adjusted according to the measurement information in YkY_{k} and the current progress on Σk\Sigma_{k}. Since the covariance Σk\Sigma_{k}, which basically suggests the uncertainty of our estimation, decreases as more measurements are absorbed in the estimation, the weight λk\lambda_{k} is increased during the process, meaning that the measurement received later are more trusted. For the internal disturbance, the sequence of λk\lambda_{k} is also chosen as increased, but at a fixed exponential rate of (1/γ)k(1/\gamma)^{k}. Another primary difference is that, while ωk\omega_{k} is set to be zero previously, here we let ωk\omega_{k} increase at the same exponential rate of (1/γ)k(1/\gamma)^{k}. The reason for such a difference can be explained as follows. Since the internal disturbance, regardless of the measurements, is accumulated during the whole process, an additional weight needs to be incorporated to deal with it globally, and therefore the increasing ωk\omega_{k} is introduced to decrease the covariance Σk\Sigma_{k} accordingly. Note that this does not mean the uncertainty of our estimation is decreased brutally, as in the D-UCB 𝝁k\bm{\mu}_{k}, the sequence of βk(δ)\beta_{k}(\delta) is also increased by an extra term related to 1/γ1/\gamma to adjust our construction of the confidence bound.

Second, it can be concluded by the two theorems that once the disturbance bound BKB_{K} increases sub-linearly, the regrets generated by our algorithm for both cases also grow at a sub-linear rate, meaning that the agents will be able to track the moving sources dynamically under the disturbed environment. More precisely, while the regret for the first case increases at the rate of 𝒪~(K+BK)\widetilde{\mathcal{O}}(\sqrt{K}+B_{K}), the rate is 𝒪~(BK1/3K2/3)\widetilde{\mathcal{O}}(B_{K}^{1/3}K^{2/3}) for the second case. Note that both of them are identical to the state-of-the-art results in the study of bandit algorithms with non-stationary and adversarial settings. Therefore, we can conclude that our developments of discounted Kalman filter and D-UCB do not degrade the performance of algorithm with respect to its convergence. However, in terms of the scale of the problem NN, i.e., size of the searching environment, the complexity of our algorithm indeed grows at the rate of 𝒪~(N2)\widetilde{\mathcal{O}}(N^{2}), as compared to 𝒪~(N)\widetilde{\mathcal{O}}(N) in the literature. This is mainly due to the reason that the ellipsoid confidence sets in the classical UCB-based methods is changed to the polytope one in our algorithm. Despite this fact, we argue that such an increase of complexity is actually reasonable, since more computational complexities have been reduced by avoiding the combinatorial problems at each step.

V Simulation

In this section, numerical examples are provided to validate the effectiveness of our multi-agent source seeking algorithm. We consider a pollution monitoring application where three mobile robots are deployed in a pollution diffusion field with the aim to localize as many leaking sources as possible. The dynamics of the pollution field is governed a convection diffusion equation. More details of the simulation settings can be found in [9], including linearization of the partial differential equation, robots’ measurement models and their communication topology, specification of the pollution field, etc. However, a key difference here is that the non-stochastic disturbances are assumed to be present after the linearization of the dynamics. More concretely, the linearized model of the pollution field is represented by

𝚽k+1=A𝚽k+𝜹𝒌,\displaystyle\bm{\Phi}_{k+1}=A\mathbf{\bm{\Phi}}_{k}+\bm{\delta_{k}}, (26)

where 𝚽k\mathbf{\bm{\Phi}}_{k} denotes the discretized states of the field, AA is the state transition matrix and 𝜹𝒌\bm{\delta_{k}} represents the non-stochastic disturbance.

In particular, we consider in this simulation that the pollution field is modeled by a D×DD\times D lattice with D=50D=50. Each of the mobile robots is capable of sensing a circular area with radius R=5R=5 during the searching process. The sensing noise is assumed to be i.i.d. Gaussian with zero-mean and covariance Vi=4𝐈m,i=1,2,3V^{i}=4\cdot\mathbf{I}_{m},i=1,2,3. In terms of the disturbance, we here consider two different scenarios: i) a slowly-varying disturbance which occurs externally; and ii) an abruptly-changing which occurs internally. For the slowly-varying disturbance of type I, it is assumed that 𝜹k=0\bm{\delta}_{k}=0 when k<100k<100 and 𝜹k=1/k2𝚷0\bm{\delta}_{k}=1/k^{2}\cdot\bm{\Pi}_{0} when k100k\geq 100 where 𝚷0\bm{\Pi}_{0} is randomly generated. For the abruptly-changing disturbance of type II, we consider that two more leaking sources are randomly injected into the field during the period of [150,165][150,165] and [600,615][600,615]. That is, 𝜹k=𝚷1\bm{\delta}_{k}=\bm{\Pi}_{1} for 150k165150\leq k\leq 165 and 𝜹k=𝚷2\bm{\delta}_{k}=\bm{\Pi}_{2} for 600k615600\leq k\leq 615 where 𝚷1,𝚷2N\bm{\Pi}_{1},\bm{\Pi}_{2}\in\mathbb{R}^{N} are randomly generated, and 𝜹k=0\bm{\delta}_{k}=0 otherwise.

Refer to caption
(a) Slowly-varying disturbance
Refer to caption
(b) Abruptly-changing disturbance
Figure 1: Regrets generated by the algorithms under different scenarios

To illustrate the performance of our algorithm in seeking the dynamical pollution sources with the two types of disturbances, we show the cumulative regrets RTR_{T} produced by Algorithm 1, respectively. The obtained numerical results are shown in Fig. 1, in which each curve is corresponding to 2020 independent trials. It can be observed from the figures that our algorithm produces the smaller cumulative regret than the one generated by the standard D-UCB algorithm. We can thus conclude that, while the standard D-UCB algorithm fails to localize the sources when the disturbances are present in the field, our algorithm manages to complete the task in both scenarios with the external and internal disturbances. More specifically, with respect to the internal abruptly-changing disturbance, we also compare the performance of our algorithm with different choices of the parameter γ\gamma. Note that by setting γ=1\gamma=1, our algorithm will be naturally reduced to standard D-UCB algorithm. It can be observed that, after the disturbance are injected, our algorithm will soon adapt to the disturbed pollution field and then track the newly-added sources accordingly. On the contrary, the standard D-UCB algorithm fails to do so. In addition, it can be also seen from Fig. 1(b) that the smaller γ\gamma results in a shorter period of the adaption process. This is mainly due to the fact that the agents tend to perform more explorations when the small γ\gamma is chosen. As a result of the classical dilemma between exploration and exploitation, however, an disadvantage of the smaller γ\gamma is that the cumulative regret grows more rapidly after the sources are localized.

VI Conclusion

In this paper, a learning based algorithm is developed to solve the problem of multi-agent online source seeking under the environment disturbed by non-stochastic perturbations. Building on the technique of discounted Kalman filtering as well as the notion of D-UCB proposed in our previous work, our algorithm enables the computation-efficient cooperation among the multi-agent network and is robust against the non-stochastic perturbations (also interpreted as the adversarial disturbances in the context of multi-armed bandits). It is shown that a sub-linear cumulative regret is achieved by our algorithm, which is comparable to the state-of-art. Numerical results on a real-world pollution monitoring application is finally provided to support our theoretical findings.

Appendix I: proof of Lemma 1

Let us prove Lemma 1 by mathematical induction. First, it is straightforward to confirm that, given the initialization ϕ^0\widehat{\bm{\phi}}_{0}, Σ0\Sigma_{0}, and ω1=0\omega_{-1}=0, the recursions (11) and (12) produce the identical ϕ^1\widehat{\bm{\phi}}_{1} and Σ1\Sigma_{1}. Next, we assume that (12) generates the same results as (11) up to the time-step kk, it will suffice to prove the consistency for the time-step k+1k+1.

In fact, based on the recursion of Σk\Sigma_{k} in (11), we can have

Σk+1=Ak+1(Σk1+λkYk+(ωkωk1)Γk1)1Ak+1\displaystyle\Sigma_{k+1}=A_{k+1}\Big{(}\Sigma_{k}^{-1}+\lambda_{k}Y_{k}+(\omega_{k}-\omega_{k-1})\Gamma_{k}^{-1}\Big{)}^{-1}A_{k+1}^{\top} (27)
=Ak+1(A[k:1]ΥkA[k:1]1+λkYk\displaystyle=A_{k+1}\Big{(}A[k:1]^{-\top}\Upsilon_{k}A[k:1]^{-1}+\lambda_{k}Y_{k}
+(ωkωk1)Γk1)1Ak+1\displaystyle\hskip 40.0pt+(\omega_{k}-\omega_{k-1})\Gamma_{k}^{-1}\Big{)}^{-1}A_{k+1}^{\top}
=A[k+1:1](Υk+A[k:1]YkA[k:1]\displaystyle=A[k+1:1]\Big{(}\Upsilon_{k}+A[k:1]^{\top}Y_{k}A[k:1]
+(ωkωk1)𝐈N)1A[k+1:1]\displaystyle\hskip 40.0pt+(\omega_{k}-\omega_{k-1})\mathbf{I}_{N}\Big{)}^{-1}A[k+1:1]^{\top}
=A[k+1:1]Υk+11A[k+1:1],\displaystyle=A[k+1:1]\Upsilon_{k+1}^{-1}A[k+1:1]^{\top},

where the second equality comes from our assumption of Σk\Sigma_{k} in the form of (12a) and the last equality is due to the definition of Υk\Upsilon_{k} in (13). Similarly, based on the recursion of ϕ^k\widehat{\bm{\phi}}_{k} in (11), we can have

ϕ^k+1=A[k+1:1]Υk+11A[k+1:1]Ak+1Σk+1/21ϕ^k+1/2\displaystyle\widehat{\bm{\phi}}_{k+1}=A[k+1:1]\Upsilon_{k+1}^{-1}A[k+1:1]^{\top}A_{k+1}^{-\top}\Sigma_{k+\scriptscriptstyle{1/2}}^{-1}\widehat{\bm{\phi}}_{k+\scriptscriptstyle{1/2}} (28)
=A[k+1:1]Υk+11A[k:1](Σk+1/21ϕ^k+λk(𝐲kYkϕ^k))\displaystyle=A[k+1:1]\Upsilon_{k+1}^{-1}A[k:1]^{\top}\Big{(}\Sigma_{k+\scriptscriptstyle{1/2}}^{-1}\widehat{\bm{\phi}}_{k}+\lambda_{k}(\mathbf{y}_{k}-Y_{k}\widehat{\bm{\phi}}_{k})\Big{)}
=A[k+1:1]Υk+11A[k:1](Σk1ϕ^k+λk𝐲k)\displaystyle=A[k+1:1]\Upsilon_{k+1}^{-1}A[k:1]^{\top}\Big{(}\Sigma_{k}^{-1}\widehat{\bm{\phi}}_{k}+\lambda_{k}\mathbf{y}_{k}\Big{)}
=A[k+1:1]Υk+11(ΥkA[k:1]1ϕ^k+λkA[k:1]𝐲k)\displaystyle=A[k+1:1]\Upsilon_{k+1}^{-1}\Big{(}\Upsilon_{k}A[k:1]^{-1}\widehat{\bm{\phi}}_{k}+\lambda_{k}A[k:1]^{\top}\mathbf{y}_{k}\Big{)}
=A[k+1:1]Υk+11(Σ01ϕ^0+t=0kλtA[t:1]𝐲t),\displaystyle=A[k+1:1]\Upsilon_{k+1}^{-1}\Big{(}\Sigma_{0}^{-1}\widehat{\bm{\phi}}_{0}+\sum_{t=0}^{k}\lambda_{t}A[t:1]^{\top}\mathbf{y}_{t}\Big{)},

where the first equality comes from (27) which just has been proved; the second and third equalities are due to (11); the second last equality follows Σk\Sigma_{k} in the form of (12a); and the last one is due to our assumption of ϕ^k\widehat{\bm{\phi}}_{k} in the form of (12b).

Appendix II: proofs of Main Theorems

We shall notice that the proofs in this section are mainly inspired by [25] and [29], which performed the regret analysis in the context of stochastic linear bandits under non-stationary and adversarial environments, respectively. The contributions of our proofs are i) integration of linear dynamics and Kalman filtering into the algorithmic framework; and ii) adaptation of the new notion of D-UCB into the regret analysis.

To facilitate the following proofs, let us start by introducing some useful vector norms. First, associated with the diagonal matrix of an arbitrary positive definite matrix MN×NM\in\mathbb{R}^{N\times N}, i.e., 𝒟M=Diag{m11,m22,,mNN}N×N\mathcal{D}_{M}=\text{Diag}\{m_{11},m_{22},\cdots,m_{NN}\}\in\mathbb{R}^{N\times N}, we define the 2\mathcal{L}_{2}-based vector norm 𝒟M:N+\|\cdot\|_{\mathcal{D}_{M}}:\mathbb{R}^{N}\to\mathbb{R}_{+} as

𝐱𝒟M:=i=1Nmiixi2,\displaystyle\|\mathbf{x}\|_{\mathcal{D}_{M}}:=\sqrt{\sum_{i=1}^{N}m_{ii}\cdot x_{i}^{2}}, (29)

where 𝐱=[x1,x2,xN]N\mathbf{x}=[x_{1},x_{2},\cdots x_{N}]^{\top}\in\mathbb{R}^{N}. Further, let us define the \mathcal{L}_{\infty}-based norm 𝒟M,:N+\|\cdot\|_{\mathcal{D}_{M},\infty}:\mathbb{R}^{N}\to\mathbb{R}_{+} with respect to the matrix 𝒟M\mathcal{D}_{M} as

𝐱𝒟M,:=max1iNmii|xi|.\displaystyle\|\mathbf{x}\|_{\mathcal{D}_{M},\infty}:=\max_{1\leq i\leq N}\;m_{ii}\cdot|x_{i}|. (30)

Note that the above norm 𝒟M,\|\cdot\|_{\mathcal{D}_{M},\infty} is well-defined since the positive definiteness of MM ensures that mii>0m_{ii}>0. Similarly, we define the 1\mathcal{L}_{1}-based norm 𝒟M,1:N+\|\cdot\|_{\mathcal{D}_{M},1}:\mathbb{R}^{N}\to\mathbb{R}_{+} as

𝐱𝒟M,1:=i=1Nmii|xi|.\displaystyle\|\mathbf{x}\|_{\mathcal{D}_{M},1}:=\sum_{i=1}^{N}m_{ii}\cdot|x_{i}|. (31)

With the vector norms introduced above, it can be immediately verified that 𝒟M,1\|\cdot\|_{\mathcal{D}_{M},1} and 𝒟M1,\|\cdot\|_{\mathcal{D}_{M}^{-1},\infty} are dual norms where 𝒟M1\mathcal{D}_{M}^{-1} takes the inverse of the matrix 𝒟M\mathcal{D}_{M}. In addition, we provide in the following lemma the connections among all defined norms.

Lemma 2

For arbitrary positive definite matrix MM, it holds that, 1) 𝐱𝒟M,𝐱𝒟M2\|\mathbf{x}\|_{\mathcal{D}_{M},\infty}\leq\|\mathbf{x}\|_{\mathcal{D}_{M}^{2}}; 2) 𝐱𝒟M,1N𝐱𝒟M2\|\mathbf{x}\|_{\mathcal{D}_{M},1}\leq\sqrt{N}\cdot\|\mathbf{x}\|_{\mathcal{D}_{M}^{2}}; and 3) 𝐱MN𝐱𝒟M\|\mathbf{x}\|_{M}\leq{\sqrt{N}\cdot\|\mathbf{x}\|_{\mathcal{D}_{M}}}.

Proof:

While the inequalities a) and b) can be straightforwardly confirmed by the definitions and the inequality of arithmetic and geometric means, respectively, the part c) is proved as follows.

𝐱M2\displaystyle\|\mathbf{x}\|_{M}^{2} i=1Nmiixi2+i=1Nji|mij||xixj|\displaystyle\leq\sum_{i=1}^{N}m_{ii}\cdot x_{i}^{2}+\sum_{i=1}^{N}\sum_{j\neq i}|m_{ij}|\cdot|x_{i}x_{j}| (32)
i=1Nmiixi2+i=1Njimiimjj|xixj|\displaystyle\leq\sum_{i=1}^{N}m_{ii}\cdot x_{i}^{2}+\sum_{i=1}^{N}\sum_{j\neq i}\sqrt{m_{ii}m_{jj}}\cdot|x_{i}x_{j}|
i=1Nmiixi2+i=1Nji12(miixi2+mjjxj2)\displaystyle\leq\sum_{i=1}^{N}m_{ii}\cdot x_{i}^{2}+\sum_{i=1}^{N}\sum_{j\neq i}\frac{1}{2}(m_{ii}\cdot x_{i}^{2}+m_{jj}\cdot x_{j}^{2})
=N𝐱𝒟M2.\displaystyle=N\cdot\|\mathbf{x}\|^{2}_{\mathcal{D}_{M}}.

Note that the first inequality is due to the positive definiteness of MM, i.e., |mij|miimjj|m_{ij}|\leq\sqrt{m_{ii}m_{jj}}. Hence, the proof is completed. ∎

VI-A Proof of Proposition 1

To prove the inequality (17) in Proposition 1, with the help of the above defined vector norms, it will suffice to show

(ϕ^kϕk𝒟Σk1/2,βk(δ))1δ.\displaystyle\mathbb{P}\Big{(}\big{\|}\widehat{\bm{\phi}}_{k}-\bm{\phi}_{k}\big{\|}_{\mathcal{D}^{-1/2}_{\Sigma_{k}},\infty}\leq\beta_{k}(\delta)\Big{)}\geq 1-\delta. (33)

Note that the inequality in (33) is stronger than the one in (17), in the sense that the state ϕk\bm{\phi}_{k} is both upper and lower bounded. Though the lower bound is not reflected in the development of our algorithm, it helps the proof for the sub-linear regret. In addition, due to the fact that the weight ω\omega is specified as ωk0\omega_{k}\equiv 0 in this part, it indeed changes the generation of state estimates by simplifying the matrix Υk\Upsilon_{k} as

Υk:=Σ01+t=0k1λtA[t:1]YtA[t:1].\displaystyle\Upsilon_{k}:=\Sigma_{0}^{-1}+\sum_{t=0}^{k-1}\lambda_{t}A[t:1]^{\top}Y_{t}A[t:1]. (34)

According to the nature of the first type of disturbance, the disturbed state can be expressed as ϕ~k=A[k:1]ϕ0+𝜹k\widetilde{\bm{\phi}}_{k}=A[k:1]\bm{\phi}_{0}+\bm{\delta}_{k} and thus the measurement is 𝐳k=Hk(A[k:1]ϕ0+𝜹k)+𝐧k\mathbf{z}_{k}=H_{k}(A[k:1]\bm{\phi}_{0}+\bm{\delta}_{k})+\mathbf{n}_{k}. Then, by Lemma 1 and the definitions of Υk\Upsilon_{k} and YkY_{k}, the state estimate ϕ^k\widehat{\bm{\phi}}_{k} has

ϕ^k\displaystyle\widehat{\bm{\phi}}_{k} =A[k:1]Υk1(Σ01ϕ^0+t=0k1λtA[t:1]HtV1𝐧t\displaystyle=A[k:1]\Upsilon_{k}^{-1}\Big{(}\Sigma_{0}^{-1}\widehat{\bm{\phi}}_{0}+\sum_{t=0}^{k-1}\lambda_{t}A[t:1]^{\top}H_{t}^{\top}V^{-1}\mathbf{n}_{t} (35)
+t=0k1λtA[t:1]Yt(A[t:1]ϕ0+𝜹t))\displaystyle\hskip 10.0pt+\sum_{t=0}^{k-1}\lambda_{t}A[t:1]^{\top}Y_{t}\big{(}A[t:1]{\bm{\phi}}_{0}+\bm{\delta}_{t}\big{)}\Big{)}
=ϕk+A[k:1]Υk1(t=0k1λtA[t:1]HtV1𝐧t\displaystyle=\bm{\phi}_{k}+A[k:1]\Upsilon_{k}^{-1}\Big{(}\sum_{t=0}^{k-1}\lambda_{t}A[t:1]^{\top}H_{t}^{\top}V^{-1}\mathbf{n}_{t}
+t=0k1λtA[t:1]Yt𝜹t+Σ01(ϕ^0ϕ0)).\displaystyle\hskip 10.0pt+\sum_{t=0}^{k-1}\lambda_{t}A[t:1]^{\top}Y_{t}\bm{\delta}_{t}+\Sigma_{0}^{-1}(\widehat{\bm{\phi}}_{0}-{\bm{\phi}}_{0})\Big{)}.

Therefore, it holds for 𝐱N\forall\mathbf{x}\in\mathbb{R}^{N} that

𝐱(ϕ^kϕk)\displaystyle\mathbf{x}^{\top}(\widehat{\bm{\phi}}_{k}-\bm{\phi}_{k}) (36)
(1.a)A[k:1]𝐱Υk1(Σ01(ϕ^0ϕ0)Υk1\displaystyle\overset{(1.a)}{\leq}\big{\|}A[k:1]^{\top}\mathbf{x}\big{\|}_{\Upsilon_{k}^{-1}}\cdot\Bigg{(}\big{\|}\Sigma_{0}^{-1}(\widehat{\bm{\phi}}_{0}-{\bm{\phi}}_{0})\big{\|}_{\Upsilon_{k}^{-1}}
+t=0k1λtA[t:1]Yt𝜹tΥk1\displaystyle\hskip 10.0pt+\Big{\|}\sum_{t=0}^{k-1}\lambda_{t}A[t:1]^{\top}Y_{t}\bm{\delta}_{t}\Big{\|}_{\Upsilon_{k}^{-1}}
+t=0k1λtA[t:1]HtV1𝐧tΥk1)\displaystyle\hskip 10.0pt+\Big{\|}\sum_{t=0}^{k-1}\lambda_{t}A[t:1]^{\top}H_{t}^{\top}V^{-1}\mathbf{n}_{t}\Big{\|}_{\Upsilon_{k}^{-1}}\Bigg{)}
=(1.b)𝐱Σk(Σ01(ϕ^0ϕ0)Υk1\displaystyle\overset{(1.b)}{=}\big{\|}\mathbf{x}\big{\|}_{\Sigma_{k}}\cdot\Bigg{(}\big{\|}\Sigma_{0}^{-1}(\widehat{\bm{\phi}}_{0}-{\bm{\phi}}_{0})\big{\|}_{\Upsilon_{k}^{-1}}
+t=0k1λtA[t:1]HtV1𝐧tΥk1\displaystyle\hskip 10.0pt+\Big{\|}\sum_{t=0}^{k-1}\lambda_{t}A[t:1]^{\top}H_{t}^{\top}V^{-1}\mathbf{n}_{t}\Big{\|}_{\Upsilon_{k}^{-1}}
+t=0k1λtA[t:1]Yt𝜹tΥk1)\displaystyle\hskip 10.0pt+\Big{\|}\sum_{t=0}^{k-1}\lambda_{t}A[t:1]^{\top}Y_{t}\bm{\delta}_{t}\Big{\|}_{\Upsilon_{k}^{-1}}\Bigg{)}
(1.c)N𝐱𝒟Σk(Σ01(ϕ^0ϕ0)Υk1\displaystyle\overset{(1.c)}{\leq}{\sqrt{N}}\cdot\big{\|}\mathbf{x}\big{\|}_{\mathcal{D}_{\Sigma_{k}}}\cdot\Bigg{(}\big{\|}\Sigma_{0}^{-1}(\widehat{\bm{\phi}}_{0}-{\bm{\phi}}_{0})\big{\|}_{\Upsilon_{k}^{-1}}
+t=0k1λtA[t:1]Yt𝜹tΥk1\displaystyle\hskip 10.0pt+\Big{\|}\sum_{t=0}^{k-1}\lambda_{t}A[t:1]^{\top}Y_{t}\bm{\delta}_{t}\Big{\|}_{\Upsilon_{k}^{-1}}
+t=0k1λtA[t:1]HtV1𝐧tΥk1),\displaystyle\hskip 10.0pt+\Big{\|}\sum_{t=0}^{k-1}\lambda_{t}A[t:1]^{\top}H_{t}^{\top}V^{-1}\mathbf{n}_{t}\Big{\|}_{\Upsilon_{k}^{-1}}\Bigg{)},

where (1.a)(1.a) is according to the Cauchy-Schwartz and triangle inequalities; (1.b)(1.b) is due to the recursion of Σk\Sigma_{k} in the form of (12a); and (1.c)(1.c) is based on Lemma 2-3).

Now, by Lemma 2-1), it follows that

ϕ^k\displaystyle\big{\|}\widehat{\bm{\phi}}_{k} ϕk𝒟Σk1/2,ϕ^kϕk𝒟Σk1\displaystyle-\bm{\phi}_{k}\big{\|}_{\mathcal{D}^{-1/2}_{\Sigma_{k}},\infty}\leq\big{\|}\widehat{\bm{\phi}}_{k}-\bm{\phi}_{k}\big{\|}_{\mathcal{D}^{-1}_{\Sigma_{k}}} (37)
N(Σ01(ϕ^0ϕ0)Υk1\displaystyle\leq{\sqrt{N}}\cdot\Bigg{(}\big{\|}\Sigma_{0}^{-1}(\widehat{\bm{\phi}}_{0}-{\bm{\phi}}_{0})\big{\|}_{\Upsilon_{k}^{-1}}
+t=0k1λtA[t:1]Yt𝜹tΥk1\displaystyle\hskip 10.0pt+\Big{\|}\sum_{t=0}^{k-1}\lambda_{t}A[t:1]^{\top}Y_{t}\bm{\delta}_{t}\Big{\|}_{\Upsilon_{k}^{-1}}
+t=0k1λtA[t:1]HtV1𝐧tΥk1),\displaystyle\hskip 10.0pt+\Big{\|}\sum_{t=0}^{k-1}\lambda_{t}A[t:1]^{\top}H_{t}^{\top}V^{-1}\mathbf{n}_{t}\Big{\|}_{\Upsilon_{k}^{-1}}\Bigg{)},

where the last inequality is according to (36) and meanwhile taking 𝐱=𝒟Σk1(ϕ^kϕk)\mathbf{x}=\mathcal{D}^{-1}_{\Sigma_{k}}(\widehat{\bm{\phi}}_{k}-\bm{\phi}_{k}). Next, to prove the inequality (33), we upper bound the three terms on the right hand side of (37) in the following three lemmas, respectively.

Lemma 3

Under the conditions in Proposition 1, there exists a constant C1=ϕ^0ϕ0/σ¯C_{1}=\|\widehat{\bm{\phi}}_{0}-\bm{\phi}_{0}\|/\sqrt{\underaccent{\bar}{\sigma}} such that,

Σ01(ϕ^0ϕ0)Υk1C1,k0.\displaystyle\big{\|}\Sigma_{0}^{-1}(\widehat{\bm{\phi}}_{0}-{\bm{\phi}}_{0})\big{\|}_{\Upsilon_{k}^{-1}}\leq C_{1},\quad\forall k\geq 0. (38)
Proof:

By the definition of the matrix Υk\Upsilon_{k}, it is straightforward to see that Υk1Σ0{\Upsilon_{k}^{-1}}\;{\preceq}\;\Sigma_{0}, and therefore,

Σ01(ϕ^0ϕ0)Υk12\displaystyle\big{\|}\Sigma_{0}^{-1}(\widehat{\bm{\phi}}_{0}-{\bm{\phi}}_{0})\big{\|}^{2}_{\Upsilon_{k}^{-1}} (39)
=(ϕ^0ϕ0)Σ01Υk1Σ01(ϕ^0ϕ0)\displaystyle=(\widehat{\bm{\phi}}_{0}-{\bm{\phi}}_{0})^{\top}\Sigma_{0}^{-1}\Upsilon_{k}^{-1}\Sigma_{0}^{-1}(\widehat{\bm{\phi}}_{0}-{\bm{\phi}}_{0})
(ϕ^0ϕ0)Σ01(ϕ^0ϕ0)\displaystyle\leq(\widehat{\bm{\phi}}_{0}-{\bm{\phi}}_{0})^{\top}\Sigma_{0}^{-1}(\widehat{\bm{\phi}}_{0}-{\bm{\phi}}_{0})
1/σ¯ϕ^0ϕ02,\displaystyle\leq 1/\underaccent{\bar}{\sigma}\cdot\|\widehat{\bm{\phi}}_{0}-\bm{\phi}_{0}\|^{2},

where the last inequality is due to the assumption Σ0σ¯𝐈\Sigma_{0}\succeq\underaccent{\bar}{\sigma}\cdot\mathbf{I}. Thus, the proof is completed. ∎

Lemma 4

Under the conditions in Proposition 1, one can have that

t=0k1λtA[t:1]Yt𝜹tΥk1λ¯Bk,\displaystyle\Big{\|}\sum_{t=0}^{k-1}\lambda_{t}A[t:1]^{\top}Y_{t}\bm{\delta}_{t}\Big{\|}_{\Upsilon_{k}^{-1}}\leq\bar{\lambda}B_{k}, (40)

where the sequence {Bk}k+\{B_{k}\}_{k\in\mathbb{N}_{+}} is defined in Assumption 2.

Proof:

It holds that

t=0k1λtA[t:1]Yt𝜹tΥk1(2.a)t=0k1λtA[t:1]YtΥt1𝜹t\displaystyle\Big{\|}\sum_{t=0}^{k-1}\lambda_{t}A[t:1]^{\top}Y_{t}\bm{\delta}_{t}\Big{\|}_{\Upsilon_{k}^{-1}}\overset{(2.a)}{\leq}\sum_{t=0}^{k-1}\lambda_{t}\big{\|}A[t:1]^{\top}Y_{t}\big{\|}_{\Upsilon_{t}^{-1}}\|\bm{\delta}_{t}\| (41)
=(2.b)t=0k1λtYtΣt𝜹t(2.c)t=0k1λ¯𝜹t(2.d)λ¯Bk,\displaystyle\overset{(2.b)}{=}\sum_{t=0}^{k-1}\lambda_{t}\big{\|}Y_{t}\big{\|}_{\Sigma_{t}}\|\bm{\delta}_{t}\|\overset{(2.c)}{\leq}\sum_{t=0}^{k-1}\|\bar{\lambda}\bm{\delta}_{t}\|\overset{(2.d)}{\leq}\bar{\lambda}B_{k},

where (2.a)(2.a) is due to the triangle inequality and the fact that ΥkΥt,kt\Upsilon_{k}\geq\Upsilon_{t},\forall k\geq t by (34); (2.b)(2.b) is based on the recursion (12a) of Σt\Sigma_{t}; (2.c)(2.c) is due to the specification of the weight λk\lambda_{k}, i.e., λk=min{1,λ¯/YtΣt}\lambda_{k}=\min\{1,\bar{\lambda}/\|Y_{t}\|_{\Sigma_{t}}\}; and (2.d)(2.d) is by Assumption 2. ∎

Lemma 5

Under the conditions in Proposition 1, there exists a constant C2=v¯2max{2,2/v¯}C_{2}=\bar{v}^{2}\sqrt{\max\{2,2/\underaccent{\bar}{v}\}} such that the following inequality holds with probability at least 1δ1-\delta,

t=0k1λtA[t:1]HtV1𝐧tΥk1\displaystyle\Big{\|}\sum_{t=0}^{k-1}\lambda_{t}A[t:1]^{\top}H_{t}^{\top}V^{-1}\mathbf{n}_{t}\Big{\|}_{\Upsilon_{k}^{-1}} (42)
C2Nlog(σ¯/σ¯+α¯σ¯k/v¯2δ2/N).\displaystyle\hskip 50.0pt\leq C_{2}\sqrt{N}\cdot\sqrt{\log\Big{(}\frac{\bar{\sigma}/\underaccent{\bar}{\sigma}+\bar{\alpha}\bar{\sigma}\cdot k/\underaccent{\bar}{v}^{2}}{\delta^{2/N}}\Big{)}}.
Proof:

This proof is based on the existing results on the self-normalized Martingale, see e.g., [25]. For the notational simplicity, let us define

Xt:=λtA[t:1]HtV1N×M.\displaystyle X_{t}:=\lambda_{t}A[t:1]^{\top}H_{t}^{\top}V^{-1}\in\mathbb{R}^{N\times M}. (43)

Then, according to the result of self-normalized Martingale, it holds with probability at least 1δ1-\delta that,

t=0k1Xt𝐧tΩk12v¯2log(det(Ωk)1/2det(Σ0)1/2δ),\displaystyle\Big{\|}\sum_{t=0}^{k-1}X_{t}\mathbf{n}_{t}\Big{\|}_{\Omega_{k}^{-1}}\leq 2\bar{v}^{2}\cdot\sqrt{\log\Big{(}\frac{\det(\Omega_{k})^{1/2}\det(\Sigma_{0})^{1/2}}{\delta}\Big{)}}, (44)

where Ωk:=Σ01+t=0k1XtXtN×N\Omega_{k}:=\Sigma_{0}^{-1}+\sum_{t=0}^{k-1}X_{t}X_{t}^{\top}\in\mathbb{R}^{N\times N}. Note that there is a slight difference between Ωk\Omega_{k} and Υk\Upsilon_{k}, and we show there exists a constant C2=max{1,1/v¯}C^{\prime}_{2}=\max\{1,1/\underaccent{\bar}{v}\} such that ΩkC2Υk\Omega_{k}\;{\preceq}\;C^{\prime}_{2}\Upsilon_{k}. In fact, it holds that

Ωk\displaystyle\Omega_{k} =Σ01+t=0k1λt2A[t:1]HtV2HtA[t:1]\displaystyle=\Sigma_{0}^{-1}+\sum_{t=0}^{k-1}\lambda_{t}^{2}A[t:1]^{\top}H_{t}^{\top}V^{-2}H_{t}A[t:1] (45)
Σ01+1/v¯t=0k1λtA[t:1]HtV1HtA[t:1]\displaystyle\;{\preceq}\;\Sigma_{0}^{-1}+1/\underaccent{\bar}{v}\cdot\sum_{t=0}^{k-1}\lambda_{t}A[t:1]^{\top}H_{t}^{\top}V^{-1}H_{t}A[t:1]
max{1,1/v¯}Υk.\displaystyle\;{\preceq}\;\max\{1,1/\underaccent{\bar}{v}\}\cdot\Upsilon_{k}.

Note that the first inequality is due to ωt1\omega_{t}\leq 1 and the assumption v¯𝐈MVv¯𝐈M\underaccent{\bar}{v}\cdot\mathbf{I}_{M}\preceq V\preceq\bar{v}\cdot\mathbf{I}_{M}. Therefore, the previous statement can be immediately verified by letting C2=max{1,1/v¯}C^{\prime}_{2}=\max\{1,1/\underaccent{\bar}{v}\} and implies that Υk1C2Ωk1\Upsilon_{k}^{-1}{\preceq}\;C^{\prime}_{2}\Omega_{k}^{-1}. Together with the inequality (44), it follows that

t=0k1Xt𝐧tΥk1C2t=0k1Xt𝐧tΩk1\displaystyle\Big{\|}\sum_{t=0}^{k-1}X_{t}\mathbf{n}_{t}\Big{\|}_{\Upsilon_{k}^{-1}}\leq\sqrt{C^{\prime}_{2}}\cdot\Big{\|}\sum_{t=0}^{k-1}X_{t}\mathbf{n}_{t}\Big{\|}_{\Omega_{k}^{-1}} (46)
2v¯2max{1,1/v¯}log(det(Ωk)1/2det(Σ0)1/2δ).\displaystyle\leq 2\bar{v}^{2}\sqrt{\max\{1,1/\underaccent{\bar}{v}\}}\cdot\sqrt{\log\Big{(}\frac{\det(\Omega_{k})^{1/2}\det(\Sigma_{0})^{1/2}}{\delta}\Big{)}}.

Moreover, based on the inequality of arithmetic and geometric means and the definition of Ωk\Omega_{k}, it holds that

det(Ωk)\displaystyle\det(\Omega_{k}) (1/NTr(Σ01)+1/Nt=0k1Tr(XtXt))N,\displaystyle\leq\Big{(}1/N\cdot\text{Tr}\big{(}\Sigma_{0}^{-1}\big{)}+1/N\cdot\sum_{t=0}^{k-1}\text{Tr}(X_{t}X_{t}^{\top})\Big{)}^{N}, (47)

where the trace of the matrix XtXtX_{t}X_{t}^{\top} further has

Tr(XtXt)\displaystyle\text{Tr}(X_{t}X_{t}^{\top}) =Tr(λt2A[t:1]HtV2HtA[t:1])\displaystyle=\text{Tr}\Big{(}\lambda_{t}^{2}A[t:1]^{\top}H_{t}^{\top}V^{-2}H_{t}A[t:1]\Big{)} (48)
(2.a)1/v¯2n=1N𝐞nA[t:1]HtHtA[t:1]𝐞n\displaystyle\overset{(2.a)}{\leq}1/\underaccent{\bar}{v}^{2}\cdot\sum_{n=1}^{N}\mathbf{e}_{n}^{\top}A[t:1]^{\top}H_{t}^{\top}H_{t}A[t:1]\mathbf{e}_{n}
(2.b)1/v¯2n=1N𝐞nA[t:1]A[t:1]𝐞n\displaystyle\overset{(2.b)}{\leq}1/\underaccent{\bar}{v}^{2}\cdot\sum_{n=1}^{N}\mathbf{e}_{n}^{\top}A[t:1]^{\top}A[t:1]\mathbf{e}_{n}
(2.c)Nα¯/v¯2.\displaystyle\overset{(2.c)}{\leq}N\cdot\bar{\alpha}/\underaccent{\bar}{v}^{2}.

Note that (2.a)(2.a) is due to the assumption v¯𝐈MVv¯𝐈M\underaccent{\bar}{v}\cdot\mathbf{I}_{M}\preceq V\preceq\bar{v}\cdot\mathbf{I}_{M} and 𝐞nN\mathbf{e}_{n}\in\mathbb{R}^{N} denotes the unit vector; (2.b)(2.b) follows from the special form of the measurement matrix HtH_{t}, i.e., each row has only one element equal to one and all others equal to zero; and (2.c)(2.c) is based on Assumption 1. In addition, given that the initialization Σ0\Sigma_{0} has σ¯𝐈NΣ0σ¯𝐈N\underaccent{\bar}{\sigma}\cdot\mathbf{I}_{N}\;{\preceq}\;\Sigma_{0}\;{\preceq}\;\bar{\sigma}\cdot\mathbf{I}_{N}, it follows that Tr(Σ01)N/σ¯\text{Tr}(\Sigma_{0}^{-1})\leq N/\underaccent{\bar}{\sigma} and det(Σ0)σ¯N\det(\Sigma_{0})\leq\bar{\sigma}^{N}. As a result, we can have

log(det(Ωk)1/2det(Σ0)1/2/δ)\displaystyle\sqrt{\log\Big{(}{\det(\Omega_{k})^{1/2}\det(\Sigma_{0})^{1/2}}/{\delta}\Big{)}} (49)
=1/2log(det(Ωk))+1/2log(det(Σ0))log(δ)\displaystyle=\sqrt{1/2\cdot\log\big{(}\det(\Omega_{k})\big{)}+1/2\cdot\log\big{(}\det(\Sigma_{0})\big{)}-\log(\delta)}
N/2log(σ¯/σ¯+α¯σ¯k/v¯2δ2/N).\displaystyle\leq\sqrt{N/2}\cdot\sqrt{\log\Big{(}\frac{\bar{\sigma}/\underaccent{\bar}{\sigma}+\bar{\alpha}\bar{\sigma}\cdot k/\underaccent{\bar}{v}^{2}}{\delta^{2/N}}\Big{)}}.

Based on the inequality (44), the proof is completed. ∎

Now, combining Lemmas 35 together with (37), it has been shown that, with probability 1δ1-\delta

ϕ^kϕk𝒟Σk1/2,\displaystyle\big{\|}\widehat{\bm{\phi}}_{k}-\bm{\phi}_{k}\big{\|}_{\mathcal{D}^{-1/2}_{\Sigma_{k}},\infty} (50)
N(λ¯Bk+C1+C2Nlog(σ¯/σ¯+α¯σ¯k/v¯2δ2/N)).\displaystyle\leq\sqrt{N}\cdot\Bigg{(}\bar{\lambda}B_{k}+C_{1}+C_{2}\sqrt{N}\cdot\sqrt{\log\Big{(}\frac{\bar{\sigma}/\underaccent{\bar}{\sigma}+\bar{\alpha}\bar{\sigma}\cdot k/\underaccent{\bar}{v}^{2}}{\delta^{2/N}}\Big{)}}\Bigg{)}.

Recall the definition of βk(δ)\beta_{k}(\delta) in (18), the inequality in (33) is proved and so is Proposition 1.

VI-B Proof of Theorem 1

To facilitate the following proof, let us first introduce a new mapping 𝐚():𝒮IN\mathbf{a}(\cdot):\mathcal{S}^{I}\to\mathbb{R}^{N} which translates the positional information 𝐩=[𝐩[1],𝐩[2],,𝐩[I]]𝒮I\mathbf{p}=\big{[}\mathbf{p}[1],\mathbf{p}[2],\cdots,\mathbf{p}[I]\big{]}\in\mathcal{S}^{I} into a NN-dimensional action vector 𝐚(𝐩)N\mathbf{a}(\mathbf{p})\in\mathbb{R}^{N}, i.e.,

𝐚(𝐩)=i=1I𝐞si,\displaystyle\mathbf{a}(\mathbf{p})=\sum_{i=1}^{I}\mathbf{e}_{s_{i}}, (51)

where each sis_{i} corresponds to the index of the position 𝐩[i]\mathbf{p}[i] in the environment 𝒮\mathcal{S} and 𝐞siN\mathbf{e}_{s_{i}}\in\mathbb{R}^{N} denotes the unit vector. Now, by the definitions of 𝐩k\mathbf{p}_{k} and 𝐩k\mathbf{p}^{\star}_{k}, it can be immediately verified that the vectors 𝐚(𝐩k)\mathbf{a}(\mathbf{p}_{k}) and 𝐚(𝐩k)\mathbf{a}(\mathbf{p}^{\star}_{k}) must have II elements equal to one and all others equal to zero. Further, we denote 𝒜\mathcal{A} the set of all possibilities of these vectors, i.e.,

𝒜:={𝐚|𝐚{0,1}N,𝟏𝐚=I}.\displaystyle\mathcal{A}:=\{\mathbf{a}\,|\,\mathbf{a}\in\{0,1\}^{N},\mathbf{1}^{\top}\mathbf{a}=I\}. (52)

For simplicity, we abbreviate the above 𝐚(𝐩k)\mathbf{a}(\mathbf{p}_{k}) and 𝐚(𝐩k)\mathbf{a}(\mathbf{p}^{\star}_{k}) to 𝐚k𝒜\mathbf{a}_{k}\in\mathcal{A} and 𝐚k𝒜\mathbf{a}^{\star}_{k}\in\mathcal{A}, subsequently. Based on the definition of Fk()F_{k}(\cdot) as well as the introduced notations, the regret rkr_{k} can be expressed as,

rk=Fk(𝐩k)Fk(𝐩k)=𝐚k𝐚k,ϕk.\displaystyle{r_{k}}=F_{k}\big{(}\mathbf{p}^{\star}_{k}\big{)}-F_{k}(\mathbf{p}_{k})=\langle\mathbf{a}^{\star}_{k}-\mathbf{a}_{k},\bm{\phi}_{k}\rangle. (53)

To proceed, we show the following lemma which provides an upper bound for the regret rkr_{k} at each time-step kk.

Lemma 6

Under the conditions in Proposition 1 and let the positional information 𝐚k\mathbf{a}_{k}’s be generated by Algorithm 1, then it holds with probability at least 1δ1-\delta that,

rk2Nβk(δ)𝐚k𝒟Σk.\displaystyle r_{k}\leq 2\sqrt{N}\beta_{k}(\delta)\cdot\|\mathbf{a}_{k}\|_{\mathcal{D}_{\Sigma_{k}}}. (54)
Proof:

By the expression of rkr_{k} in (53), it follows that

rk\displaystyle r_{k} =𝐚k,ϕk𝐚k,ϕk\displaystyle=\langle\mathbf{a}^{\star}_{k},\;\bm{\phi}_{k}\rangle-\langle\mathbf{a}_{k},\;\bm{\phi}_{k}\rangle (55)
(3.a)𝐚k,𝝁kϕk\displaystyle\overset{(3.a)}{\leq}\langle\mathbf{a}_{k},\;\bm{\mu}_{k}-\bm{\phi}_{k}\rangle
(3.b)𝐚k𝒟Σk1/2,1𝝁kϕk𝒟Σk1/2,\displaystyle\overset{(3.b)}{\leq}\|\mathbf{a}_{k}\|_{\mathcal{D}^{1/2}_{\Sigma_{k}},1}\cdot\|\bm{\mu}_{k}-\bm{\phi}_{k}\|_{\mathcal{D}^{-1/2}_{\Sigma_{k}},\infty}
(3.c)2βk(δ)𝐚k𝒟Σk1/2,1\displaystyle\overset{(3.c)}{\leq}2\beta_{k}(\delta)\cdot\|\mathbf{a}_{k}\|_{\mathcal{D}^{1/2}_{\Sigma_{k}},1}
(3.d)2Nβk(δ)𝐚k𝒟Σk,\displaystyle\overset{(3.d)}{\leq}2\sqrt{N}\beta_{k}(\delta)\cdot\|\mathbf{a}_{k}\|_{\mathcal{D}_{\Sigma_{k}}},

where (3.a)(3.a) is due to 𝐚k,ϕk𝐚k,𝝁k𝐚k,𝝁k\langle\mathbf{a}^{\star}_{k},\;\bm{\phi}_{k}\rangle\leq\langle\mathbf{a}^{\star}_{k},\;\bm{\mu}_{k}\rangle\leq\langle\mathbf{a}_{k},\;\bm{\mu}_{k}\rangle which can be verified by Proposition 1 and the definition of 𝐚k\mathbf{a}_{k}; (3.b)(3.b) follows from the Hölder’s inequality and the fact that 𝒟M,1\|\cdot\|_{\mathcal{D}_{M},1} and 𝒟M1,\|\cdot\|_{\mathcal{D}_{M}^{-1},\infty} are dual norms; (3.c)(3.c) is based on the inequality (33) which has been proved previously; and (3.d)(3.d) is due to Lemma 2-2). ∎

Based on the above Lemma 6, it is shown that the regret can be upper bounded by 𝐚k𝒟Σk\|\mathbf{a}_{k}\|_{\mathcal{D}_{\Sigma_{k}}}. In order to investigate the key term 𝐚k𝒟Σk\|\mathbf{a}_{k}\|_{\mathcal{D}_{\Sigma_{k}}}, we next show in the following lemma an intermediate result which can be used to bound 𝐚k𝒟Σk\|\mathbf{a}_{k}\|_{\mathcal{D}_{\Sigma_{k}}}.

Lemma 7

Under the conditions in Proposition 1, it holds,

k=0K1min\displaystyle\sum_{k=0}^{K-1}\min {1,λkTr(YkΣk)}2Nlog(σ¯/σ¯+Kσ¯α¯I/v¯).\displaystyle\big{\{}1,\lambda_{k}\text{Tr}(Y_{k}\Sigma_{k})\big{\}}\leq 2N\cdot\log\Big{(}\bar{\sigma}/\underaccent{\bar}{\sigma}+K\bar{\sigma}\bar{\alpha}I/\underaccent{\bar}{v}\Big{)}. (56)
Proof:

Recall the recursion (34) of Υk\Upsilon_{k}, the matrix can be also generated as follows,

Υk+1=Υk+λkA[k:1]YkA[k:1].\displaystyle\Upsilon_{k+1}=\Upsilon_{k}+\lambda_{k}A[k:1]^{\top}Y_{k}A[k:1]. (57)

For simplicity, let us further denote A[k:1]YkA[k:1]A[k:1]^{\top}Y_{k}A[k:1] by a new matrix ΞkN\Xi_{k}\in\mathbb{R}^{N}. Now, considering determinant of Υk\Upsilon_{k}’s, it then holds that

det(Υk+1)\displaystyle\det(\Upsilon_{k+1}) =det(Υk1/2(𝐈+λkΥk1/2ΞkΥk1/2)Υk1/2)\displaystyle=\det\Big{(}\Upsilon_{k}^{1/2}\big{(}\mathbf{I}+\lambda_{k}\Upsilon_{k}^{-1/2}\Xi_{k}\Upsilon_{k}^{-1/2}\big{)}\Upsilon_{k}^{1/2}\Big{)} (58)
=det(Υk)det(𝐈+λkΥk1/2ΞkΥk1/2)\displaystyle=\det(\Upsilon_{k})\cdot\det\big{(}\mathbf{I}+\lambda_{k}\Upsilon_{k}^{-1/2}\Xi_{k}\Upsilon_{k}^{-1/2}\big{)}
=(4.a)det(Υk)n=1N(1+λk𝝀n(Υk1/2ΞkΥk1/2))\displaystyle\overset{(4.a)}{=}\det(\Upsilon_{k})\cdot\prod_{n=1}^{N}\Big{(}1+\lambda_{k}\bm{\lambda}_{n}(\Upsilon_{k}^{-1/2}\Xi_{k}\Upsilon_{k}^{-1/2})\Big{)}
(4.b)det(Υk)(1+n=1Nλk𝝀n(Υk1/2ΞkΥk1/2))\displaystyle\overset{(4.b)}{\geq}\det(\Upsilon_{k})\cdot\Big{(}1+\sum_{n=1}^{N}\lambda_{k}\bm{\lambda}_{n}(\Upsilon_{k}^{-1/2}\Xi_{k}\Upsilon_{k}^{-1/2})\Big{)}
=det(Υk)(1+λkTr(Υk1/2ΞkΥk1/2)),\displaystyle=\det(\Upsilon_{k})\cdot\Big{(}1+\lambda_{k}\text{Tr}(\Upsilon_{k}^{-1/2}\Xi_{k}\Upsilon_{k}^{-1/2})\Big{)},

where 𝝀n()\bm{\lambda}_{n}(\cdot) denotes the nn-th eigenvalue of the matrix in (4.a)(4.a) and (4.b)(4.b) is due to the inequality of arithmetic and geometric means. Based on the cyclic property of the matrix trace and the recursion of Σk\Sigma_{k} in (12a), it follows

Tr(Υk1/2ΞkΥk1/2)\displaystyle\text{Tr}(\Upsilon_{k}^{-1/2}\Xi_{k}\Upsilon_{k}^{-1/2}) =Tr(YkΣk).\displaystyle=\text{Tr}(Y_{k}\Sigma_{k}). (59)

Therefore, (58) can be continued as

det(Υk+1)det(Υk)(1+λkTr(YkΣk)).\displaystyle\det(\Upsilon_{k+1})\geq\det(\Upsilon_{k})\cdot\Big{(}1+\lambda_{k}\text{Tr}(Y_{k}\Sigma_{k})\Big{)}. (60)

Now, applying the above inequality (60) recursively yields

det(Υk+1)det(Υ0)t=0k(1+λtTr(YtΣt)).\displaystyle\det(\Upsilon_{k+1})\geq\det(\Upsilon_{0})\cdot\prod_{t=0}^{k}\Big{(}1+\lambda_{t}\text{Tr}(Y_{t}\Sigma_{t})\Big{)}. (61)

Notice that min{1,x}2log(1+x)\min\{1,x\}\leq 2\log(1+x) is always true for any non-negative scalar x0x\geq 0, thus one can have that

t=0kmin{1,λtTr(YtΣt)}\displaystyle\sum_{t=0}^{k}\min\big{\{}1,\lambda_{t}\text{Tr}(Y_{t}\Sigma_{t})\big{\}} t=0k2log(1+λtTr(YtΣt))\displaystyle\leq\sum_{t=0}^{k}2\log\big{(}1+\lambda_{t}\text{Tr}(Y_{t}\Sigma_{t})\big{)} (62)
2log(det(Υk+1)/det(Υ0)).\displaystyle\leq 2\log\Big{(}\det({\Upsilon_{k+1}})/\det(\Upsilon_{0})\Big{)}.

Based on the definition (13) of Υk\Upsilon_{k}, it follows that

det(Υk+1)\displaystyle\det(\Upsilon_{k+1}) (1/NTr(Σ01)+1/Nt=0kλtTr(Ξt))N\displaystyle\leq\Big{(}1/N\cdot\text{Tr}(\Sigma_{0}^{-1})+1/N\cdot\sum_{t=0}^{k}\lambda_{t}\text{Tr}(\Xi_{t})\Big{)}^{N} (63)
(1/σ¯+(k+1)α¯I/v¯)N.\displaystyle\leq\Big{(}1/\underaccent{\bar}{\sigma}+(k+1)\cdot\bar{\alpha}I/\underaccent{\bar}{v}\Big{)}^{N}.

Therefore, combining (62) and (63), we can have that

t=0kmin{1,λtTr(YtΣt)}2Nlog(σ¯/σ¯+(k+1)σ¯α¯I/v¯),\displaystyle\sum_{t=0}^{k}\min\big{\{}1,\lambda_{t}\text{Tr}(Y_{t}\Sigma_{t})\big{\}}\leq 2N\cdot\log\Big{(}\bar{\sigma}/\underaccent{\bar}{\sigma}+(k+1)\bar{\sigma}\bar{\alpha}I/\underaccent{\bar}{v}\Big{)}, (64)

which completes the proof. ∎

Next, in order to bound 𝐚k𝒟Σk\|\mathbf{a}_{k}\|_{\mathcal{D}_{\Sigma_{k}}} by using the above Lemma 7, we build the connection between Tr(HtV1HtΣt)\text{Tr}(H_{t}^{\top}V^{-1}H_{t}\Sigma_{t}) and 𝐚k𝒟Σk\|\mathbf{a}_{k}\|_{\mathcal{D}_{\Sigma_{k}}} as follows.

Lemma 8

Let the matrices Σk\Sigma_{k}’s be generated by (12a) and the positional information 𝐚k\mathbf{a}_{k}’s be generated by Algorithm 1, then the following statements hold for k0\forall k\geq 0,

  1. 1.

    Tr(YkΣk)1/v¯𝐚k𝒟Σk2\text{Tr}(Y_{k}\Sigma_{k})\geq 1/\bar{v}\cdot\|\mathbf{a}_{k}\|^{2}_{\mathcal{D}_{\Sigma_{k}}};

  2. 2.

    v¯YkΣk2Tr(YkΣk)v¯NYkΣk2\underaccent{\bar}{v}\cdot\|Y_{k}\|^{2}_{\Sigma_{k}}\leq\text{Tr}(Y_{k}\Sigma_{k})\leq\bar{v}N\cdot\|Y_{k}\|^{2}_{\Sigma_{k}}.

Proof:

Statement 1): Due to specific forms of covariance matrix VV and the measurement matrix HkH_{k}, it can be confirmed that YkY_{k} has to be diagonal and can be expressed as

Yk=i=1Il𝒞i1/vi𝐞l𝐞l,\displaystyle Y_{k}=\sum_{i=1}^{I}\sum_{l\in\mathcal{C}^{i}}1/v^{i}\cdot\mathbf{e}_{l}\mathbf{e}_{l}^{\top}, (65)

where 𝒞ki\mathcal{C}_{k}^{i} denotes the ii-th agent’s sensing area at the time kk; see definition in (8). Let us introduce a binary variable δki(n)\delta^{i}_{k}(n); let δki(n)=1\delta^{i}_{k}(n)=1 if the position indexed by nn is in the sensing area 𝒞ki\mathcal{C}_{k}^{i}, and δni=0\delta^{i}_{n}=0 otherwise. As a direct result, it holds that

Tr(YkΣk)=n=1N(σnki=1Iδki(n)/vi),\displaystyle\text{Tr}(Y_{k}\Sigma_{k})=\sum_{n=1}^{N}\Big{(}\sigma_{n}^{k}\cdot\sum_{i=1}^{I}\delta_{k}^{i}(n)/v_{i}\Big{)}, (66)

where σnk\sigma_{n}^{k} denotes the nn-th diagonal entry of the matrix Σk\Sigma_{k}. Now, let skis_{k}^{i} be the index of the agent ii’s position, one can have that δki(ski)=1\delta_{k}^{i}(s_{k}^{i})=1 and therefore,

Tr(YkΣk)\displaystyle\text{Tr}(Y_{k}\Sigma_{k}) 1/v¯i=1I𝐞skiΣk𝐞ski\displaystyle\geq 1/\bar{v}\cdot\sum_{i=1}^{I}\mathbf{e}_{s^{i}_{k}}^{\top}\Sigma_{k}\mathbf{e}_{s^{i}_{k}} (67)
=1/v¯𝐚k𝒟Σk𝐚k\displaystyle=1/\bar{v}\cdot\mathbf{a}_{k}^{\top}\mathcal{D}_{\Sigma_{k}}\mathbf{a}_{k}
=1/v¯𝐚k𝒟Σk2,\displaystyle=1/\bar{v}\cdot\|\mathbf{a}_{k}\|^{2}_{\mathcal{D}_{\Sigma_{k}}},

where the first equality is due to the definition of 𝐚k\mathbf{a}_{k}.

Statement 2): Based on the equality (66) and the fact that δki(n)\delta^{i}_{k}(n) is a binary variable, it follows that

Tr(YkΣk)\displaystyle\text{Tr}(Y_{k}\Sigma_{k}) =n=1N(σnki=1Ivi(δki(n)/vi)2)\displaystyle=\sum_{n=1}^{N}\Big{(}\sigma_{n}^{k}\cdot\sum_{i=1}^{I}v_{i}\big{(}\delta_{k}^{i}(n)/v_{i}\big{)}^{2}\Big{)} (68)
v¯n=1N(σnki=1I(δki(n)/vi)2)\displaystyle\geq\underaccent{\bar}{v}\cdot\sum_{n=1}^{N}\Big{(}\sigma_{n}^{k}\cdot\sum_{i=1}^{I}\big{(}\delta_{k}^{i}(n)/v_{i}\big{)}^{2}\Big{)}
=v¯Tr(YkΣkYk)\displaystyle=\underaccent{\bar}{v}\cdot\text{Tr}(Y_{k}\Sigma_{k}Y_{k})
v¯YkΣk2,\displaystyle\geq\underaccent{\bar}{v}\cdot\|Y_{k}\|^{2}_{\Sigma_{k}},

where the last inequality is due to the definition of the matrix norm Σk\|\cdot\|_{\Sigma_{k}}, i.e., YkΣk2\|Y_{k}\|^{2}_{\Sigma_{k}} equals the largest eigenvalue of the matrix YkΣkYkY_{k}\Sigma_{k}Y_{k}.

On the other hand, one can also have that

Tr(YkΣk)\displaystyle\text{Tr}(Y_{k}\Sigma_{k}) v¯n=1N(σnki=1I(δki(n)/vi)2)\displaystyle\leq\bar{v}\cdot\sum_{n=1}^{N}\Big{(}\sigma_{n}^{k}\cdot\sum_{i=1}^{I}\big{(}\delta_{k}^{i}(n)/v_{i}\big{)}^{2}\Big{)} (69)
=v¯Tr(YkΣkYk)\displaystyle=\bar{v}\cdot\text{Tr}(Y_{k}\Sigma_{k}Y_{k})
v¯NYkΣk2.\displaystyle\leq\bar{v}N\cdot\|Y_{k}\|^{2}_{\Sigma_{k}}.

Therefore, the proof is completed. ∎

With the help of the above lemmas, we are in the position to prove the theorem. By the definition of regret in (53), it is easy to see that the regret rkr_{k} has an uniform upper bound, i.e., rkγ¯:=2Iα¯ϕ02r_{k}\leq\bar{\gamma}:=2\sqrt{I\bar{\alpha}}\cdot\|\mathbf{\phi}_{0}\|^{2}. Based on the above Lemma 6, we can have that

rk\displaystyle r_{k} min{γ¯, 2Nβk(δ)𝐚k𝒟Σk}\displaystyle\leq\min\big{\{}\bar{\gamma},\;2\sqrt{N}\beta_{k}(\delta)\cdot\|\mathbf{a}_{k}\|_{\mathcal{D}_{\Sigma_{k}}}\big{\}} (70)
βk(δ)Nmin{1, 1/v¯𝐚k𝒟Σk},\displaystyle\leq\beta^{\prime}_{k}(\delta)\sqrt{N}\cdot\min\big{\{}1,\;1/\sqrt{\bar{v}}\cdot\|\mathbf{a}_{k}\|_{\mathcal{D}_{\Sigma_{k}}}\big{\}},

where we denote βk(δ)=max{γ¯,2v¯βk(δ)}\beta^{\prime}_{k}(\delta)=\max\{\bar{\gamma},2\sqrt{\bar{v}}\beta_{k}(\delta)\}. According to the definition (18) of the sequence {βk(δ)}k+\{\beta_{k}(\delta)\}_{k\in\mathbb{N}_{+}}, it follows that βk(δ)βk+1(δ)\beta^{\prime}_{k}(\delta)\leq\beta^{\prime}_{k+1}(\delta). Therefore, the cumulative regret has

k=0K1rkβK(δ)Nk=0K1min{1, 1/v¯𝐚k𝒟Σk}\displaystyle\sum_{k=0}^{K-1}r_{k}\leq\beta^{\prime}_{K}(\delta)\sqrt{N}\cdot\sum_{k=0}^{K-1}\min\big{\{}1,\;1/\sqrt{\bar{v}}\cdot\|\mathbf{a}_{k}\|_{\mathcal{D}_{\Sigma_{k}}}\big{\}} (71)
=βK(δ)N(k=0K1𝟙λk=1min{1,λk/v¯𝐚k𝒟Σk}\displaystyle=\beta^{\prime}_{K}(\delta)\sqrt{N}\cdot\Bigg{(}\sum_{k=0}^{K-1}\mathbbm{1}_{\lambda_{k}=1}\cdot\min\big{\{}1,\;\sqrt{\lambda_{k}/\bar{v}}\cdot\|\mathbf{a}_{k}\|_{\mathcal{D}_{\Sigma_{k}}}\big{\}}
+k=0K1𝟙λk<1min{1, 1/v¯𝐚k𝒟Σk}).\displaystyle\hskip 45.0pt+\sum_{k=0}^{K-1}\mathbbm{1}_{\lambda_{k}<1}\cdot\min\big{\{}1,\;1/\sqrt{\bar{v}}\cdot\|\mathbf{a}_{k}\|_{\mathcal{D}_{\Sigma_{k}}}\big{\}}\Bigg{)}.

Note that 𝟙λk=1\mathbbm{1}_{\lambda_{k}=1} and 𝟙λk<1\mathbbm{1}_{\lambda_{k}<1} represent the indicator functions and the last equality is due to the fact that λk1,k0\lambda_{k}\leq 1,\forall k\geq 0. Now, let us investigate the two terms in (71) separately. For the first term, by Lemmas 7 and 8, it follows that

k=0K1𝟙λk=1min{1,λk/v¯𝐚k𝒟Σk}\displaystyle\sum_{k=0}^{K-1}\mathbbm{1}_{\lambda_{k}=1}\cdot\min\big{\{}1,\;\sqrt{\lambda_{k}/\bar{v}}\cdot\|\mathbf{a}_{k}\|_{\mathcal{D}_{\Sigma_{k}}}\big{\}} (72)
(5.a)Kk=0K1min{1,λk/v¯𝐚k𝒟Σk2}\displaystyle\overset{(5.a)}{\leq}\sqrt{K}\cdot\sqrt{\sum_{k=0}^{K-1}\min\big{\{}1,\;{\lambda_{k}/\bar{v}}\cdot\|\mathbf{a}_{k}\|^{2}_{\mathcal{D}_{\Sigma_{k}}}\big{\}}}
(5.b)Kk=0K1min{1,λkTr(YkΣk)}\displaystyle\overset{(5.b)}{\leq}\sqrt{K}\cdot\sqrt{\sum_{k=0}^{K-1}\min\big{\{}1,\;{\lambda_{k}}\cdot\text{Tr}(Y_{k}\Sigma_{k})\big{\}}}
(5.c)2NKlog(σ¯/σ¯+Kα¯I/v¯),\displaystyle\overset{(5.c)}{\leq}\sqrt{2NK\cdot\log\big{(}\bar{\sigma}/\underaccent{\bar}{\sigma}+K\bar{\alpha}I/\underaccent{\bar}{v}\big{)}},

where (5.a)(5.a) is due to the inequality of arithmetic and geometric means and the fact that k=0K1𝟙λk=1K\sum_{k=0}^{K-1}\mathbbm{1}_{\lambda_{k}=1}\leq K; (5.b)(5.b) is based on Lemma 8-1); and (5.c)(5.c) is according to Lemma 7. For the second term, it holds that

k=0K1𝟙λk<1min{1, 1/v¯𝐚k𝒟Σk}\displaystyle\sum_{k=0}^{K-1}\mathbbm{1}_{\lambda_{k}<1}\cdot\min\big{\{}1,\;1/\sqrt{\bar{v}}\cdot\|\mathbf{a}_{k}\|_{\mathcal{D}_{\Sigma_{k}}}\big{\}} (73)
(6.a)k=0K1𝟙λk<1min{1,v¯NYkΣk}\displaystyle\overset{(6.a)}{\leq}\sum_{k=0}^{K-1}\mathbbm{1}_{\lambda_{k}<1}\cdot\min\big{\{}1,\;\sqrt{\bar{v}N}\cdot\|Y_{k}\|_{{\Sigma_{k}}}\big{\}}
=(6.b)k=0K1𝟙λk<1min{1,v¯Nλk/λ¯YkΣk2}\displaystyle\overset{(6.b)}{=}\sum_{k=0}^{K-1}\mathbbm{1}_{\lambda_{k}<1}\cdot\min\big{\{}1,\;\sqrt{\bar{v}N}\lambda_{k}/\bar{\lambda}\cdot\|Y_{k}\|^{2}_{{\Sigma_{k}}}\big{\}}
(6.c)k=0K1𝟙λk<1min{1,v¯Nλk/(λ¯v¯)Tr(YkΣk)}\displaystyle\overset{(6.c)}{\leq}\sum_{k=0}^{K-1}\mathbbm{1}_{\lambda_{k}<1}\cdot\min\big{\{}1,\;\sqrt{\bar{v}N}\lambda_{k}/(\bar{\lambda}\underaccent{\bar}{v})\cdot\text{Tr}(Y_{k}\Sigma_{k})\big{\}}
(6.d)λk=0K1min{1,λkTr(YkΣk)}\displaystyle\overset{(6.d)}{\leq}\lambda^{\prime}\cdot\sum_{k=0}^{K-1}\min\big{\{}1,\;\lambda_{k}\cdot\text{Tr}(Y_{k}\Sigma_{k})\big{\}}
(6.e)2Nλlog(σ¯/σ¯+Kσ¯α¯I/v¯),\displaystyle\overset{(6.e)}{\leq}2N\lambda^{\prime}\cdot\log\big{(}\bar{\sigma}/\underaccent{\bar}{\sigma}+K\bar{\sigma}\bar{\alpha}I/\underaccent{\bar}{v}\big{)},

where (6.a)(6.a) is due to the fact that 𝐚k𝒟Σkv¯NYkΣk\|\mathbf{a}_{k}\|_{\mathcal{D}_{\Sigma_{k}}}\leq\bar{v}N\cdot\|Y_{k}\|_{{\Sigma_{k}}} by Lemma 8; (6.b)(6.b) is according to the choice of the weights, i.e., λk=λ¯/YkΣk\lambda_{k}=\bar{\lambda}/\|Y_{k}\|_{\Sigma_{k}} given that λk<1\lambda_{k}<1; (6.c)(6.c) is based on Lemma 8-2); in (6.d)(6.d), we let λ=max{1,v¯N/(λ¯v¯)}\lambda^{\prime}=\max\{1,\sqrt{\bar{v}N}/(\bar{\lambda}\underaccent{\bar}{v})\}; and (6.e)(6.e) is according to Lemma 7.

Now, combining the results obtained in (71) – (73) as well as the defintion of βk(δ)\beta_{k}(\delta); see (18), and let λ¯=N/BK\bar{\lambda}=\sqrt{N}/B_{K}, it yields that

k=0K1rk\displaystyle\sum_{k=0}^{K-1}r_{k}\leq βK(δ)N(2NKlog(σ¯/σ¯+Kσ¯α¯I/v¯)\displaystyle\beta^{\prime}_{K}(\delta)\sqrt{N}\cdot\Big{(}\sqrt{2NK\cdot\log\big{(}\bar{\sigma}/\underaccent{\bar}{\sigma}+K\bar{\sigma}\bar{\alpha}I/\underaccent{\bar}{v}\big{)}} (74)
+2Nαlog(σ¯/σ¯+Kσ¯α¯I/v¯))\displaystyle+2N\alpha^{\prime}\cdot\log\big{(}\bar{\sigma}/\underaccent{\bar}{\sigma}+K\bar{\sigma}\bar{\alpha}I/\underaccent{\bar}{v}\big{)}\Big{)}
\displaystyle\leq 𝒪(N3/2logK(NKlogK+NBKlogK))\displaystyle\mathcal{O}\Big{(}N^{3/2}\sqrt{\log{K}}\cdot\big{(}\sqrt{NK\log{K}}+{N}B_{K}\log{K}\big{)}\Big{)}
=\displaystyle= 𝒪(N2KlogK+N3/2BKlog5/2K)\displaystyle\mathcal{O}\Big{(}N^{2}\sqrt{K}\log{K}+N^{3/2}B_{K}\log^{5/2}{K}\Big{)}
=\displaystyle= 𝒪~(N2K+N5/2BK).\displaystyle\widetilde{\mathcal{O}}\Big{(}N^{2}\sqrt{K}+N^{5/2}B_{K}\Big{)}.

Therefore, the proof is completed.

VI-C Proof of Proposition 2

This proof can be completed by following the similar steps as the one for Proposition 1, except the main differences in the dynamics of state ϕ~k\widetilde{\bm{\phi}}_{k} (due to the different type of disturbance) and the specification of weights λk\lambda_{k} and ωk\omega_{k}.

Taking the dynamics (3b) into account and following the same steps previously, it can be proved without details that

ϕ^kϕ¯k𝒟Σk1/2,\displaystyle\big{\|}\widehat{\bm{\phi}}_{k}-\mkern 1.5mu\overline{\mkern-1.5mu\bm{\phi}\mkern-1.5mu}\mkern 1.5mu_{k}\big{\|}_{\mathcal{D}^{-1/2}_{\Sigma_{k}},\infty} (75)
N(Σ01(ϕ^0ϕ~0)Υk1+λk1A[k:1]1ϕ~kΥk1\displaystyle\leq\sqrt{N}\cdot\Big{(}\big{\|}\Sigma_{0}^{-1}(\widehat{\bm{\phi}}_{0}-\widetilde{\bm{\phi}}_{0})\big{\|}_{\Upsilon_{k}^{-1}}+\lambda_{k-1}\big{\|}A[k:1]^{-1}\widetilde{\bm{\phi}}^{k}\big{\|}_{\Upsilon_{k}^{-1}}
+t=0k1λtA[t:1]HtV1𝐧tΥk1),\displaystyle\hskip 45.0pt+\Big{\|}\sum_{t=0}^{k-1}\lambda_{t}A[t:1]^{\top}H_{t}^{\top}V^{-1}\mathbf{n}_{t}\Big{\|}_{\Upsilon_{k}^{-1}}\Big{)},

where ϕ¯k\mkern 1.5mu\overline{\mkern-1.5mu\bm{\phi}\mkern-1.5mu}\mkern 1.5mu_{k} is defined as in (21). While the first term on the right hand side can be exactly bounded by the previous Lemma 3, the last two need specific attentions to obtain the upper bounds.

First, we show in the following lemma that the second term can be indeed bounded by the weight λk\lambda_{k}.

Lemma 9

Under the conditions in Proposition 2, there exists a constant C3=ϕ¯/α¯C_{3}=\bar{\phi}/\sqrt{\underaccent{\bar}{\alpha}}, such that

λk1A[k:1]1ϕ~kΥk1C3λk1.\displaystyle\lambda_{k-1}\big{\|}A[k:1]^{-1}\widetilde{\bm{\phi}}^{k}\big{\|}_{\Upsilon_{k}^{-1}}\leq C_{3}\sqrt{\lambda_{k-1}}. (76)
Proof:

By the definition (13) of Υk\Upsilon_{k} and the specification of λk=ωk=(1/γ)k\lambda_{k}=\omega_{k}=(1/\gamma)^{k}, it is verified that Υk11/(λk1)𝐈N{\Upsilon_{k}^{-1}}\leq 1/(\lambda_{k-1})\cdot\mathbf{I}_{N}. Therefore, it holds that

A[k:1]1ϕ~kΥk12\displaystyle\big{\|}A[k:1]^{-1}\widetilde{\bm{\phi}}^{k}\big{\|}^{2}_{\Upsilon_{k}^{-1}} 1/(λk1)A[k:1]1ϕ~k2\displaystyle\leq 1/(\lambda_{k-1})\cdot\big{\|}A[k:1]^{-1}\widetilde{\bm{\phi}}^{k}\big{\|}^{2} (77)
=1/(λk1)ϕ~kA[k:1]A[k:1]12\displaystyle=1/(\lambda_{k-1})\cdot\|\widetilde{\bm{\phi}}^{k}\|^{2}_{A[k:1]^{-\top}A[k:1]^{-1}}
1/(α¯λk1)ϕ~k2\displaystyle\leq 1/(\underaccent{\bar}{\alpha}\lambda_{k-1})\cdot\|\widetilde{\bm{\phi}}^{k}\|^{2}
ϕ¯2/(α¯λk1),\displaystyle\leq\bar{\phi}^{2}/(\underaccent{\bar}{\alpha}\lambda_{k-1}),

where the last two inequalities are due to Assumption 1 and the condition ϕ~kϕ¯\|\widetilde{\bm{\phi}}_{k}\|\leq\bar{\phi} in Assumption 3, respectively. Therefore, the proof is completed. ∎

Next, the third term can be handled by applying the result of self-normalized Martingale as in Lemma 5. Nevertheless, to adapt the change of the matrix Υk\Upsilon_{k}, we need to modify the definition of Ωk\Omega_{k} accordingly,

Ωk:=t=0k1XtXt+λk12𝐈NN×N,\displaystyle\Omega_{k}:=\sum_{t=0}^{k-1}X_{t}X_{t}^{\top}+\lambda_{k-1}^{2}\cdot\mathbf{I}_{N}\in\mathbb{R}^{N\times N}, (78)

where XtX_{t} is defined as same as before; see equation (43). As a result, we can have that, with probability at least 1δ1-\delta,

t=0k1Xt𝐧tΩk12v¯2log(det(Ωk)1/2λk1Nδ).\displaystyle\Big{\|}\sum_{t=0}^{k-1}X_{t}\mathbf{n}_{t}\Big{\|}_{\Omega_{k}^{-1}}\leq 2\bar{v}^{2}\cdot\sqrt{\log\Big{(}\frac{\det(\Omega_{k})^{1/2}\cdot\lambda_{k-1}^{-N}}{\delta}\Big{)}}. (79)

Due to the fact that the sequence {λk}k+\{\lambda_{k}\}_{k\in\mathbb{N}_{+}} is increasing with λ1>1\lambda_{1}>1, it can be proved by following the same steps as before that Ωkmax{1,1/v¯}λk1Υk\Omega_{k}\leq\max\{1,1/\underaccent{\bar}{v}\}\cdot\lambda_{k-1}\Upsilon_{k} and furthermore,

t=0k1Xt𝐧tΥk1max{1,1/v¯}λk1t=0k1Xt𝐧tΩk1\displaystyle\Big{\|}\sum_{t=0}^{k-1}X_{t}\mathbf{n}_{t}\Big{\|}_{\Upsilon_{k}^{-1}}\leq\sqrt{\max\{1,1/\underaccent{\bar}{v}\}\cdot\lambda_{k-1}}\cdot\Big{\|}\sum_{t=0}^{k-1}X_{t}\mathbf{n}_{t}\Big{\|}_{\Omega_{k}^{-1}} (80)
C2Nλk1log(1+α¯/v¯2λk12t=0k1λt2δ2/N),\displaystyle\leq C_{2}\sqrt{N}\sqrt{\lambda_{k-1}}\cdot\sqrt{\log\Big{(}\frac{1+\bar{\alpha}/\underaccent{\bar}{v}^{2}\lambda_{k-1}^{-2}\cdot\sum_{t=0}^{k-1}\lambda_{t}^{2}}{\delta^{2/N}}\Big{)}},
=C2Nλk1log(1+α¯/v¯2t=0k1γ2(kt1)δ2/N),\displaystyle=C_{2}\sqrt{N}\sqrt{\lambda_{k-1}}\cdot\sqrt{\log\Big{(}\frac{1+\bar{\alpha}/\underaccent{\bar}{v}^{2}\cdot\sum_{t=0}^{k-1}\gamma^{2(k-t-1)}}{\delta^{2/N}}\Big{)}},

where C2=v¯2max{2,2/v¯}C_{2}=\bar{v}^{2}\sqrt{\max\{2,2/\underaccent{\bar}{v}\}}.

Now, combining the above inequality together with (75) as well as Lemmas 3 and 9, one can have that,

ϕ^kϕk𝒟Σk1/2,N(C1+C3λk1\displaystyle\big{\|}\widehat{\bm{\phi}}_{k}-\bm{\phi}_{k}\big{\|}_{\mathcal{D}^{-1/2}_{\Sigma_{k}},\infty}\leq\sqrt{N}\cdot\Bigg{(}C_{1}+C_{3}\sqrt{\lambda_{k-1}} (81)
+C2Nλk1log(1+α¯/v¯2t=0k1γ2(kt1)δ2/N)).\displaystyle+C_{2}\sqrt{N}\sqrt{\lambda_{k-1}}\cdot\sqrt{\log\Big{(}\frac{1+\bar{\alpha}/\underaccent{\bar}{v}^{2}\cdot\sum_{t=0}^{k-1}\gamma^{2(k-t-1)}}{\delta^{2/N}}\Big{)}}\Bigg{)}.

Therefore, recall the definition of βk(δ)\beta_{k}(\delta) in (23), the proof of Proposition 2 is completed.

VI-D Proof of Theorem 2

By applying the notions defined in the proof of Theorem 1, one can have that,

r~k\displaystyle\widetilde{r}_{k} =F~k(𝐩k)F~k(𝐩k)=𝐚k𝐚k,ϕ~k\displaystyle=\widetilde{F}_{k}\big{(}\mathbf{p}^{\star}_{k}\big{)}-\widetilde{F}_{k}(\mathbf{p}_{k})=\langle\mathbf{a}^{\star}_{k}-\mathbf{a}_{k},\widetilde{\bm{\phi}}_{k}\rangle (82)
=𝐚k,ϕ¯k𝐚k,ϕ¯k+𝐚k𝐚k,ϕ~kϕ¯k\displaystyle=\langle\mathbf{a}^{\star}_{k},\;\mkern 1.5mu\overline{\mkern-1.5mu\bm{\phi}\mkern-1.5mu}\mkern 1.5mu_{k}\rangle-\langle\mathbf{a}_{k},\;\mkern 1.5mu\overline{\mkern-1.5mu\bm{\phi}\mkern-1.5mu}\mkern 1.5mu_{k}\rangle+\langle\mathbf{a}^{\star}_{k}-\mathbf{a}_{k},\;\widetilde{\bm{\phi}}_{k}-\mkern 1.5mu\overline{\mkern-1.5mu\bm{\phi}\mkern-1.5mu}\mkern 1.5mu_{k}\rangle
(7.a)𝐚k,𝝁kϕ¯k+2Iϕ~kϕ¯k\displaystyle\overset{(7.a)}{\leq}\langle\mathbf{a}_{k},\;\bm{\mu}_{k}-\mkern 1.5mu\overline{\mkern-1.5mu\bm{\phi}\mkern-1.5mu}\mkern 1.5mu_{k}\rangle+2\sqrt{I}\cdot\|\widetilde{\bm{\phi}}_{k}-\mkern 1.5mu\overline{\mkern-1.5mu\bm{\phi}\mkern-1.5mu}\mkern 1.5mu_{k}\|
(7.b)𝐚k𝒟Σk1/2,1𝝁kϕ¯k𝒟Σk1/2,+2Iϕ~kϕ¯k\displaystyle\overset{(7.b)}{\leq}\|\mathbf{a}_{k}\|_{\mathcal{D}^{1/2}_{\Sigma_{k}},1}\cdot\|\bm{\mu}_{k}-\mkern 1.5mu\overline{\mkern-1.5mu\bm{\phi}\mkern-1.5mu}\mkern 1.5mu_{k}\|_{\mathcal{D}^{-1/2}_{\Sigma_{k}},\infty}+2\sqrt{I}\cdot\|\widetilde{\bm{\phi}}_{k}-\mkern 1.5mu\overline{\mkern-1.5mu\bm{\phi}\mkern-1.5mu}\mkern 1.5mu_{k}\|
(7.c)2Nβk(δ)𝐚k𝒟Σk+2Iϕ~kϕ¯k,\displaystyle\overset{(7.c)}{\leq}2\sqrt{N}\beta_{k}(\delta)\cdot\|\mathbf{a}_{k}\|_{\mathcal{D}_{\Sigma_{k}}}+2\sqrt{I}\cdot\|\widetilde{\bm{\phi}}_{k}-\mkern 1.5mu\overline{\mkern-1.5mu\bm{\phi}\mkern-1.5mu}\mkern 1.5mu_{k}\|,

where (7.a)(7.a) is due to the definitions of 𝐚k\mathbf{a}^{\star}_{k} and 𝐚k\mathbf{a}_{k} as well as the fact that 𝐚k,ϕ¯k𝐚k,𝝁k𝐚k,𝝁k\langle\mathbf{a}^{\star}_{k},\;\mkern 1.5mu\overline{\mkern-1.5mu\bm{\phi}\mkern-1.5mu}\mkern 1.5mu_{k}\rangle\leq\langle\mathbf{a}^{\star}_{k},\;\bm{\mu}_{k}\rangle\leq\langle\mathbf{a}_{k},\;\bm{\mu}_{k}\rangle; (7.b)(7.b) is from the Hölder’s inequality; and (7.c)(7.c) comes from the inequality (81) which has been proved in the proof of Proposition 2. It can be seen from the above result that, due to the involvement of all prior disturbances 𝜹k\bm{\delta}_{k}’s in the state ϕ~k\widetilde{\bm{\phi}}_{k}, the regret r~k\widetilde{r}_{k} can be no longer bounded by the action term 𝐚k𝒟Σk\|\mathbf{a}_{k}\|_{\mathcal{D}_{\Sigma_{k}}} solely, but needs to consider the extra term which is related to the discrepancy between ϕ~k\widetilde{\bm{\phi}}_{k} and ϕ¯k\mkern 1.5mu\overline{\mkern-1.5mu\bm{\phi}\mkern-1.5mu}\mkern 1.5mu_{k}. Hence, we next provide an upper bound for ϕ~kϕ¯k\|\widetilde{\bm{\phi}}_{k}-\mkern 1.5mu\overline{\mkern-1.5mu\bm{\phi}\mkern-1.5mu}\mkern 1.5mu_{k}\| with respect to the disturbances 𝜹k\bm{\delta}_{k}’s.

Lemma 10

Under the conditions in Proposition 2, it holds that, for 0Dk0\leq D\leq k,

ϕ~kϕ¯kC4/λk1+C5/λk1t=0kD1λt+C6t=kDk1𝜹t,\displaystyle\|\widetilde{\bm{\phi}}_{k}-\mkern 1.5mu\overline{\mkern-1.5mu\bm{\phi}\mkern-1.5mu}\mkern 1.5mu_{k}\|\leq C_{4}/\lambda_{k-1}+C_{5}/\lambda_{k-1}\sum_{t=0}^{k-D-1}\lambda^{t}+C_{6}\sum_{t=k-D}^{k-1}\|\bm{\delta}_{t}\|, (83)

where C4=ϕ¯α¯(1+1/α¯)/σ¯C_{4}=\bar{\phi}\sqrt{\bar{\alpha}}\cdot(1+1/\sqrt{\underaccent{\bar}{\alpha}})/\underaccent{\bar}{\sigma}, C5=α¯ϕ¯(1+1/α¯)/(v¯I)C_{5}={\bar{\alpha}}\bar{\phi}\cdot(1+1/\sqrt{\underaccent{\bar}{\alpha}})/(\underaccent{\bar}{v}I), and C6=α¯/α¯C_{6}=\sqrt{\bar{\alpha}/\underaccent{\bar}{\alpha}}.

Proof:

Recall the definition (21) of ϕ¯k\mkern 1.5mu\overline{\mkern-1.5mu\bm{\phi}\mkern-1.5mu}\mkern 1.5mu_{k}, it follows that

ϕ¯kϕ~k=A[k:1]Υk1(Σ01(ϕ~0A[k:1]1ϕ~k)\displaystyle\mkern 1.5mu\overline{\mkern-1.5mu\bm{\phi}\mkern-1.5mu}\mkern 1.5mu_{k}-\widetilde{\bm{\phi}}_{k}=A[k:1]\Upsilon_{k}^{-1}\Big{(}\Sigma_{0}^{-1}\big{(}\widetilde{\bm{\phi}}_{0}-A[k:1]^{-1}\widetilde{\bm{\phi}}_{k}\big{)} (84)
+t=0kD1λtA[t:1]HtV1Ht(ϕ~tA[k:t+1]1ϕ~k)\displaystyle\hskip 10.0pt+\sum_{t=0}^{k-D-1}\lambda_{t}A[t:1]^{\top}H_{t}^{\top}V^{-1}H_{t}\big{(}\widetilde{\bm{\phi}}_{t}-A[k:t+1]^{-1}\widetilde{\bm{\phi}}_{k}\big{)}
+t=kDk1λtA[t:1]HtV1Ht(ϕ~tA[k:t+1]1ϕ~k)).\displaystyle\hskip 10.0pt+\sum_{t=k-D}^{k-1}\lambda_{t}A[t:1]^{\top}H_{t}^{\top}V^{-1}H_{t}\big{(}\widetilde{\bm{\phi}}_{t}-A[k:t+1]^{-1}\widetilde{\bm{\phi}}_{k}\big{)}\Big{)}.

We next upper bound in order the three terms on the right-hand-side of (84).

Terms I: Due to the fact that Υk11/λk1𝐈N\Upsilon_{k}^{-1}\leq 1/\lambda_{k-1}\cdot\mathbf{I}_{N}, it holds

A[k:1]Υk1Σ01(ϕ~0A[k:1]1ϕ~k)\displaystyle\Big{\|}A[k:1]\Upsilon_{k}^{-1}\Sigma_{0}^{-1}\big{(}\widetilde{\bm{\phi}}_{0}-A[k:1]^{-1}\widetilde{\bm{\phi}}_{k}\big{)}\Big{\|} (85)
ϕ~0M1+ϕ~kM2.\displaystyle\leq\big{\|}\widetilde{\bm{\phi}}_{0}\big{\|}_{M_{1}}+\big{\|}\widetilde{\bm{\phi}}_{k}\big{\|}_{M_{2}}.

Note that, in (85), the two matrices M1M_{1} and M2M_{2} have

M1:\displaystyle M_{1}: =Σ01Υk1A[k:1]A[k:1]Υk1Σ01\displaystyle=\Sigma_{0}^{-1}\Upsilon_{k}^{-1}A[k:1]^{\top}A[k:1]\Upsilon_{k}^{-1}\Sigma_{0}^{-1} (86)
α¯/(λk1σ¯)2𝐈N,\displaystyle\leq\bar{\alpha}/(\lambda_{k-1}\underaccent{\bar}{\sigma})^{2}\cdot\mathbf{I}_{N},

and

M2:\displaystyle M_{2}: =A[k:1]Σ01Υk1A[k:1]A[k:1]Υk1Σ01A[k:1]1\displaystyle=A[k:1]^{-\top}\Sigma_{0}^{-1}\Upsilon_{k}^{-1}A[k:1]^{\top}A[k:1]\Upsilon_{k}^{-1}\Sigma_{0}^{-1}A[k:1]^{-1} (87)
α¯/(α¯(λk1σ¯)2)𝐈N.\displaystyle\leq\bar{\alpha}/\big{(}\underaccent{\bar}{\alpha}(\lambda_{k-1}\underaccent{\bar}{\sigma})^{2}\big{)}\cdot\mathbf{I}_{N}.

As a result of ϕ~kϕ¯\|\widetilde{\bm{\phi}}_{k}\|\leq\bar{\phi} in Assumption 3, we can have that

A[k:1]Υk1Σ01(ϕ~0A[k:1]1ϕ~k)C4/λk1,\displaystyle\Big{\|}A[k:1]\Upsilon_{k}^{-1}\Sigma_{0}^{-1}\big{(}\widetilde{\bm{\phi}}_{0}-A[k:1]^{-1}\widetilde{\bm{\phi}}_{k}\big{)}\Big{\|}\leq C_{4}/\lambda_{k-1}, (88)

where C4=ϕ¯α¯(1+1/α¯)/σ¯C_{4}=\bar{\phi}\sqrt{\bar{\alpha}}\cdot(1+1/\sqrt{\underaccent{\bar}{\alpha}})/\underaccent{\bar}{\sigma}.

Term II: Following the same path for the analysis of the first term, it can be shown that

A[k:1]Υk1\displaystyle\Big{\|}A[k:1]\Upsilon_{k}^{-1} (89)
t=0kD1λtA[t:1]HtV1Ht(ϕ~tA[k:t+1]1ϕ~k)\displaystyle\bm{\cdot}\sum_{t=0}^{k-D-1}\lambda_{t}A[t:1]^{\top}H_{t}^{\top}V^{-1}H_{t}\big{(}\widetilde{\bm{\phi}}_{t}-A[k:t+1]^{-1}\widetilde{\bm{\phi}}_{k}\big{)}\Big{\|}
α¯ϕ¯/(v¯Iλk1)t=0kD1λt+α¯ϕ¯/(α¯v¯Iλk1)t=0kD1λt\displaystyle\leq\bar{\alpha}\bar{\phi}/(\underaccent{\bar}{v}I\lambda_{k-1})\cdot\sum_{t=0}^{k-D-1}\lambda_{t}+\bar{\alpha}\bar{\phi}/(\sqrt{\underaccent{\bar}{\alpha}}\underaccent{\bar}{v}I\lambda_{k-1})\cdot\sum_{t=0}^{k-D-1}\lambda_{t}
C5/λk1t=0kD1λt,\displaystyle\leq C_{5}/\lambda_{k-1}\sum_{t=0}^{k-D-1}\lambda_{t},

where C5=α¯ϕ¯(1+1/α¯)/(v¯I)C_{5}={\bar{\alpha}}\bar{\phi}\cdot(1+1/\sqrt{\underaccent{\bar}{\alpha}})/(\underaccent{\bar}{v}I).

Term III: Note that, for kt+1\forall k\geq t+1, we can have

A[k:t+1]1ϕ~kϕ~t\displaystyle A[k:t+1]^{-1}\widetilde{\bm{\phi}}_{k}-\widetilde{\bm{\phi}}_{t} (90)
=s=tk1A[s+1:t+1]1(ϕ~s+1As+1ϕ~s).\displaystyle\hskip 20.0pt=\sum_{s=t}^{k-1}A[s+1:t+1]^{-1}(\widetilde{\bm{\phi}}_{s+1}-A_{s+1}\widetilde{\bm{\phi}}_{s}).

Therefore, it holds that

A[k:1]Υk1\displaystyle\Big{\|}A[k:1]\Upsilon_{k}^{-1} (91)
t=kDk1λtA[t:1]HtV1Ht(A[k:t+1]1ϕ~kϕ~t)\displaystyle\bm{\cdot}\sum_{t=k-D}^{k-1}\lambda_{t}A[t:1]^{\top}H_{t}^{\top}V^{-1}H_{t}\big{(}A[k:t+1]^{-1}\widetilde{\bm{\phi}}_{k}-\widetilde{\bm{\phi}}_{t}\big{)}\Big{\|}
(8.a)α¯Υk1t=kDk1s=tk1λtA[t:1]HtV1Ht\displaystyle\overset{(8.a)}{\leq}\sqrt{\bar{\alpha}}\cdot\Big{\|}\Upsilon_{k}^{-1}\sum_{t=k-D}^{k-1}\sum_{s=t}^{k-1}\lambda_{t}A[t:1]^{\top}H_{t}^{\top}V^{-1}H_{t}
A[s+1:t+1]1(ϕ~s+1As+1ϕ~s)\displaystyle\hskip 40.0pt\bm{\cdot}A[s+1:t+1]^{-1}(\widetilde{\bm{\phi}}_{s+1}-A_{s+1}\widetilde{\bm{\phi}}_{s})\Big{\|}
=(8.b)α¯s=kDk1Υk1t=kDsλtA[t:1]HtV1HtA[t:1]\displaystyle\overset{(8.b)}{=}\sqrt{\bar{\alpha}}\cdot\Big{\|}\sum_{s=k-D}^{k-1}\Upsilon_{k}^{-1}\sum_{t=k-D}^{s}\lambda_{t}A[t:1]^{\top}H_{t}^{\top}V^{-1}H_{t}A[t:1]
A[s+1:1]1(ϕ~s+1As+1ϕ~s)\displaystyle\hskip 40.0pt\bm{\cdot}A[s+1:1]^{-1}(\widetilde{\bm{\phi}}_{s+1}-A_{s+1}\widetilde{\bm{\phi}}_{s})\Big{\|}
(8.c)α¯s=kDk1Υk1t=kDsλtA[t:1]HtV1HtA[t:1]\displaystyle\overset{(8.c)}{\leq}\sqrt{\bar{\alpha}}\cdot\sum_{s=k-D}^{k-1}\Big{\|}\Upsilon_{k}^{-1}\sum_{t=k-D}^{s}\lambda_{t}A[t:1]^{\top}H_{t}^{\top}V^{-1}H_{t}A[t:1]
A[s+1:1]1(ϕ~s+1As+1ϕ~s)\displaystyle\hskip 40.0pt\bm{\cdot}A[s+1:1]^{-1}(\widetilde{\bm{\phi}}_{s+1}-A_{s+1}\widetilde{\bm{\phi}}_{s})\Big{\|}
(8.d)α¯s=kDk1A[s+1:1]1(ϕ~s+1As+1ϕ~s)\displaystyle\overset{(8.d)}{\leq}\sqrt{\bar{\alpha}}\cdot\sum_{s=k-D}^{k-1}\Big{\|}A[s+1:1]^{-1}(\widetilde{\bm{\phi}}_{s+1}-A_{s+1}\widetilde{\bm{\phi}}_{s})\Big{\|}
(8.e)α¯/α¯s=kDk1ϕ~s+1As+1ϕ~s.\displaystyle\overset{(8.e)}{\leq}\sqrt{\bar{\alpha}/\underaccent{\bar}{\alpha}}\cdot\sum_{s=k-D}^{k-1}\big{\|}\widetilde{\bm{\phi}}_{s+1}-A_{s+1}\widetilde{\bm{\phi}}_{s}\big{\|}.

Note that (8.a)(8.a) and (8.e)(8.e) are due to Assumption 1; in (8.b)(8.b), we exchange the summation indices and apply (90); (8.c)(8.c) follows from the triangle inequality; and (8.d)(8.d) is due to the fact that

Υk1t=kDsλtA[t:1]HtV1HtA[t:1]\displaystyle\Upsilon_{k}^{-1}\sum_{t=k-D}^{s}\lambda_{t}A[t:1]^{\top}H_{t}^{\top}V^{-1}H_{t}A[t:1] (92)
Υk1(Σ01+t=0k1λtA[t:1]HtV1HtA[t:1]+λk1𝐈N)\displaystyle\leq\Upsilon_{k}^{-1}\Big{(}\Sigma_{0}^{-1}+\sum_{t=0}^{k-1}\lambda_{t}A[t:1]^{\top}H_{t}^{\top}V^{-1}H_{t}A[t:1]+\lambda_{k-1}\cdot\mathbf{I}_{N}\Big{)}
=𝐈N.\displaystyle=\mathbf{I}_{N}.

Thus, let C6=α¯/α¯C_{6}=\sqrt{\bar{\alpha}/\underaccent{\bar}{\alpha}} and based on the upper bounds of the three terms; see inequalities (88), (89) and (91), the proof of Lemma 10 is completed. ∎

In terms of 𝐚k𝒟Σk\|\mathbf{a}_{k}\|_{\mathcal{D}_{\Sigma_{k}}} appearing the regret r~k\widetilde{r}_{k}’s bound (82), we apply the same analysis as in the proof of Theorem 1; see Lemma 7 and obtain the following result.

Lemma 11

Under the conditions in Proposition 2, it holds,

k=0Kmin{1,λkTr(YkΣk)}\displaystyle\sum_{k=0}^{K}\min\big{\{}1,\lambda_{k}\text{Tr}(Y_{k}\Sigma_{k})\big{\}} (93)
2Nlog(σ¯(σ¯1+λK+α¯I/v¯k=0Kλk)).\displaystyle\leq 2N\cdot\log\Big{(}\bar{\sigma}\big{(}\underaccent{\bar}{\sigma}^{-1}+\lambda_{K}+\bar{\alpha}I/\underaccent{\bar}{v}\cdot\sum_{k=0}^{K}\lambda_{k}\big{)}\Big{)}.
Proof:

This proof can be finished by following the same steps as in the one for Lemma 7. However, two differences should be noted which are resulted from the distinct definition of Υk\Upsilon_{k}.

First, under the conditions in this lemma, the recursion of Υk\Upsilon_{k} follows

Υk+1=Υk+λkA[k:1]YkA[k:1]+(λkλk1)𝐈N.\displaystyle\Upsilon_{k+1}=\Upsilon_{k}+\lambda_{k}A[k:1]^{\top}Y_{k}A[k:1]+(\lambda_{k}-\lambda_{k-1})\cdot\mathbf{I}_{N}. (94)

Despite the difference as compared to (57), due to the fact that

det(Υk+1)det(Υk+λkA[k:1]YkA[k:1]),\displaystyle\det\big{(}{\Upsilon_{k+1}}\big{)}\geq\det\Big{(}{\Upsilon_{k}+\lambda_{k}A[k:1]^{\top}Y_{k}A[k:1]}\Big{)}, (95)

the (in)equalities in (58) are still valid, and so is the subsequent deduction. At last, based on the definition of Υk\Upsilon_{k} in (13), the final bound in (93) is obtained by

det(Υk+1)\displaystyle\det(\Upsilon_{k+1}) (1/NTr(Σ01)+1/Nt=0kλtTr(Ξt)+λk)N\displaystyle\leq\Big{(}1/N\cdot\text{Tr}(\Sigma_{0}^{-1})+1/N\cdot\sum_{t=0}^{k}\lambda_{t}\text{Tr}(\Xi_{t})+\lambda_{k}\Big{)}^{N} (96)
(σ¯1+λk+α¯I/v¯t=0kλt)N.\displaystyle\leq\Big{(}\underaccent{\bar}{\sigma}^{-1}+\lambda_{k}+\bar{\alpha}I/\underaccent{\bar}{v}\cdot\sum_{t=0}^{k}\lambda_{t}\Big{)}^{N}.

With the help of the above lemmas, we are ready to prove the sub-linear regret as stated in Theorem 2. Notice that an uniform upper bound γ¯:=2Iϕ¯\bar{\gamma}:=2\sqrt{I}\bar{\phi} still exists for the regret r~k\widetilde{r}_{k}. Therefore, it follows from (82) that

r~kmin{γ¯, 2Nβk(δ)𝐚k𝒟Σk}+2Iϕ~kϕ¯k\displaystyle\widetilde{r}_{k}\leq\min\big{\{}\bar{\gamma},\;2\sqrt{N}\beta_{k}(\delta)\cdot\|\mathbf{a}_{k}\|_{\mathcal{D}_{\Sigma_{k}}}\big{\}}+2\sqrt{I}\cdot\|\widetilde{\bm{\phi}}_{k}-\mkern 1.5mu\overline{\mkern-1.5mu\bm{\phi}\mkern-1.5mu}\mkern 1.5mu_{k}\| (97)
Nβk(δ)min{1,λk/v¯𝐚k𝒟Σk}+2Iϕ~kϕ¯k,\displaystyle\leq\sqrt{N}{\beta}^{\prime}_{k}(\delta)\min\big{\{}1,\sqrt{\lambda_{k}/\bar{v}}\cdot\|\mathbf{a}_{k}\|_{\mathcal{D}_{\Sigma_{k}}}\big{\}}+2\sqrt{I}\cdot\|\widetilde{\bm{\phi}}_{k}-\mkern 1.5mu\overline{\mkern-1.5mu\bm{\phi}\mkern-1.5mu}\mkern 1.5mu_{k}\|,

where we let βk(δ):=max{γ¯, 2v¯/λkβk(δ)}\beta^{\prime}_{k}(\delta):=\max\{\bar{\gamma},\,2\sqrt{\bar{v}/\lambda_{k}}\beta_{k}(\delta)\} in the last inequality. Therefore, it holds that

k=0K1r~kNβK(δ)k=0K1min{1,λk/v¯𝐚k𝒟Σk}\displaystyle\sum_{k=0}^{K-1}\widetilde{r}_{k}\leq\sqrt{N}{\beta}^{\prime}_{K}(\delta)\cdot\sum_{k=0}^{K-1}\min\big{\{}1,\;\sqrt{\lambda_{k}/\bar{v}}\cdot\|\mathbf{a}_{k}\|_{\mathcal{D}_{\Sigma_{k}}}\big{\}} (98)
+2Ik=0K1ϕ~kϕ¯k\displaystyle\hskip 10.0pt+2\sqrt{I}\cdot\sum_{k=0}^{K-1}\|\widetilde{\bm{\phi}}_{k}-\mkern 1.5mu\overline{\mkern-1.5mu\bm{\phi}\mkern-1.5mu}\mkern 1.5mu_{k}\|
(9.a)NβK(δ)Kk=0K1min{1,λkTr(YkΣk)}\displaystyle\overset{(9.a)}{\leq}\sqrt{N}{\beta}^{\prime}_{K}(\delta)\cdot\sqrt{K\cdot\sum_{k=0}^{K-1}\min\big{\{}1,\;\lambda_{k}\text{Tr}(Y_{k}\Sigma_{k})\big{\}}}
+2Ik=0K1ϕ~kϕ¯k\displaystyle\hskip 10.0pt+2\sqrt{I}\cdot\sum_{k=0}^{K-1}\|\widetilde{\bm{\phi}}_{k}-\mkern 1.5mu\overline{\mkern-1.5mu\bm{\phi}\mkern-1.5mu}\mkern 1.5mu_{k}\|
(9.b)NβK(δ)2Klog(σ¯(σ¯1+λK+α¯I/v¯k=0K1λk))\displaystyle\overset{(9.b)}{\leq}N{\beta}^{\prime}_{K}(\delta)\sqrt{2K}\cdot\log\Big{(}\bar{\sigma}\big{(}\underaccent{\bar}{\sigma}^{-1}+\lambda_{K}+\bar{\alpha}I/\underaccent{\bar}{v}\cdot\sum_{k=0}^{K-1}\lambda_{k}\big{)}\Big{)}
+2IC4k=0K11/λk+2IC5k=0K11/λkt=0kD1λt\displaystyle\hskip 10.0pt+2\sqrt{I}C_{4}\cdot\sum_{k=0}^{K-1}1/\lambda_{k}+2\sqrt{I}C_{5}\cdot\sum_{k=0}^{K-1}1/\lambda_{k}\cdot\sum_{t=0}^{k-D-1}\lambda_{t}
+2IC6k=0K1t=kDk1ϕt+1At+1ϕt\displaystyle\hskip 10.0pt+2\sqrt{I}C_{6}\cdot\sum_{k=0}^{K-1}\sum_{t=k-D}^{k-1}\|\bm{\phi}_{t+1}-A_{t+1}\bm{\phi}_{t}\|
(9.c)NβK(δ)2Klog(σ¯(σ¯1+γK+α¯I/v¯γK/(1γ)))\displaystyle\overset{(9.c)}{\leq}N{\beta}^{\prime}_{K}(\delta)\sqrt{2K}\cdot\log\Big{(}\bar{\sigma}\big{(}\underaccent{\bar}{\sigma}^{-1}+\gamma^{-K}+\bar{\alpha}I/\underaccent{\bar}{v}\cdot\gamma^{-K}/(1-\gamma)\big{)}\Big{)}
+2IC4/(1γ)+2IC5(KD)γD+1/(1γ)\displaystyle\hskip 10.0pt+2\sqrt{I}C_{4}/(1-\gamma)+2\sqrt{I}C_{5}\cdot(K-D)\gamma^{D+1}/(1-\gamma)
+2IC6DBK,\displaystyle\hskip 10.0pt+2\sqrt{I}C_{6}\cdot DB_{K},

where (9.a)(9.a) is due to the Cauchy-Schwartz inequality and Lemma 8-1); (9.b)(9.b) is by Lemmas 10 and 11; and (9.c)(9.c) follows from the specification of λk=(1/γ)k\lambda_{k}=(1/\gamma)^{k} with 0<γ<10<\gamma<1.

Now, provided that γ=1(BK/K)2/3\gamma=1-(B_{K}/K)^{2/3}; see the condition of Theorem 2, and letting D=log(K)/(1γ)D=\lfloor\log(K)/(1-\gamma)\rfloor, it can be confirmed that DBK2/3K2/3log(K)D\leq B_{K}^{-2/3}K^{2/3}\log(K) and therefore,

2IC6DBK=𝒪~(BK1/3K2/3).\displaystyle 2\sqrt{I}C_{6}\cdot DB_{K}=\widetilde{\mathcal{O}}\Big{(}B_{K}^{1/3}K^{2/3}\Big{)}. (99)

Further, considering that log(1/γ)1γ=(BK/K)2/3\log(1/\gamma)\sim 1-\gamma=(B_{K}/K)^{2/3}, it holds that γD=eDlog(γ)elog(K)log(γ)/(1γ)=𝒪~(1/K)\gamma^{D}=e^{D\log(\gamma)}\leq e^{\log(K)\log(\gamma)/{(1-\gamma)}}=\widetilde{\mathcal{O}}(1/K), and consequently,

2IC4/(1γ)+2IC5(KD)γD+1/(1γ)\displaystyle 2\sqrt{I}C_{4}/(1-\gamma)+2\sqrt{I}C_{5}\cdot(K-D)\gamma^{D+1}/(1-\gamma) (100)
1/(1γ)=𝒪~(BK2/3K2/3).\displaystyle\sim 1/(1-\gamma)=\widetilde{\mathcal{O}}\Big{(}B_{K}^{-2/3}K^{2/3}\Big{)}.

According to the definitions of βk(δ){\beta}^{\prime}_{k}(\delta) and βk(δ){\beta}_{k}(\delta), it holds that

βK(δ)\displaystyle{\beta}^{\prime}_{K}(\delta) βK(δ)/λKNlog(t=0K1γ2(Kt1))\displaystyle\sim\beta_{K}(\delta)/\sqrt{\lambda_{K}}\sim N\sqrt{\log\Big{(}\sum_{t=0}^{K-1}\gamma^{2(K-t-1)}\Big{)}} (101)
Nlog(1/(1γ))=𝒪(Nlog(K/BK)).\displaystyle\leq N\sqrt{\log\big{(}{1}/(1-\gamma)\big{)}}={\mathcal{O}}\Big{(}N\sqrt{\log(K/B_{K})}\Big{)}.

As a result, one can have that

NβK(δ)2Klog(σ¯(σ¯1+γK+α¯I/v¯γK/(1γ)))\displaystyle N{\beta}^{\prime}_{K}(\delta)\sqrt{2K}\cdot\log\Big{(}\bar{\sigma}\big{(}\underaccent{\bar}{\sigma}^{-1}+\gamma^{-K}+\bar{\alpha}I/\underaccent{\bar}{v}\cdot\gamma^{-K}/(1-\gamma)\big{)}\Big{)} (102)
N2log(K/BK)KKlog(1/γ)+log(1/(1γ))\displaystyle\sim N^{2}\sqrt{\log(K/B_{K})}\cdot\sqrt{K}\cdot\sqrt{K\log(1/\gamma)+\log\big{(}1/(1-\gamma)\big{)}}
=𝒪~(N2BK1/3K2/3).\displaystyle=\widetilde{\mathcal{O}}\Big{(}N^{2}B_{K}^{1/3}K^{2/3}\Big{)}.

At last, combining (98)–(100) and (102) arrives at the conclusion in Theorem 2, i.e., the cumulative regret generated by our algorithm is upper bounded by R~K𝒪~(N2BK1/3K2/3)\widetilde{R}_{K}\leq\widetilde{\mathcal{O}}\big{(}N^{2}B_{K}^{1/3}K^{2/3}\big{)}.

References

  • [1] J. Poveda, M. Benosman, and R. Teel, A.and Sanfelice. Robust coordinated hybrid source seeking with obstacle avoidance in multi-vehicle autonomous systems. IEEE Transactions on Automatic Control, 2021.
  • [2] B. Angélico, L. Chamon, S. Paternain, A. Ribeiro, and G. Pappas. Source seeking in unknown environments with convex obstacles. In Proceedings of 2021 American Control Conference, pages 5055–5061. IEEE, 2021.
  • [3] T. Li, B. Jayawardhana, A. Kamat, and A. Kottapalli. Source-seeking control of unicycle robots with 3-D printed flexible piezoresistive sensors. IEEE Transactions on Robotics, 2021.
  • [4] W. Liu, X. Huo, G. Duan, and K. Ma. Semi-global stability analysis of source seeking with dynamic sensor reading and a class of nonlinear maps. International Journal of Control, pages 1–10, 2020.
  • [5] E. Ramirez-Llanos and S. Martinez. Stochastic source seeking for mobile robots in obstacle environments via the SPSA method. IEEE Transactions on Automatic Control, 64(4):1732–1739, 2018.
  • [6] S. Azuma, M. Sakar, and G. Pappas. Stochastic source seeking by mobile robots. IEEE Transactions on Automatic Control, 57(9):2308–2321, 2012.
  • [7] J. Habibi, H. Mahboubi, and A. Aghdam. A gradient-based coverage optimization strategy for mobile sensor networks. IEEE Transactions on Control of Network Systems, 4(3):477–488, 2016.
  • [8] E. Rolf, D. Fridovich-Keil, M. Simchowitz, B. Recht, and C. Tomlin. A successive-elimination approach to adaptive robotic source seeking. IEEE Transactions on Robotics, 37(1):34–47, 2020.
  • [9] B. Du, K. Qian, H. Iqbal, C. Claudel, and D. Sun. Multi-robot dynamical source seeking in unknown environments. In Proceedings of 2021 IEEE International Conference on Robotics and Automation, pages 9036–9042. IEEE, 2021.
  • [10] J. Feiling, S. Koga, M. Krstić, and T. Oliveira. Gradient extremum seeking for static maps with actuation dynamics governed by diffusion PDEs. Automatica, 95:197–206, 2018.
  • [11] S. Dougherty and M. Guay. An extremum-seeking controller for distributed optimization over sensor networks. IEEE Transactions on Automatic Control, 62(2):928–933, 2016.
  • [12] Shuai Li, Ruofan Kong, and Yi Guo. Cooperative distributed source seeking by multiple robots: Algorithms and experiments. IEEE/ASME Transactions on Mechatronics, 19(6):1810–1820, 2014.
  • [13] Ruggero Fabbiano, Carlos Canudas De Wit, and Federica Garin. Source localization by gradient estimation based on Poisson integral. Automatica, 50(6):1715–1724, 2014.
  • [14] Lara Briñón-Arranz, Luca Schenato, and Alexandre Seuret. Distributed source seeking via a circular formation of agents under communication constraints. IEEE Transactions on Control of Network Systems, 3(2):104–115, 2015.
  • [15] Ruggero Fabbiano, Federica Garin, and Carlos Canudas-de Wit. Distributed source seeking without global position information. IEEE Transactions on Control of Network Systems, 5(1):228–238, 2016.
  • [16] Nikolay Atanasov, Jerome Le Ny, Nathan Michael, and George J Pappas. Stochastic source seeking in complex environments. In 2012 IEEE International Conference on Robotics and Automation, pages 3013–3018. IEEE, 2012.
  • [17] Nikolay A Atanasov, Jerome Le Ny, and George J Pappas. Distributed algorithms for stochastic source seeking with mobile robot networks. Journal of Dynamic Systems, Measurement, and Control, 137(3), 2015.
  • [18] K. Zhou and J. Doyle. Essentials of Robust Control, volume 104. Prentice hall Upper Saddle River, NJ, 1998.
  • [19] N. Agarwal, B. Bullins, E. Hazan, S. Kakade, and K. Singh. Online control with adversarial disturbances. In Proceedings of 2019 International Conference on Machine Learning, pages 111–119. PMLR, 2019.
  • [20] D. Foster and M. Simchowitz. Logarithmic regret for adversarial online control. In Proceedings of 2020 International Conference on Machine Learning, pages 3211–3221. PMLR, 2020.
  • [21] M. Simchowitz, K. Singh, and E. Hazan. Improper learning for non-stochastic control. In Proceedings of 2020 Conference on Learning Theory, pages 3320–3436. PMLR, 2020.
  • [22] E. Hazan, S. Kakade, and K. Singh. The nonstochastic control problem. In Proceedings of the 31st International Conference on Algorithmic Learning Theory, pages 408–421. PMLR, 2020.
  • [23] M. Simchowitz. Making non-stochastic control (almost) as easy as stochastic. In Proceedings of the 34th International Conference on Neural Information Processing Systems, pages 18318–18329. PMLR, 2020.
  • [24] W. Cheung, D. Simchi-Levi, and R. Zhu. Learning to optimize under non-stationarity. In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics, pages 1079–1087. PMLR, 2019.
  • [25] Y. Russac, C. Vernade, and O. Cappé. Weighted linear bandits for non-stationary environments. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, pages 12040–12049, 2019.
  • [26] Peng Zhao, Lijun Zhang, Yuan Jiang, and Zhi-Hua Zhou. A simple approach for non-stationary linear bandits. In Silvia Chiappa and Roberto Calandra, editors, Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pages 746–755, 2020.
  • [27] Qin Ding, Cho-Jui Hsieh, and James Sharpnack. Robust stochastic linear contextual bandits under adversarial attacks. In International Conference on Artificial Intelligence and Statistics, pages 7111–7123. PMLR, 2022.
  • [28] Ilija Bogunovic, Arpan Losalka, Andreas Krause, and Jonathan Scarlett. Stochastic linear bandits robust to adversarial attacks. In International Conference on Artificial Intelligence and Statistics, pages 991–999. PMLR, 2021.
  • [29] Jiafan He, Dongruo Zhou, Tong Zhang, and Quanquan Gu. Nearly optimal algorithms for linear contextual bandits with adversarial corruptions. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022.
  • [30] W. Li, Z. Wang, D. Ho, and G. Wei. On boundedness of error covariances for Kalman consensus filtering problems. IEEE Transactions on Automatic Control, 65(6):2654–2661, 2019.
  • [31] G. Battistelli and L. Chisci. Kullback–Leibler average, consensus on probability densities, and distributed state estimation with guaranteed stability. Automatica, 50(3):707–718, 2014.
  • [32] G. Battistelli, L. Chisci, G. Mugnai, A. Farina, and A. Graziano. Consensus-based linear and nonlinear filtering. IEEE Transactions on Automatic Control, 60(5):1410–1415, 2014.
  • [33] F. Cattivelli and A. Sayed. Diffusion strategies for distributed Kalman filtering and smoothing. IEEE Transactions on automatic control, 55(9):2069–2084, 2010.
  • [34] Lin Yang, Mohammad Hassan Hajiesmaili, Mohammad Sadegh Talebi, John CS Lui, Wing Shing Wong, et al. Adversarial bandits with corruptions: Regret lower bound and no-regret algorithm. In NeurIPS, 2020.
  • [35] R. Olfati-Saber and J. Shamma. Consensus filters for sensor networks and distributed sensor fusion. In Proceedings of the 44th IEEE Conference on Decision and Control, pages 6698–6703. IEEE, 2005.