Learning to Seek: Multi-Agent Online Source Seeking Against Non-Stochastic Disturbances
Abstract
This paper proposes to leverage the emerging learning techniques and devise a multi-agent online source seeking algorithm under unknown environment. Of particular significance in our problem setups are: i) the underlying environment is not only unknown, but dynamically changing and also perturbed by two types of non-stochastic disturbances; and ii) a group of agents is deployed and expected to cooperatively seek as many sources as possible. Correspondingly, a new technique of discounted Kalman filter is developed to tackle with the non-stochastic disturbances, and a notion of confidence bound in polytope nature is utilized to aid the computation-efficient cooperation among multiple agents. With standard assumptions on the unknown environment as well as the disturbances, our algorithm is shown to achieve sub-linear regrets under the two types of non-stochastic disturbances; both results are comparable to the state-of-the-art. Numerical examples on a real-world pollution monitoring application are provided to demonstrate the effectiveness of our algorithm.
I Introduction
The problem of online source seeking, in which one or multiple agents are deployed to adaptively localize the underlying sources under a possibly unknown and disturbed environment, has gained considerably increasing attention recently among researchers in both control and robotics communities [1, 2, 3, 4]. Two challenges are of particular significance to solve such a source seeking problem: i) how to obtain a reliable perception or estimation via observations on the unknown environment; and ii) how to integrate the environment estimation with task planning for the agent(s) to seek sources in an online manner.
In order to tackle with the above two challenges, a variety of methodologies have been investigated in the literature, among which, the mainstream approaches are typically based on the estimation of environment gradients[5, 6, 7]. Considering that the sources are often associated with the maximum/minimum values of a function which is utilized to characterize the state of environment, thus the gradient based approaches naturally steer the agents to search along with the direction of estimated gradients toward the locations whose gradients are close to zero. An appealing feature of this method is often attributed to the fact that only local measurements are collected during the searching process without the knowledge of agents’ global positions. However, a critical disadvantage is that the agents are easily trapped into the local extremum when the environment can not be modeled as an ideal convex/concave function.
To further address the above issue, recent methods, building on certain learning techniques [8, 9], interplays processes of learning of an unknown environment and source seeking based on the learned environmental information. Particularly, a novel algorithm termed as AdaSearch is proposed in [8], which leverages the notions of upper and lower confidence bounds to guide the agent’s adaptive searching for the static sources. Our previous work [9] considers a more sophisticated searching scenario, in which i) the unknown environment follows certain linear dynamics, and thus the underlying sources are moving around; and ii) multiple agents are deployed simultaneously with the aim to cooperatively locate as many moving sources as possible. Indeed, one of the significant challenges in such a multi-agent source seeking setup is the combinatorial growth of the searching space as the increase of the number of agents. To deal with this challenge, we developed a novel notion of confidence bound, termed as D-UCB, which appropriately constructs a polytope confidence set and helps decompose the searching space for each agent. As a consequence, a linear complexity is achieved for our algorithm with respect to the number of agents, which enables the computation-efficient cooperation among the multiple agents.
Despite the remarkable feature of our D-UCB algorithm in reducing the computational complexity, one critical drawback is its dependence on the precise knowledge of the environment dynamics. Nevertheless, considering that uncertainties and/or disturbances are almost ubiquitous in practice, the knowledge of an exact model on the environment is barely available when considering real-world applications. To take into account the disturbances in system dynamics, a set of classical approaches, such as linear quadratic regulator, incorporates the stochastic processing noise which is usually assumed to be independent and identically (Gaussian) distributed and in most cases with zero-mean. Recently, with the great advancement of learning theory applied into control problems, relevant works started to turn to a new paradigm where the stochastic disturbances are replaced by non-stochastic ones. It is well recognized that, in most problems, the non-stochastic setup is more challenging than the stochastic one, as the standard statistical properties of the disturbances are no longer available. In addition, it is also more general, on the other hand, since the non-stochastic disturbances can not only characterize the modeling deviation of the environment, but also be interpreted as the one which is arbitrarily injected by an underlying adversary. As such, we consider in this paper the multi-agent online source seeking problem with the non-stochastic setup where the environment is disturbed by two types of non-stochastic disturbances. Our objective is to enhance the D-UCB algorithm with the capability of dealing with the non-stochastic disturbances while still enjoying the low computational complexity with a guaranteed source seeking performance.
I-A Related Works
As mentioned earlier, the predominate approaches to solve the source seeking problem, including the well-known technique of extremum seeking control [10, 11], often build on the environment gradient estimation. These approaches can be indeed viewed as variants of the first-order optimization algorithm, which drive the agent to search for the local extremum values. In particular, by modeling the unknown environment as a time-invariant and concave real-valued function, the authors in [12] designed the distributed source seeking control law for a group of cooperative agents. Besides, the diffusion process is further considered in [13] for investigating the scenario of dynamical environments. The source seeking problems are also studied in [14, 15] by forcing the multiple agents to follow a specific circular formation. In addition, the stochastic gradient based methods are proposed in [16, 17] when considering that the gradient estimation is subject to environment and/or measurement noises. We should note that, also inherited from the first-order optimization algorithm, the above gradient based methods are very likely to be stuck at local extremum points when the considered environment is non-convex/non-concave. Furthermore, the gradient estimation is also sensitive to the measurement/environment noise, and thus additional statistical properties on the noise, such as known distribution with zero-mean, need to be imposed as assumptions in the problem setup.
Whilst it is unknown how to deal with noises without statistical properties in the context of source seeking, non-stochastic disturbance has been considered increasingly broadly in control communities. Within the classical robust control framework, the non-stochastic disturbance is often treated by considering the worst-scenario performance, see e.g., [18]. However, more recent works related to the learning based control mainly concern about the development of adaptive approaches which aim at controlling typically a linear system with adversarial disturbances while optimizing certain objective function with respect to the system states and control inputs[19, 20, 21, 22, 23]. To measure the performance of adaptive controllers, the notion of regret is adopted; that is, to measure the discrepancy between the gain/cost of the devised controller and that of the best one in hindsight. In particular, the authors in [19] devise the first -regret algorithm by assuming a convex cost function and known system dynamics. Afterwards, such a regret bound is enhanced to be logarithmic in [20, 21] within the same problem setup. To further relax the requirement of the known dynamics, the authors in [22] develop the algorithm which attains -regret, and such a bound is also improved to later in [21, 23]. Though the above works have investigated quite thoroughly the non-stochastic setting in the context of learning based control, we remark that our paper considers a different problem where some standard conditions in control, such as controllability and observability, can be no longer simply assumed. In fact, our problem is more related to a sequential decision process; that is, the agents make their source seeking decisions in sequence while interplaying with perception of the unknown environment.
This sequential feature also makes our setting closely related to the well-known problem of multi-armed bandits. Therefore, another rich line of relevant works is on the series of bandit algorithms. More specifically, involved with the non-stochastic disturbances, linear bandits are investigated within two settings of non-stationary environment and adversarial corruptions, respectively. While the former one interprets the non-stochastic disturbance as a variation of the environment, the latter one is corresponding to corruptions injected by potential adversaries. Both cases are well studied in literature with the development of algorithms guaranteeing sub-linear regrets. To deal with the environmental non-stationarity, the WindowUCB algorithm is first proposed in [24] along with the technique of sliding-window least squares. It is shown that the algorithm achieves the regret of where is a measure to the level of non-stationarity. The same regret is proved for the weighted linear bandit algorithm proposed in [25], which leverages on the weighted least-square estimator. Further, a simple restart strategy is developed in [26], obtaining the same regret. It is indeed proved that the -regret is the optimal one that can be achieved in the setting of non-stationary bandits. In terms of the adversarial bandits, a robust algorithm is proposed in [27] which guarantees the -regret, and thus it is sub-linear only if the level of adversarial corruptions satisfies . More recently, such a regret has been improved to in [28, 29] which is also shown to be nearly optimal in the adversarial setting. It can be concluded from the above discussion that, once grows sub-linearly, the regrets in both cases are guaranteed to be sub-linear. These are also the state-of-the-art that we are expected to achieve for our algorithm to be developed in this work.
I-B Statement of Contributions
This paper proposes an online source seeking algorithmic framework using the emerging learning technique, which is capable of i) dealing with the unknown environment in the presence of non-stochastic disturbances; and ii) taking advantages of the cooperation among the multi-agent network. In terms of the non-stochastic disturbances, two specific types of them are considered: i) an external one which disturbs the measurable states of the environment; and ii) an internal one which is truly evolved with the environment dynamics. To deal with them, an unified technique of discounted Kalman filtering is proposed to estimate the unknown environment states while mitigating the disturbances. Meanwhile, to build the cooperation among multiple agents and avoid the combinatorial complexity, we leverage the polytope confidence set, and as a result, the proposed algorithm is exceptionally computation-efficient in the multi-agent setting. It is shown by the regret analysis that our algorithm attains sub-linear regrets against both types of non-stochastic disturbances. The obtained two regrets are both comparable to the state-of-the-art in the studies of non-stationary and adversarial bandit algorithms. At last, all theoretical findings are validated by simulation examples on a real-world pollution monitoring application.
II Problem Statement
II-A Unknown Environment with Non-Stochastic Disturbances
Consider an obstacle-free environment which is assumed to be bounded and discretized by a finite set of points where each represents the corresponding position. Suppose that the unknown state of the environment at each discrete time is described by a real-valued function which maps the positional information to a positive quantity indicating the environmental value of interest. Let us denote the total number of all points, i.e., , and for simplicity, denote the vector which stacks all individual . Further, to characterize dynamics of the changing environment, we consider that the evolution of state is basically governed by the following nominal linear time-varying (LTV) model
(1) |
where the state transition matrix is assumed to be known a prior. In order for the considered source seeking problem to be well-defined, we need the state to be neither explosive nor vanishing to zero, which can be ensured by the following assumption.
Assumption 1
For the LTV dynamics (1), there exists a pair of uniform lower and upper bounds such that, for ,
(2) |
where represents the identity matrix and the state propagation matrix111By convention, we let when . is defined as .
We should note that the above Assumption 1 not only helps confine the behavior of the environment states, but also implies the invertibility of the state transition matrices ’s which aids the subsequent regret analysis of our algorithm. In fact, such an assumption is not unusual in the study of system control and estimation problems; see e.g., [30, 31, 32, 33].
Now, in order to further impose the underlying disturbances into the environment model, let us consider the following two types of non-stochastic ones on top of the nominal dynamics:
. | (3a) | ||||
, | (3b) |
Note that in both types, denotes the disturbed state. However, while the first type of disturbance can be interpreted as an external one since in (3a) is still evolved according to the nominal dynamics (1) and the disturbance only affects the state in one step, the second type can be viewed as an internal one since the disturbance is intrinsically imposed into the dynamics and accumulated during the evolution of . In fact, we shall remark that the two types of disturbances both find a wide range of real-world applications. For instance, in the scenario of pollution monitoring which is investigated in our simulations, the external disturbance could correspond to certain unrelated emitters which will not change the locations of sources of interest but interfere the perceptible environment states, the internal one might result from some environmental conditions, such as wind, which will truly affect the diffusion of pollutants and thus change their positions. It is also enlightened by the provided example that the localization of sources should be considered differently for the above two cases. More details will be found in Section II-B. In addition, we note that the internal disturbance can be also used to capture to some extent the unmodeled dynamics of the unknown environment. However, no matter which type of disturbances is involved in the process, only the disturbed state is measurable for the agents which are employed to operate in the environment later.
As we have remarked earlier, the disturbances of both types are supposed to be non-stochastic, i.e., no statistical property in any form is assumed regarding . Instead, to characterize the effect of both disturbances in long term, we consider to impose the following assumption.
Assumption 2
There exists a positive sequence such that, for ,
(4) |
Remark 1
The sequence in Assumption 2 is not necessarily required to be bounded by some constant in our work. In fact, we consider the problem under the condition that increases at a sub-linear rate and aim to provide a performance guarantee for our algorithm on the dependence of . It is often implied by the sub-linear increasing that either the total number of occurrence of the disturbance increases sub-linearly or the effect of disturbance vanishes to zero over the time-steps . While the former is often referred to as the abrupt-changing disturbance, the latter is regarded as the slowly-varying one. In addition, in the context of learning theory in adversarial/non-stationary settings, such a sequence is also viewed as an attack budget of an adversary; see e.g., [34, 28].
II-B Multi-Agent Source Seeking
With the aim to locate the potential sources which usually correspond to the extreme values in the unknown environment state, we deploy a network of agents and expect each of them to seek its best positions at each time by solving the following maximization problem,
(5) |
Notice that the summation involved in the objective function takes into account the union of positions ’s, therefore all agents will naturally tend to locate as many distinct positions as possible for the purpose of maximizing . In addition, it is now clear to see the reason why Assumption 1 would be needed, i.e., the maximization in (5) is otherwise not well-defined if the environment state explodes or vanishes to zero. Further, we should also note that an inherent difference will take place in the counted state involved in the objective function when considering the disturbances of the two types. More precisely, for the first type of disturbance, i.e., the external one, the positions of sources should be indeed reflected by the undisturbed , though only the information of disturbed is measurable for agents. On the contrary, for the second type, i.e., the internal disturbance, the disturbed should be taken into account in (5), since is evolved in the environment dynamics and changes the positions of sources. On this account, we emphasize that while the maximization problem (5) is precisely the one that the agents would like to solve when considering the external disturbance, yet for the internal one, the objective function should be amended as
(6) |
With the above difference presented in the objective functions, the main challenges in solving the maximization problems are also distinguishable in principle. Whilst the former is to extract the true information hidden in in the case that only is accessible, the latter is to identify and compensate the unmodeled disturbance . Despite this difference, we develop in this paper an unified algorithmic framework for both cases, enabling the agents to track the dynamical sources in an online manner. We remark that this is also one of the main contributions of our work.
Another common technical issue, regardless of types of the disturbances involved, is to deal with the estimation of the environment. Therefore, we leverage on the following linear stochastic measurement model,
(7) |
where is the -th agent’s obtained measurement at the time-step ; denotes the measurement matrix depending on the agent’s position ; and is the measurement noise which is assumed to be independent and identically distributed (i.i.d.) Gaussian with zero mean and variance . We shall note that the measurement matrix is not specified in (7). In fact, it can be defined by various means based on the agent’s position. Nevertheless, we assume that each has the following basic form,
(8) |
where denotes the unit vector, i.e., the -th column of the identity matrix, and is the set of positions which are covered by the agent’s sensing area at the time-step . It is natural to assume that the position where the agent currently locates falls into its sensing area, i.e., .
III Development of the Online ALgorithm
In this section, we develop our online source seeking algorithm which relies on two central ingredients: 1) a discounted Kalman filter, which is capable of providing an estimation on the unknown environment while dealing with the two types of non-stochastic disturbances in an unified framework; and 2) a D-UCB approach, which helps determine the agents’ seeking positions sequentially in an computation-efficient manner.
III-A Estimation of the Environment States with Disturbances
According to the measurement model (7) introduced in the previous section, let us first express it as a compact form which counts for all agents within the network. For this purpose, we stack all the measurements ’s and also the noise ’s as the concatenated vectors and with , e.g., . Likewise, we define the concatenated measurement matrix by stacking all local ’s. Consequently, the measurement model of the compact form can be written as
(9) |
Note that, in the notation , we have absorbed for simplicity the dependency on agents’ positions ’s into the index . In addition, by our assumption on the measurement noise, one can have that the concatenated noise is also i.i.d. Gaussian with zero-mean and variance being
(10) |
Equipped with the agents’ measurement model in its compact form (9), we are now ready to present the technique of discounted Kalman filtering. Similar to the standard Kalman filter, we also use mean and covariance to recursively generate estimates of the unknown environment. However, a primary difference is that two positive sequences of weights and are imposed in the filtering process with the aim to mitigate the effect of disturbances presented in the environment. Keep this in mind, the discounted Kalman filter performs the following recursions,
(11a) | |||
(11b) | |||
(11c) | |||
(11d) | |||
(11e) |
Notice that and here denote the intermediate results during the recursions; is an auxiliary matrix initialized by ; and the variables and , which can be readily acquired by consensus schemes, i.e., [35], incorporate latest measurements into the update of estimates. Next, to better show how the imposed weights help deal with the non-stochastic disturbances, we present in the subsequent lemma another expression of the discounted Kalman filter (11).
Lemma 1
Suppose that the state estimates and are generated by (11) with the initialization and , then at each iteration , it is equivalent to have
(12a) | ||||
(12b) |
where the matrix is defined as
(13) |
Proof:
See Appendix I. ∎
Remark 2
According to the form (12) of the discounted Kalman filter, it can be observed that the sequence serves to adjust weights on the measurements obtained during the process. Considering that the cumulative quantity of the disturbance is upper bounded by the sequence ; see Assumption 2, this implies that, in general, the influence of disturbances vanishes over time if increases sub-linearly. In this case, the significant disturbance which took place at the early stage can be expected to be gradually mitigated by performing the discounted Kalman filtering. Further, unlike the weight which is only added to the measurements locally, another sequence of weights is applied to globally adjust the covariance so that it can compensate the effect of internal disturbances more directly.
III-B Multi-Agent Online Source Seeking via D-UCB
Based on and , we now introduce the key notion of D-UCB , which is defined as follows,
(14) |
Note that the operator maps the square root of the matrix diagonal elements to a vector, and the sequence depending on a predefined confidence level will be specified subsequently in the next section. With the aid of the defined D-UCB , one can update the agents’ seeking positions in an online manner, by solving the following maximization problem:
(15) |
Here, stacks the decided seeking positions ’s for all agents, and likewise, represents one component of the vector which corresponds to the position . The complete multi-agent online source seeking scheme under the environment with disturbances is outlined in Algorithm 1.
IV Regret Analysis
In this section, we provide theoretical performance guarantee for our algorithm by the notion of regret. More specifically, we perform the regret analysis for both cases which are subject to the two types of non-stochastic disturbances, respectively. By showing the sub-linear cumulative regrets for both cases, it is ensured that the agents are capable of tracking the dynamical sources under an unknown and disturbed environment.
IV-A On the Disturbance of Type I
As we have informed in the previous discussion, for the first type of disturbance, the objective function in (5) takes into account the undisturbed state . Therefore, we introduce the notion of regret for the first case as follows,
(16) |
where denotes the optimal solution to problem (5) and corresponds to the decision generated by our source seeking algorithm. Here, we aim to show that the cumulative regret, i.e., , increases sub-linearly with respect to the number of time-steps , namely the regret converges to zero on average. To this end, let us first show the following result which formalizes that the D-UCB provides indeed a valid upper bound for the unknown state .
Proposition 1
Under Assumptions 1 and 2, let and be generated by the discounted Kalman filter (11) with and . Suppose that the initialization satisfies and likewise the noise variance has , then it holds that,
(17) |
where is defined in element-wise, the probability is taken on random noises and the sequence in D-UCB is chosen satisfying
(18) | ||||
where is defined in Assumption 2, and .
Proof:
See Appendix II-A. ∎
It can be concluded from Proposition 1 that the D-UCB is guaranteed to be an upper bound for with probability at least . In fact, considering that the disturbance of type I is not really evolved in the environment dynamics, the weight is thus set to be zero during the whole process. Further, to extract the true information, we set the weight adaptively according to the timely estimation of the environment. Since the estimate covariance is, in general, decreased as more measurements are absorbed during the filtering process, it can be seen that the sequence will increase with an upper bound set to be . With the help of Proposition 1, we are now ready to present the result of regret analysis for our algorithm.
Theorem 1
Proof:
See Appendix II-B. ∎
IV-B On the Disturbance of Type II
Similar to the previous analysis, to provide the performance guarantee for our algorithm in this part, we also rely on the notion of regret. However, considering the difference in features of the disturbance of type II; see details in Section II-B, the definition of regret should be amended accordingly,
(20) |
where the objective function is defined in (6). Likewise, it is also expected to show a sub-linear cumulative regret, i.e., increases sub-linearly with respect to .
Due to the fact that the disturbance of type II is imposed in the environment dynamics, the current state inherently accumulates all disturbances prior to the time . As a result, it is not necessarily implied by Assumptions 1 and 2 that is upper bounded if the sequence is allowed to be increased infinitely. Thus, to ensure the well-definedness of our problem, we need an additional assumption.
Assumption 3
There exists an uniform upper bound such that .
Now, we follow the similar path as in the previous analysis to show the sub-linear regret of . Note that, due to the long term effect of the second type disturbance in the state , one cannot expect that the D-UCB serves as an upper bound for . To deal with this issue, we construct an auxiliary variable , i.e.,
(21) | ||||
which helps build a connection between and the state as shown in the following propositions.
Proposition 2
Proof:
See Appendix II-C. ∎
It is proved by Proposition 2 that the D-UCB provides a valid upper bound for the constructed variable if is chosen appropriately. To further build the connection between and the true state , it can be shown that the discrepancy between and will be bounded by a term related to the disturbances . However, for the sake of presentation, such a result will be deferred, and we directly provide the statement of sub-linear regret for our algorithm in the following theorem. The bound of will be shown as an intermediate step of proofs for the theorem in Appendix.
Theorem 2
Proof:
See Appendix II-D ∎
Remark 3
Note that, in Proposition 2 and Theorem 2, the two weights and are specified as where . This means that they will increase exponentially with respect to the time-step . Therefore, numerical overflow may arise in the discounted Kalman filtering as shown in (11), when is large. To deal with this issue, we notice that the discounted Kalman filter, when and are chosen as above, can be implemented equivalently by the following recursions,
(25a) | |||
(25b) | |||
(25c) | |||
(25d) | |||
where is defined as same as before. It should be also noted that, in (25), the covariance is slightly different from the one in (11), in the sense that . This needs to be taken into account in Algorithm 1 when generating the D-UCB by using . |
IV-C Further Discussions
Before the end of this section, a few more remarks should be added on the obtained results of the above regret analysis.
First, to tackle with the two types of non-stochastic disturbances, it can be seen from the propositions that the sequences of weights and are also determined differently. More specifically, for the external disturbance which only affects the measured state but not evolve with the nominal dynamics, the sequence of is chosen as increased at the same rate of . This is due to the fact that the disturbance in this case only comes into play when the state is measured, and thus the weight is also adjusted according to the measurement information in and the current progress on . Since the covariance , which basically suggests the uncertainty of our estimation, decreases as more measurements are absorbed in the estimation, the weight is increased during the process, meaning that the measurement received later are more trusted. For the internal disturbance, the sequence of is also chosen as increased, but at a fixed exponential rate of . Another primary difference is that, while is set to be zero previously, here we let increase at the same exponential rate of . The reason for such a difference can be explained as follows. Since the internal disturbance, regardless of the measurements, is accumulated during the whole process, an additional weight needs to be incorporated to deal with it globally, and therefore the increasing is introduced to decrease the covariance accordingly. Note that this does not mean the uncertainty of our estimation is decreased brutally, as in the D-UCB , the sequence of is also increased by an extra term related to to adjust our construction of the confidence bound.
Second, it can be concluded by the two theorems that once the disturbance bound increases sub-linearly, the regrets generated by our algorithm for both cases also grow at a sub-linear rate, meaning that the agents will be able to track the moving sources dynamically under the disturbed environment. More precisely, while the regret for the first case increases at the rate of , the rate is for the second case. Note that both of them are identical to the state-of-the-art results in the study of bandit algorithms with non-stationary and adversarial settings. Therefore, we can conclude that our developments of discounted Kalman filter and D-UCB do not degrade the performance of algorithm with respect to its convergence. However, in terms of the scale of the problem , i.e., size of the searching environment, the complexity of our algorithm indeed grows at the rate of , as compared to in the literature. This is mainly due to the reason that the ellipsoid confidence sets in the classical UCB-based methods is changed to the polytope one in our algorithm. Despite this fact, we argue that such an increase of complexity is actually reasonable, since more computational complexities have been reduced by avoiding the combinatorial problems at each step.
V Simulation
In this section, numerical examples are provided to validate the effectiveness of our multi-agent source seeking algorithm. We consider a pollution monitoring application where three mobile robots are deployed in a pollution diffusion field with the aim to localize as many leaking sources as possible. The dynamics of the pollution field is governed a convection diffusion equation. More details of the simulation settings can be found in [9], including linearization of the partial differential equation, robots’ measurement models and their communication topology, specification of the pollution field, etc. However, a key difference here is that the non-stochastic disturbances are assumed to be present after the linearization of the dynamics. More concretely, the linearized model of the pollution field is represented by
(26) |
where denotes the discretized states of the field, is the state transition matrix and represents the non-stochastic disturbance.
In particular, we consider in this simulation that the pollution field is modeled by a lattice with . Each of the mobile robots is capable of sensing a circular area with radius during the searching process. The sensing noise is assumed to be i.i.d. Gaussian with zero-mean and covariance . In terms of the disturbance, we here consider two different scenarios: i) a slowly-varying disturbance which occurs externally; and ii) an abruptly-changing which occurs internally. For the slowly-varying disturbance of type I, it is assumed that when and when where is randomly generated. For the abruptly-changing disturbance of type II, we consider that two more leaking sources are randomly injected into the field during the period of and . That is, for and for where are randomly generated, and otherwise.


To illustrate the performance of our algorithm in seeking the dynamical pollution sources with the two types of disturbances, we show the cumulative regrets produced by Algorithm 1, respectively. The obtained numerical results are shown in Fig. 1, in which each curve is corresponding to independent trials. It can be observed from the figures that our algorithm produces the smaller cumulative regret than the one generated by the standard D-UCB algorithm. We can thus conclude that, while the standard D-UCB algorithm fails to localize the sources when the disturbances are present in the field, our algorithm manages to complete the task in both scenarios with the external and internal disturbances. More specifically, with respect to the internal abruptly-changing disturbance, we also compare the performance of our algorithm with different choices of the parameter . Note that by setting , our algorithm will be naturally reduced to standard D-UCB algorithm. It can be observed that, after the disturbance are injected, our algorithm will soon adapt to the disturbed pollution field and then track the newly-added sources accordingly. On the contrary, the standard D-UCB algorithm fails to do so. In addition, it can be also seen from Fig. 1(b) that the smaller results in a shorter period of the adaption process. This is mainly due to the fact that the agents tend to perform more explorations when the small is chosen. As a result of the classical dilemma between exploration and exploitation, however, an disadvantage of the smaller is that the cumulative regret grows more rapidly after the sources are localized.
VI Conclusion
In this paper, a learning based algorithm is developed to solve the problem of multi-agent online source seeking under the environment disturbed by non-stochastic perturbations. Building on the technique of discounted Kalman filtering as well as the notion of D-UCB proposed in our previous work, our algorithm enables the computation-efficient cooperation among the multi-agent network and is robust against the non-stochastic perturbations (also interpreted as the adversarial disturbances in the context of multi-armed bandits). It is shown that a sub-linear cumulative regret is achieved by our algorithm, which is comparable to the state-of-art. Numerical results on a real-world pollution monitoring application is finally provided to support our theoretical findings.
Appendix I: proof of Lemma 1
Let us prove Lemma 1 by mathematical induction. First, it is straightforward to confirm that, given the initialization , , and , the recursions (11) and (12) produce the identical and . Next, we assume that (12) generates the same results as (11) up to the time-step , it will suffice to prove the consistency for the time-step .
In fact, based on the recursion of in (11), we can have
(27) | ||||
where the second equality comes from our assumption of in the form of (12a) and the last equality is due to the definition of in (13). Similarly, based on the recursion of in (11), we can have
(28) | ||||
where the first equality comes from (27) which just has been proved; the second and third equalities are due to (11); the second last equality follows in the form of (12a); and the last one is due to our assumption of in the form of (12b).
Appendix II: proofs of Main Theorems
We shall notice that the proofs in this section are mainly inspired by [25] and [29], which performed the regret analysis in the context of stochastic linear bandits under non-stationary and adversarial environments, respectively. The contributions of our proofs are i) integration of linear dynamics and Kalman filtering into the algorithmic framework; and ii) adaptation of the new notion of D-UCB into the regret analysis.
To facilitate the following proofs, let us start by introducing some useful vector norms. First, associated with the diagonal matrix of an arbitrary positive definite matrix , i.e., , we define the -based vector norm as
(29) |
where . Further, let us define the -based norm with respect to the matrix as
(30) |
Note that the above norm is well-defined since the positive definiteness of ensures that . Similarly, we define the -based norm as
(31) |
With the vector norms introduced above, it can be immediately verified that and are dual norms where takes the inverse of the matrix . In addition, we provide in the following lemma the connections among all defined norms.
Lemma 2
For arbitrary positive definite matrix , it holds that, 1) ; 2) ; and 3) .
Proof:
While the inequalities a) and b) can be straightforwardly confirmed by the definitions and the inequality of arithmetic and geometric means, respectively, the part c) is proved as follows.
(32) | ||||
Note that the first inequality is due to the positive definiteness of , i.e., . Hence, the proof is completed. ∎
VI-A Proof of Proposition 1
To prove the inequality (17) in Proposition 1, with the help of the above defined vector norms, it will suffice to show
(33) |
Note that the inequality in (33) is stronger than the one in (17), in the sense that the state is both upper and lower bounded. Though the lower bound is not reflected in the development of our algorithm, it helps the proof for the sub-linear regret. In addition, due to the fact that the weight is specified as in this part, it indeed changes the generation of state estimates by simplifying the matrix as
(34) |
According to the nature of the first type of disturbance, the disturbed state can be expressed as and thus the measurement is . Then, by Lemma 1 and the definitions of and , the state estimate has
(35) | ||||
Therefore, it holds for that
(36) | ||||
where is according to the Cauchy-Schwartz and triangle inequalities; is due to the recursion of in the form of (12a); and is based on Lemma 2-3).
Now, by Lemma 2-1), it follows that
(37) | ||||
where the last inequality is according to (36) and meanwhile taking . Next, to prove the inequality (33), we upper bound the three terms on the right hand side of (37) in the following three lemmas, respectively.
Lemma 3
Under the conditions in Proposition 1, there exists a constant such that,
(38) |
Proof:
By the definition of the matrix , it is straightforward to see that , and therefore,
(39) | ||||
where the last inequality is due to the assumption . Thus, the proof is completed. ∎
Lemma 4
Proof:
Lemma 5
Under the conditions in Proposition 1, there exists a constant such that the following inequality holds with probability at least ,
(42) | ||||
Proof:
This proof is based on the existing results on the self-normalized Martingale, see e.g., [25]. For the notational simplicity, let us define
(43) |
Then, according to the result of self-normalized Martingale, it holds with probability at least that,
(44) |
where . Note that there is a slight difference between and , and we show there exists a constant such that . In fact, it holds that
(45) | ||||
Note that the first inequality is due to and the assumption . Therefore, the previous statement can be immediately verified by letting and implies that . Together with the inequality (44), it follows that
(46) | ||||
Moreover, based on the inequality of arithmetic and geometric means and the definition of , it holds that
(47) |
where the trace of the matrix further has
(48) | ||||
Note that is due to the assumption and denotes the unit vector; follows from the special form of the measurement matrix , i.e., each row has only one element equal to one and all others equal to zero; and is based on Assumption 1. In addition, given that the initialization has , it follows that and . As a result, we can have
(49) | ||||
Based on the inequality (44), the proof is completed. ∎
VI-B Proof of Theorem 1
To facilitate the following proof, let us first introduce a new mapping which translates the positional information into a -dimensional action vector , i.e.,
(51) |
where each corresponds to the index of the position in the environment and denotes the unit vector. Now, by the definitions of and , it can be immediately verified that the vectors and must have elements equal to one and all others equal to zero. Further, we denote the set of all possibilities of these vectors, i.e.,
(52) |
For simplicity, we abbreviate the above and to and , subsequently. Based on the definition of as well as the introduced notations, the regret can be expressed as,
(53) |
To proceed, we show the following lemma which provides an upper bound for the regret at each time-step .
Lemma 6
Proof:
Based on the above Lemma 6, it is shown that the regret can be upper bounded by . In order to investigate the key term , we next show in the following lemma an intermediate result which can be used to bound .
Lemma 7
Under the conditions in Proposition 1, it holds,
(56) |
Proof:
Recall the recursion (34) of , the matrix can be also generated as follows,
(57) |
For simplicity, let us further denote by a new matrix . Now, considering determinant of ’s, it then holds that
(58) | ||||
where denotes the -th eigenvalue of the matrix in and is due to the inequality of arithmetic and geometric means. Based on the cyclic property of the matrix trace and the recursion of in (12a), it follows
(59) |
Therefore, (58) can be continued as
(60) |
Next, in order to bound by using the above Lemma 7, we build the connection between and as follows.
Lemma 8
Proof:
Statement 1): Due to specific forms of covariance matrix and the measurement matrix , it can be confirmed that has to be diagonal and can be expressed as
(65) |
where denotes the -th agent’s sensing area at the time ; see definition in (8). Let us introduce a binary variable ; let if the position indexed by is in the sensing area , and otherwise. As a direct result, it holds that
(66) |
where denotes the -th diagonal entry of the matrix . Now, let be the index of the agent ’s position, one can have that and therefore,
(67) | ||||
where the first equality is due to the definition of .
Statement 2): Based on the equality (66) and the fact that is a binary variable, it follows that
(68) | ||||
where the last inequality is due to the definition of the matrix norm , i.e., equals the largest eigenvalue of the matrix .
On the other hand, one can also have that
(69) | ||||
Therefore, the proof is completed. ∎
With the help of the above lemmas, we are in the position to prove the theorem. By the definition of regret in (53), it is easy to see that the regret has an uniform upper bound, i.e., . Based on the above Lemma 6, we can have that
(70) | ||||
where we denote . According to the definition (18) of the sequence , it follows that . Therefore, the cumulative regret has
(71) | ||||
Note that and represent the indicator functions and the last equality is due to the fact that . Now, let us investigate the two terms in (71) separately. For the first term, by Lemmas 7 and 8, it follows that
(72) | ||||
where is due to the inequality of arithmetic and geometric means and the fact that ; is based on Lemma 8-1); and is according to Lemma 7. For the second term, it holds that
(73) | ||||
where is due to the fact that by Lemma 8; is according to the choice of the weights, i.e., given that ; is based on Lemma 8-2); in , we let ; and is according to Lemma 7.
VI-C Proof of Proposition 2
This proof can be completed by following the similar steps as the one for Proposition 1, except the main differences in the dynamics of state (due to the different type of disturbance) and the specification of weights and .
Taking the dynamics (3b) into account and following the same steps previously, it can be proved without details that
(75) | ||||
where is defined as in (21). While the first term on the right hand side can be exactly bounded by the previous Lemma 3, the last two need specific attentions to obtain the upper bounds.
First, we show in the following lemma that the second term can be indeed bounded by the weight .
Lemma 9
Under the conditions in Proposition 2, there exists a constant , such that
(76) |
Proof:
Next, the third term can be handled by applying the result of self-normalized Martingale as in Lemma 5. Nevertheless, to adapt the change of the matrix , we need to modify the definition of accordingly,
(78) |
where is defined as same as before; see equation (43). As a result, we can have that, with probability at least ,
(79) |
Due to the fact that the sequence is increasing with , it can be proved by following the same steps as before that and furthermore,
(80) | ||||
where .
VI-D Proof of Theorem 2
By applying the notions defined in the proof of Theorem 1, one can have that,
(82) | ||||
where is due to the definitions of and as well as the fact that ; is from the Hölder’s inequality; and comes from the inequality (81) which has been proved in the proof of Proposition 2. It can be seen from the above result that, due to the involvement of all prior disturbances ’s in the state , the regret can be no longer bounded by the action term solely, but needs to consider the extra term which is related to the discrepancy between and . Hence, we next provide an upper bound for with respect to the disturbances ’s.
Lemma 10
Proof:
Recall the definition (21) of , it follows that
(84) | ||||
We next upper bound in order the three terms on the right-hand-side of (84).
Terms I: Due to the fact that , it holds
(85) | ||||
Note that, in (85), the two matrices and have
(86) | ||||
and
(87) | ||||
As a result of in Assumption 3, we can have that
(88) |
where .
Term II: Following the same path for the analysis of the first term, it can be shown that
(89) | ||||
where .
In terms of appearing the regret ’s bound (82), we apply the same analysis as in the proof of Theorem 1; see Lemma 7 and obtain the following result.
Lemma 11
Under the conditions in Proposition 2, it holds,
(93) | ||||
Proof:
This proof can be finished by following the same steps as in the one for Lemma 7. However, two differences should be noted which are resulted from the distinct definition of .
First, under the conditions in this lemma, the recursion of follows
(94) |
Despite the difference as compared to (57), due to the fact that
(95) |
the (in)equalities in (58) are still valid, and so is the subsequent deduction. At last, based on the definition of in (13), the final bound in (93) is obtained by
(96) | ||||
∎
With the help of the above lemmas, we are ready to prove the sub-linear regret as stated in Theorem 2. Notice that an uniform upper bound still exists for the regret . Therefore, it follows from (82) that
(97) | ||||
where we let in the last inequality. Therefore, it holds that
(98) | ||||
where is due to the Cauchy-Schwartz inequality and Lemma 8-1); is by Lemmas 10 and 11; and follows from the specification of with .
Now, provided that ; see the condition of Theorem 2, and letting , it can be confirmed that and therefore,
(99) |
Further, considering that , it holds that , and consequently,
(100) | ||||
According to the definitions of and , it holds that
(101) | ||||
As a result, one can have that
(102) | ||||
References
- [1] J. Poveda, M. Benosman, and R. Teel, A.and Sanfelice. Robust coordinated hybrid source seeking with obstacle avoidance in multi-vehicle autonomous systems. IEEE Transactions on Automatic Control, 2021.
- [2] B. Angélico, L. Chamon, S. Paternain, A. Ribeiro, and G. Pappas. Source seeking in unknown environments with convex obstacles. In Proceedings of 2021 American Control Conference, pages 5055–5061. IEEE, 2021.
- [3] T. Li, B. Jayawardhana, A. Kamat, and A. Kottapalli. Source-seeking control of unicycle robots with 3-D printed flexible piezoresistive sensors. IEEE Transactions on Robotics, 2021.
- [4] W. Liu, X. Huo, G. Duan, and K. Ma. Semi-global stability analysis of source seeking with dynamic sensor reading and a class of nonlinear maps. International Journal of Control, pages 1–10, 2020.
- [5] E. Ramirez-Llanos and S. Martinez. Stochastic source seeking for mobile robots in obstacle environments via the SPSA method. IEEE Transactions on Automatic Control, 64(4):1732–1739, 2018.
- [6] S. Azuma, M. Sakar, and G. Pappas. Stochastic source seeking by mobile robots. IEEE Transactions on Automatic Control, 57(9):2308–2321, 2012.
- [7] J. Habibi, H. Mahboubi, and A. Aghdam. A gradient-based coverage optimization strategy for mobile sensor networks. IEEE Transactions on Control of Network Systems, 4(3):477–488, 2016.
- [8] E. Rolf, D. Fridovich-Keil, M. Simchowitz, B. Recht, and C. Tomlin. A successive-elimination approach to adaptive robotic source seeking. IEEE Transactions on Robotics, 37(1):34–47, 2020.
- [9] B. Du, K. Qian, H. Iqbal, C. Claudel, and D. Sun. Multi-robot dynamical source seeking in unknown environments. In Proceedings of 2021 IEEE International Conference on Robotics and Automation, pages 9036–9042. IEEE, 2021.
- [10] J. Feiling, S. Koga, M. Krstić, and T. Oliveira. Gradient extremum seeking for static maps with actuation dynamics governed by diffusion PDEs. Automatica, 95:197–206, 2018.
- [11] S. Dougherty and M. Guay. An extremum-seeking controller for distributed optimization over sensor networks. IEEE Transactions on Automatic Control, 62(2):928–933, 2016.
- [12] Shuai Li, Ruofan Kong, and Yi Guo. Cooperative distributed source seeking by multiple robots: Algorithms and experiments. IEEE/ASME Transactions on Mechatronics, 19(6):1810–1820, 2014.
- [13] Ruggero Fabbiano, Carlos Canudas De Wit, and Federica Garin. Source localization by gradient estimation based on Poisson integral. Automatica, 50(6):1715–1724, 2014.
- [14] Lara Briñón-Arranz, Luca Schenato, and Alexandre Seuret. Distributed source seeking via a circular formation of agents under communication constraints. IEEE Transactions on Control of Network Systems, 3(2):104–115, 2015.
- [15] Ruggero Fabbiano, Federica Garin, and Carlos Canudas-de Wit. Distributed source seeking without global position information. IEEE Transactions on Control of Network Systems, 5(1):228–238, 2016.
- [16] Nikolay Atanasov, Jerome Le Ny, Nathan Michael, and George J Pappas. Stochastic source seeking in complex environments. In 2012 IEEE International Conference on Robotics and Automation, pages 3013–3018. IEEE, 2012.
- [17] Nikolay A Atanasov, Jerome Le Ny, and George J Pappas. Distributed algorithms for stochastic source seeking with mobile robot networks. Journal of Dynamic Systems, Measurement, and Control, 137(3), 2015.
- [18] K. Zhou and J. Doyle. Essentials of Robust Control, volume 104. Prentice hall Upper Saddle River, NJ, 1998.
- [19] N. Agarwal, B. Bullins, E. Hazan, S. Kakade, and K. Singh. Online control with adversarial disturbances. In Proceedings of 2019 International Conference on Machine Learning, pages 111–119. PMLR, 2019.
- [20] D. Foster and M. Simchowitz. Logarithmic regret for adversarial online control. In Proceedings of 2020 International Conference on Machine Learning, pages 3211–3221. PMLR, 2020.
- [21] M. Simchowitz, K. Singh, and E. Hazan. Improper learning for non-stochastic control. In Proceedings of 2020 Conference on Learning Theory, pages 3320–3436. PMLR, 2020.
- [22] E. Hazan, S. Kakade, and K. Singh. The nonstochastic control problem. In Proceedings of the 31st International Conference on Algorithmic Learning Theory, pages 408–421. PMLR, 2020.
- [23] M. Simchowitz. Making non-stochastic control (almost) as easy as stochastic. In Proceedings of the 34th International Conference on Neural Information Processing Systems, pages 18318–18329. PMLR, 2020.
- [24] W. Cheung, D. Simchi-Levi, and R. Zhu. Learning to optimize under non-stationarity. In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics, pages 1079–1087. PMLR, 2019.
- [25] Y. Russac, C. Vernade, and O. Cappé. Weighted linear bandits for non-stationary environments. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, pages 12040–12049, 2019.
- [26] Peng Zhao, Lijun Zhang, Yuan Jiang, and Zhi-Hua Zhou. A simple approach for non-stationary linear bandits. In Silvia Chiappa and Roberto Calandra, editors, Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pages 746–755, 2020.
- [27] Qin Ding, Cho-Jui Hsieh, and James Sharpnack. Robust stochastic linear contextual bandits under adversarial attacks. In International Conference on Artificial Intelligence and Statistics, pages 7111–7123. PMLR, 2022.
- [28] Ilija Bogunovic, Arpan Losalka, Andreas Krause, and Jonathan Scarlett. Stochastic linear bandits robust to adversarial attacks. In International Conference on Artificial Intelligence and Statistics, pages 991–999. PMLR, 2021.
- [29] Jiafan He, Dongruo Zhou, Tong Zhang, and Quanquan Gu. Nearly optimal algorithms for linear contextual bandits with adversarial corruptions. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022.
- [30] W. Li, Z. Wang, D. Ho, and G. Wei. On boundedness of error covariances for Kalman consensus filtering problems. IEEE Transactions on Automatic Control, 65(6):2654–2661, 2019.
- [31] G. Battistelli and L. Chisci. Kullback–Leibler average, consensus on probability densities, and distributed state estimation with guaranteed stability. Automatica, 50(3):707–718, 2014.
- [32] G. Battistelli, L. Chisci, G. Mugnai, A. Farina, and A. Graziano. Consensus-based linear and nonlinear filtering. IEEE Transactions on Automatic Control, 60(5):1410–1415, 2014.
- [33] F. Cattivelli and A. Sayed. Diffusion strategies for distributed Kalman filtering and smoothing. IEEE Transactions on automatic control, 55(9):2069–2084, 2010.
- [34] Lin Yang, Mohammad Hassan Hajiesmaili, Mohammad Sadegh Talebi, John CS Lui, Wing Shing Wong, et al. Adversarial bandits with corruptions: Regret lower bound and no-regret algorithm. In NeurIPS, 2020.
- [35] R. Olfati-Saber and J. Shamma. Consensus filters for sensor networks and distributed sensor fusion. In Proceedings of the 44th IEEE Conference on Decision and Control, pages 6698–6703. IEEE, 2005.