assumptionAssumption \newsiamremarkremarkRemark \headersGlobal Linear Convergence of Evolution StrategiesY. Akimoto, A. Auger, T. Glasmachers and D. Morinaga
Global Linear Convergence of Evolution Strategies on More than Smooth Strongly Convex Functions
Abstract
Evolution strategies (ESs) are zeroth-order stochastic black-box optimization heuristics invariant to monotonic transformations of the objective function. They evolve a multivariate normal distribution, from which candidate solutions are generated. Among different variants, CMA-ES is nowadays recognized as one of the state-of-the-art zeroth-order optimizers for difficult problems. Albeit ample empirical evidence that ESs with a step-size control mechanism converge linearly, theoretical guarantees of linear convergence of ESs have been established only on limited classes of functions. In particular, theoretical results on convex functions are missing, where zeroth-order and also first-order optimization methods are often analyzed. In this paper, we establish almost sure linear convergence and a bound on the expected hitting time of an ES family, namely the -ES, which includes the (1+1)-ES with (generalized) one-fifth success rule and an abstract covariance matrix adaptation with bounded condition number, on a broad class of functions. The analysis holds for monotonic transformations of positively homogeneous functions and of quadratically bounded functions, the latter of which particularly includes monotonic transformation of strongly convex functions with Lipschitz continuous gradient. As far as the authors know, this is the first work that proves linear convergence of ES on such a broad class of functions.
keywords:
Evolution strategies, Randomized derivative free optimization, Black-box optimization, Linear convergence, Stochastic algorithms65K05, 90C25, 90C26, 90C56, 90C59
1 Introduction
We consider the unconstrained minimization of an objective function without the use of derivatives where an optimization solver sees as a zeroth-order black-box oracle [48, 13, 49]. This setting is also referred to as derivative-free optimization [16]. Such problems can be advantageously approached by randomized algorithms that can typically be more robust to noise, non-convexity and irregularities of the objective function than deterministic algorithms. There has been recently a vivid interest in randomized derivative-free algorithms giving rise to several theoretical studies of randomized direct search methods [26], trust region [10, 27] and model-based methods [14, 50]. We refer to [41] for an in-depth survey including the references of this paragraph and additional ones.
In this context, we investigate Evolution Strategies (ES), which are among the oldest randomized derivative-free or zeroth-order black-box methods [17, 54, 51]. They are widely used in applications in different domains [57, 40, 45, 23, 5, 28, 58, 12, 21, 22]. Notably a specific ES called covariance-matrix-adaptation ES (CMA-ES) [31] is among the best solvers to address difficult black-box problems. It is affine-invariant and implements complex adaptation mechanisms for the sampling covariance matrix and step-size. It performs well on many ill-conditioned, non-convex, non-smooth, and non-separable problems [30, 53]. ES are known to be difficult to analyze. Yet, given their importance in practice, it is essential to study them from a theoretical convergence perspective.
We focus on the arguably simplest and oldest adaptive ES, denoted (1+1)-ES. It samples a candidate solution from a Gaussian distribution whose step-size (standard deviation) is adapted. The candidate solution is accepted if and only if it is better than the current one (see pseudo-code Algorithm 1). The algorithm shares some similarities with simplified direct search whose complexity analysis has been presented in [39]. Yet the (1+1)-ES is comparison-based and thus invariant to strictly increasing transformations of the objective function. Simplified direct search can be thought of as a variant of mesh adaptive direct search [7, 2]. Arguably, in contrast to direct search, a sufficient decrease condition cannot be guaranteed. This causes some difficulties for the analysis. The (1+1)-ES is rotational invariant, while direct search candidate solutions are created along a predefined set of vectors. While the CMA-ES should always be preferred for practical applications over the (1+1)-ES variant analyzed here, this latter variant achieves faster linear convergence on well-conditioned problems when compared to algorithms with established complexity analysis (see [55, Table 6.3 and Figure 6.1] and [9, Figure B.4] where the random pursuit algorithm and the (1+1)-ES algorithms are compared, and also Appendix A).
Prior theoretical studies of the (1+1)-ES with success rule have established the global linear convergence on differentiable positively homogeneous functions (composed with a strictly increasing function) with a single optimum [9, 8]. Those results establish the almost sure linear convergence from all initial states. They however do not provide the dependency of the convergence rate with respect to the dimension. A more specific study on the sphere function establishes lower and upper bounds on the expected hitting time of an -ball of the optimum in , where is the optimum of the function, is the initial solution, and is the problem dimension [4]. Prior to that, a variant of the -ES with one-fifth success rule had been analyzed on the sphere and certain convex quadratic functions establishing bounds on the expected hitting time with overwhelming probability in , where is the condition number (the ratio between the greatest and smallest eigenvalues) of the Hessian [34, 36, 37, 35]. Recently, the class of functions where the convergence of the (1+1)-ES was proven has been extended to continuously differentiable functions. This analysis does not address the question of linear convergence, focusing only on convergence as such, which is possibly sublinear [24].
Our main contribution is as follows. For a generalized version of the (1+1)-ES with one-fifth success rule, we prove bounds on the expected hitting time akin to linear convergence, i.e., hitting an -ball in iterations on a quite general class of functions. This class of functions includes all composites of Lipschitz-smooth strongly convex functions with a strictly increasing transformation. This latter transformation allows to include some non-continuous functions, and even functions with non-smooth level sets. We additionally deduce linear convergence with probability one. Our analysis relies on finding an appropriate Lyapunov function with lower and upper-bounded expected drift. It is building on classical fundamental ideas presented by Hajek [29] and widely used to analyze stochastic hill-climbing algorithms on discrete search spaces [43].
Notation
Throughout the paper, we use the following notations. The set of natural numbers is denoted . Open, closed, and left open intervals on are denoted by , , and , respectively. The set of strictly positive real numbers is denoted by . The Euclidean norm on is denoted by . Open and closed balls with center and radius are denoted as and , respectively. Lebesgue measures on and are both denoted by the same symbol . A multivariate normal distribution with mean and covariance matrix is denoted by . Its probability measure and its induced probability density under Lebesgue measure are denoted by and . The indicator function of a set or condition is denoted by . We use Bachmann-Landau notations , , , and to mean , for some , , for some , and and , respectively.
2 Algorithm, Definitions and Objective Function Assumptions
2.1 Algorithm: (1+1)-ES with Success-based Step-size Control
We analyze a generalized version of the (1+1)-ES with one-fifth success rule presented in Algorithm 1, which implements one of the oldest approaches to adapt the step-size in randomized optimization methods [51, 17, 54]. The specific implementation was proposed in [38]. At each iteration, a candidate solution is sampled. It is centered in the current incumbent and follows a multivariate normal distribution with mean vector and covariance matrix equal to where denotes the identity matrix. The candidate solution is accepted, that is becomes , if and only if is better than (i.e. ). In this case, we say that the candidate solution is successful. The step-size is adapted so as to maintain a probability of success to be approximately the target success probability denoted by . To do so, the step-size is increased by the increase factor in case of success (which is an indication that the step-size is likely to be too small) and decreased by the decrease factor otherwise. The covariance matrix of the sampling distribution of candidate solutions is adapted in the set of positive-definite symmetric matrices with determinant and condition number . We do not assume any specific update mechanism for , but we assume that the update of is invariant to any strictly increasing transformation of . We call such an update comparison-based (see Lines 7 and 11 of Algorithm 1). Then, our algorithm behaves exact-equally on and on for all strictly increasing functions (i.e., ). This defines a class of comparison-based randomized algorithms and we denote it as (1+1). For , it is simply denoted as (1+1)-ES.
Note that and are not meant to be tuned depending on the function properties. How to choose such constants for is well-known and is related to the so-called evolution window [52]. In practice, is the most commonly used setting, which leads to . It has been shown to be close to optimal, which gives nearly optimal (linear) convergence rate on the sphere function [51, 17]. Hereunder we write as the state of the algorithm, and the state-space is denoted by .
Figure 1 shows typical runs of the (1+1)-ES and a version of (1+1) proposed in [6], which is known as the (1+1)-CMA-ES, on a -dimensional ellipsoidal function with different condition numbers of the Hessian. It is empirically observed that in the (1+1)-CMA-ES approaches the inverse Hessian of the objective function up to the scalar factor if the objective function is convex quadratic. The runtime of (1+1)-ES scales linearly with (notice the logarithmic scale of the horizontal axis), while the runtime of the (1+1)-CMA-ES suffers only an additive penalty, roughly proportional to the logarithm of . Once the Hessian is well approximated by (up to a scalar factor), it approaches the global optimum geometrically at the same rate for different values of .
In our analysis, we do not assume any specific update mechanism, hence it does not necessarily behave as shown in Figure 1. Our analysis is therefore the worst case analysis (for the upper bound of the runtime) and the best case analysis (for the lower bound of the runtime) among the algorithms in (1+1).



Right: Three runs of (1+1)-ES ( and ) on dimensional spherical function with initial step-size , , and (in blue, red, green, respectively). Plotted are the distance to the optimum (dotted line), the step-size (dashed line), and the potential function defined in (22) (solid line) with , , and .
2.2 Preliminary Definitions
2.2.1 Spatial Suboptimality Function
The algorithms studied in this paper are comparison-based and thus invariant to strictly increasing transformations of . If the convergence of the algorithms is measured in terms of , say by investigating the convergence or hitting time of the sequence , this will not reflect the invariance to monotonic transformations of because the first iteration such that is not equal to the first iteration such that for some . For this reason, we introduce a quality measure called spatial suboptimality function [24]. It is the th root of the volume of the sub-levelset where the function value is better or equal to :
Definition 2.1 (Spatial Suboptimality Function).
Let be a measurable function with respect to the Borel algebra of (simply referred to as measurable function in the sequel). Then the spatial suboptimality function is defined as
(1) |
We remark that for any , the suboptimality function is greater or equal to zero. For any and any strictly increasing function , and its composite have the same spatial suboptimality function such that hitting time of smaller than will be the same for or . Moreover, there exists a strictly increasing function such that holds -almost everywhere [24, Lemma 1].
We will investigate the expected first hitting time of to . For this, we will bound the first hitting time of to by the first hitting time of to a constant times . To understand why, consider first a strictly convex quadratic function with Hessian and minimal solution . We have for all , where is the th root of the volume of the -dimensional unit hyper-sphere [3]. This implies that the first hitting time of translates to the first hitting time of . We have , where and are the minimal and maximal eigenvalues of . E.g., consider . Then the above condition also translates to the first hitting time of . More generally, we will formalize an assumption on later on (Assumption A1), which allows us to bound by a constant times from above and below (see (6)), implying that the first hitting time of to is bounded by that of to , times a constant.
2.2.2 Success Probability
The success probability, i.e., the probability of sampling a candidate solution with an objective function better than or equal to that of the current solution , plays an important role in the analysis of the (1+1) with success-based step-size control mechanism. We present here several useful definitions related to the success probability.
We start with the definition of the success domain with rate and the success probability with rate . The probability to sample in the -success domain is called success probability with rate . When we simply talk about success probability.111For , the success domain is not necessarily equivalent to the sub-levelset , where it always holds that . However, since it is guaranteed that by [24, Lemma 1], due to the absolute continuity of for , the success probability with rate is equivalent to , with defined in (3).
Definition 2.2 (Success Domain).
For a measurable function and such that , the -success domain at with is defined as
(2) |
Definition 2.3 (Success Probability).
Let be a measurable function and let be the initial search point satisfying . For any and any , the success probability with rate at under the normalized step-size is defined as
(3) |
Definition 2.3 introduces the notion of normalized step-size and the success probability is defined as a function of rather than the actual step-size . This is motivated by the fact that as approaches the global optimum of , the step-size needs to shrink for the success probability to be constant. If the objective function is and the covariance matrix is the identity matrix, then the success probability is fully controlled by and is independent of . This statement can be formalized in the following way.
Lemma 2.4.
If , then letting , we have
Proof 2.5.
The suboptimality function is the -th rooth of the volume of a sphere of radius . Hence . Then, the proof follows the derivation in Section 3 in [4].
Therefore, is more discriminative than itself. In general, the optimal step-size is not necessarily proportional to neither nor .
Since the success probability under a given normalized step-size depends on and , we define the upper and lower success probability as follows.
Definition 2.6 (Lower and Upper Success Probability).
Let . Given the normalized step-size , the lower and upper success probabilities are defined as
A central quantity for our analysis is the limit for to of the success probability . Intuitively, if this limit is too small for a given (compared to ), because the ruling principle of the algorithm is to decrease the step-size if the probability of success is smaller than , the step-size will keep decreasing, causing undesired convergence. Following Glasmachers [24], we introduce the concepts of -improbability and -criticality. They are defined in [24] by the probability of sampling a better point from the isotropic normal distribution in the limit of the step-size to zero. Here, we define -improvability and -criticality for a general multivariate normal distribution.
Definition 2.7 (-improvability and -criticality).
Let be a measurable function. The function is called -improvable at under the covariance matrix if there exists such that
(4) |
Otherwise, it is called -critical.
The connection to the classical definition of the critical points for continuously differentiable functions is summarized in the following proposition, which is an extension of Lemma 4 in [24], taking a non-identity covariance matrix into account.
Proposition 2.8.
Let be a measurable function where is any strictly increasing function and is continuously differentiable. Then, is -improvable with at any regular point where under any . Moreover, if is twice continuously differentiable at a critical point where and at least one eigenvalue of is non-zero, under any , is -improvable with if has only non-positive eigenvalues, -critical if has only non-negative eigenvalues, and -improvable with some if has at least one strictly negative eigenvalue.
Proof 2.9.
Note that on is equivalent to on . Therefore, it suffices to show that the claims hold for on , which is proven in Lemma 4 in [24].
2.3 Main Assumptions on the Objective Functions


Given positive real numbers and satisfying , and a measurable objective function, let be the set defined in Definition 2.6.
We pose two core assumptions on the objective functions under which we will derive an upper bound on the expected first hitting time of by (Theorem 4.6) provided . First, we require to be able to embed and include balls of radius scaling with into the sublevel sets of . We do not require this to hold on the whole search space but, for a set .
-
A1
We assume that is a measurable function and that there exists such that for any , there exist an open ball with radius and a closed ball with radius such that it holds and .
We do not specify the center of those balls that may or may not be centered on an optimum of the function. We will see in Proposition 4.1 that this assumption allows to bound and by tractable functions of which will be essential for the analysis. The property is illustrated in Figure 2.
The second assumption requires that the functions are -improvable for which is lower-bounded uniformly over .
-
A2
Let be a measurable function, we assume that there exists and there exists such that for any and any , the objective function is -improvable for some , i.e.,
(5)
The property is illustrated in Figure 2. This assumption implies in particular for a continuous function that does not contain any local optimum. This latter assumption is required to obtain global convergence [24, Theorem 2] even without any covariance matrix adaptation (i.e. with ) and it can be intuitively understood: If we have a point which is -improvable with and which is not a local minimum of the function, then, starting with a small step-size, the success-based step-size control may keep decreasing the step-size at such a point and the (1+1) will prematurely converge to a point that is not a local optimum.
If A1 is satisfied with balls centered at the optimum of the function , then it is easy to see that for all
(6) |
If the balls are not centered at the optimum, we have the one-side inequality . Hence, the expected first hitting time of to translates to an upper bound for the expected first hitting time of to .
We remark that A1 and A2 satisfied for allow to include some non-differentiable functions with non-convex sublevel sets as illustrated in Figure 2.
We now give two examples of functions that satisfy A1 and A2, including function classes where linear convergence of numerical optimization algorithms are typically analyzed. The first class consists of quadratically bounded functions. It includes all strongly-convex functions with Lipschitz continuous gradient. It also includes some non-convex functions. The second class consists of positively homogeneous functions. The levelsets of a positively homogeneous function are all geometrically similar around .
-
A3
We assume that where is a strictly increasing function and is measurable, continuously differentiable with the unique critical point , and quadratically bounded around , i.e., for some ,
(7) -
A4
We assume that where is continuously differentiable and positively homogeneous with a unique optimum , i.e., for some
(8)
The following lemmas show that these assumptions imply A1 and A2. The proofs of the lemmas are presented in Section B.1 and Section B.2, respectively.
3 Methodology: Additive Drift on Unbounded Continuous Domains
3.1 First Hitting Time
We start with the generic definition of the first hitting time of a stochastic process , defined as follows.
Definition 3.1 (First hitting time).
Let be a sequence of real-valued random variables adapted to the natural filtration with initial condition . For , the first hitting time of to the set is defined as .
The first hitting time is the number of iterations that the stochastic process requires to reach the target level for the first time. In our situation, measures the distance from the current solution to the target point (typically, global or local optimal point) after iterations. Then, defines the target accuracy and is the runtime of the algorithm until it finds an -neighborhood . The first hitting time is a random variable as is a random variable. In this paper, we focus on the expected first hitting time . We want to derive lower and upper bounds on this expected hitting time that relate to the linear convergence of towards . Such bounds take the following form: There exist and , such that for any
(9) |
That is, the time to reach the target accuracy scales logarithmically with the ratio between the initial accuracy and the target accuracy . The first pair of constants, and , capture the transient time, which is the time that adaptive algorithms typically spend for adaptation. The second pair of constants, and , reflect the speed of convergence (logarithmic convergence rate). Intuitively, assuming that and are close, the distance to the optimum decreases in each step at a rate of approximately . While upper-bounds inform us about the (linear) convergence, the lower-bound helps understanding whether the upper bounds are tight.
Alternatively, linear convergence can be defined as the following property: there exits such that
(10) |
When we have an equality in the previous statement, we say that is the convergence rate.
Figure 1 (right plot) visualizes three different runs of the (1+1)-ES on a function with spherical level sets with different initial step-sizes. First of all, we clearly observe linear convergence. The first hitting time of scales linearly with for a sufficiently small . Second, its convergence speed is independent of the initial condition. Therefore, we expect to have universal constants and independent of the initial state. Last, depending on the initial step-size, the transient time can vary. If the initial step-size is too large or too small, it does not produce progress in terms of until the step-size is well adapted. Therefore, and depend on the initial condition, with a logarithmic dependency on the initial multiplicative mismatch.
3.2 Bounds of the Hitting Time via Drift Conditions
We are going to use drift analysis that consists in deducing properties of a sequence (adapted to a natural filtration ) from its drift defined as [29]. Drift analysis has been widely used to analyze hitting times of evolutionary algorithms defined on discrete search spaces (mainly on binary search spaces) [32, 33, 11, 46, 20, 19]. Though they were developed mainly for finite search spaces, the drift theorems can naturally be generalized to continuous domains [42, 44]. Indeed, Jägersküpper’s work [34, 36, 37] is based on the same idea, while the link to the drift analysis was implicit.
Since many drift conditions have been developed for analyzing algorithms on discrete domains, the domain of is often implicitly assumed to be bounded. However, this assumption is violated in our situation, where we will use as the quality measure, which takes values in , and is meant to approach . We refer to [4] for additional details. In general, translating expected progress requires bounding the tail of the progress distribution, as formalized in [29].
To control the tails of the drift distribution, we construct a stochastic process iteratively as follows: and
(11) |
for some and with . It clips to some constant () from below. We introduce the indicator for a technical reason. The process disregards progress larger than , and it fixes the progress of the step that hits the target set to . It is formalized in the following theorem, which is our main mathematical tool to derive an upper bound of the expected first hitting time of (1+1) in the form of (9).
Theorem 3.2.
Let be a sequence of real-valued random variables adapted to a filtration with . For , let be the first hitting time of the set . Define a stochastic process iteratively through (11) with for some , and let be the first hitting time of the set . If is integrable, i.e. , and
(12) |
then the expectation of satisfies
(13) |
Proof 3.3 (Proof of Theorem 3.2).
We consider the stopped process . Then, we have for and for all .
We will prove that
(14) |
We start from
(15) |
and bound the different terms:
(16) |
where we have used that , , , and are all -measurable. Also
(17) |
where we have used condition Eq. 12. Hence, by injecting Eq. 16 and Eq. 17 into Eq. 15, we obtain Eq. 14. From Eq. 14, by taking the expectation we deduce . Following the same approach as [44, Theorem 1], since is a random variable taking values in , it can be rewritten as and thus it holds
Since , we have . Given that , we deduce that for all . With , we have
Because for , we have , implying . This completes the proof.
This theorem can be intuitively understood: we assume for the sake of simplicity a process such that . Then (12) states that the process progresses in expectation by at least . The theorem concludes that the expected time needed to reach a value smaller than when started in equals to (what we would get for a deterministic algorithm) plus . This last term is due to the stochastic nature of the algorithm. It is minimized if is as close as possible to , which corresponds to a highly concentrated process.
Jägersküpper [36, Theorem 2] established a general lower bound of the expected first hitting time of the (1+1)-ES. We borrow the same idea to prove the following general theorem for a lower bound of the expected first hitting time, which generalizes [37, Lemma 12]. See Theorem 2.3 in [4] for its proof.
Theorem 3.4.
Let be a sequence of real-valued random variables adapted to a filtration and integrable such that , , and for . For we define . Then the expected hitting time is lower bounded by .
4 Main Result: Expected First Hitting Time Bound
4.1 Mathematical Modeling of the Algorithm
In the sequel, we will analyze the process where generated by the (1+1) algorithm. We assume from now on that the optimized objective function is measurable with respect to the Borel -algebra. We equip the state-space with its Borel -algebra denoted .
4.2 Preliminaries
We present two preliminary results. In Assumption A1, we assume that for , we can include a ball of radius into the sublevel set and embed into a ball of radius . This allows us to upper bound and lower bound the probability of success for all , for all , by the probability to sample inside of balls of radius and with appropriate center. From this we can upper-bound by a function of . Similarly we can lower-bound by a function of . The corresponding proof is given in Section B.3.
Proposition 4.1.
We use the previous proposition to establish the next lemma that guarantees the existence of a finite range of normalized step-size that leads to the success probability into some range independent of and , and provides a lower bound on the success probability with rate when the normalized step-size is in the above range. Its proof is provided in Section B.4.
4.3 Potential Function
Lemma 4.2 divides the domain of the normalized step-size into three disjoint subsets: is a too small normalized step-size situation where we have for all and ; is a too large normalized step-size situation where we have for all and ; and is a reasonable normalized step-size situation where the success probability with rate is lower bounded by Eq. 21. Since , the normalized step-size is supposed to be maintained in the reasonable range.
Our potential function is defined as follows. In light of Lemma 4.2, we can take and such that . With some constant , we define our potential function as
(22) |
The rationale behind the second term on the right-hand side (RHS) is as follows. The second and third terms inside are positive only if the normalized step-size is smaller than and greater than , respectively. The potential value is if the normalized step-size is in and it is penalized if the normalized step-size is too small or too large. We need this penalization for the following reason. If the normalized step-size is too small, the success probability is close to for non-critical points, assuming where is a continuously differentiable function, but the progress per step is very small because the step-size directly controls the progress for instance measured as . If the normalized step-size is too large, the success probability is close to zero and produces no progress with high probability. If we would use as a potential function instead of then the progress is arbitrarily small in such situations, which prevents the application of drift arguments. The above potential function penalizes such situations, and guarantees a certain progress in the penalized quantity since the step-size will be increased or decreased, respectively, with high probability, leading to a certain decrease of . We illustrate in Figure 1 that cannot work alone as a potential function while does: when we start from a too small or too large step-size, looks constant (doted line in green and blue). Only when the step-size is started at , we see progress in . Also, the step size can always get arbitrarily worse, with a very small probability, which forces us to handle the case of badly adapted step size properly. Yet the simulation of shows that in all three situations (small, large and well adapted step-sizes compared to the distance to the optimum), we observe a geometric decrease of .
4.4 Upper Bound of the First Hitting Time
We are now ready to establish that the potential function defined in (22) satisfies a (truncated)-drift condition from Theorem 3.2. This will in turn imply an upper bound on the expected hitting time of to provided . The proof follows the same line of argumentation as the proof of [4, Proposition 4.2], which was restricted to the case of spherical functions. It was generalized under similar assumptions as in this paper, but for a fixed covariance matrix equal to the identity, in [47, Proposition 6]. The detailed proof is given in Section B.5.
Proposition 4.3.
Consider the (1+1) described in Algorithm 1 with state . Assume that the minimized objective function satisfies A1 and A2 for some . Let and be constants satisfying and . Then, there exists and such that , where and are defined in Lemma 4.2. For any , taking satisfying , and the potential function Eq. 22, we have
(23) |
where
(24) |
and Moreover, for any there exists such that is positive.
We apply Theorem 3.2 along with Proposition 4.3 to derive the expected first hitting time bound. To do so, we need to confirm that it satisfies the prerequisite of the theorem: integrability of the process defined in Eq. 11 with .
Lemma 4.4.
Let be the sequence of parameters defined by the (1+1) with the initial condition optimizing a measurable function . Set as defined in Eq. 22 and define the process as defined in Theorem 3.2. Then, for any , is integrable, i.e., for each .
Proof 4.5 (Proof of Lemma 4.4).
The drift is by construction bounded by from below. It is also bounded by a constant from above. Indeed, from the proof of Proposition 4.3, it is easy to find the upper bound, say , of the truncated one-step change, in the proof of Proposition 4.3, without using A1 and A2. Let . Then, by recursion, . Hence for all .
Finally, we derive the expected first hitting time of .
Theorem 4.6.
Consider the same situation as described in Proposition 4.3. Let be the first hitting time of to . Choose , where and appear in Definition 2.6. If , the first hitting time is upper bounded by for described in Proposition 4.3, where is the potential function defined in Eq. 22. Equivalently, we have , where
Moreover, the above result yields an upper bound of the expected first hitting time of to .
Proof 4.7.
Theorem 3.2 with Proposition 4.3 and Lemma 4.4 together bounds the expected first hitting time of to by . Since , is bounded by the first hitting time of to . The inequality is preserved if we take the expectation. The last claim is trivial from the inequality , which holds under A1.
Theorem 4.6 shows an upper bound on the expected hitting time of the (1+1) with success-based step-size adaptation for linear convergence towards the global optimum on functions satisfying A1 and A2 with . Moreover, for , this bound holds from all initial search points . If , the bound in Theorem 4.6 does not translate into linear convergence, but we still obtain an upper bound on the expected first hitting time of the target accuracy . This is useful for understanding the behavior of (1+1) on multimodal functions, and on functions with degenerated Hessian matrix at the optimum.
4.5 Lower Bound of the First Hitting Time
We derive a general lower bound of the expected first hitting time of to . The following results hold for an arbitrary measurable function and for a (1+1) with an arbitrary -control mechanism. The following lemma provides the lower bound of the expected one-step progress measured by the logarithm of the distance to the optimum.
Lemma 4.8.
We consider the process generated by a (1+1) algorithm with an arbitrary step-size adaptation mechanism and an arbitrary covariance matrix update optimizing an arbitrary measurable function . We assume and . We consider the natural filtration . Then, the expected single-step progress is lower-bounded by
(25) |
Proof 4.9 (Proof of Lemma 4.8).
Note first that . This value can be positive since does not imply in general. Clipping the positive part to zero, we obtain a lower bound, which is the RHS of the above equality times the indicator . Since the quantity is non-positive, dropping the indicator only decreases the lower bound. Hence, we have . Then,
We rewrite the lower bound of the drift. The RHS of the above inequality is the integral of in the integral domain under the probability measure . Performing a variable change (through rotation and scaling) so that becomes and letting , we can further rewrite it as the integral of in under . With , we have , see Lemma B.1. Altogether, we obtain the lower bound . The RHS is equivalent to times the single step progress of the (1+1)-ES on the spherical function at and , which is proven in the proof of Lemma 4.4 of [4] to be lower bounded by for . This completes the proof.
The following theorem proves that the expected first hitting time of (1+1) is for any measurable function , implying that it can not converge faster than linearly. In case of the lower runtime bound becomes , meaning that the runtime scales linearly with respect to . The proof is a direct application of Lemma 4.8 to Theorem 3.4.
Theorem 4.10.
We consider the process generated by a (1+1) described in Algorithm 1 and assume that is a measurable function with . Let be the first hitting time of by . Then, the expected first hitting time is lower bounded by . The bound holds for arbitrary step-size adaptation mechanisms. If A1 holds, it gives a lower bound for the expected first hitting time bound of to .
Proof 4.11 (Proof of Theorem 4.10).
Let for . Define iteratively as and . Then, it is easy to see that and for all . Note that , where the RMS is lower bounded in light of Lemma 4.8. Then, applying Theorem 3.4, we obtain the lower bound. The last statement directly follows from under A1.
4.6 Almost Sure Linear Convergence
Additionally to the expected first hitting time bound, we can deduce from Proposition 4.3, almost sure linear convergence as stated in the following proposition.
Proposition 4.12.
Consider the same situation as described in Proposition 4.3, where and . Then, for any , and , we have
(26) |
where is as defined in Proposition 4.3. Hence almost sure linear convergence holds at a rate such that .
Proof 4.13 (Proof of Proposition 4.12).
Let be defined in (22). Let and . Define for . Then, is a martingale difference sequence on the filtration produced by . We hence have , and from Proposition 4.3 we obtain
By repeatedly applying the above inequality and dividing it by , we obtain , where and is a martingale sequence. In light of the strong law of large numbers for martingales [15], if , we have almost surely. By the definition of and the working mechanism of the (1+1), we have . Hence, . Hence, we have almost surely. Along with , we obtain Equation 26.
4.7 Wrap-up of the Results: Global Linear Convergence
As a corollary to the lower-bound from Theorem 4.10, the upper bound from Theorem 4.6, Proposition 4.12 stating the almost sure linear convergence and the fact that different assumptions discussed in Section 2.3 imply A1 and A2, we summarize our linear convergence results in the following theorem.
Theorem 4.14 (Global Linear Convergence).
We consider the (1+1) optimizing an objective function . Suppose either
- (a)
- (b)
Then, for any and , the expected hitting time of to is for all . Moreover, both and linearly converge almost surely, i.e.
where is as defined in Proposition 4.3. The convergence rate is thus upper-bounded by .
4.8 Tightness in the Sphere Function Case
Now we consider a specific convex quadratic function, namely the sphere function where the spatial suboptimality function equals . In Theorem 4.14 we have formulated that the expected hitting time of a ball of radius for the (1+1) equals . Yet, this statement does not give information on how the constants hidden in the -notation scale with the dimension. In particular the convergence rate of the algorithm is upper-bounded by where is given in (24), see Theorem 4.6. In this section, we estimate precisely the scaling of in Proposition 4.3 with respect to the dimension and compare it with the general lower bound of the expected first hitting time given in Theorem 4.10. We then conclude that the bound is tight with respect to the scaling with in the case of the sphere function.
Let us assume , that is, we consider the (1+1)-ES without covariance matrix adaptation (). Then, , where the right-most side is independent of and as described in Lemma 2.4. This means that the success probability is solely controlled by the normalized step-size .
The following proposition states that the convergence speed is , hence the expected first hitting time scales as . The proof is provided in Section B.6.
Proposition 4.15.
For , and , we have .
Two conditions on the choice of and : and , are understood as follows. The first condition implies that the target success probability must be independent of . In the 1/5 success rule, and are set so that independent of . The second condition implies that the factors of the step-size increase and decrease must be and . Note that on the sphere function the normalized step-size is kept around a constant during the search. It implies that the convergence speed of and must agree. Therefore the speed of the adaptation of the step-size must not be too small to achieve scaling of the expected first hitting time.
Proposition 4.15 and Theorem 4.6 imply and Theorem 4.10 implies . They yield . This result shows i) that the runtime of the (1+1)-ES on the sphere function is proportional to as long as , and ii) that from our methodology one can derive a tight bound of the runtime in some cases. The result is formally stated as follows.
Theorem 4.16.
The (1+1)-ES (Algorithm 1) with and converges globally and linearly in terms of from any starting point , , and on any function , where is a strictly increasing function. Moreover, if and , the expected first hitting time of to is and the almost sure convergence rate is upper-bounded by .
Since the lower bound holds for an arbitrary -adaptation mechanism, the above result not only implies that our upper bound is tight, but it also implies that the success-based -control mechanism achieves the best possible convergence rate except for a constant factor on the spherical function.
5 Discussion
We have established the almost sure global linear convergence of the (1+1) and also expressed as a bound on the expected hitting time of an -neighborhood of the solution. Assumption A1 has been the key to obtaining the expected first hitting time bound of (1+1) in the form of (9). The convergence results hold on a wide class of functions. It includes
-
(i)
strongly convex functions with Lipschitz gradient, where linear convergence of numerical optimization algorithm is usually analyzed,
-
(ii)
continuously differentiable positively homogenous functions, where previous linear convergence results had been introduced, and
-
(iii)
functions with non-smooth level sets as illustrated in Figure 2.
Because the analyzed algorithms are invariant to strictly monotonic transformations of the objective functions, all results that hold on also hold on where is a strictly increasing transformation, which can thus introduce discontinuities on the objective function. In contrast to the previous result establishing the convergence of CMA-ES [18] by adding a step to enforce a sufficient decrease (which works well for direct search methods, but which is unnatural for ESs), we did not need to modify the adaptation mechanism of the (1+1)-ES to achieve our convergence proofs. We believe that this is crucial, since it allows our analysis to reflect the main mechanism that makes the algorithm work well in practice.
Theorem 4.16 proves that we can derive a tight convergence rate with Proposition 4.3 on the sphere function in the case where , i.e., without covariance matrix adaptation. This partially supports the utility of our methodology. However, its derivation relies on the fact that both the level sets of the objective function and the equal-density curves of the sampling distribution are isotropic, and hence does not generalize immediately. Moreover, the lower bound (Theorem 4.10) seems to be loose even for on convex quadratic functions, where we empirically observe that the logarithmic convergence rate scales like , see Figure 1, while its dependency on the dimension is tight.
A better lower bound of the expected first hitting time and a handy way to estimate the convergence rate are relevant directions of future work. Further directions of future work are as follows:
Proving linear convergence of (1+1) does not reveal the benefits of (1+1) over the (1+1)-ES without covariance matrix adaptation. The motivation of the introduction of the covariance matrix is to improve the convergence rate and to broaden the class of functions on which linear convergence is exhibited. None of them are achieved in this paper.
On convex quadratic functions, we empirically observe that the covariance matrix approaches a stable distribution that is closely concentrated around the inverse Hessian up to a scalar factor, and the convergence speed on all convex quadratic functions is equal to that on the sphere function (see Figure 1). This behavior is not described by our result.
Covariance matrix adaptation is also important for optimizing functions with non-smooth level sets. On continuously differentiable functions, we can always set and so that . This is the rationale behind the 1/5 success rule, where . Indeed, is known to approximate the optimal situation on the sphere function where the expected one-step progress is maximized [51]. Therefore, one does not need to tune these parameters in a problem-specific manner. However, if the objective is not continuously differentiable and levelsets are non-smooth, then is in general smaller than . For example, it can be as low as on . Without an appropriate adaptation of the covariance matrix the success probability will be smaller than and one must tune and in order to converge to the optimum, which requires information about . By adapting the covariance matrix appropriately, the success probability can be increased arbitrary close to (by elongating steps in the direction of the success domain) and and do not require tuning.
To achieve a reasonable convergence rate bound and broaden the class of functions on which linear convergence is exhibited, one needs to find another potential function that may penalize a high condition number and replace the definitions of and accordingly. This point is left for future work.
Acknowledgement
We gratefully acknowledge support by Dagstuhl seminar 17191 “Theory of Randomized Search Heuristics”. We would like to thank Per Kristian Lehre, Carsten Witt, and Johannes Lengler for valuable discussions and advice on drift theory. Y. A. is supported by JSPS KAKENHI Grant Number 19H04179.
References
- [1]
- [2] M.A. Abramson, C. Audet, J.E. Dennis Jr, and S. Le Digabel, OrthoMADS: A deterministic MADS instance with orthogonal directions, SIAM Journal on Optimization, 20 (2009), pp. 948–966.
- [3] Y. Akimoto, Analysis of a natural gradient algorithm on monotonic convex-quadratic-composite functions, in GECCO, 2012, pp. 1293–1300.
- [4] Y. Akimoto, A. Auger, and T. Glasmachers, Drift theory in continuous search spaces: expected hitting time of the (1+ 1)-ES with 1/5 success rule, in GECCO, 2018, pp. 801–808.
- [5] S. Alvernaz and J. Togelius, Autoencoder-augmented neuroevolution for visual doom playing, in IEEE CIG, 2017, pp. 1–8.
- [6] D. V. Arnold and N. Hansen, Active covariance matrix adaptation for the (1+1)-CMA-ES, in GECCO, 2010, pp. 385–392.
- [7] C. Audet and J.E. Dennis Jr, Mesh adaptive direct search algorithms for constrained optimization, SIAM Journal on Optimization, 17 (2006), pp. 188–217.
- [8] A. Auger and N. Hansen, Linear convergence on positively homogeneous functions of a comparison based step-size adaptive randomized search: the (1+1)-ES with generalized one-fifth success rule, 2013, https://arxiv.org/abs/1310.8397.
- [9] A. Auger and N. Hansen, Linear convergence of comparison-based step-size adaptive randomized search via stability of Markov chains, SIAM Journal on Optimization, 26 (2016), pp. 1589–1624.
- [10] A. S. Bandeira, K. Scheinberg, and L. N. Vicente, Convergence of trust-region methods based on probabilistic models, SIAM Journal on Optimization, 24 (2014), pp. 1238–1264.
- [11] B. Baritompa and M. Steel, Bounds on absorption times of directionally biased random sequences, Random Structures & Algorithms, 9 (1996), pp. 279–293.
- [12] P. Bontrager, A. Roy, J. Togelius, N. Memon, and A. Ross, DeepMasterPrints: Generating MasterPrints for Dictionary Attacks via Latent Variable Evolution, in IEEE BTAS, 2018, pp. 1–9.
- [13] S. Bubeck, Convex optimization: Algorithms and complexity, 2014, https://arxiv.org/abs/1405.4980.
- [14] C. Cartis and K. Scheinberg, Global convergence rate analysis of unconstrained optimization methods based on probabilistic models, Mathematical Programming, 169 (2018), pp. 337–375.
- [15] Y. S. Chow, On a strong law of large numbers for martingales, Ann. Math. Statist., 38 (1967), p. 610.
- [16] A. R. Conn, K. Scheinberg, and L. N. Vicente, Introduction to Derivative-Free Optimization, SIAM, 2009.
- [17] L. Devroye, The compound random search, in International Symposium on Systems Engineering and Analysis, 1972, pp. 195–110.
- [18] Y. Diouane, S. Gratton, and L. N. Vicente, Globally convergent evolution strategies, Mathematical Programming, 152 (2015), pp. 467–490.
- [19] B. Doerr and L. A. Goldberg, Adaptive drift analysis, Algorithmica, 65 (2013), pp. 224–250.
- [20] B. Doerr, D. Johannsen, and C. Winzen, Multiplicative drift analysis, Algorithmica, 64 (2012), pp. 673–697.
- [21] Y. Dong, H. Su, B. Wu, Z. Li, W. Liu, T. Zhang, and J. Zhu, Efficient decision-based black-box adversarial attacks on face recognition, in CVPR, 2019.
- [22] G. Fujii, M. Takahashi, and Y. Akimoto, CMA-ES-based structural topology optimization using a level set boundary expression—application to optical and carpet cloaks, Computer Methods in Applied Mechanics and Engineering, 332 (2018), pp. 624 – 643.
- [23] T. Geijtenbeek, M. Van De Panne, and A. F. Van Der Stappen, Flexible muscle-based locomotion for bipedal creatures, ACM Transactions on Graphics (TOG), 32 (2013), pp. 1–11.
- [24] T. Glasmachers, Global convergence of the (1+1) Evolution Strategy to a critical point, Evolutionary Computation, 28 (2020), pp. 27–53.
- [25] D. Golovin, J. Karro, G. Kochanski, C. Lee, X. Song, and Q. Zhang, Gradientless descent: High-dimensional zeroth-order optimization, in ICLR, 2020.
- [26] S. Gratton, C. W. Royer, L. N. Vicente, and Z. Zhang, Direct search based on probabilistic descent, SIAM Journal on Optimization, 25 (2015), pp. 1515–1541.
- [27] S. Gratton, C. W. Royer, L. N. Vicente, and Z. Zhang, Complexity and global rates of trust-region methods based on probabilistic models, IMA Journal of Numerical Analysis, 38 (2017), pp. 1579–1597.
- [28] D. Ha and J. Schmidhuber, Recurrent world models facilitate policy evolution, in NeurIPS, 2018, pp. 2450–2462.
- [29] B. Hajek, Hitting-time and occupation-time bounds implied by drift analysis with applications, Advances in Applied probability, 14 (1982), pp. 502–525.
- [30] N. Hansen, A. Auger, R. Ros, S. Finck, and P. Pošík, Comparing results of 31 algorithms from the black-box optimization benchmarking bbob-2009, in GECCO, 2010, pp. 1689–1696.
- [31] N. Hansen and A. Ostermeier, Completely derandomized self-adaptation in evolution strategies, Evolutionary Computation, 9 (2001), pp. 159–195.
- [32] J. He and X. Yao, Drift analysis and average time complexity of evolutionary algorithms, Artificial intelligence, 127 (2001), pp. 57–85.
- [33] J. He and X. Yao, A study of drift analysis for estimating computation time of evolutionary algorithms, Natural Computing, 3 (2004), pp. 21–35.
- [34] J. Jägersküpper, Analysis of a simple evolutionary algorithm for minimization in Euclidean spaces, Automata, Languages and Programming, 2003, pp. 188–188.
- [35] J. Jägersküpper, Rigorous runtime analysis of the (1+1)-ES: 1/5-rule and ellipsoidal fitness landscapes, in FOGA, 2005, pp. 260–281.
- [36] J. Jägersküpper, How the (1+1)-ES using isotropic mutations minimizes positive definite quadratic forms, Theoretical Computer Science, 361 (2006), pp. 38–56.
- [37] J. Jägersküpper, Algorithmic analysis of a basic evolutionary algorithm for continuous optimization, Theoretical Computer Science, 379 (2007), pp. 329–347.
- [38] S. Kern, S. D. Müller, N. Hansen, D. Büche, J. Ocenasek, and P. Koumoutsakos, Learning probability distributions in continuous evolutionary algorithms–a comparative review, Natural Computing, 3 (2004), pp. 77–112.
- [39] J. Konečnỳ and P. Richtárik, Simple complexity analysis of simplified direct search, 2014, https://arxiv.org/abs/1410.0390.
- [40] I. Kriest, V. Sauerland, S. Khatiwala, A. Srivastav, and A. Oschlies, Calibrating a global three-dimensional biogeochemical ocean model (mops-1.0), Geoscientific Model Development, 10 (2017), p. 127.
- [41] J. Larson, M. Menickelly, and S. M. Wild, Derivative-free optimization methods, Acta Numerica, 28 (2019), pp. 287–404.
- [42] P. K. Lehre and C. Witt, General drift analysis with tail bounds, 2013, https://arxiv.org/abs/1307.2559.
- [43] J. Lengler, Drift analysis, in Theory of Evolutionary Computation, Springer, 2020, pp. 89–131.
- [44] J. Lengler and A. Steger, Drift analysis and evolutionary algorithms revisited, 2016, https://arxiv.org/abs/1608.03226.
- [45] P. MacAlpine, S. Barrett, D. Urieli, V. Vu, and P. Stone, Design and optimization of an omnidirectional humanoid walk: A winning approach at the RoboCup 2011 3D simulation competition, in AAAI, 2012.
- [46] B. Mitavskiy, J. Rowe, and C. Cannings, Theoretical analysis of local search strategies to optimize network communication subject to preserving the total number of links, International Journal of Intelligent Computing and Cybernetics, 2 (2009), pp. 243–284.
- [47] D. Morinaga and Y. Akimoto, Generalized drift analysis in continuous domain: linear convergence of (1+ 1)-ES on strongly convex functions with lipschitz continuous gradients, in FOGA, 2019, pp. 13–24.
- [48] A. Nemirovski, Information-based complexity of convex programming, Lecture Notes, (1995).
- [49] Y. Nesterov, Lectures on convex optimization, vol. 137, Springer, 2018.
- [50] C. Paquette and K. Scheinberg, A stochastic line search method with convergence rate analysis, 2018, https://arxiv.org/abs/1807.07994.
- [51] I. Rechenberg, Evolutionsstrategie: Optimierung technisher Systeme nach Prinzipien der biologischen Evolution, Frommann-Holzboog, 1973.
- [52] I. Rechenberg, Evolutionsstrategie’94, frommann-holzboog, 1994.
- [53] L. M. Rios and N. V. Sahinidis, Derivative-free optimization: a review of algorithms and comparison of software implementations, Journal of Global Optimization, 56 (2013), pp. 1247–1293.
- [54] M. Schumer and K. Steiglitz, Adaptive step size random search, Automatic Control, IEEE Transactions on, 13 (1968), pp. 270–276.
- [55] S. U. Stich, C. L. Muller, and B. Gartner, Optimization of convex functions with random pursuit, SIAM Journal on Optimization, 23 (2013), pp. 1284–1309.
- [56] S. U. Stich, C. L. Müller, and B. Gärtner, Variable metric random pursuit, Mathematical Programming, 156 (2016), pp. 549–579.
- [57] J. Uhlendorf, A. Miermont, T. Delaveau, G. Charvin, F. Fages, S. Bottani, G. Batt, and P. Hersen, Long-term model predictive control of gene expression at the population and single-cell levels, Proceedings of the National Academy of Sciences, 109 (2012), pp. 14271–14276.
- [58] V. Volz, J. Schrum, J. Liu, S. M. Lucas, A. Smith, and S. Risi, Evolving Mario levels in the latent space of a deep convolutional generative adversarial network, in GECCO, 2018, pp. 221–228.
Appendix A Some Numerical Results
We present experiments with five algorithms on two convex quadratic functions. We compare (1+1)-ES, (1+1)-CMA-ES, simplified direction search [39], random pursuit [55], and gradientless descent [25].
All algorithms were started at the initial search point . We implemented the algorithms as follows, with their parameters tuned where necessary: The ES always uses the setting and for step size adaptation. We set the constant in the sufficient decrease condition of Simplified Direction Search to , and we employed the standard basis as well as the negatives of these vectors as candidate directions. In each iteration we looped over the set of directions in random order. Randomizing the order greatly boosted performance over a fixed order. Random Pursuit was implemented with a golden section line search in the range with a rather loose target precision of , where is either the initial step size or the length of the previous step. For Gradientless Descent we used the initial step size as the maximal step size and defined a target precision of . This target is reached by the ES in all cases. The experiments are designed to demonstrate several different effects: (a) We perform all experiments in and dimensions to investigate dimension-dependent effects. (b) We investigate best-case performance by running the algorithms on the spherical function , i.e., on the separable convex quadratic function with minimal condition number. The initial step size is set to . All algorithms have a budget of function evaluations. (c) We investigate the dependency of the performance on initial parameter settings by repeating the same experiment as above, but with an initial step size of . All algorithms have a budget of function evaluations. (d) We investigate the dependence on problem difficulty by running the algorithms on an ellipsoid problem with a moderate condition number of . The eigenvalues of the Hessian are evenly distributed on a log-scale. We use like in the first experiment. All algorithms have a budget of function evaluations.






The experimental results are presented in Figure 3.
Interpretation. We observe only moderate dimension-dependent effects, besides the expected linear increase of the runtime. We see robust performance of the ES, in particular with covariance matrix adaptation. The second experiment demonstrates the practical importance of the ability to grow the step size: the ES is essentially unaffected by wrong initial parameter settings while the gradientless descent and the simplified direct search are (which can be understood directly from the algorithms themselves). This property does not show up in convergence rates and is therefore often (but not always) neglected in algorithm design. The last experiment clearly demonstrates the benefit of variable-metric methods like CMA-ES. It should be noted that variable metric techniques can be implemented into most existing algorithms. This is rarely done though, with random pursuit being a notable exception [56].
Appendix B Proofs
B.1 Proof of Lemma 2.10
Since is invariant to , without loss of generality we assume in this proof. Inequality Eq. 7 implies that , meaning that . Since is the th root of the volume of the left-hand side of the above relation, we find . Analogously, we obtain and . From these inequalities, we obtain and . This implies A1 for . A2 is immediately implied by Proposition 2.8. This completes the proof.
B.2 Proof of Lemma 2.11
We first prove that A1 holds for and with and and they are finite.
It is easy to see that the spatial suboptimality function is proportional to . Let for some . Then, is also a homogeneous function. Since it is homogeneous, A1 reduces to that there are open and closed balls with radius and satisfying the conditions described in the assumption with . Such constants are obtained by and .
Due to the continuity of there exists an open ball around such that for all . Then, it holds that for all . It implies that is no smaller than the radius of , which is positive. Hence, .
We show the finiteness of by a contradiction argument. Suppose . Then, there is a direction such that with an arbitrarily large . Since is homogeneous, we have and this must hold for any . This implies , which contradicts the assumption that is the unique global optimum. Hence, .
The above argument proves that A1 holds with the above constants for and . Proposition 2.8 proves A2.
B.3 Proof of Proposition 4.1
For a given , there is a closed ball such that , see Figure 2. We have
The integral is maximized if the ball is centered at . By a variable change (),
Here we used for any , which is proven in Lemma B.1 below. The right-most side (RMS) of the above inequality is independent of . It proves Eq. 18.
Similarly, there are balls and such that . We have
The integral is minimized if the ball is at the opposite side of on the ball , see Figure 2. By a variable change (moving to the origin) and letting ,
Here we used for any and (Lemma B.1). The RMS of the above inequality is independent of as its value is constant over all unit vectors . Replacing with , we have Eq. 19.
Lemma B.1.
For all , and .
Proof B.2.
For , we have and . Since and , we have . Therefore, we have and . Then we obtain . With this inequality we have
Analogously, we obtain . Taking the integral over , we obtain the second statement.
B.4 Proof of Lemma 4.2
The upper bound of given in (18) is strictly decreasing in and converges to zero when goes to infinity. This guarantees the existence of as a finite value. The existence of is obvious under A2. A1 guarantees that there exists an open ball with radius such that . Then, analogously to the proof of Proposition 4.1, the success probability with rate is lower bounded by
(27) |
The probability is independent of , positive, and continuous in . Therefore the minimum is attained. This completes the proof.
B.5 Proof of Proposition 4.3
First, we remark that is equivalent to the condition . If or , both sides of Eq. 23 are zero, hence the inequality is trivial. In the following we assume that .
For the sake of simplicity we introduce . We rewrite the potential function as
(28) |
The potential function at time can be written as
We want to estimate the conditional expectation
(29) |
We partition the possible values of into three sets: first the set of such that ( is small), second the set of such that ( is large), and last the set of such that (reasonable ). In the following, we bound Eq. 29 for each of the three cases and in the end our bound will equal the minimum of the three bounds obtained for each case.
Reasonable case: . In case of success, where , we have , implying that is always . Similarly, in case of failure, and we find that is always zero. We rearrange and into
Then, the one-step change is upper bounded by
(30) |
The truncated one-step change is upper bounded by
(31) |
To consider the expectation of the above upper bound, we need to compute the expectation of the maximum of and . Let and then . Applying this and taking the conditional expectation, a trivial upper bound for the conditional expectation of is times the probability of being no greater than . The latter condition is equivalent to corresponding to successes with rate or better. That is,
(32) |
Note also that the expected value of is the success probability, namely, . We obtain an upper bound for the conditional expectation of in the case of reasonable as
(33) |
Small case: . If , the 2nd summand in Eq. 28 is positive. Moreover, if , we have and hence the 2nd summand in Eq. 28 is positive for as well. If , any regime can happen. Then,
On the RMS of the above equality, the first term is guaranteed to be non-positive since . The second and third terms are non-positive as well since and . Replacing the indicator with in the last term provides an upper bound. Altogether, we obtain
Note that the RHS is larger than since it is lower bounded by and . Then, the conditional expectation of is
(34) |
Here we used for the first inequality, for the second inequality, and for the last two equalities.
Large case: . Since , the 3rd summand in Eq. 28 is positive in both and . For the 2nd summand in Eq. 28, recall that since we have assumed that . Hence, for the 2nd summand in Eq. 28 is zero. Also, and thus for the 2nd summand in Eq. 28 also equals . We obtain
The first term on the RHS is guaranteed to be non-positive since , yielding . On the other hand,
where the last inequality comes from the prerequisite . Hence,
Then, the conditional expectation of is
(35) |
Here we used .
Conclusion. Inequalities Eqs. 34, 35 and 33 together cover all possible cases and we hence obtain Eq. 24.
Finally, we prove the positivity of for an arbitrary . Lemma 4.2 guarantees the positivity of for any choice of since for any and . Therefore, for any and . Moreover, for a sufficiently small , is strictly positive for any . Therefore, one can take a sufficiently small that satisfies . The first term in the minimum in Eq. 24 is positive. The second term therein is clearly positive for . This completes the proof.
B.6 Proof of Proposition 4.15
Consider . We set . We bound from below by taking a specific value for instead of considering for . Our candidate is , where and . It holds and hence , from which we obtain .
We bound the terms in Eq. 24 as: and . Therefore, we have . Note that one can take since the only condition is . To obtain , it is sufficient to show for .
Fix and independently of . In the light of Lemma 3.1 in [4], we have that is continuous and strictly decreasing from to for all . Therefore, for each there exists an inverse map . Define and for each . It follows from Lemma 3.2 in [4] that is also strictly decreasing, hence invertible. The existence of is also proved in [4]. We let and . Because of the pointwise convergence of to , we have and for . Hence, for any and with , there exists such that for all we have and . Now we fix and in this way. This amounts to selecting and .
We have since and hence according to Lemma 3.2 in [4] we have
where the equality follows from the pointwise convergence of to and the continuity of and .222Let be a sequence of continuous functions on and be a continuous function such that is the pointwise limit of the sequence. Since they are continuous, there exist the minimizers of and in a compact set . Let and , where is taken over and we pick one if there exist more than one minimizers. It is easy to see that , hence . Let be the sub-sequence of the indices such that . Since is a bounded sequence, Bolzano-Weirstraß theorem provides a convergent sub-sequence and we denote its limit as . Of course we have . Due to the continuity of and the pointwise convergence to , we have . Therefore, . Since is the minimizer of in and , it must hold . Hence, . This completes the proof.