\newsiamthm

assumptionAssumption \newsiamremarkremarkRemark \headersGlobal Linear Convergence of Evolution StrategiesY. Akimoto, A. Auger, T. Glasmachers and D. Morinaga

Global Linear Convergence of Evolution Strategies on More than Smooth Strongly Convex Functions

Youhei Akimoto Faculty of Engineering, Information and Systems, University of Tsukuba; RIKEN AIP, Tsukuba, Japan (). akimoto@cs.tsukuba.ac.jp Anne Auger Inria and CMAP, Ecole Polytechnique, IP Paris, France (). anne.auger@inria.fr Tobias Glasmachers Institute for Neural Computation, Ruhr-University Bochum, Bochum, Germany (). tobias.glasmachers@ini.rub.de Daiki Morinaga Department of Computer Science, University of Tsukuba; RIKEN AIP, Tsukuba, Japan (). morinaga@bbo.cs.tsukuba.ac.jp

Abstract

Evolution strategies (ESs) are zeroth-order stochastic black-box optimization heuristics invariant to monotonic transformations of the objective function. They evolve a multivariate normal distribution, from which candidate solutions are generated. Among different variants, CMA-ES is nowadays recognized as one of the state-of-the-art zeroth-order optimizers for difficult problems. Albeit ample empirical evidence that ESs with a step-size control mechanism converge linearly, theoretical guarantees of linear convergence of ESs have been established only on limited classes of functions. In particular, theoretical results on convex functions are missing, where zeroth-order and also first-order optimization methods are often analyzed. In this paper, we establish almost sure linear convergence and a bound on the expected hitting time of an ES family, namely the $(1+1)_{\kappa}$ -ES, which includes the (1+1)-ES with (generalized) one-fifth success rule and an abstract covariance matrix adaptation with bounded condition number, on a broad class of functions. The analysis holds for monotonic transformations of positively homogeneous functions and of quadratically bounded functions, the latter of which particularly includes monotonic transformation of strongly convex functions with Lipschitz continuous gradient. As far as the authors know, this is the first work that proves linear convergence of ES on such a broad class of functions.

keywords:

Evolution strategies, Randomized derivative free optimization, Black-box optimization, Linear convergence, Stochastic algorithms

{AMS}

65K05, 90C25, 90C26, 90C56, 90C59

1 Introduction

We consider the unconstrained minimization of an objective function $f:\mathbb{R}^{d}\to\mathbb{R}$ without the use of derivatives where an optimization solver sees $f$ as a zeroth-order black-box oracle [48, 13, 49]. This setting is also referred to as derivative-free optimization [16]. Such problems can be advantageously approached by randomized algorithms that can typically be more robust to noise, non-convexity and irregularities of the objective function than deterministic algorithms. There has been recently a vivid interest in randomized derivative-free algorithms giving rise to several theoretical studies of randomized direct search methods [26], trust region [10, 27] and model-based methods [14, 50]. We refer to [41] for an in-depth survey including the references of this paragraph and additional ones.

In this context, we investigate Evolution Strategies (ES), which are among the oldest randomized derivative-free or zeroth-order black-box methods [17, 54, 51]. They are widely used in applications in different domains [57, 40, 45, 23, 5, 28, 58, 12, 21, 22]. Notably a specific ES called covariance-matrix-adaptation ES (CMA-ES) [31] is among the best solvers to address difficult black-box problems. It is affine-invariant and implements complex adaptation mechanisms for the sampling covariance matrix and step-size. It performs well on many ill-conditioned, non-convex, non-smooth, and non-separable problems [30, 53]. ES are known to be difficult to analyze. Yet, given their importance in practice, it is essential to study them from a theoretical convergence perspective.

We focus on the arguably simplest and oldest adaptive ES, denoted (1+1)-ES. It samples a candidate solution from a Gaussian distribution whose step-size (standard deviation) is adapted. The candidate solution is accepted if and only if it is better than the current one (see pseudo-code Algorithm 1). The algorithm shares some similarities with simplified direct search whose complexity analysis has been presented in [39]. Yet the (1+1)-ES is comparison-based and thus invariant to strictly increasing transformations of the objective function. Simplified direct search can be thought of as a variant of mesh adaptive direct search [7, 2]. Arguably, in contrast to direct search, a sufficient decrease condition cannot be guaranteed. This causes some difficulties for the analysis. The (1+1)-ES is rotational invariant, while direct search candidate solutions are created along a predefined set of vectors. While the CMA-ES should always be preferred for practical applications over the (1+1)-ES variant analyzed here, this latter variant achieves faster linear convergence on well-conditioned problems when compared to algorithms with established complexity analysis (see [55, Table 6.3 and Figure 6.1] and [9, Figure B.4] where the random pursuit algorithm and the (1+1)-ES algorithms are compared, and also Appendix A).

Prior theoretical studies of the (1+1)-ES with $1/5$ success rule have established the global linear convergence on differentiable positively homogeneous functions (composed with a strictly increasing function) with a single optimum [9, 8]. Those results establish the almost sure linear convergence from all initial states. They however do not provide the dependency of the convergence rate with respect to the dimension. A more specific study on the sphere function $f(x)=\frac{1}{2}\|x\|^{2}$ establishes lower and upper bounds on the expected hitting time of an $\epsilon$ -ball of the optimum in $\Theta(\log(d\|m_{0}-x^{*}\|/\epsilon))$ , where $x^{*}$ is the optimum of the function, $m_{0}$ is the initial solution, and $d$ is the problem dimension [4]. Prior to that, a variant of the $(1+1)$ -ES with one-fifth success rule had been analyzed on the sphere and certain convex quadratic functions establishing bounds on the expected hitting time with overwhelming probability in $\Theta(\log(\kappa_{f}d\|m_{0}-x^{*}\|/\epsilon))$ , where $\kappa_{f}$ is the condition number (the ratio between the greatest and smallest eigenvalues) of the Hessian [34, 36, 37, 35]. Recently, the class of functions where the convergence of the (1+1)-ES was proven has been extended to continuously differentiable functions. This analysis does not address the question of linear convergence, focusing only on convergence as such, which is possibly sublinear [24].

Our main contribution is as follows. For a generalized version of the (1+1)-ES with one-fifth success rule, we prove bounds on the expected hitting time akin to linear convergence, i.e., hitting an $\epsilon$ -ball in $\Theta(\log\|m_{0}-x^{*}\|/\epsilon)$ iterations on a quite general class of functions. This class of functions includes all composites of Lipschitz-smooth strongly convex functions with a strictly increasing transformation. This latter transformation allows to include some non-continuous functions, and even functions with non-smooth level sets. We additionally deduce linear convergence with probability one. Our analysis relies on finding an appropriate Lyapunov function with lower and upper-bounded expected drift. It is building on classical fundamental ideas presented by Hajek [29] and widely used to analyze stochastic hill-climbing algorithms on discrete search spaces [43].

Notation

Throughout the paper, we use the following notations. The set of natural numbers $\{1,2,\ldots,\}$ is denoted $\mathbb{N}$ . Open, closed, and left open intervals on $\mathbb{R}$ are denoted by $(,)$ , $[,]$ , and $(,]$ , respectively. The set of strictly positive real numbers is denoted by $\mathbb{R}_{>}$ . The Euclidean norm on $\mathbb{R}^{d}$ is denoted by $\lVert~{}\rVert$ . Open and closed balls with center $c$ and radius $r$ are denoted as $\mathcal{B}(c,r)=\{x\in\mathbb{R}^{d}:\lVert x-c\rVert<r\}$ and $\bar{\mathcal{B}}(c,r)=\{x\in\mathbb{R}^{d}:\lVert x-c\rVert\leqslant r\}$ , respectively. Lebesgue measures on $\mathbb{R}$ and $\mathbb{R}^{d}$ are both denoted by the same symbol $\mu$ . A multivariate normal distribution with mean $m$ and covariance matrix $\Sigma$ is denoted by $\mathcal{N}(m,\Sigma)$ . Its probability measure and its induced probability density under Lebesgue measure are denoted by $\Phi(\cdot;m,\Sigma)$ and $\varphi(\cdot;m,\Sigma)$ . The indicator function of a set or condition $C$ is denoted by $1\left\{\scriptstyle{C}\right\}$ . We use Bachmann-Landau notations $f\in o(g)$ , $O(g)$ , $\omega(g)$ , $\Omega(g)$ and $\Theta(g)$ to mean $\limsup_{\epsilon\to 0}f(\epsilon)/g(\epsilon)=0$ , $\limsup_{\epsilon\to 0}f(\epsilon)/g(\epsilon)\leqslant C$ for some $C>0$ , $\liminf_{\epsilon\to 0}f(\epsilon)/g(\epsilon)=\infty$ , $\liminf_{\epsilon\to 0}f(\epsilon)/g(\epsilon)\geqslant C$ for some $C>0$ , and $f\in O(g)$ and $f\in\Omega(g)$ , respectively.

2 Algorithm, Definitions and Objective Function Assumptions

2.1 Algorithm: (1+1)-ES with Success-based Step-size Control

We analyze a generalized version of the (1+1)-ES with one-fifth success rule presented in Algorithm 1, which implements one of the oldest approaches to adapt the step-size in randomized optimization methods [51, 17, 54]. The specific implementation was proposed in [38]. At each iteration, a candidate solution $x_{t}$ is sampled. It is centered in the current incumbent $m_{t}$ and follows a multivariate normal distribution with mean vector $m_{t}$ and covariance matrix equal to $\sigma_{t}^{2}I_{d}$ where $I_{d}$ denotes the identity matrix. The candidate solution is accepted, that is $m_{t}$ becomes $x_{t}$ , if and only if $x_{t}$ is better than $m_{t}$ (i.e. $f(x_{t})\leqslant f(m_{t})$ ). In this case, we say that the candidate solution is successful. The step-size $\sigma_{t}$ is adapted so as to maintain a probability of success to be approximately the target success probability denoted by $p_{\mathrm{target}}:=\frac{\log(1/\alpha_{\downarrow})}{\log(\alpha_{\uparrow}/\alpha_{\downarrow})}$ . To do so, the step-size is increased by the increase factor $\alpha_{\uparrow}>1$ in case of success (which is an indication that the step-size is likely to be too small) and decreased by the decrease factor $\alpha_{\downarrow}<1$ otherwise. The covariance matrix $\Sigma_{t}$ of the sampling distribution of candidate solutions is adapted in the set $\mathcal{S}_{\kappa}$ of positive-definite symmetric matrices with determinant $\det(\Sigma)=1$ and condition number $\operatorname{Cond}(\Sigma)\leqslant\kappa$ . We do not assume any specific update mechanism for $\Sigma$ , but we assume that the update of $\Sigma$ is invariant to any strictly increasing transformation of $f$ . We call such an update comparison-based (see Lines 7 and 11 of Algorithm 1). Then, our algorithm behaves exact-equally on $f$ and on $g\circ f$ for all strictly increasing functions $g:\mathbb{R}\to\mathbb{R}$ (i.e., $g(s)\lesseqgtr g(t)\Leftrightarrow s\lesseqgtr t$ ). This defines a class of comparison-based randomized algorithms and we denote it as (1+1) $\text{-ES}_{\kappa}$ . For $\kappa=1$ , it is simply denoted as (1+1)-ES.

Algorithm 1 (1+1)

\text{-ES}_{\kappa}

with success-based step-size adaptation

1: input

m_{0}\in\mathbb{R}^{d}

\sigma_{0}>0

\Sigma_{0}=I

f:\mathbb{R}^{d}\to\mathbb{R}

, parameter

\alpha_{\uparrow}>1>\alpha_{\downarrow}>0

2: for

t=1,2,\dots

, until stopping criterion is met do

3: sample

x_{t}\sim m_{t}+\sigma_{t}\mathcal{N}(0,\Sigma_{t})

4: if

f\big{(}x_{t}\big{)}\leqslant f\big{(}m_{t}\big{)}

then

m_{t+1}\leftarrow x_{t}

\triangleright

move to the better solution

\sigma_{t+1}\leftarrow\sigma_{t}\alpha_{\uparrow}

\triangleright

increase the step size

\Sigma_{t+1}\in\mathcal{S}_{\kappa}

\triangleright

adapt the covariance matrix

8: else

m_{t+1}\leftarrow m_{t}

\triangleright

stay where we are

10:

\sigma_{t+1}\leftarrow\sigma_{t}\alpha_{\downarrow}

\triangleright

decrease the step size

11:

\Sigma_{t+1}\in\mathcal{S}_{\kappa}

\triangleright

adapt the covariance matrix

Note that $\alpha_{\uparrow}$ and $\alpha_{\downarrow}$ are not meant to be tuned depending on the function properties. How to choose such constants for $\Sigma_{t}=I_{d}$ is well-known and is related to the so-called evolution window [52]. In practice, $\alpha_{\downarrow}=\alpha_{\uparrow}^{-1/4}$ is the most commonly used setting, which leads to $p_{\mathrm{target}}=1/5$ . It has been shown to be close to optimal, which gives nearly optimal (linear) convergence rate on the sphere function [51, 17]. Hereunder we write $\theta=(m,\sigma,\Sigma)$ as the state of the algorithm, $\theta_{t}=(m_{t},\sigma_{t},\Sigma_{t})$ and the state-space is denoted by $\Theta$ .

Figure 1 shows typical runs of the (1+1)-ES and a version of (1+1) $\text{-ES}_{\kappa}$ proposed in [6], which is known as the (1+1)-CMA-ES, on a $10$ -dimensional ellipsoidal function with different condition numbers $\kappa_{f}$ of the Hessian. It is empirically observed that $\Sigma_{t}$ in the (1+1)-CMA-ES approaches the inverse Hessian $\nabla^{2}f(m_{t})$ of the objective function up to the scalar factor if the objective function is convex quadratic. The runtime of (1+1)-ES scales linearly with $\kappa_{f}$ (notice the logarithmic scale of the horizontal axis), while the runtime of the (1+1)-CMA-ES suffers only an additive penalty, roughly proportional to the logarithm of $\kappa_{f}$ . Once the Hessian is well approximated by $\Sigma$ (up to a scalar factor), it approaches the global optimum geometrically at the same rate for different values of $\kappa_{f}$ .

In our analysis, we do not assume any specific $\Sigma$ update mechanism, hence it does not necessarily behave as shown in Figure 1. Our analysis is therefore the worst case analysis (for the upper bound of the runtime) and the best case analysis (for the lower bound of the runtime) among the algorithms in (1+1) $\text{-ES}_{\kappa}$ .

Figure 1: Convergence of the (1+1)-ES (left) and the (1+1)-CMA-ES (middle) on

10

dimensional ellipsoidal function

f(x)=\frac{1}{2}\sum_{i=1}^{d}\kappa_{f}^{\frac{i-1}{d-1}}x_{i}^{2}

with

\kappa_{f}=10^{0},10^{1},\dots,10^{6}

. The y-axis displays the distance to the optimum (and not the function value). We employ the covariance matrix adaptation mechanism proposed by [6], where

\sigma

is adapted as in Algorithm 1 with

\alpha_{\uparrow}=e^{0.1}

and

\alpha_{\downarrow}=e^{-0.025}

. Note the logarithmic scale of the time axis of the left plot vs. the linear time axis of the middle plot.
Right: Three runs of (1+1)-ES (

\alpha_{\uparrow}=e^{0.1}

and

\alpha_{\uparrow}=e^{-0.025}

) on

10

dimensional spherical function

f(x)=\frac{1}{2}\lVert x-x^{*}\rVert^{2}

with initial step-size

\sigma_{0}=10^{-4}

1

, and

10^{4}

(in blue, red, green, respectively). Plotted are the distance to the optimum (dotted line), the step-size (dashed line), and the potential function

V(\theta)

defined in (22) (solid line) with

v=4/d

\ell=\alpha_{\uparrow}^{-10}

, and

u=\alpha_{\downarrow}^{-10}

2.2 Preliminary Definitions

2.2.1 Spatial Suboptimality Function

The algorithms studied in this paper are comparison-based and thus invariant to strictly increasing transformations of $f$ . If the convergence of the algorithms is measured in terms of $f$ , say by investigating the convergence or hitting time of the sequence $f(m_{t})$ , this will not reflect the invariance to monotonic transformations of $f$ because the first iteration $t_{0}$ such that $f(m_{t_{0}})\leqslant\epsilon$ is not equal to the first iteration $t_{0}^{\prime}$ such that $g(f(m_{t_{0}^{\prime}}))\leqslant\epsilon$ for some $\epsilon>0$ . For this reason, we introduce a quality measure called spatial suboptimality function [24]. It is the $d$ th root of the volume of the sub-levelset where the function value is better or equal to $f(x)$ :

Definition 2.1 (Spatial Suboptimality Function).

Let $f:\mathbb{R}^{d}\to\mathbb{R}$ be a measurable function with respect to the Borel $\sigma$ algebra of $\mathbb{R}^{d}$ (simply referred to as measurable function in the sequel). Then the spatial suboptimality function $f_{\mu}:\mathbb{R}^{d}\to[0,+\infty]$ is defined as

(1)

f_{\mu}(x)=\sqrt[d]{\mu\left(f^{-1}\left((-\infty,f(x)]\right)\right)}=\sqrt[d]{\mu\left(\left\{y\in\mathbb{R}^{d}\,\big{|}\,f(y)\leqslant f(x)\right\}\right)}\enspace.

We remark that for any $f$ , the suboptimality function $f_{\mu}$ is greater or equal to zero. For any $f$ and any strictly increasing function $g:{\rm Im}(f)\to\mathbb{R}$ , $f$ and its composite $g\circ f$ have the same spatial suboptimality function such that hitting time of $f_{\mu}$ smaller than $\epsilon>0$ will be the same for $f$ or $g\circ f$ . Moreover, there exists a strictly increasing function $g$ such that $f_{\mu}(x)=g(f(x))$ holds $\mu$ -almost everywhere [24, Lemma 1].

We will investigate the expected first hitting time of $\lVert m_{t}-x^{*}\rVert$ to $\epsilon>0$ . For this, we will bound the first hitting time of $\lVert m_{t}-x^{*}\rVert$ to $\epsilon$ by the first hitting time of $f_{\mu}(m_{t})$ to a constant times $\epsilon$ . To understand why, consider first a strictly convex quadratic function $f$ with Hessian $H$ and minimal solution $x^{*}$ . We have $f_{\mu}(x)=V_{d}\big{[}2(f(x)-f(x^{*}))/\det(H)^{1/d}\big{]}^{1/2}$ for all $x\in\mathbb{R}^{d}$ , where $V_{d}=\pi^{1/2}/\Gamma^{1/d}(d/2+1)$ is the $d$ th root of the volume of the $d$ -dimensional unit hyper-sphere [3]. This implies that the first hitting time of $f_{\mu}(m_{t})$ translates to the first hitting time of $\sqrt{f(m_{t})-f(x^{*})}$ . We have $\sqrt{\lambda_{\min}}\lVert x-x^{*}\rVert\leqslant\sqrt{f(x)-f(x^{*})}\leqslant\sqrt{\lambda_{\max}}\lVert x-x^{*}\rVert$ , where $\lambda_{\min}$ and $\lambda_{\max}$ are the minimal and maximal eigenvalues of $H$ . E.g., consider $f(x)=\lVert x-x^{*}\rVert^{2}+1$ . Then the above condition also translates to the first hitting time of $\lVert m_{t}-x^{*}\rVert$ . More generally, we will formalize an assumption on $f$ later on (Assumption A1), which allows us to bound $\lVert x-x^{*}\rVert$ by a constant times $f_{\mu}(x)$ from above and below (see (6)), implying that the first hitting time of $\lVert m_{t}-x^{*}\rVert$ to $\epsilon$ is bounded by that of $f_{\mu}(m_{t})$ to $\epsilon$ , times a constant.

2.2.2 Success Probability

The success probability, i.e., the probability of sampling a candidate solution $x_{t}$ with an objective function better than or equal to that of the current solution $m_{t}$ , plays an important role in the analysis of the (1+1) $\text{-ES}_{\kappa}$ with success-based step-size control mechanism. We present here several useful definitions related to the success probability.

We start with the definition of the success domain with rate $r$ and the success probability with rate $r$ . The probability to sample in the $r$ -success domain is called success probability with rate $r$ . When $r=0$ we simply talk about success probability.¹¹1For $r=0$ , the success domain $S_{0}(m)$ is not necessarily equivalent to the sub-levelset $S_{0}^{\prime}(m):=\{x\in\mathbb{R}^{d}\mid f(x)\leqslant f(m)\}$ , where it always holds that $S_{0}^{\prime}(m)\subseteq S_{0}(m)$ . However, since it is guaranteed that $\mu(S_{0}(m)\setminus S_{0}^{\prime}(m))=0$ by [24, Lemma 1], due to the absolute continuity of $\Phi(;0,\Sigma)$ for $\Sigma\in\mathcal{S}_{\kappa}$ , the success probability with rate $r=0$ is equivalent to $\Pr_{z\sim\mathcal{N}(0,\Sigma)}\left[m+f_{\mu}(m)\cdot\bar{\sigma}z\in S_{0}^{\prime}(m)\right]$ , with $\bar{\sigma}$ defined in (3).

Definition 2.2 (Success Domain).

For a measurable function $f:\mathbb{R}^{d}\to\mathbb{R}$ and $m\in\mathbb{R}^{d}$ such that $f_{\mu}(m)<\infty$ , the $r$ -success domain at $m$ with $r\in[0,1]$ is defined as

(2)

S_{r}(m)=\{x\in\mathbb{R}^{d}\mid f_{\mu}(x)\leqslant(1-r)f_{\mu}(m)\}\enspace.

Definition 2.3 (Success Probability).

Let $f$ be a measurable function and let $m_{0}\in\mathbb{R}^{d}$ be the initial search point satisfying $f_{\mu}(m_{0})<\infty$ . For any $r\in[0,1]$ and any $m\in S_{0}(m_{0})$ , the success probability with rate $r$ at $m$ under the normalized step-size $\bar{\sigma}$ is defined as

(3)

p^{\mathrm{succ}}_{r}(\bar{\sigma};m,\Sigma)=\Pr_{z\sim\mathcal{N}(0,\Sigma)}\left[m+f_{\mu}(m)\bar{\sigma}z\in S_{r}(m)\right]\enspace.

Definition 2.3 introduces the notion of normalized step-size $\bar{\sigma}$ and the success probability is defined as a function of $\bar{\sigma}$ rather than the actual step-size $\sigma=f_{\mu}(m)\bar{\sigma}$ . This is motivated by the fact that as $m$ approaches the global optimum $x^{*}$ of $f$ , the step-size $\sigma$ needs to shrink for the success probability to be constant. If the objective function is $f(x)=\frac{1}{2}\lVert x-x^{*}\rVert^{2}$ and the covariance matrix is the identity matrix, then the success probability is fully controlled by $\bar{\sigma}_{t}=\sigma_{t}/f_{\mu}(m_{t})\propto\sigma_{t}/\lVert m_{t}-x^{*}\rVert$ and is independent of $m_{t}$ . This statement can be formalized in the following way.

Lemma 2.4.

If $f(x)=\frac{1}{2}\lVert x-x^{*}\rVert^{2}$ , then letting $e_{1}=(1,0,\ldots,0)$ , we have

p^{\mathrm{succ}}_{r}(\bar{\sigma};m,\mathrm{I})=\Pr_{z\sim\mathcal{N}(0,\mathrm{I})}\left[m+f_{\mu}(m)\bar{\sigma}z\in S_{r}(m)\right]=\Pr_{z\sim\mathcal{N}(0,\mathrm{I})}\left[\|e_{1}+V_{d}\bar{\sigma}z\|\leqslant(1-r)\right]\enspace.

Proof 2.5.

The suboptimality function is the $d$ -th rooth of the volume of a sphere of radius $\|x-x^{*}\|$ . Hence $f_{\mu}(x)=V_{d}\lVert x-x^{*}\rVert$ . Then, the proof follows the derivation in Section 3 in [4].

Therefore, $\bar{\sigma}$ is more discriminative than $\sigma$ itself. In general, the optimal step-size is not necessarily proportional to neither $\lVert m_{t}-x^{*}\rVert$ nor $f_{\mu}(m_{t})$ .

Since the success probability under a given normalized step-size depends on $m$ and $\Sigma$ , we define the upper and lower success probability as follows.

Definition 2.6 (Lower and Upper Success Probability).

Let $\mathcal{X}_{a}^{b}=\{x\in\mathbb{R}^{d}:a<f_{\mu}(x)\leqslant b\}$ . Given the normalized step-size $\bar{\sigma}>0$ , the lower and upper success probabilities are defined as

\displaystyle p^{\mathrm{lower}}_{(a,b]}(\bar{\sigma})

\displaystyle=\inf_{m\in\mathcal{X}_{a}^{b}}\inf_{\Sigma\in\mathcal{S}_{\kappa}}p^{\mathrm{succ}}_{0}(\bar{\sigma};m,\Sigma)\enspace,

\displaystyle p^{\mathrm{upper}}_{(a,b]}(\bar{\sigma})

\displaystyle=\sup_{m\in\mathcal{X}_{a}^{b}}\sup_{\Sigma\in\mathcal{S}_{\kappa}}p^{\mathrm{succ}}_{0}(\bar{\sigma};m,\Sigma)\enspace.

A central quantity for our analysis is the limit for $\bar{\sigma}$ to $0$ of the success probability $p^{\mathrm{succ}}_{0}(\bar{\sigma};m,\Sigma)$ . Intuitively, if this limit is too small for a given $m$ (compared to $p_{\rm target}$ ), because the ruling principle of the algorithm is to decrease the step-size if the probability of success is smaller than $p_{\rm target}$ , the step-size will keep decreasing, causing undesired convergence. Following Glasmachers [24], we introduce the concepts of $p$ -improbability and $p$ -criticality. They are defined in [24] by the probability of sampling a better point from the isotropic normal distribution in the limit of the step-size to zero. Here, we define $p$ -improvability and $p$ -criticality for a general multivariate normal distribution.

Definition 2.7 ( $p$ -improvability and $p$ -criticality).

Let $f$ be a measurable function. The function $f$ is called $p$ -improvable at $m\in\mathbb{R}^{d}$ under the covariance matrix $\Sigma\in\mathcal{S}_{\kappa}$ if there exists $p\in(0,1]$ such that

(4)

p=\liminf_{\bar{\sigma}\to+0}p^{\mathrm{succ}}_{0}(\bar{\sigma};m,\Sigma)\enspace.

Otherwise, it is called $p$ -critical.

The connection to the classical definition of the critical points for continuously differentiable functions is summarized in the following proposition, which is an extension of Lemma 4 in [24], taking a non-identity covariance matrix into account.

Proposition 2.8.

Let $f=g\circ h$ be a measurable function where $g$ is any strictly increasing function and $h$ is continuously differentiable. Then, $f$ is $p$ -improvable with $p=1/2$ at any regular point $m$ where $\nabla h(m)\neq 0$ under any $\Sigma\in\mathcal{S}_{\kappa}$ . Moreover, if $h$ is twice continuously differentiable at a critical point $m$ where $\nabla h(m)=0$ and at least one eigenvalue of $\nabla^{2}f(m)$ is non-zero, under any $\Sigma\in\mathcal{S}_{\kappa}$ , $m$ is $p$ -improvable with $p=1$ if $\nabla^{2}h(m)$ has only non-positive eigenvalues, $p$ -critical if $\nabla^{2}h(m)$ has only non-negative eigenvalues, and $p$ -improvable with some $p>0$ if $\nabla^{2}h(x)$ has at least one strictly negative eigenvalue.

Proof 2.9.

Note that $p^{\mathrm{succ}}_{0}(\bar{\sigma};m,\Sigma)$ on $f$ is equivalent to $p^{\mathrm{succ}}_{0}(\bar{\sigma};m,I_{d})$ on $f^{\prime}(x)=f(m+\sqrt{\Sigma}(x-m))$ . Therefore, it suffices to show that the claims hold for $\Sigma=I_{d}$ on $f^{\prime}$ , which is proven in Lemma 4 in [24].

2.3 Main Assumptions on the Objective Functions

Given positive real numbers $a$ and $b$ satisfying $0\leqslant a<b\leqslant+\infty$ , and a measurable objective function, let ${\mathcal{X}_{a}^{b}}$ be the set defined in Definition 2.6.

We pose two core assumptions on the objective functions under which we will derive an upper bound on the expected first hitting time of $[0,\epsilon]$ by $f_{\mu}(m_{t})$ (Theorem 4.6) provided $a\leqslant\epsilon\leqslant f_{\mu}(m_{0})\leqslant b$ . First, we require to be able to embed and include balls of radius scaling with $f_{\mu}(m)$ into the sublevel sets of $f$ . We do not require this to hold on the whole search space but, for a set $\mathcal{X}_{a}^{b}$ .

A1

We assume that $f$ is a measurable function and that there exists $\mathcal{X}_{a}^{b}$ such that for any $m\in\mathcal{X}_{a}^{b}$ , there exist an open ball $\mathcal{B}_{\ell}$ with radius $C_{\ell}f_{\mu}(m)$ and a closed ball $\bar{\mathcal{B}}_{u}$ with radius $C_{u}f_{\mu}(m)$ such that it holds $\mathcal{B}_{\ell}\subseteq\{x\in\mathbb{R}^{d}\mid f_{\mu}(x)<f_{\mu}(m)\}$ and $\{x\in\mathbb{R}^{d}\mid f_{\mu}(x)\leqslant f_{\mu}(m)\}\subseteq\bar{\mathcal{B}}_{u}$ .

We do not specify the center of those balls that may or may not be centered on an optimum of the function. We will see in Proposition 4.1 that this assumption allows to bound $p^{\mathrm{lower}}_{(a,b]}(\bar{\sigma})$ and $p^{\mathrm{upper}}_{(a,b]}(\bar{\sigma})$ by tractable functions of $\bar{\sigma}$ which will be essential for the analysis. The property is illustrated in Figure 2.

The second assumption requires that the functions are $p$ -improvable for $p$ which is lower-bounded uniformly over ${\mathcal{X}_{a}^{b}}$ .

A2

Let $f$ be a measurable function, we assume that there exists ${\mathcal{X}_{a}^{b}}$ and there exists $p^{\mathrm{limit}}>p^{\mathrm{target}}$ such that for any $m\in\mathcal{X}_{a}^{b}$ and any $\Sigma\in\mathcal{S}_{\kappa}$ , the objective function $f$ is $p$ -improvable for some $p\geqslant p^{\mathrm{limit}}$ , i.e.,

(5) $\liminf_{\bar{\sigma}\downarrow 0}p^{\mathrm{lower}}_{(a,b]}(\bar{\sigma})\geqslant p^{\mathrm{limit}}\enspace.$

The property is illustrated in Figure 2. This assumption implies in particular for a continuous function that ${\mathcal{X}_{a}^{b}}$ does not contain any local optimum. This latter assumption is required to obtain global convergence [24, Theorem 2] even without any covariance matrix adaptation (i.e. with $\kappa=1$ ) and it can be intuitively understood: If we have a point which is $p$ -improvable with $p<p_{\mathrm{target}}$ and which is not a local minimum of the function, then, starting with a small step-size, the success-based step-size control may keep decreasing the step-size at such a point and the (1+1) $\text{-ES}_{\kappa}$ will prematurely converge to a point that is not a local optimum.

If A1 is satisfied with balls centered at the optimum $x^{*}$ of the function $f$ , then it is easy to see that for all $x\in{\mathcal{X}_{a}^{b}}$

(6)

C_{\ell}f_{\mu}(x)\leqslant\lVert x-x^{*}\rVert\leqslant C_{u}f_{\mu}(x)\enspace.

If the balls are not centered at the optimum, we have the one-side inequality $\lVert x-x^{*}\rVert\leqslant 2C_{u}f_{\mu}(x)$ . Hence, the expected first hitting time of $f_{\mu}(m_{t})$ to $[0,\epsilon]$ translates to an upper bound for the expected first hitting time of $\lVert m_{t}-x^{*}\rVert$ to $[0,2C_{u}\epsilon]$ .

We remark that A1 and A2 satisfied for $a=0$ allow to include some non-differentiable functions with non-convex sublevel sets as illustrated in Figure 2.

We now give two examples of functions that satisfy A1 and A2, including function classes where linear convergence of numerical optimization algorithms are typically analyzed. The first class consists of quadratically bounded functions. It includes all strongly-convex functions with Lipschitz continuous gradient. It also includes some non-convex functions. The second class consists of positively homogeneous functions. The levelsets of a positively homogeneous function are all geometrically similar around $x^{*}$ .

We assume that $f=g\circ h$ where $g$ is a strictly increasing function and $h$ is measurable, continuously differentiable with the unique critical point $x^{*}$ , and quadratically bounded around $x^{*}$ , i.e., for some $L_{u}\geqslant L_{\ell}>0$ ,

(7)

\displaystyle(L_{\ell}/2)\lVert x-x^{*}\rVert^{2}\leqslant h(x)-h(x^{*})\leqslant(L_{u}/2)\lVert x-x^{*}\rVert^{2}\enspace.

A4

We assume that $f=g\circ h$ where $h$ is continuously differentiable and positively homogeneous with a unique optimum $x^{*}$ , i.e., for some $\gamma>0$

(8) $h(x^{*}+\gamma x)=h(x^{*})+\gamma\left(h(x^{*}+x)-h(x^{*})\right)\enspace.$

The following lemmas show that these assumptions imply A1 and A2. The proofs of the lemmas are presented in Section B.1 and Section B.2, respectively.

Lemma 2.10.

Let $f$ satisfy A3. Then, $f$ satisfies A1 and A2 with $a=0$ , $b=\infty$ , $C_{\ell}=\frac{1}{V_{d}}\sqrt{\frac{L_{\ell}}{L_{u}}}$ and $C_{u}=\frac{1}{V_{d}}\sqrt{\frac{L_{u}}{L_{\ell}}}$ .

Lemma 2.11.

Let $f$ be positively homogeneous satisfying A4, then the suboptimality function $f_{\mu}(x)$ is proportional to $h(x)-h(x^{*})$ and satisfies A1 and A2 for $a=0$ and $b=\infty$ with $C_{u}=\sup\{\lVert x-x^{*}\rVert:f_{\mu}(x)=1\}$ and $C_{\ell}=\inf\{\lVert x-x^{*}\rVert:f_{\mu}(x)=1\}$ .

3 Methodology: Additive Drift on Unbounded Continuous Domains

3.1 First Hitting Time

We start with the generic definition of the first hitting time of a stochastic process $\{X_{t}:t\geqslant 0\}$ , defined as follows.

Definition 3.1 (First hitting time).

Let $\{X_{t}:t\geqslant 0\}$ be a sequence of real-valued random variables adapted to the natural filtration $\{\mathcal{F}_{t}:t\geqslant 0\}$ with initial condition $X_{0}=\beta_{0}\in\mathbb{R}$ . For $\beta<\beta_{0}$ , the first hitting time $T_{\beta}^{X}$ of $X_{t}$ to the set $(-\infty,\beta]$ is defined as $T_{\beta}^{X}=\inf\{t:X_{t}\leqslant\beta\}$ .

The first hitting time is the number of iterations that the stochastic process requires to reach the target level $\beta<\beta_{0}$ for the first time. In our situation, $X_{t}=\lVert m_{t}-x^{*}\rVert$ measures the distance from the current solution $m_{t}$ to the target point $x^{*}$ (typically, global or local optimal point) after $t$ iterations. Then, $\beta=\epsilon>0$ defines the target accuracy and $T_{\epsilon}^{X}$ is the runtime of the algorithm until it finds an $\epsilon$ -neighborhood $\mathcal{B}(x^{*},\epsilon)$ . The first hitting time $T_{\epsilon}^{X}$ is a random variable as $m_{t}$ is a random variable. In this paper, we focus on the expected first hitting time $\mathbb{E}[T_{\epsilon}^{X}]$ . We want to derive lower and upper bounds on this expected hitting time that relate to the linear convergence of $X_{t}$ towards $x^{*}$ . Such bounds take the following form: There exist $C_{T},\tilde{C}_{T}\in\mathbb{R}$ and $C_{R}>0$ , $\tilde{C}_{R}>0$ such that for any $0<\epsilon\leqslant\beta_{0}$

(9)

\tilde{C}_{T}+\frac{\log\left({\lVert m_{0}-x^{*}\rVert}/{\epsilon}\right)}{\tilde{C}_{R}}\leqslant\mathbb{E}[T_{\epsilon}^{X}|\mathcal{F}_{0}]\leqslant C_{T}+\frac{\log(\lVert m_{0}-x^{*}\rVert/\epsilon)}{C_{R}}\enspace.

That is, the time to reach the target accuracy scales logarithmically with the ratio between the initial accuracy $\lVert m_{0}-x^{*}\rVert$ and the target accuracy $\epsilon$ . The first pair of constants, $C_{T}$ and $\tilde{C}_{T}$ , capture the transient time, which is the time that adaptive algorithms typically spend for adaptation. The second pair of constants, $C_{R}$ and $\tilde{C}_{R}$ , reflect the speed of convergence (logarithmic convergence rate). Intuitively, assuming that $C_{R}$ and $\tilde{C}_{R}$ are close, the distance to the optimum decreases in each step at a rate of approximately $\exp(-C_{R})\approx\exp(-\tilde{C}_{R})$ . While upper-bounds inform us about the (linear) convergence, the lower-bound helps understanding whether the upper bounds are tight.

Alternatively, linear convergence can be defined as the following property: there exits $C>0$ such that

(10)

\limsup_{t\to\infty}\frac{1}{t}\log\frac{\|m_{t}-x^{*}\|}{\|m_{0}-x^{*}\|}\leqslant-C\,\,{\rm almost\,\,surely.}

When we have an equality in the previous statement, we say that $\exp(-C)$ is the convergence rate.

Figure 1 (right plot) visualizes three different runs of the (1+1)-ES on a function with spherical level sets with different initial step-sizes. First of all, we clearly observe linear convergence. The first hitting time of $\mathcal{B}(x^{*},\epsilon)$ scales linearly with $\log(1/\epsilon)$ for a sufficiently small $\epsilon>0$ . Second, its convergence speed is independent of the initial condition. Therefore, we expect to have universal constants $C_{R}$ and $\tilde{C}_{R}$ independent of the initial state. Last, depending on the initial step-size, the transient time can vary. If the initial step-size is too large or too small, it does not produce progress in terms of $\lVert m_{t}-x^{*}\rVert$ until the step-size is well adapted. Therefore, $C_{T}$ and $\tilde{C}_{T}$ depend on the initial condition, with a logarithmic dependency on the initial multiplicative mismatch.

3.2 Bounds of the Hitting Time via Drift Conditions

We are going to use drift analysis that consists in deducing properties of a sequence $\{X_{t}:t\geqslant 0\}$ (adapted to a natural filtration $\{\mathcal{F}_{t}:t\geqslant 0\}$ ) from its drift defined as $\mathbb{E}[X_{t+1}\mid\mathcal{F}_{t}]-X_{t}$ [29]. Drift analysis has been widely used to analyze hitting times of evolutionary algorithms defined on discrete search spaces (mainly on binary search spaces) [32, 33, 11, 46, 20, 19]. Though they were developed mainly for finite search spaces, the drift theorems can naturally be generalized to continuous domains [42, 44]. Indeed, Jägersküpper’s work [34, 36, 37] is based on the same idea, while the link to the drift analysis was implicit.

Since many drift conditions have been developed for analyzing algorithms on discrete domains, the domain of $X_{t}$ is often implicitly assumed to be bounded. However, this assumption is violated in our situation, where we will use $X_{t}=\log\big{(}f_{\mu}(m_{t})\big{)}$ as the quality measure, which takes values in $\mathbb{R}\cup\{-\infty\}$ , and is meant to approach $-\infty$ . We refer to [4] for additional details. In general, translating expected progress requires bounding the tail of the progress distribution, as formalized in [29].

To control the tails of the drift distribution, we construct a stochastic process $\{Y_{t}:t\geqslant 0\}$ iteratively as follows: $Y_{0}=X_{0}$ and

(11)

Y_{t+1}=Y_{t}+\max\big{\{}X_{t+1}-X_{t},-A\big{\}}1\left\{\scriptstyle{T_{\beta}^{X}>t}\right\}-B1\left\{\scriptstyle{T_{\beta}^{X}\leqslant t}\right\}

for some $A\geqslant B>0$ and $\beta<\beta_{0}$ with $X_{0}=\beta_{0}$ . It clips $X_{t+1}-X_{t}$ to some constant $-A$ ( $A>0$ ) from below. We introduce the indicator $1\left\{\scriptstyle{T_{\beta}^{X}>t}\right\}$ for a technical reason. The process disregards progress larger than $A$ , and it fixes the progress of the step that hits the target set to $B$ . It is formalized in the following theorem, which is our main mathematical tool to derive an upper bound of the expected first hitting time of (1+1) $\text{-ES}_{\kappa}$ in the form of (9).

Theorem 3.2.

Let $\{X_{t}:t\geqslant 0\}$ be a sequence of real-valued random variables adapted to a filtration $\{\mathcal{F}_{t}:t\geqslant 0\}$ with $X_{0}=\beta_{0}\in\mathbb{R}$ . For $\beta<\beta_{0}$ , let $T^{X}_{\beta}=\inf\left\{t:X_{t}\leqslant\beta\right\}$ be the first hitting time of the set $(-\infty,\beta]$ . Define a stochastic process $\{Y_{t}:t\geqslant 0\}$ iteratively through (11) with $Y_{0}=X_{0}$ for some $A\geqslant B>0$ , and let $T^{Y}_{\beta}=\inf\left\{t:Y_{t}\leqslant\beta\right\}$ be the first hitting time of the set $(-\infty,\beta]$ . If $Y_{t}$ is integrable, i.e. $\mathbb{E}\left[\big{|}Y_{t}\big{|}\right]<\infty$ , and

(12)

\mathbb{E}\left[\max\left\{X_{t+1}-X_{t},-A\right\}1\left\{\scriptstyle{T_{\beta}^{X}>t}\right\}\,\big{|}\,\mathcal{F}_{t}\right]\leqslant-B1\left\{\scriptstyle{T_{\beta}^{X}>t}\right\}\enspace,

then the expectation of $T^{X}_{\beta}$ satisfies

(13)

\mathbb{E}\left[T^{X}_{\beta}\right]\leqslant\mathbb{E}\left[T^{Y}_{\beta}\right]\leqslant\frac{A+\beta_{0}-\beta}{B}\enspace.

Proof 3.3 (Proof of Theorem 3.2).

We consider the stopped process $\bar{Y}_{t}=Y_{\min\{t,T^{Y}_{\beta}\}}$ . Then, we have $Y_{t}=\bar{Y}_{t}$ for $t\leqslant T^{Y}_{\beta}$ and $\bar{Y}_{t}\geqslant Y_{T^{Y}_{\beta}}$ for all $t$ .

We will prove that

(14)

\mathbb{E}[\bar{Y}_{t+1}\mid\mathcal{F}_{t}]\leqslant\bar{Y}_{t}-B1\left\{\scriptstyle{T^{Y}_{\beta}>t}\right\}\enspace.

We start from

(15)

\mathbb{E}[\bar{Y}_{t+1}\mid\mathcal{F}_{t}]=\mathbb{E}[\bar{Y}_{t+1}1\left\{\scriptstyle{T^{Y}_{\beta}\leqslant t}\right\}\mid\mathcal{F}_{t}]+\mathbb{E}[\bar{Y}_{t+1}1\left\{\scriptstyle{T^{Y}_{\beta}>t}\right\}\mid\mathcal{F}_{t}]

and bound the different terms:

(16)

\mathbb{E}[\bar{Y}_{t+1}1\left\{\scriptstyle{T^{Y}_{\beta}\leqslant t}\right\}\mid\mathcal{F}_{t}]=\mathbb{E}[\bar{Y}_{t}1\left\{\scriptstyle{T^{Y}_{\beta}\leqslant t}\right\}\mid\mathcal{F}_{t}]=\bar{Y}_{t}1\left\{\scriptstyle{T^{Y}_{\beta}\leqslant t}\right\}\enspace,

where we have used that $1_{\{T_{\beta}^{X}>t\}}$ , $Y_{t}$ , $1_{\{T_{\beta}^{Y}>t\}}$ , and $\bar{Y}_{t}$ are all $\mathcal{F}_{t}$ -measurable. Also

(17)

\mathbb{E}[\bar{Y}_{t+1}1\left\{\scriptstyle{T^{Y}_{\beta}>t}\right\}\mid\mathcal{F}_{t}]=\mathbb{E}[Y_{t+1}\mid\mathcal{F}_{t}]1\left\{\scriptstyle{T^{Y}_{\beta}>t}\right\}\\ \leqslant(Y_{t}-B1\left\{\scriptstyle{T^{X}_{\beta}>t}\right\}-B1\left\{\scriptstyle{T^{X}_{\beta}\leqslant t}\right\})1\left\{\scriptstyle{T^{Y}_{\beta}>t}\right\}=(\bar{Y}_{t}-B)1\left\{\scriptstyle{T^{Y}_{\beta}>t}\right\}\enspace,

where we have used condition Eq. 12. Hence, by injecting Eq. 16 and Eq. 17 into Eq. 15, we obtain Eq. 14. From Eq. 14, by taking the expectation we deduce $\mathbb{E}[\bar{Y}_{t+1}]\leqslant\mathbb{E}[\bar{Y}_{t}]-B\Pr[T_{\beta}^{Y}>t]$ . Following the same approach as [44, Theorem 1], since $T^{Y}_{\beta}$ is a random variable taking values in $\mathbb{N}$ , it can be rewritten as $\mathbb{E}[T^{Y}_{\beta}]=\sum_{t=0}^{+\infty}\Pr[T^{Y}_{\beta}>t]$ and thus it holds

B\mathbb{E}\left[T^{Y}_{\beta}\right]\stackrel{{\scriptstyle\tilde{t}\to\infty}}{{\longleftarrow}}\sum_{t=0}^{\tilde{t}}B\Pr\left[T^{Y}_{\beta}>t\right]\leqslant\sum_{t=0}^{\tilde{t}}\Big{(}\mathbb{E}[\bar{Y}_{t}]-\mathbb{E}[\bar{Y}_{t+1}]\Big{)}=\mathbb{E}[\bar{Y}_{0}]-\mathbb{E}[\bar{Y}_{\tilde{t}}]\enspace.

Since $Y_{t+1}\geqslant Y_{t}-A$ , we have $Y_{T^{Y}_{\beta}}\geqslant\beta-A$ . Given that $\bar{Y}_{\tilde{t}}\geqslant Y_{T^{Y}_{\beta}}$ , we deduce that $\mathbb{E}[\bar{Y}_{\tilde{t}}]\geqslant\beta-A$ for all $\tilde{t}$ . With $\mathbb{E}[\bar{Y}_{0}]=\beta_{0}$ , we have

\mathbb{E}\left[T^{Y}_{\beta}\right]\leqslant(A/B)+B^{-1}(\beta_{0}-\beta)\enspace.

Because $\beta<X_{t}\leqslant Y_{t}$ for $0\leqslant t<T^{X}_{\beta}$ , we have $T^{X}_{\beta}\leqslant T^{Y}_{\beta}$ , implying $\mathbb{E}[T^{X}_{\beta}]\leqslant\mathbb{E}[T^{Y}_{\beta}]$ . This completes the proof.

This theorem can be intuitively understood: we assume for the sake of simplicity a process $X_{t}$ such that $X_{t+1}\geqslant X_{t}-A$ . Then (12) states that the process progresses in expectation by at least $-B$ . The theorem concludes that the expected time needed to reach a value smaller than $\beta$ when started in $\beta_{0}$ equals to $(\beta_{0}-\beta)/B$ (what we would get for a deterministic algorithm) plus $A/B$ . This last term is due to the stochastic nature of the algorithm. It is minimized if $A$ is as close as possible to $B$ , which corresponds to a highly concentrated process.

Jägersküpper [36, Theorem 2] established a general lower bound of the expected first hitting time of the (1+1)-ES. We borrow the same idea to prove the following general theorem for a lower bound of the expected first hitting time, which generalizes [37, Lemma 12]. See Theorem 2.3 in [4] for its proof.

Theorem 3.4.

Let $\{X_{t}:t\geqslant 0\}$ be a sequence of real-valued random variables adapted to a filtration $\{\mathcal{F}_{t}:t\geqslant 0\}$ and integrable such that $X_{0}=\beta_{0}$ , $X_{t+1}\leqslant X_{t}$ , and $\mathbb{E}[X_{t+1}\mid\mathcal{F}_{t}]-X_{t}\geqslant-C$ for $C>0$ . For $\beta<\beta_{0}$ we define $T^{X}_{\beta}=\min\left\{t:X_{t}\leqslant\beta\right\}$ . Then the expected hitting time is lower bounded by $\mathbb{E}\left[T_{\beta}^{X}\right]\geqslant-(1/2)+(4C)^{-1}(\beta_{0}-\beta)$ .

4 Main Result: Expected First Hitting Time Bound

4.1 Mathematical Modeling of the Algorithm

In the sequel, we will analyze the process $\{\theta_{t}:t\geqslant 0\}$ where $\theta_{t}=(m_{t},\sigma_{t},\Sigma_{t})\in\mathbb{R}^{n}\times\mathbb{R}_{>}\times\mathcal{S}_{\kappa}$ generated by the (1+1) $\text{-ES}_{\kappa}$ algorithm. We assume from now on that the optimized objective function $f$ is measurable with respect to the Borel $\sigma$ -algebra. We equip the state-space $\mathcal{X}=\mathbb{R}^{n}\times\mathbb{R}_{>}\times\mathcal{S}_{\kappa}$ with its Borel $\sigma$ -algebra denoted $\mathcal{B}(\mathcal{X})$ .

4.2 Preliminaries

We present two preliminary results. In Assumption A1, we assume that for $m\in{\mathcal{X}_{a}^{b}}$ , we can include a ball of radius $C_{\ell}f_{\mu}(m)$ into the sublevel set $S_{0}(m)$ and embed $S_{0}(m)$ into a ball of radius $C_{u}f_{\mu}(m)$ . This allows us to upper bound and lower bound the probability of success for all $m\in{\mathcal{X}_{a}^{b}}$ , for all $\Sigma\in\mathcal{S}_{\kappa}$ , by the probability to sample inside of balls of radius $C_{u}f_{\mu}(m)$ and $C_{\ell}f_{\mu}(m)$ with appropriate center. From this we can upper-bound $p^{\mathrm{upper}}_{(a,b]}(\bar{\sigma})$ by a function of $\bar{\sigma}$ . Similarly we can lower-bound $p^{\mathrm{lower}}_{(a,b]}(\bar{\sigma})$ by a function of $\bar{\sigma}$ . The corresponding proof is given in Section B.3.

Proposition 4.1.

Suppose that $f$ satisfies A1. Consider the lower and upper success probabilities $p^{\mathrm{upper}}_{(a,b]}$ and $p^{\mathrm{lower}}_{(a,b]}$ defined in Definition 2.6, then

(18)		$\displaystyle p^{\mathrm{upper}}_{(a,b]}(\bar{\sigma})$	$\displaystyle\leqslant\kappa^{d/2}\Phi\left(\bar{\mathcal{B}}\left(0,\frac{C_{u}}{\bar{\sigma}\kappa^{1/2}}\right);0,\mathrm{I}\right)$
(19)		$\displaystyle p^{\mathrm{lower}}_{(a,b]}(\bar{\sigma})$	$\displaystyle\geqslant\kappa^{-d/2}\Phi\left(\bar{\mathcal{B}}\left(\frac{(2C_{u}-C_{\ell})\kappa^{1/2}}{\bar{\sigma}}e_{1},\frac{C_{\ell}\kappa^{1/2}}{\bar{\sigma}}\right);0,\mathrm{I}\right)\enspace,$

where $e_{1}=(1,0,\dots,0)$ .

We use the previous proposition to establish the next lemma that guarantees the existence of a finite range of normalized step-size that leads to the success probability into some range $(p_{u},p_{\ell})$ independent of $m$ and $\Sigma$ , and provides a lower bound on the success probability with rate $r$ when the normalized step-size is in the above range. Its proof is provided in Section B.4.

Lemma 4.2.

We assume that $f$ satisfies A1 and A2 for some $0\leqslant a<b\leqslant\infty$ . Then, for any $p_{u}$ and $p_{\ell}$ satisfying $0<p_{u}<p^{\mathrm{target}}<p_{\ell}<p^{\mathrm{limit}}$ , the constants

\displaystyle\bar{\sigma}_{\ell}

\displaystyle=\sup\left\{\bar{\sigma}>0:p^{\mathrm{lower}}_{(a,b]}(\bar{\sigma})\geqslant p_{\ell}\right\}

and

\displaystyle\bar{\sigma}_{u}

\displaystyle=\inf\left\{\bar{\sigma}>0:p^{\mathrm{upper}}_{(a,b]}(\bar{\sigma})\leqslant p_{u}\right\}

exist as positive finite values. Let $\ell\leqslant\bar{\sigma}_{\ell}$ and $u\geqslant\bar{\sigma}_{u}$ such that $u/\ell\geqslant\alpha_{\uparrow}/\alpha_{\downarrow}$ . Then, for $r\in[0,1]$ , $p^{*}_{r}$ defined as

(20)

p^{*}_{r}:=\inf_{\ell\leqslant\bar{\sigma}\leqslant u}\inf_{m\in\mathcal{X}_{a}^{b}}\inf_{\Sigma\in\mathcal{S}_{\kappa}}p^{\mathrm{succ}}_{r}(\bar{\sigma};m,\Sigma)

is lower bounded by a positive number defined by

(21)

\min_{\ell\leqslant\bar{\sigma}\leqslant u}\kappa^{-d/2}\Phi\left(\mathcal{B}\left(\left(\frac{(2C_{u}-(1-r)C_{\ell})\kappa^{1/2}}{\bar{\sigma}}\right)e_{1},\frac{(1-r)C_{\ell}\kappa^{1/2}}{\bar{\sigma}}\right);0,\mathrm{I}\right)\enspace.

4.3 Potential Function

Lemma 4.2 divides the domain of the normalized step-size into three disjoint subsets: $\bar{\sigma}\in(0,\ell)$ is a too small normalized step-size situation where we have $p^{\mathrm{succ}}_{0}(\bar{\sigma};m,\Sigma)\geqslant p_{\ell}$ for all $m\in\mathcal{X}_{a}^{b}$ and $\Sigma\in\mathcal{S}_{\kappa}$ ; $\bar{\sigma}\in(u,\infty)$ is a too large normalized step-size situation where we have $p^{\mathrm{succ}}_{0}(\bar{\sigma};m,\Sigma)\leqslant p_{u}$ for all $m\in\mathcal{X}_{a}^{b}$ and $\Sigma\in\mathcal{S}_{\kappa}$ ; and $\bar{\sigma}\in[\ell,u]$ is a reasonable normalized step-size situation where the success probability with rate $r$ is lower bounded by Eq. 21. Since $p_{\mathrm{target}}\in[p_{u},p_{\ell}]$ , the normalized step-size is supposed to be maintained in the reasonable range.

Our potential function is defined as follows. In light of Lemma 4.2, we can take $\ell\leqslant\bar{\sigma}_{\ell}$ and $u\geqslant\bar{\sigma}_{u}$ such that $u/\ell\geqslant\alpha_{\uparrow}/\alpha_{\downarrow}$ . With some constant $v>0$ , we define our potential function as

(22)

V(\theta)=\log(f_{\mu}(m))+\max\left\{0,v\log\left[\frac{\alpha_{\uparrow}\ell f_{\mu}(m)}{\sigma}\right],v\log\left[\frac{\sigma}{\alpha_{\downarrow}uf_{\mu}(m)}\right]\right\}\enspace.

The rationale behind the second term on the right-hand side (RHS) is as follows. The second and third terms inside $\max$ are positive only if the normalized step-size $\bar{\sigma}=\sigma/f_{\mu}(m)$ is smaller than $\ell\alpha_{\uparrow}$ and greater than $u\alpha_{\downarrow}$ , respectively. The potential value is $\log f_{\mu}(m)$ if the normalized step-size is in $[\ell\alpha_{\uparrow},u\alpha_{\downarrow}]$ and it is penalized if the normalized step-size is too small or too large. We need this penalization for the following reason. If the normalized step-size is too small, the success probability is close to $1/2$ for non-critical points, assuming $f=g\circ h$ where $h$ is a continuously differentiable function, but the progress per step is very small because the step-size directly controls the progress for instance measured as $\|m_{t+1}-m_{t}\|=\sigma_{t}\|\mathcal{N}(0,\Sigma_{t})\|1_{\{f(m_{t+1})\leqslant f(m_{t})\}}$ . If the normalized step-size is too large, the success probability is close to zero and produces no progress with high probability. If we would use $\log f_{\mu}(m)$ as a potential function instead of $V(\theta)$ then the progress is arbitrarily small in such situations, which prevents the application of drift arguments. The above potential function penalizes such situations, and guarantees a certain progress in the penalized quantity since the step-size will be increased or decreased, respectively, with high probability, leading to a certain decrease of $V(\theta)$ . We illustrate in Figure 1 that $\log(f_{\mu}(m))$ cannot work alone as a potential function while $V(\theta)$ does: when we start from a too small or too large step-size, $\log(f_{\mu}(m))$ looks constant (doted line in green and blue). Only when the step-size is started at $1$ , we see progress in $\log(f_{\mu}(m))$ . Also, the step size can always get arbitrarily worse, with a very small probability, which forces us to handle the case of badly adapted step size properly. Yet the simulation of $V(\theta)$ shows that in all three situations (small, large and well adapted step-sizes compared to the distance to the optimum), we observe a geometric decrease of $V(\theta)$ .

4.4 Upper Bound of the First Hitting Time

We are now ready to establish that the potential function defined in (22) satisfies a (truncated)-drift condition from Theorem 3.2. This will in turn imply an upper bound on the expected hitting time of $f_{\mu}(m)$ to $[0,\epsilon]$ provided $a\leqslant\epsilon$ . The proof follows the same line of argumentation as the proof of [4, Proposition 4.2], which was restricted to the case of spherical functions. It was generalized under similar assumptions as in this paper, but for a fixed covariance matrix equal to the identity, in [47, Proposition 6]. The detailed proof is given in Section B.5.

Proposition 4.3.

Consider the (1+1) $\text{-ES}_{\kappa}$ described in Algorithm 1 with state $\theta_{t}=(m_{t},\sigma_{t},\Sigma_{t})$ . Assume that the minimized objective function $f$ satisfies A1 and A2 for some $0\leqslant a<b\leqslant\infty$ . Let $p_{u}$ and $p_{\ell}$ be constants satisfying $0<p_{u}<p_{\mathrm{target}}<p_{\ell}<p^{\mathrm{limit}}$ and $p_{\ell}+p_{u}=2p_{\mathrm{target}}$ . Then, there exists $\ell\leqslant\bar{\sigma}_{\ell}$ and $u\geqslant\bar{\sigma}_{u}$ such that $u/\ell\geqslant\alpha_{\uparrow}/\alpha_{\downarrow}$ , where $\bar{\sigma}_{\ell}$ and $\bar{\sigma}_{u}$ are defined in Lemma 4.2. For any $A>0$ , taking $v$ satisfying $0<v<\min\left\{1,\ \frac{A}{\log(1/\alpha_{\downarrow})},\ \frac{A}{\log(\alpha_{\uparrow})}\right\}$ , and the potential function Eq. 22, we have

(23)

\mathbb{E}\left[\max\{V(\theta_{t+1})-V(\theta_{t}),-A\}1\left\{\scriptstyle{m_{t}\in\mathcal{X}_{a}^{b}}\right\}\mid\mathcal{F}_{t}\right]\leqslant-B1\left\{\scriptstyle{m_{t}\in\mathcal{X}_{a}^{b}}\right\}

where

(24)

B=\min\left\{Ap^{*}_{r}-v\log\left(\frac{\alpha_{\uparrow}}{\alpha_{\downarrow}}\right),\ v\frac{p_{\ell}-p_{u}}{2}\log\left(\frac{\alpha_{\uparrow}}{\alpha_{\downarrow}}\right)\right\}\enspace,

and $p^{*}_{r}=\inf_{\bar{\sigma}\in[\ell,u]}\inf_{m\in\mathcal{X}_{a}^{b}}\inf_{\Sigma\in\mathcal{S}_{\kappa}}p^{\mathrm{succ}}_{r}(\bar{\sigma};m,\Sigma)\ \text{with}\ r=1-\exp\left(-\frac{A}{1-v}\right).$ Moreover, for any $A>0$ there exists $v$ such that $B<A$ is positive.

We apply Theorem 3.2 along with Proposition 4.3 to derive the expected first hitting time bound. To do so, we need to confirm that it satisfies the prerequisite of the theorem: integrability of the process $\{Y_{t}:t\geqslant 0\}$ defined in Eq. 11 with $X_{t}=V(\theta_{t})$ .

Lemma 4.4.

Let $\{\theta_{t}:t\geqslant 0\}$ be the sequence of parameters $\theta_{t}=(m_{t},\sigma_{t},\Sigma_{t})$ defined by the (1+1) $\text{-ES}_{\kappa}$ with the initial condition $\theta_{0}=(m_{0},\sigma_{0},\Sigma_{0})$ optimizing a measurable function $f$ . Set $X_{t}=V(\theta_{t})$ as defined in Eq. 22 and define the process $Y_{t}$ as defined in Theorem 3.2. Then, for any $A>0$ , $\{Y_{t}:t\geqslant 0\}$ is integrable, i.e., $\mathbb{E}[\lvert Y_{t}\rvert]<\infty$ for each $t$ .

Proof 4.5 (Proof of Lemma 4.4).

The drift $Y_{t+1}=Y_{t}+\max\big{\{}V(\theta_{t+1})-V(\theta_{t}),-A\big{\}}1\left\{\scriptstyle{T_{\beta}^{X}>t}\right\}-B1\left\{\scriptstyle{T_{\beta}^{X}\leqslant t}\right\}$ is by construction bounded by $-A$ from below. It is also bounded by a constant from above. Indeed, from the proof of Proposition 4.3, it is easy to find the upper bound, say $C$ , of the truncated one-step change, $\Delta_{t}$ in the proof of Proposition 4.3, without using A1 and A2. Let $D=\max\{A,C\}$ . Then, by recursion, $\lvert V(\theta_{t})\rvert\leqslant\lvert V(\theta_{0})\rvert+\lvert V(\theta_{t})-V(\theta_{0})\rvert\leqslant\lvert Y_{0}\rvert+Dt$ . Hence $\mathbb{E}[\lvert Y_{t}\rvert]\leqslant\lvert Y_{0}\rvert+Dt<\infty$ for all $t$ .

Finally, we derive the expected first hitting time of $\log f_{\mu}(m_{t})$ .

Theorem 4.6.

Consider the same situation as described in Proposition 4.3. Let $T_{\epsilon}=\min\{t:f_{\mu}(m_{t})\leqslant\epsilon\}$ be the first hitting time of $f_{\mu}(m_{t})$ to $[0,\epsilon]$ . Choose $a\leqslant\epsilon<f_{\mu}(m_{t})\leqslant b$ , where $a$ and $b$ appear in Definition 2.6. If $m_{0}\in{\mathcal{X}_{a}^{b}}$ , the first hitting time is upper bounded by $\mathbb{E}[T_{\epsilon}]\leqslant\big{(}V(\theta_{0})-\log(\epsilon)+A\big{)}/B$ for $A>B>0$ described in Proposition 4.3, where $V(\theta)$ is the potential function defined in Eq. 22. Equivalently, we have $\mathbb{E}[T_{\epsilon}]\leqslant C_{T}+C_{R}^{-1}\log(f_{\mu}(m_{0})/\epsilon)$ , where

\displaystyle C_{T}

\displaystyle=\frac{A}{B}+\frac{v}{B}\max\left\{0,\log\left(\frac{\alpha_{\uparrow}\ell f_{\mu}(m_{0})}{\sigma_{0}}\right),\log\left(\frac{\sigma_{0}}{\alpha_{\downarrow}uf_{\mu}(m_{0})}\right)\right\}\enspace,

\displaystyle C_{R}

\displaystyle=B\enspace.

Moreover, the above result yields an upper bound of the expected first hitting time of $\lVert m_{t}-x^{*}\rVert$ to $[0,2C_{u}\epsilon]$ .

Proof 4.7.

Theorem 3.2 with Proposition 4.3 and Lemma 4.4 together bounds the expected first hitting time of $V(\theta_{t})$ to $(-\infty,\log(\epsilon)]$ by $\big{(}V(\theta_{0})-\log(\epsilon)+A\big{)}/B$ . Since $\log f_{\mu}(m_{t})\leqslant V(\theta_{t})$ , $T_{\epsilon}$ is bounded by the first hitting time of $V(\theta_{t})$ to $(-\infty,\log(\epsilon)]$ . The inequality is preserved if we take the expectation. The last claim is trivial from the inequality $\lVert x-x^{*}\rVert\leqslant 2C_{u}f_{\mu}(x)$ , which holds under A1.

Theorem 4.6 shows an upper bound on the expected hitting time of the (1+1) $\text{-ES}_{\kappa}$ with success-based step-size adaptation for linear convergence towards the global optimum $x^{*}$ on functions satisfying A1 and A2 with $a=0$ . Moreover, for $b=\infty$ , this bound holds from all initial search points $m_{0}$ . If $a>0$ , the bound in Theorem 4.6 does not translate into linear convergence, but we still obtain an upper bound on the expected first hitting time of the target accuracy $\epsilon\geqslant a$ . This is useful for understanding the behavior of (1+1) $\text{-ES}_{\kappa}$ on multimodal functions, and on functions with degenerated Hessian matrix at the optimum.

4.5 Lower Bound of the First Hitting Time

We derive a general lower bound of the expected first hitting time of $\lVert m_{t}-x^{*}\rVert$ to $[0,\epsilon]$ . The following results hold for an arbitrary measurable function $f$ and for a (1+1) $\text{-ES}_{\kappa}$ with an arbitrary $\sigma$ -control mechanism. The following lemma provides the lower bound of the expected one-step progress measured by the logarithm of the distance to the optimum.

Lemma 4.8.

We consider the process $\{\theta_{t}:t\geqslant 0\}$ generated by a (1+1) $\text{-ES}_{\kappa}$ algorithm with an arbitrary step-size adaptation mechanism and an arbitrary covariance matrix update optimizing an arbitrary measurable function $f$ . We assume $d\geqslant 2$ and $\kappa_{t}=\operatorname{Cond}(\Sigma_{t})\leqslant\kappa$ . We consider the natural filtration $\mathcal{F}_{t}$ . Then, the expected single-step progress is lower-bounded by

(25)

\mathbb{E}[\min(\log(\lVert m_{t+1}-x^{*}\rVert/\lVert m_{t}-x^{*}\rVert),0)\mid\mathcal{F}_{t}]\geqslant-\kappa_{t}^{\frac{d}{2}}/d\enspace.

Proof 4.9 (Proof of Lemma 4.8).

Note first that $\log(\lVert m_{t+1}-x^{*}\rVert/\lVert m_{t}-x^{*}\rVert)=\log(\lVert x_{t}-x^{*}\rVert/\lVert m_{t}-x^{*}\rVert)1\left\{\scriptstyle{f(x_{t})\leqslant f(m_{t})}\right\}$ . This value can be positive since $f(x_{t})\leqslant f(m_{t})$ does not imply $\lVert x_{t}-x^{*}\rVert\leqslant\lVert m_{t}-x^{*}\rVert$ in general. Clipping the positive part to zero, we obtain a lower bound, which is the RHS of the above equality times the indicator $1\left\{\scriptstyle{\lVert x_{t}-x^{*}\rVert\leqslant\lVert m_{t}-x^{*}\rVert}\right\}$ . Since the quantity is non-positive, dropping the indicator $1\left\{\scriptstyle{f(x_{t})\leqslant f(m_{t})}\right\}$ only decreases the lower bound. Hence, we have $\min(\log(\lVert m_{t+1}-x^{*}\rVert/\lVert m_{t}-x^{*}\rVert),0)\geqslant\log(\lVert x_{t}-x^{*}\rVert/\lVert m_{t}-x^{*}\rVert)1\left\{\scriptstyle{\lVert x_{t}-x^{*}\rVert\leqslant\lVert m_{t}-x^{*}\rVert}\right\}$ . Then,

\mathbb{E}[\min(\log(\lVert m_{t+1}-x^{*}\rVert)-\log(\lVert m_{t}-x^{*}\rVert),0)\mid\mathcal{F}_{t}]\\ \geqslant\mathbb{E}[\log(\lVert x_{t}-x^{*}\rVert/\lVert m_{t}-x^{*}\rVert)1\left\{\scriptstyle{\lVert x_{t}-x^{*}\rVert\leqslant\lVert m_{t}-x^{*}\rVert}\right\}\mid\mathcal{F}_{t}]\enspace.

We rewrite the lower bound of the drift. The RHS of the above inequality is the integral of $\log(\lVert x-x^{*}\rVert/\lVert m_{t}-x^{*}\rVert)$ in the integral domain $\bar{\mathcal{B}}(x^{*},\lVert m_{t}-x^{*}\rVert)$ under the probability measure $\Phi\left(;m_{t},\sigma_{t}^{2}\Sigma_{t}\right)$ . Performing a variable change (through rotation and scaling) so that $m_{t}-x^{*}$ becomes $e_{1}=(1,0,\cdots,0)$ and letting $\tilde{\sigma}_{t}=\sigma_{t}/\lVert m_{t}-x^{*}\rVert$ , we can further rewrite it as the integral of $\log(\lVert x\rVert)$ in $\bar{\mathcal{B}}(0,1)$ under $\Phi\left(;e_{1},\tilde{\sigma}_{t}^{2}\Sigma_{t}\right)$ . With $\kappa_{t}=\operatorname{Cond}(\Sigma_{t})$ , we have $\varphi\left(;e_{1},\tilde{\sigma}_{t}^{2}\Sigma_{t}\right)\leqslant\kappa_{t}^{d/2}\varphi\left(;e_{1},\kappa_{t}\tilde{\sigma}_{t}^{2}\mathrm{I}\right)$ , see Lemma B.1. Altogether, we obtain the lower bound $\mathbb{E}[\log(\lVert x_{t}-x^{*}\rVert/\lVert m_{t}-x^{*}\rVert)1\left\{\scriptstyle{\lVert x_{t}-x^{*}\rVert\leqslant\lVert m_{t}-x^{*}\rVert}\right\}\mid\mathcal{F}_{t}]\geqslant\kappa_{t}^{d/2}\int_{\bar{\mathcal{B}}(0,1)}\log(\lVert x\rVert)\varphi\left(;e_{1},\kappa_{t}\tilde{\sigma}_{t}^{2}\mathrm{I}\right)\mathrm{d}x$ . The RHS is equivalent to $-\kappa_{t}^{d/2}$ times the single step progress of the (1+1)-ES on the spherical function at $m_{t}=e_{1}$ and $\sigma=\sqrt{\kappa}\tilde{\sigma}_{t}$ , which is proven in the proof of Lemma 4.4 of [4] to be lower bounded by $1/d$ for $d\geqslant 2$ . This completes the proof.

The following theorem proves that the expected first hitting time of (1+1) $\text{-ES}_{\kappa}$ is $\Omega(\log(\lVert m_{0}-x^{*}\rVert/\epsilon))$ for any measurable function $f$ , implying that it can not converge faster than linearly. In case of $\kappa=1$ the lower runtime bound becomes $\Omega(d(\log(\lVert m_{0}-x^{*}\rVert/\epsilon)))$ , meaning that the runtime scales linearly with respect to $d$ . The proof is a direct application of Lemma 4.8 to Theorem 3.4.

Theorem 4.10.

We consider the process $\{\theta_{t}:t\geqslant 0\}$ generated by a (1+1) $\text{-ES}_{\kappa}$ described in Algorithm 1 and assume that $f$ is a measurable function with $d\geqslant 2$ . Let $T_{\epsilon}=\inf\{t:\|m_{t}-x^{*}\|\leqslant\epsilon\}$ be the first hitting time of $[0,\epsilon]$ by $\|m_{t}-x^{*}\|$ . Then, the expected first hitting time is lower bounded by $\mathbb{E}[T_{\epsilon}]\geqslant-(1/2)+\frac{d}{4\kappa^{d/2}}\log(\lVert m_{0}-x^{*}\rVert/\epsilon)$ . The bound holds for arbitrary step-size adaptation mechanisms. If A1 holds, it gives a lower bound for the expected first hitting time bound of $f_{\mu}(m_{t})$ to $[0,2C_{\ell}\epsilon]$ .

Proof 4.11 (Proof of Theorem 4.10).

Let $X_{t}=\log\lVert m_{t}-x^{*}\rVert$ for $t\geqslant 0$ . Define $Y_{t}$ iteratively as $Y_{0}=X_{0}$ and $Y_{t+1}=Y_{t}+\min(X_{t+1}-X_{t},0)$ . Then, it is easy to see that $Y_{t}\leqslant X_{t}$ and $Y_{t+1}\leqslant Y_{t}$ for all $t\geqslant 0$ . Note that $\mathbb{E}[Y_{t+1}-Y_{t}\mid\mathcal{F}_{t}]=\mathbb{E}[\min(X_{t+1}-X_{t},0)\mid\mathcal{F}_{t}]=\mathbb{E}[\min(\log(\lVert m_{t+1}-x^{*}\rVert/\lVert m_{t}-x^{*}\rVert),0)\mid\mathcal{F}_{t}]$ , where the RMS is lower bounded in light of Lemma 4.8. Then, applying Theorem 3.4, we obtain the lower bound. The last statement directly follows from $\lVert x-x^{*}\rVert\leqslant 2C_{\ell}f_{\mu}(x)$ under A1.

4.6 Almost Sure Linear Convergence

Additionally to the expected first hitting time bound, we can deduce from Proposition 4.3, almost sure linear convergence as stated in the following proposition.

Proposition 4.12.

Consider the same situation as described in Proposition 4.3, where $a=0$ and $0<b\leqslant\infty$ . Then, for any $m_{0}\in\mathcal{X}_{0}^{b}$ , $\sigma_{0}>0$ and $\Sigma\in\mathcal{S}_{\kappa}$ , we have

(26)

\Pr\left[\limsup_{t\to\infty}\frac{1}{t}\log f_{\mu}(m_{t})\leqslant-B\right]=\Pr\left[\limsup_{t\to\infty}\frac{1}{t}\log\|m_{t}-x^{*}\|\leqslant-B\right]=1\enspace,

where $B>0$ is as defined in Proposition 4.3. Hence almost sure linear convergence holds at a rate $\exp(-C)$ such that $\exp(-C)\leqslant\exp(-B)$ .

Proof 4.13 (Proof of Proposition 4.12).

Let $V$ be defined in (22). Let $Y_{0}=V(\theta_{0})$ and $Y_{t+1}=Y_{t}+\max(-A,V(\theta_{t+1})-V(\theta_{t}))$ . Define $Z_{t}=Y_{t}-\mathbb{E}_{t-1}[Y_{t}]$ for $t\geqslant 0$ . Then, $\{Z_{t}\}$ is a martingale difference sequence on the filtration $\{\mathcal{F}_{t}\}$ produced by $\{\theta_{t}\}$ . We hence have $\frac{1}{t}\log f_{\mu}(m_{t})\leqslant\frac{1}{t}V(\theta_{t})\leqslant\frac{1}{t}Y_{t}$ , and from Proposition 4.3 we obtain

\displaystyle Y_{t}=\mathbb{E}_{t-1}[Y_{t}]+Z_{t}=Y_{t-1}+\mathbb{E}_{t-1}[Y_{t}-Y_{t-1}]+Z_{t}\leqslant Y_{t-1}-B+Z_{t}\enspace.

By repeatedly applying the above inequality and dividing it by $t$ , we obtain $\frac{1}{t}Y_{t}\leqslant-B+\frac{1}{t}Y_{0}+\frac{1}{t}\sum_{i=1}^{t}Z_{i}$ , where $\lim_{t\to\infty}\frac{1}{t}Y_{0}=0$ and $\sum_{i=1}^{t}Z_{i}$ is a martingale sequence. In light of the strong law of large numbers for martingales [15], if $\sum_{t=1}^{\infty}\mathbb{E}[Z_{t}^{2}]/t^{2}<\infty$ , we have $\lim_{t\to\infty}\frac{1}{t}\sum_{i=1}^{t}Z_{i}=0$ almost surely. By the definition of $V(\theta_{t})$ and the working mechanism of the (1+1) $\text{-ES}_{\kappa}$ , we have $V(\theta_{i})-V(\theta_{i-1})\leqslant v\log(\alpha_{\uparrow}/\alpha_{\downarrow})$ . Hence, $\mathbb{E}[Z_{i}^{2}]=\mathbb{E}[(Y_{i}-\mathbb{E}_{i-1}[Y_{i}])^{2}]=\mathbb{E}[\max(-A,V(\theta_{i})-V(\theta_{i-1}))^{2}]\leqslant\max(A,v\log(\alpha_{\uparrow}/\alpha_{\downarrow}))^{2}$ . Hence, we have $\limsup_{t\to\infty}\frac{1}{t}\log f_{\mu}(m_{t})\leqslant-B+\lim_{t\to\infty}\frac{1}{t}Y_{0}+\lim_{t\to\infty}\frac{1}{t}\sum_{i=1}^{t}Z_{i}=-B$ almost surely. Along with $\lVert x-x^{*}\rVert\leqslant 2C_{u}f_{\mu}(x)$ , we obtain Equation 26.

4.7 Wrap-up of the Results: Global Linear Convergence

As a corollary to the lower-bound from Theorem 4.10, the upper bound from Theorem 4.6, Proposition 4.12 stating the almost sure linear convergence and the fact that different assumptions discussed in Section 2.3 imply A1 and A2, we summarize our linear convergence results in the following theorem.

Theorem 4.14 (Global Linear Convergence).

We consider the (1+1) $\text{-ES}_{\kappa}$ optimizing an objective function $f$ . Suppose either

(a)

$f$ satisfies A1 and A2 for $a=0$ , $p^{\mathrm{limit}}>p^{\mathrm{target}}$ , and $m_{0}\in\mathcal{X}_{0}^{b}$ ; or
(b)

$f$ satisfies either A3 or A4, $p^{\mathrm{target}}<1/2$ , and $m_{0}\in\mathbb{R}^{d}$ .

Then, for any $\sigma_{0}>0$ and $\Sigma_{0}\in\mathcal{S}_{\kappa}$ , the expected hitting time $\mathbb{E}[T_{\epsilon}]$ of $\lVert m_{t}-x^{*}\rVert$ to $[0,\epsilon]$ is $\Theta\big{(}\log(\lVert m_{0}-x^{*}\rVert/\epsilon)\big{)}$ for all $\epsilon>0$ . Moreover, both $f_{\mu}(m_{t})$ and $\lVert m_{t}-x^{*}\rVert$ linearly converge almost surely, i.e.

\Pr\left[\limsup_{t\to\infty}\frac{1}{t}\log f_{\mu}(m_{t})\leqslant-B\right]=\Pr\left[\limsup_{t\to\infty}\frac{1}{t}\log\lVert m_{t}-x^{*}\rVert\leqslant-B\right]=1\enspace,

where $B>0$ is as defined in Proposition 4.3. The convergence rate $\exp(-C)$ is thus upper-bounded by $\exp(-B)$ .

4.8 Tightness in the Sphere Function Case

Now we consider a specific convex quadratic function, namely the sphere function $f(x)=\frac{1}{2}\lVert x\rVert^{2}$ where the spatial suboptimality function equals $f_{\mu}(x)=V_{d}\lVert x\rVert$ . In Theorem 4.14 we have formulated that the expected hitting time of a ball of radius $\epsilon$ for the (1+1) $\text{-ES}_{\kappa}$ equals $\Theta(\log\|m_{0}-x^{*}\|/\epsilon)$ . Yet, this statement does not give information on how the constants hidden in the $\Theta$ -notation scale with the dimension. In particular the convergence rate of the algorithm is upper-bounded by $\exp(-B)$ where $B$ is given in (24), see Theorem 4.6. In this section, we estimate precisely the scaling of $B$ in Proposition 4.3 with respect to the dimension and compare it with the general lower bound of the expected first hitting time given in Theorem 4.10. We then conclude that the bound is tight with respect to the scaling with $d$ in the case of the sphere function.

Let us assume $\kappa=1$ , that is, we consider the (1+1)-ES without covariance matrix adaptation ( $\Sigma=I$ ). Then, $p^{\mathrm{lower}}_{(a,b]}(\bar{\sigma})=p^{\mathrm{upper}}_{(a,b]}(\bar{\sigma})=p_{r}^{\mathrm{succ}}(\bar{\sigma};m,\Sigma)$ , where the right-most side is independent of $m$ and $\Sigma$ as described in Lemma 2.4. This means that the success probability is solely controlled by the normalized step-size $\bar{\sigma}$ .

The following proposition states that the convergence speed is $\Omega(1/d)$ , hence the expected first hitting time scales as $O(1/d)$ . The proof is provided in Section B.6.

Proposition 4.15.

For $A=1/d$ , $p_{\mathrm{target}}\in\Theta(1)$ and $\log(\alpha_{\uparrow}/\alpha_{\downarrow})\in\omega(1/d)$ , we have $B\in\Omega(1/d)$ .

Two conditions on the choice of $\alpha_{\uparrow}$ and $\alpha_{\downarrow}$ : $p_{\mathrm{target}}=\log(1/\alpha_{\downarrow})/\log(\alpha_{\uparrow}/\alpha_{\downarrow})\in\Theta(1)$ and $\log(\alpha_{\uparrow}/\alpha_{\downarrow})\in\omega(1/d)$ , are understood as follows. The first condition implies that the target success probability $p_{\mathrm{target}}$ must be independent of $d$ . In the 1/5 success rule, $\alpha_{\uparrow}$ and $\alpha_{\downarrow}$ are set so that $p_{\mathrm{target}}=1/5$ independent of $d$ . The second condition implies that the factors of the step-size increase and decrease must be $\log(\alpha_{\uparrow})\in\omega(1/d)$ and $\log(1/\alpha_{\downarrow})\in\omega(1/d)$ . Note that on the sphere function the normalized step-size $\bar{\sigma}\propto\sigma/\lVert m-x^{*}\rVert$ is kept around a constant during the search. It implies that the convergence speed of $\lVert m-x^{*}\rVert$ and $\sigma$ must agree. Therefore the speed of the adaptation of the step-size must not be too small to achieve $\Theta(d)$ scaling of the expected first hitting time.

Proposition 4.15 and Theorem 4.6 imply $\mathbb{E}[T_{\epsilon}]\in O(d\log(\lVert m_{0}\rVert/\epsilon))$ and Theorem 4.10 implies $\mathbb{E}[T_{\epsilon}]\in\Omega(d\log(\lVert m_{0}\rVert/\epsilon))$ . They yield $\mathbb{E}[T_{\epsilon}]\in\Theta(d\log(\lVert m_{0}\rVert/\epsilon))$ . This result shows i) that the runtime of the (1+1)-ES on the sphere function is proportional to $d$ as long as $\log(\alpha_{\uparrow}/\alpha_{\downarrow})\in\omega(1/d)$ , and ii) that from our methodology one can derive a tight bound of the runtime in some cases. The result is formally stated as follows.

Theorem 4.16.

The (1+1)-ES (Algorithm 1) with $\kappa=1$ and $p^{\mathrm{target}}<1/2$ converges globally and linearly in terms of $\log\lVert m_{t}-x^{*}\rVert$ from any starting point $m_{0}\in\mathbb{R}^{d}$ , $\sigma_{0}>0$ , and $\Sigma_{0}=I$ on any function $f(x)=g(\lVert x-x^{*}\rVert)$ , where $g$ is a strictly increasing function. Moreover, if $p^{\mathrm{target}}\in\Theta(1)$ and $\log(\alpha_{\uparrow}/\alpha_{\downarrow})\in\omega(1/d)$ , the expected first hitting time $T_{\epsilon}$ of $\log\lVert m_{t}-x^{*}\rVert$ to $(-\infty,\log(\epsilon)]$ is $\Theta(d\log(\lVert m_{0}\rVert/\epsilon))$ and the almost sure convergence rate is upper-bounded by $\exp(-\Theta(1/d))$ .

Since the lower bound holds for an arbitrary $\sigma$ -adaptation mechanism, the above result not only implies that our upper bound is tight, but it also implies that the success-based $\sigma$ -control mechanism achieves the best possible convergence rate except for a constant factor on the spherical function.

5 Discussion

We have established the almost sure global linear convergence of the (1+1) $\text{-ES}_{\kappa}$ and also expressed as a bound on the expected hitting time of an $\epsilon$ -neighborhood of the solution. Assumption A1 has been the key to obtaining the expected first hitting time bound of (1+1) $\text{-ES}_{\kappa}$ in the form of (9). The convergence results hold on a wide class of functions. It includes

(i)

strongly convex functions with Lipschitz gradient, where linear convergence of numerical optimization algorithm is usually analyzed,
(ii)

continuously differentiable positively homogenous functions, where previous linear convergence results had been introduced, and
(iii)

functions with non-smooth level sets as illustrated in Figure 2.

Because the analyzed algorithms are invariant to strictly monotonic transformations of the objective functions, all results that hold on $f$ also hold on $g\circ f$ where $g:{\rm Im}(f)\to\mathbb{R}$ is a strictly increasing transformation, which can thus introduce discontinuities on the objective function. In contrast to the previous result establishing the convergence of CMA-ES [18] by adding a step to enforce a sufficient decrease (which works well for direct search methods, but which is unnatural for ESs), we did not need to modify the adaptation mechanism of the (1+1)-ES to achieve our convergence proofs. We believe that this is crucial, since it allows our analysis to reflect the main mechanism that makes the algorithm work well in practice.

Theorem 4.16 proves that we can derive a tight convergence rate with Proposition 4.3 on the sphere function in the case where $\kappa=1$ , i.e., without covariance matrix adaptation. This partially supports the utility of our methodology. However, its derivation relies on the fact that both the level sets of the objective function and the equal-density curves of the sampling distribution are isotropic, and hence does not generalize immediately. Moreover, the lower bound (Theorem 4.10) seems to be loose even for $\kappa=1$ on convex quadratic functions, where we empirically observe that the logarithmic convergence rate scales like $\Theta(1/\operatorname{Cond}(\nabla\nabla f))$ , see Figure 1, while its dependency on the dimension is tight.

A better lower bound of the expected first hitting time and a handy way to estimate the convergence rate are relevant directions of future work. Further directions of future work are as follows:

Proving linear convergence of (1+1) $\text{-ES}_{\kappa}$ does not reveal the benefits of (1+1) $\text{-ES}_{\kappa}$ over the (1+1)-ES without covariance matrix adaptation. The motivation of the introduction of the covariance matrix is to improve the convergence rate and to broaden the class of functions on which linear convergence is exhibited. None of them are achieved in this paper.

On convex quadratic functions, we empirically observe that the covariance matrix approaches a stable distribution that is closely concentrated around the inverse Hessian up to a scalar factor, and the convergence speed on all convex quadratic functions is equal to that on the sphere function (see Figure 1). This behavior is not described by our result.

Covariance matrix adaptation is also important for optimizing functions with non-smooth level sets. On continuously differentiable functions, we can always set $\alpha_{\uparrow}$ and $\alpha_{\downarrow}$ so that $p=\frac{\log(1/\alpha_{\downarrow})}{\log(\alpha_{\uparrow}/\alpha_{\downarrow})}<p^{\mathrm{limit}}=1/2$ . This is the rationale behind the 1/5 success rule, where $p=1/5$ . Indeed, $p=1/5$ is known to approximate the optimal situation on the sphere function where the expected one-step progress is maximized [51]. Therefore, one does not need to tune these parameters in a problem-specific manner. However, if the objective is not continuously differentiable and levelsets are non-smooth, then $p^{\mathrm{limit}}$ is in general smaller than $1/2$ . For example, it can be as low as $p^{\mathrm{limit}}=1/2^{d}$ on $f(x)=\|x\|_{\infty}=\max_{i=1,\dots,n}\lvert x_{i}\rvert$ . Without an appropriate adaptation of the covariance matrix the success probability will be smaller than $p=1/5$ and one must tune $\alpha_{\uparrow}$ and $\alpha_{\downarrow}$ in order to converge to the optimum, which requires information about $p^{\mathrm{limit}}$ . By adapting the covariance matrix appropriately, the success probability can be increased arbitrary close to $1/2$ (by elongating steps in the direction of the success domain) and $\alpha_{\uparrow}$ and $\alpha_{\downarrow}$ do not require tuning.

To achieve a reasonable convergence rate bound and broaden the class of functions on which linear convergence is exhibited, one needs to find another potential function $V$ that may penalize a high condition number $\operatorname{Cond}(\nabla\nabla f(m_{t})\Sigma_{t})$ and replace the definitions of $p^{\mathrm{upper}}$ and $p^{\mathrm{lower}}$ accordingly. This point is left for future work.

Acknowledgement

We gratefully acknowledge support by Dagstuhl seminar 17191 “Theory of Randomized Search Heuristics”. We would like to thank Per Kristian Lehre, Carsten Witt, and Johannes Lengler for valuable discussions and advice on drift theory. Y. A. is supported by JSPS KAKENHI Grant Number 19H04179.

References

[1]
[2] M.A. Abramson, C. Audet, J.E. Dennis Jr, and S. Le Digabel, OrthoMADS: A deterministic MADS instance with orthogonal directions, SIAM Journal on Optimization, 20 (2009), pp. 948–966.
[3] Y. Akimoto, Analysis of a natural gradient algorithm on monotonic convex-quadratic-composite functions, in GECCO, 2012, pp. 1293–1300.
[4] Y. Akimoto, A. Auger, and T. Glasmachers, Drift theory in continuous search spaces: expected hitting time of the (1+ 1)-ES with 1/5 success rule, in GECCO, 2018, pp. 801–808.
[5] S. Alvernaz and J. Togelius, Autoencoder-augmented neuroevolution for visual doom playing, in IEEE CIG, 2017, pp. 1–8.
[6] D. V. Arnold and N. Hansen, Active covariance matrix adaptation for the (1+1)-CMA-ES, in GECCO, 2010, pp. 385–392.
[7] C. Audet and J.E. Dennis Jr, Mesh adaptive direct search algorithms for constrained optimization, SIAM Journal on Optimization, 17 (2006), pp. 188–217.
[8] A. Auger and N. Hansen, Linear convergence on positively homogeneous functions of a comparison based step-size adaptive randomized search: the (1+1)-ES with generalized one-fifth success rule, 2013, https://arxiv.org/abs/1310.8397.
[9] A. Auger and N. Hansen, Linear convergence of comparison-based step-size adaptive randomized search via stability of Markov chains, SIAM Journal on Optimization, 26 (2016), pp. 1589–1624.
[10] A. S. Bandeira, K. Scheinberg, and L. N. Vicente, Convergence of trust-region methods based on probabilistic models, SIAM Journal on Optimization, 24 (2014), pp. 1238–1264.
[11] B. Baritompa and M. Steel, Bounds on absorption times of directionally biased random sequences, Random Structures & Algorithms, 9 (1996), pp. 279–293.
[12] P. Bontrager, A. Roy, J. Togelius, N. Memon, and A. Ross, DeepMasterPrints: Generating MasterPrints for Dictionary Attacks via Latent Variable Evolution, in IEEE BTAS, 2018, pp. 1–9.
[13] S. Bubeck, Convex optimization: Algorithms and complexity, 2014, https://arxiv.org/abs/1405.4980.
[14] C. Cartis and K. Scheinberg, Global convergence rate analysis of unconstrained optimization methods based on probabilistic models, Mathematical Programming, 169 (2018), pp. 337–375.
[15] Y. S. Chow, On a strong law of large numbers for martingales, Ann. Math. Statist., 38 (1967), p. 610.
[16] A. R. Conn, K. Scheinberg, and L. N. Vicente, Introduction to Derivative-Free Optimization, SIAM, 2009.
[17] L. Devroye, The compound random search, in International Symposium on Systems Engineering and Analysis, 1972, pp. 195–110.
[18] Y. Diouane, S. Gratton, and L. N. Vicente, Globally convergent evolution strategies, Mathematical Programming, 152 (2015), pp. 467–490.
[19] B. Doerr and L. A. Goldberg, Adaptive drift analysis, Algorithmica, 65 (2013), pp. 224–250.
[20] B. Doerr, D. Johannsen, and C. Winzen, Multiplicative drift analysis, Algorithmica, 64 (2012), pp. 673–697.
[21] Y. Dong, H. Su, B. Wu, Z. Li, W. Liu, T. Zhang, and J. Zhu, Efficient decision-based black-box adversarial attacks on face recognition, in CVPR, 2019.
[22] G. Fujii, M. Takahashi, and Y. Akimoto, CMA-ES-based structural topology optimization using a level set boundary expression—application to optical and carpet cloaks, Computer Methods in Applied Mechanics and Engineering, 332 (2018), pp. 624 – 643.
[23] T. Geijtenbeek, M. Van De Panne, and A. F. Van Der Stappen, Flexible muscle-based locomotion for bipedal creatures, ACM Transactions on Graphics (TOG), 32 (2013), pp. 1–11.
[24] T. Glasmachers, Global convergence of the (1+1) Evolution Strategy to a critical point, Evolutionary Computation, 28 (2020), pp. 27–53.
[25] D. Golovin, J. Karro, G. Kochanski, C. Lee, X. Song, and Q. Zhang, Gradientless descent: High-dimensional zeroth-order optimization, in ICLR, 2020.
[26] S. Gratton, C. W. Royer, L. N. Vicente, and Z. Zhang, Direct search based on probabilistic descent, SIAM Journal on Optimization, 25 (2015), pp. 1515–1541.
[27] S. Gratton, C. W. Royer, L. N. Vicente, and Z. Zhang, Complexity and global rates of trust-region methods based on probabilistic models, IMA Journal of Numerical Analysis, 38 (2017), pp. 1579–1597.
[28] D. Ha and J. Schmidhuber, Recurrent world models facilitate policy evolution, in NeurIPS, 2018, pp. 2450–2462.
[29] B. Hajek, Hitting-time and occupation-time bounds implied by drift analysis with applications, Advances in Applied probability, 14 (1982), pp. 502–525.
[30] N. Hansen, A. Auger, R. Ros, S. Finck, and P. Pošík, Comparing results of 31 algorithms from the black-box optimization benchmarking bbob-2009, in GECCO, 2010, pp. 1689–1696.
[31] N. Hansen and A. Ostermeier, Completely derandomized self-adaptation in evolution strategies, Evolutionary Computation, 9 (2001), pp. 159–195.
[32] J. He and X. Yao, Drift analysis and average time complexity of evolutionary algorithms, Artificial intelligence, 127 (2001), pp. 57–85.
[33] J. He and X. Yao, A study of drift analysis for estimating computation time of evolutionary algorithms, Natural Computing, 3 (2004), pp. 21–35.
[34] J. Jägersküpper, Analysis of a simple evolutionary algorithm for minimization in Euclidean spaces, Automata, Languages and Programming, 2003, pp. 188–188.
[35] J. Jägersküpper, Rigorous runtime analysis of the (1+1)-ES: 1/5-rule and ellipsoidal fitness landscapes, in FOGA, 2005, pp. 260–281.
[36] J. Jägersküpper, How the (1+1)-ES using isotropic mutations minimizes positive definite quadratic forms, Theoretical Computer Science, 361 (2006), pp. 38–56.
[37] J. Jägersküpper, Algorithmic analysis of a basic evolutionary algorithm for continuous optimization, Theoretical Computer Science, 379 (2007), pp. 329–347.
[38] S. Kern, S. D. Müller, N. Hansen, D. Büche, J. Ocenasek, and P. Koumoutsakos, Learning probability distributions in continuous evolutionary algorithms–a comparative review, Natural Computing, 3 (2004), pp. 77–112.
[39] J. Konečnỳ and P. Richtárik, Simple complexity analysis of simplified direct search, 2014, https://arxiv.org/abs/1410.0390.
[40] I. Kriest, V. Sauerland, S. Khatiwala, A. Srivastav, and A. Oschlies, Calibrating a global three-dimensional biogeochemical ocean model (mops-1.0), Geoscientific Model Development, 10 (2017), p. 127.
[41] J. Larson, M. Menickelly, and S. M. Wild, Derivative-free optimization methods, Acta Numerica, 28 (2019), pp. 287–404.
[42] P. K. Lehre and C. Witt, General drift analysis with tail bounds, 2013, https://arxiv.org/abs/1307.2559.
[43] J. Lengler, Drift analysis, in Theory of Evolutionary Computation, Springer, 2020, pp. 89–131.
[44] J. Lengler and A. Steger, Drift analysis and evolutionary algorithms revisited, 2016, https://arxiv.org/abs/1608.03226.
[45] P. MacAlpine, S. Barrett, D. Urieli, V. Vu, and P. Stone, Design and optimization of an omnidirectional humanoid walk: A winning approach at the RoboCup 2011 3D simulation competition, in AAAI, 2012.
[46] B. Mitavskiy, J. Rowe, and C. Cannings, Theoretical analysis of local search strategies to optimize network communication subject to preserving the total number of links, International Journal of Intelligent Computing and Cybernetics, 2 (2009), pp. 243–284.
[47] D. Morinaga and Y. Akimoto, Generalized drift analysis in continuous domain: linear convergence of (1+ 1)-ES on strongly convex functions with lipschitz continuous gradients, in FOGA, 2019, pp. 13–24.
[48] A. Nemirovski, Information-based complexity of convex programming, Lecture Notes, (1995).
[49] Y. Nesterov, Lectures on convex optimization, vol. 137, Springer, 2018.
[50] C. Paquette and K. Scheinberg, A stochastic line search method with convergence rate analysis, 2018, https://arxiv.org/abs/1807.07994.
[51] I. Rechenberg, Evolutionsstrategie: Optimierung technisher Systeme nach Prinzipien der biologischen Evolution, Frommann-Holzboog, 1973.
[52] I. Rechenberg, Evolutionsstrategie’94, frommann-holzboog, 1994.
[53] L. M. Rios and N. V. Sahinidis, Derivative-free optimization: a review of algorithms and comparison of software implementations, Journal of Global Optimization, 56 (2013), pp. 1247–1293.
[54] M. Schumer and K. Steiglitz, Adaptive step size random search, Automatic Control, IEEE Transactions on, 13 (1968), pp. 270–276.
[55] S. U. Stich, C. L. Muller, and B. Gartner, Optimization of convex functions with random pursuit, SIAM Journal on Optimization, 23 (2013), pp. 1284–1309.
[56] S. U. Stich, C. L. Müller, and B. Gärtner, Variable metric random pursuit, Mathematical Programming, 156 (2016), pp. 549–579.
[57] J. Uhlendorf, A. Miermont, T. Delaveau, G. Charvin, F. Fages, S. Bottani, G. Batt, and P. Hersen, Long-term model predictive control of gene expression at the population and single-cell levels, Proceedings of the National Academy of Sciences, 109 (2012), pp. 14271–14276.
[58] V. Volz, J. Schrum, J. Liu, S. M. Lucas, A. Smith, and S. Risi, Evolving Mario levels in the latent space of a deep convolutional generative adversarial network, in GECCO, 2018, pp. 221–228.

Appendix A Some Numerical Results

We present experiments with five algorithms on two convex quadratic functions. We compare (1+1)-ES, (1+1)-CMA-ES, simplified direction search [39], random pursuit [55], and gradientless descent [25].

All algorithms were started at the initial search point $x_{0}=\frac{1}{\sqrt{d}}(1,\dots,1)\in\mathbb{R}^{d}$ . We implemented the algorithms as follows, with their parameters tuned where necessary: The ES always uses the setting $\alpha_{\uparrow}=\exp(4/d)$ and $\alpha_{\downarrow}=\alpha_{\uparrow}^{-1/4}$ for step size adaptation. We set the constant $c$ in the sufficient decrease condition of Simplified Direction Search to $\frac{1}{10}$ , and we employed the standard basis as well as the negatives of these vectors as candidate directions. In each iteration we looped over the set of directions in random order. Randomizing the order greatly boosted performance over a fixed order. Random Pursuit was implemented with a golden section line search in the range $[-2\sigma,2\sigma]$ with a rather loose target precision of $\sigma/2$ , where $\sigma$ is either the initial step size or the length of the previous step. For Gradientless Descent we used the initial step size as the maximal step size and defined a target precision of $10^{-10}$ . This target is reached by the ES in all cases. The experiments are designed to demonstrate several different effects: (a) We perform all experiments in $d=10$ and $d=50$ dimensions to investigate dimension-dependent effects. (b) We investigate best-case performance by running the algorithms on the spherical function $\|x\|^{2}$ , i.e., on the separable convex quadratic function with minimal condition number. The initial step size is set to $\sigma_{0}=1$ . All algorithms have a budget of $100d$ function evaluations. (c) We investigate the dependency of the performance on initial parameter settings by repeating the same experiment as above, but with an initial step size of $\sigma_{0}=\frac{1}{1000}$ . All algorithms have a budget of $700d$ function evaluations. (d) We investigate the dependence on problem difficulty by running the algorithms on an ellipsoid problem with a moderate condition number of $\kappa_{f}=100$ . The eigenvalues of the Hessian are evenly distributed on a log-scale. We use $\sigma_{0}=1$ like in the first experiment. All algorithms have a budget of $500d$ function evaluations.

The experimental results are presented in Figure 3.

Interpretation. We observe only moderate dimension-dependent effects, besides the expected linear increase of the runtime. We see robust performance of the ES, in particular with covariance matrix adaptation. The second experiment demonstrates the practical importance of the ability to grow the step size: the ES is essentially unaffected by wrong initial parameter settings while the gradientless descent and the simplified direct search are (which can be understood directly from the algorithms themselves). This property does not show up in convergence rates and is therefore often (but not always) neglected in algorithm design. The last experiment clearly demonstrates the benefit of variable-metric methods like CMA-ES. It should be noted that variable metric techniques can be implemented into most existing algorithms. This is rarely done though, with random pursuit being a notable exception [56].

Appendix B Proofs

B.1 Proof of Lemma 2.10

Since $f_{\mu}$ is invariant to $g$ , without loss of generality we assume $f(x)=h(x)-h(x^{*})$ in this proof. Inequality Eq. 7 implies that $f(y)\leqslant f(x)\Rightarrow(L_{\ell}/2)\lVert y-x^{*}\rVert^{2}\leqslant f(x)$ , meaning that $\{y:f(y)\leqslant f(x)\}\subseteq\bar{\mathcal{B}}\Big{(}x^{*},\sqrt{\frac{f(x)}{L_{\ell}/2}}\Big{)}$ . Since $f_{\mu}(x)$ is the $d$ th root of the volume of the left-hand side of the above relation, we find $f_{\mu}(x)\leqslant\mu^{\frac{1}{d}}\Big{(}\bar{\mathcal{B}}\Big{(}x^{*},\sqrt{\frac{f(x)}{L_{\ell}/2}}\Big{)}\Big{)}=V_{d}\sqrt{\frac{f(x)}{L_{\ell}/2}}$ . Analogously, we obtain $\mathcal{B}\Big{(}x^{*},\sqrt{\frac{f(x)}{L_{u}/2}}\Big{)}\subseteq\{y:f(y)<f(x)\}$ and $f_{\mu}(x)\geqslant V_{d}\sqrt{\frac{f(x)}{L_{u}/2}}$ . From these inequalities, we obtain $\{y:f(y)\leqslant f(x)\}\subseteq\bar{\mathcal{B}}\Big{(}x^{*},\sqrt{\frac{L_{u}}{L_{\ell}}}\frac{f_{\mu}(x)}{V_{d}}\Big{)}$ and $\mathcal{B}\Big{(}x^{*},\sqrt{\frac{L_{\ell}}{L_{u}}}\frac{f_{\mu}(x)}{V_{d}}\Big{)}\subseteq\{y:f(y)<f(x)\}$ . This implies A1 for $\mathcal{X}_{0}^{\infty}$ . A2 is immediately implied by Proposition 2.8. This completes the proof.

B.2 Proof of Lemma 2.11

We first prove that A1 holds for $a=0$ and $b=\infty$ with $C_{u}=\sup\{\lVert x-x^{*}\rVert:f_{\mu}(x)=1\}$ and $C_{\ell}=\inf\{\lVert x-x^{*}\rVert:f_{\mu}(x)=1\}$ and they are finite.

It is easy to see that the spatial suboptimality function $f_{\mu}(x)$ is proportional to $h(x)-h(x^{*})$ . Let $f_{\mu}(x)=c(h(x)-h(x^{*}))$ for some $c>0$ . Then, $f_{\mu}$ is also a homogeneous function. Since it is homogeneous, A1 reduces to that there are open and closed balls with radius $C_{\ell}$ and $C_{u}$ satisfying the conditions described in the assumption with $f_{\mu}(m)=1$ . Such constants are obtained by $C_{u}=\sup\{\lVert x-x^{*}\rVert:f_{\mu}(x)=1\}$ and $C_{\ell}=\inf\{\lVert x-x^{*}\rVert:f_{\mu}(x)=1\}$ .

Due to the continuity of $f$ there exists an open ball $\mathcal{B}$ around $x^{*}$ such that $h(x)<h(x^{*})+1/c$ for all $x\in\mathcal{B}$ . Then, it holds that $f_{\mu}(x)<1$ for all $x\in\mathcal{B}$ . It implies that $C_{\ell}$ is no smaller than the radius of $\mathcal{B}$ , which is positive. Hence, $C_{\ell}>0$ .

We show the finiteness of $C_{u}$ by a contradiction argument. Suppose $C_{u}=\infty$ . Then, there is a direction $v$ such that $f_{\mu}(x^{*}+Mv)\leqslant 1$ with an arbitrarily large $M>0$ . Since $f_{\mu}$ is homogeneous, we have $f_{\mu}(x^{*}+v)\leqslant 1/M$ and this must hold for any $M>0$ . This implies $f_{\mu}(x^{*}+v)=c(h(x)-h(x^{*}))=0$ , which contradicts the assumption that $x^{*}$ is the unique global optimum. Hence, $C_{u}<\infty$ .

The above argument proves that A1 holds with the above constants for $a=0$ and $b=\infty$ . Proposition 2.8 proves A2.

B.3 Proof of Proposition 4.1

For a given $m\in\mathcal{X}_{a}^{b}$ , there is a closed ball $\bar{\mathcal{B}}_{u}$ such that $S_{0}(m)\subseteq\bar{\mathcal{B}}_{u}$ , see Figure 2. We have

	$\textstyle p^{\mathrm{upper}}_{(a,b]}(\bar{\sigma})$	$\textstyle=\sup_{m\in\mathcal{X}_{a}^{b}}\sup_{\Sigma\in\mathcal{S}_{\kappa}}\int_{S_{0}(m)}\varphi\big{(}x;m,\left(f_{\mu}(m)\bar{\sigma}\right)^{2}\Sigma\big{)}dx$
		$\textstyle\leqslant\sup_{m\in\mathcal{X}_{a}^{b}}\sup_{\Sigma\in\mathcal{S}_{\kappa}}\underbrace{\int_{\bar{\mathcal{B}}_{u}}\varphi\big{(}x;m,\left(f_{\mu}(m)\bar{\sigma}\right)^{2}\Sigma\big{)}dx}_{(*1)}\enspace.$

The integral is maximized if the ball is centered at $m$ . By a variable change ( $x\leftarrow x-m$ ),

	$\textstyle(*1)$	$\textstyle\leqslant\int_{\lVert x\rVert\leqslant C_{u}f_{\mu}(m)}\varphi\big{(}x;0,\left(f_{\mu}(m)\bar{\sigma}\right)^{2}\Sigma\big{)}dx=\int_{\lVert x\rVert\leqslant C_{u}/\bar{\sigma}}\varphi(x;0,\Sigma)dx$
		$\textstyle\leqslant\kappa^{d/2}\Phi\left(\bar{\mathcal{B}}\left(0,\frac{C_{u}}{\bar{\sigma}\kappa^{1/2}}\right);0,\mathrm{I}\right)\enspace.$

Here we used $\Phi\big{(}\bar{\mathcal{B}}(0,r)\big{)};0,\Sigma)\leqslant\kappa^{d/2}\Phi\left(\bar{\mathcal{B}}\left(0,\kappa^{-1/2}r\right);0,\mathrm{I}\right)$ for any $r>0$ , which is proven in Lemma B.1 below. The right-most side (RMS) of the above inequality is independent of $m$ . It proves Eq. 18.

Similarly, there are balls $\mathcal{B}_{\ell}$ and $\bar{\mathcal{B}}_{u}$ such that $\mathcal{B}_{\ell}\subseteq S_{0}(m)\subseteq\bar{\mathcal{B}}_{u}$ . We have

	$\textstyle p^{\mathrm{lower}}_{(a,b]}(\bar{\sigma})$	$\textstyle=\inf_{m\in\mathcal{X}_{a}^{b}}\inf_{\Sigma\in\mathcal{S}_{\kappa}}\int_{S_{0}(m)}\varphi\big{(}x;m,\left(f_{\mu}(m)\bar{\sigma}\right)^{2}\Sigma\big{)}dx$
		$\textstyle\geqslant\inf_{m\in\mathcal{X}_{a}^{b}}\inf_{\Sigma\in\mathcal{S}_{\kappa}}\underbrace{\int_{\mathcal{B}_{\ell}}\varphi\big{(}x;m,\left(f_{\mu}(m)\bar{\sigma}\right)^{2}\Sigma\big{)}dx}_{(*2)}\enspace.$

The integral is minimized if the ball is at the opposite side of $m$ on the ball $\bar{\mathcal{B}}_{u}$ , see Figure 2. By a variable change (moving $m$ to the origin) and letting $e_{m}=m/\lVert m\rVert$ ,

	$\textstyle(*2)$	$\textstyle\geqslant\int_{\lVert x-((2C_{u}-C_{\ell})f_{\mu}(m))e_{m}\rVert\leqslant C_{\ell}f_{\mu}(m)}\varphi\big{(}x;0,\left(f_{\mu}(m)\bar{\sigma}\right)^{2}\Sigma\big{)}dx$
		$\textstyle=\int_{\lVert x-((2C_{u}-C_{\ell})/\bar{\sigma})e_{m}\rVert\leqslant C_{\ell}/\bar{\sigma}}\varphi(x;0,\Sigma)dx$
		$\textstyle\geqslant\kappa^{-d/2}\Phi\left(\bar{\mathcal{B}}\left(\left(\frac{(2C_{u}-C_{\ell})\kappa^{1/2}}{\bar{\sigma}}\right)e_{m},\frac{C_{\ell}\kappa^{1/2}}{\bar{\sigma}}\right);0,\mathrm{I}\right)\enspace.$

Here we used $\Phi\big{(}\bar{\mathcal{B}}(c,r);0,\Sigma\big{)}\geqslant\kappa^{-d/2}\Phi\big{(}\bar{\mathcal{B}}(\kappa^{1/2}c,\kappa^{1/2}r);0,\mathrm{I}\big{)}$ for any $c\in\mathbb{R}^{d}$ and $r>0$ (Lemma B.1). The RMS of the above inequality is independent of $m$ as its value is constant over all unit vectors $e_{m}$ . Replacing $e_{m}$ with $e_{1}$ , we have Eq. 19.

Lemma B.1.

For all $\Sigma\in\mathcal{S}_{\kappa}$ , $\kappa^{-d/2}\varphi\left(x;0,\kappa^{-1}\mathrm{I}\right)\leqslant\varphi\big{(}x;0,\Sigma\big{)}\leqslant\kappa^{d/2}\varphi\left(x;0,\kappa\mathrm{I}\right)$ and $\kappa^{-d/2}\Phi\left(\mathcal{B}(\sqrt{\kappa}c,\sqrt{\kappa}r);0,\mathrm{I}\right)\leqslant\Phi\big{(}\mathcal{B}(c,r);0,\Sigma\big{)}\leqslant\kappa^{d/2}\Phi\left(\mathcal{B}(c/\sqrt{\kappa},r/\sqrt{\kappa});0,\mathrm{I}\right)$ .

Proof B.2.

For $\Sigma\in\mathcal{S}_{\kappa}$ , we have $\det(\Sigma)=1$ and $\operatorname{Cond}(\Sigma)=\lambda_{\max}(\Sigma)/\lambda_{\min}(\Sigma)\leqslant\kappa$ . Since $\det(\Sigma)=1$ and $\det(\Sigma)=\prod_{i=1}^{d}\lambda_{i}(\Sigma)$ , we have $\lambda_{\max}(\Sigma)\geqslant 1\geqslant\lambda_{\min}(\Sigma)$ . Therefore, we have $\lambda_{\min}(\Sigma)\geqslant\lambda_{\max}/\kappa\geqslant\kappa^{-1}$ and $\lambda_{\max}(\Sigma)\leqslant\kappa\lambda_{\min}(\lambda)\leqslant\kappa$ . Then we obtain $\kappa^{-1}x^{\mathrm{T}}\mathrm{I}x\leqslant x^{\mathrm{T}}\Sigma^{-1}x\leqslant\kappa x^{\mathrm{T}}\mathrm{I}x$ . With this inequality we have

\textstyle\varphi\big{(}x;0,\Sigma\big{)}=(2\pi)^{-d/2}\exp(-x^{\mathrm{T}}\Sigma^{-1}x/2)\leqslant(2\pi)^{-d/2}\exp(-x^{\mathrm{T}}\mathrm{I}x/(2\kappa))\\ \textstyle=\kappa^{d/2}(2\pi\kappa)^{-d/2}\exp(-x^{\mathrm{T}}\mathrm{I}x/(2\kappa))=\kappa^{d/2}\varphi\big{(}x;0,\kappa\mathrm{I}\big{)}\enspace.

Analogously, we obtain $\varphi\big{(}x;0,\Sigma\big{)}\geqslant\kappa^{-d/2}\varphi\big{(}x;0,\kappa^{-1}\mathrm{I}\big{)}$ . Taking the integral over $\mathcal{B}(c,r)$ , we obtain the second statement.

B.4 Proof of Lemma 4.2

The upper bound of $p^{\mathrm{upper}}_{(a,b]}$ given in (18) is strictly decreasing in $\bar{\sigma}$ and converges to zero when $\bar{\sigma}$ goes to infinity. This guarantees the existence of $\bar{\sigma}_{u}$ as a finite value. The existence of $\bar{\sigma}_{\ell}>0$ is obvious under A2. A1 guarantees that there exists an open ball $B_{\ell}$ with radius $C_{\ell}(1-r)f_{\mu}(m)$ such that $\mathcal{B}_{\ell}\subseteq\{x\in\mathbb{R}^{d}\mid f_{\mu}(x)<(1-r)f_{\mu}(m)\}$ . Then, analogously to the proof of Proposition 4.1, the success probability with rate $r$ is lower bounded by

(27)

\textstyle p^{\mathrm{succ}}_{r}(\bar{\sigma};m,\Sigma)\geqslant\kappa^{-d/2}\Phi\left(\mathcal{B}\left(\left(\frac{(2C_{u}-(1-r)C_{\ell})\kappa^{1/2}}{\bar{\sigma}}\right)e_{1},\frac{(1-r)C_{\ell}\kappa^{1/2}}{\bar{\sigma}}\right);0,\mathrm{I}\right).

The probability is independent of $m$ , positive, and continuous in $\bar{\sigma}\in[\ell,u]$ . Therefore the minimum is attained. This completes the proof.

B.5 Proof of Proposition 4.3

First, we remark that $m_{t}\in\mathcal{X}_{a,b}$ is equivalent to the condition $a<f_{\mu}(m_{t})\leqslant b$ . If $f_{\mu}(m_{t})\leqslant a$ or $f_{\mu}(m_{t})>b$ , both sides of Eq. 23 are zero, hence the inequality is trivial. In the following we assume that $m_{t}\in\mathcal{X}_{a}^{b}$ .

For the sake of simplicity we introduce $\log^{+}(x)=\log(x)1\left\{\scriptstyle{x\geqslant 1}\right\}$ . We rewrite the potential function as

(28)

\textstyle V(\theta_{t})=

\textstyle\log\left(f_{\mu}(m_{t})\right)+v\log^{+}\left(\frac{\alpha_{\uparrow}\ell f_{\mu}(m_{t})}{\sigma_{t}}\right)+v\log^{+}\left(\frac{\sigma_{t}}{\alpha_{\downarrow}uf_{\mu}(m_{t})}\right)\enspace.

The potential function at time $t+1$ can be written as

\textstyle V(\theta_{t+1})=\log f_{\mu}(m_{t+1})+\underbrace{v\log^{+}\frac{\ell f_{\mu}(m_{t+1})}{\sigma_{t}}1\left\{\scriptstyle{\sigma_{t+1}>\sigma_{t}}\right\}}_{P_{2}}+\underbrace{v\log^{+}\frac{\alpha_{\uparrow}\ell f_{\mu}(m_{t})}{\alpha_{\downarrow}\sigma_{t}}1\left\{\scriptstyle{\sigma_{t+1}<\sigma_{t}}\right\}}_{P_{3}}\\ \textstyle+\underbrace{v\log^{+}\frac{\alpha_{\uparrow}\sigma_{t}}{\alpha_{\downarrow}uf_{\mu}(m_{t+1})}1\left\{\scriptstyle{\sigma_{t+1}>\sigma_{t}}\right\}}_{P_{4}}+\underbrace{v\log^{+}\frac{\sigma_{t}}{uf_{\mu}(m_{t})}1\left\{\scriptstyle{\sigma_{t+1}<\sigma_{t}}\right\}}_{P_{5}}\enspace.

We want to estimate the conditional expectation

(29)

\mathbb{E}\left[\max\{V(\theta_{t+1})-V(\theta_{t})\,,\,-A\}\mid\theta_{t}\right].

We partition the possible values of $\theta_{t}$ into three sets: first the set of $\theta_{t}$ such that $\sigma_{t}<\ell f_{\mu}(m_{t})$ ( $\sigma_{t}$ is small), second the set of $\theta_{t}$ such that $\sigma_{t}>uf_{\mu}(m_{t})$ ( $\sigma_{t}$ is large), and last the set of $\theta_{t}$ such that $\ell f_{\mu}(m_{t})\leqslant\sigma_{t}\leqslant uf_{\mu}(m_{t})$ (reasonable $\sigma_{t}$ ). In the following, we bound Eq. 29 for each of the three cases and in the end our bound $B$ will equal the minimum of the three bounds obtained for each case.

Reasonable $\sigma_{t}$ case: $\frac{f_{\mu}(m_{t})}{\sigma_{t}}\in\left[\frac{1}{u},\frac{1}{\ell}\right]$ . In case of success, where $1\left\{\scriptstyle{\sigma_{t+1}>\sigma_{t}}\right\}=1$ , we have $f_{\mu}(m_{t+1})/\sigma_{t+1}\leqslant f_{\mu}(m_{t})/(\alpha_{\uparrow}\sigma_{t})\leqslant 1/(\alpha_{\uparrow}\ell)$ , implying that $P_{2}$ is always $0$ . Similarly, in case of failure, $f_{\mu}(m_{t+1})/\sigma_{t+1}=f_{\mu}(m_{t})/(\alpha_{\downarrow}\sigma_{t})\geqslant 1/(\alpha_{\downarrow}u)$ and we find that $P_{5}$ is always zero. We rearrange $P_{3}$ and $P_{4}$ into

	$\textstyle P_{3}$	$\textstyle=v\log^{+}\left(\frac{\alpha_{\uparrow}\ell f_{\mu}(m_{t})}{\alpha_{\downarrow}\sigma_{t}}\right)1\left\{\scriptstyle{\sigma_{t+1}<\sigma_{t}}\right\}\enspace,$
	$\textstyle P_{4}$	$\textstyle=v\left[\log\left(\frac{\alpha_{\uparrow}\sigma_{t}}{\alpha_{\downarrow}uf_{\mu}(m_{t})}\right)-\log\left(\frac{f_{\mu}(m_{t+1})}{f_{\mu}(m_{t})}\right)\right]1\left\{\scriptstyle{\frac{\alpha_{\downarrow}uf_{\mu}(m_{t+1})}{\alpha_{\uparrow}\sigma_{t}}<1}\right\}1\left\{\scriptstyle{\sigma_{t+1}>\sigma_{t}}\right\}\enspace.$

Then, the one-step change $\Delta_{t}=V(\theta_{t+1})-V(\theta_{t})$ is upper bounded by

(30)

\textstyle\Delta_{t}\leqslant\left(1-v1\left\{\scriptstyle{\frac{\alpha_{\downarrow}uf_{\mu}(m_{t})}{\alpha_{\uparrow}\sigma_{t}}<1}\right\}1\left\{\scriptstyle{\sigma_{t+1}>\sigma_{t}}\right\}\right)\log\left(\frac{f_{\mu}(m_{t+1})}{f_{\mu}(m_{t})}\right)\\ \textstyle+v\log^{+}\left(\frac{\alpha_{\uparrow}\ell f_{\mu}(m_{t})}{\alpha_{\downarrow}\sigma_{t}}\right)1\left\{\scriptstyle{\sigma_{t+1}<\sigma_{t}}\right\}+v\log^{+}\left(\frac{\alpha_{\uparrow}\sigma_{t}}{\alpha_{\downarrow}uf_{\mu}(m_{t})}\right)1\left\{\scriptstyle{\sigma_{t+1}>\sigma_{t}}\right\}\\ \textstyle\leqslant(1-v)\log\frac{f_{\mu}(m_{t+1})}{f_{\mu}(m_{t})}+v\log^{+}\frac{\alpha_{\uparrow}\ell f_{\mu}(m_{t})}{\alpha_{\downarrow}\sigma_{t}}1\left\{\scriptstyle{\sigma_{t+1}<\sigma_{t}}\right\}+v\log^{+}\frac{\alpha_{\uparrow}\sigma_{t}}{\alpha_{\downarrow}uf_{\mu}(m_{t})}1\left\{\scriptstyle{\sigma_{t+1}>\sigma_{t}}\right\}\enspace.

The truncated one-step change $\max\{\Delta_{t}\,,\,-A\}$ is upper bounded by

(31)

\textstyle\max\{\Delta_{t}\,,\,-A\}\leqslant(1-v)\max\left\{\log\left(\frac{f_{\mu}(m_{t+1})}{f_{\mu}(m_{t})}\right)\,,\,-\frac{A}{1-v}\right\}\\ \textstyle+v\log^{+}\left(\frac{\alpha_{\uparrow}\ell f_{\mu}(m_{t})}{\alpha_{\downarrow}\sigma_{t}}\right)1\left\{\scriptstyle{\sigma_{t+1}<\sigma_{t}}\right\}+v\log^{+}\left(\frac{\alpha_{\uparrow}\sigma_{t}}{\alpha_{\downarrow}uf_{\mu}(m_{t})}\right)1\left\{\scriptstyle{\sigma_{t+1}>\sigma_{t}}\right\}\enspace.

To consider the expectation of the above upper bound, we need to compute the expectation of the maximum of $\log\left(\frac{f_{\mu}(m_{t+1})}{f_{\mu}(m_{t})}\right)$ and $-\frac{A}{1-v}$ . Let $a\leqslant 0$ and $b\in\mathbb{R}$ then $\max(a,b)=a1\left\{\scriptstyle{a>b}\right\}+b1\left\{\scriptstyle{a\leqslant b}\right\}\leqslant b1\left\{\scriptstyle{a\leqslant b}\right\}$ . Applying this and taking the conditional expectation, a trivial upper bound for the conditional expectation of $\max\left\{\log\left(\frac{f_{\mu}(m_{t+1})}{f_{\mu}(m_{t})}\right)\,,\,-\frac{A}{1-v}\right\}$ is $-\frac{A}{1-v}$ times the probability of $\log\left(\frac{f_{\mu}(m_{t+1})}{f_{\mu}(m_{t})}\right)$ being no greater than $-\frac{A}{1-v}$ . The latter condition is equivalent to $f_{\mu}(m_{t+1})\leqslant(1-r)f_{\mu}(m_{t})$ corresponding to successes with rate $r=1-\exp\left(-\frac{A}{1-v}\right)$ or better. That is,

(32)

\textstyle(1-v)\mathbb{E}\left[\max\left\{\log\left(\frac{f_{\mu}(m_{t+1})}{f_{\mu}(m_{t})}\right)\,,\,-\frac{A}{1-v}\right\}\right]\\ \leqslant-Ap^{\mathrm{succ}}_{r}\left(\frac{\sigma_{t}}{f_{\mu}(m_{t})};m_{t},\Sigma_{t}\right)\enspace.

Note also that the expected value of $1\left\{\scriptstyle{\sigma_{t+1}>\sigma_{t}}\right\}$ is the success probability, namely, $p^{\mathrm{succ}}_{0}\left(\frac{\sigma_{t}}{f_{\mu}(m_{t})};m_{t},\Sigma_{t}\right)$ . We obtain an upper bound for the conditional expectation of $\max\{\Delta_{t}\,,\,-A\}$ in the case of reasonable $\sigma_{t}$ as

(33)

\textstyle\mathbb{E}\left[\max\{\Delta_{t}\,,\,-A\}|\theta_{t}\right]\leqslant-Ap^{\mathrm{succ}}_{r}\left(\frac{\sigma_{t}}{f_{\mu}(m_{t})};m_{t},\Sigma_{t}\right)\\ \textstyle+\bigg{(}\log\left(\frac{\alpha_{\uparrow}}{\alpha_{\downarrow}}\right)+\underbrace{\log\left(\frac{\ell f_{\mu}(m_{t})}{\sigma_{t}}\right)}_{\leqslant 0}\bigg{)}v\left(1-p^{\mathrm{succ}}_{0}\left(\frac{\sigma_{t}}{f_{\mu}(m_{t})};m_{t},\Sigma_{t}\right)\right)\\ \textstyle+\bigg{(}\log\left(\frac{\alpha_{\uparrow}}{\alpha_{\downarrow}}\right)+\underbrace{\log\left(\frac{\sigma_{t}}{uf_{\mu}(m_{t})}\right)}_{\leqslant 0}\bigg{)}vp^{\mathrm{succ}}_{0}\left(\frac{\sigma_{t}}{f_{\mu}(m_{t})};m_{t},\Sigma_{t}\right)\leqslant-Ap^{*}_{r}+v\log\left(\frac{\alpha_{\uparrow}}{\alpha_{\downarrow}}\right)\enspace.

Small $\sigma_{t}$ case: $\frac{f_{\mu}(m_{t})}{\sigma_{t}}>\frac{1}{\ell}$ . If $\ell f_{\mu}(m_{t})>\sigma_{t}$ , the 2nd summand in Eq. 28 is positive. Moreover, if $\sigma_{t+1}<\sigma_{t}$ , we have $\ell f_{\mu}(m_{t+1})=\ell f_{\mu}(m_{t})>\sigma_{t}>\sigma_{t+1}$ and hence the 2nd summand in Eq. 28 is positive for $V(\theta_{t+1})$ as well. If $\sigma_{t+1}>\sigma_{t}$ , any regime can happen. Then, $V(\theta_{t+1})-V(\theta_{t})=$

	$\textstyle=$	$\textstyle\log\frac{f_{\mu}(m_{t+1})}{f_{\mu}(m_{t})}-v\log\frac{\alpha_{\uparrow}\ell f_{\mu}(m_{t})}{\sigma_{t}}+v\log\frac{\ell f_{\mu}(m_{t+1})}{\sigma_{t}}1\left\{\scriptstyle{\frac{\ell f_{\mu}(m_{t+1})}{\sigma_{t}}>1}\right\}1\left\{\scriptstyle{\sigma_{t+1}>\sigma_{t}}\right\}$
		$\textstyle+v\log\frac{\alpha_{\uparrow}\ell f_{\mu}(m_{t})}{\alpha_{\downarrow}\sigma_{t}}1\left\{\scriptstyle{\frac{\alpha_{\uparrow}\ell f_{\mu}(m_{t})}{\alpha_{\downarrow}\sigma_{t}}>1}\right\}1\left\{\scriptstyle{\sigma_{t+1}<\sigma_{t}}\right\}$
		$\textstyle+v\log\frac{\alpha_{\uparrow}\sigma_{t}}{\alpha_{\downarrow}uf_{\mu}(m_{t+1})}1\left\{\scriptstyle{\frac{\alpha_{\downarrow}uf_{\mu}(m_{t+1})}{\alpha_{\uparrow}\sigma_{t}}<1}\right\}1\left\{\scriptstyle{\sigma_{t+1}>\sigma_{t}}\right\}$
	$\textstyle=$	$\textstyle\log\left(\frac{f_{\mu}(m_{t+1})}{f_{\mu}(m_{t})}\right)\left[1+v\left(1\left\{\scriptstyle{\frac{\ell f_{\mu}(m_{t+1})}{\sigma_{t}}>1}\right\}-1\left\{\scriptstyle{\frac{\alpha_{\downarrow}uf_{\mu}(m_{t+1})}{\alpha_{\uparrow}\sigma_{t}}<1}\right\}\right)1\left\{\scriptstyle{\sigma_{t+1}>\sigma_{t}}\right\}\right]$
		$\textstyle-v\log\left(\frac{\alpha_{\downarrow}uf_{\mu}(m_{t})}{\alpha_{\uparrow}\sigma_{t}}\right)1\left\{\scriptstyle{\frac{\alpha_{\downarrow}uf_{\mu}(m_{t+1})}{\alpha_{\uparrow}\sigma_{t}}<1}\right\}1\left\{\scriptstyle{\sigma_{t+1}>\sigma_{t}}\right\}$
		$\textstyle-v\log\left(\frac{\ell f_{\mu}(m_{t})}{\sigma_{t}}\right)\left[1-1\left\{\scriptstyle{\frac{\ell f_{\mu}(m_{t+1})}{\sigma_{t}}>1}\right\}1\left\{\scriptstyle{\sigma_{t+1}>\sigma_{t}}\right\}-1\left\{\scriptstyle{\frac{\alpha_{\uparrow}\ell f_{\mu}(m_{t})}{\alpha_{\downarrow}\sigma_{t}}>1}\right\}1\left\{\scriptstyle{\sigma_{t+1}<\sigma_{t}}\right\}\right]$
		$\textstyle-v\left(\log(\alpha_{\uparrow})-\log\left(\frac{\alpha_{\uparrow}}{\alpha_{\downarrow}}\right)1\left\{\scriptstyle{\frac{\alpha_{\uparrow}\ell f_{\mu}(m_{t})}{\alpha_{\downarrow}\sigma_{t}}>1}\right\}1\left\{\scriptstyle{\sigma_{t+1}<\sigma_{t}}\right\}\right)\enspace.$

On the RMS of the above equality, the first term is guaranteed to be non-positive since $v\in(0,1)$ . The second and third terms are non-positive as well since $\frac{\alpha_{\downarrow}uf_{\mu}(m_{t})}{\alpha_{\uparrow}\sigma_{t}}>\frac{\alpha_{\downarrow}u}{\alpha_{\uparrow}\ell}{\geqslant}1$ and $\frac{\ell f_{\mu}(m_{t})}{\sigma_{t}}>1$ . Replacing the indicator $1\left\{\scriptstyle{\frac{\alpha_{\uparrow}\ell f_{\mu}(m_{t})}{\alpha_{\downarrow}\sigma_{t}}>1}\right\}$ with $1$ in the last term provides an upper bound. Altogether, we obtain

\Delta_{t}=V(\theta_{t+1})-V(\theta_{t})\leqslant-v\left(\log(\alpha_{\uparrow})-\log(\alpha_{\uparrow}/\alpha_{\downarrow})1\left\{\scriptstyle{\sigma_{t+1}<\sigma_{t}}\right\}\right)\enspace.

Note that the RHS is larger than $-A$ since it is lower bounded by $-v\log(\alpha_{\uparrow})$ and $v\leqslant A/\log(\alpha_{\uparrow})$ . Then, the conditional expectation of $\max\{\Delta_{t}\,,\,-A\}$ is

(34)

\textstyle\mathbb{E}\left[\max\{\Delta_{t}\,,\,-A\}|\mathcal{F}_{t}\right]\leqslant-v\left(\log\left(\frac{\alpha_{\uparrow}}{\alpha_{\downarrow}}\right)p^{\mathrm{succ}}_{0}\left(\frac{\sigma_{t}}{f_{\mu}(m_{t})};m_{t},\Sigma_{t}\right)+\log(\alpha_{\downarrow})\right)\\ \textstyle\leqslant-v\left(\log\left(\frac{\alpha_{\uparrow}}{\alpha_{\downarrow}}\right)p_{\ell}+\log(\alpha_{\downarrow})\right){=-v\log\left(\frac{\alpha_{\uparrow}}{\alpha_{\downarrow}}\right)\left(p_{\ell}-p_{\mathrm{target}}\right)}=-v\frac{p_{\ell}-p_{u}}{2}\log\left(\frac{\alpha_{\uparrow}}{\alpha_{\downarrow}}\right)\enspace.

Here we used $\mathbb{E}[1\{\sigma_{t+1}<\sigma_{t}\}\mid\mathcal{F}_{t}]=1-p_{0}^{\mathrm{succ}}\left(\frac{\sigma_{t}}{f_{\mu}(m_{t})};m_{t},\Sigma_{t}\right)$ for the first inequality, $p^{\mathrm{succ}}_{0}\left(\frac{\sigma_{t}}{f_{\mu}(m_{t})};m_{t},\Sigma_{t}\right)>p_{\ell}$ for the second inequality, $p_{\mathrm{target}}=\log\left(\frac{1}{\alpha_{\downarrow}}\right)\Big{/}\log\left(\frac{\alpha_{\uparrow}}{\alpha_{\downarrow}}\right)$ and $p_{\mathrm{target}}=(p_{u}+p_{\ell})/2$ for the last two equalities.

Large $\sigma_{t}$ case: $\frac{f_{\mu}(m_{t})}{\sigma_{t}}<\frac{1}{u}$ . Since $\frac{f_{\mu}(m_{t+1})}{\sigma_{t+1}}\leqslant\frac{f_{\mu}(m_{t})}{\alpha_{\downarrow}\sigma_{t}}<\frac{1}{\alpha_{\downarrow}u}$ , the 3rd summand in Eq. 28 is positive in both $V(\theta_{t})$ and $V(\theta_{t+1})$ . For the 2nd summand in Eq. 28, recall that $\alpha_{\uparrow}\ell f_{\mu}(m_{t})/\sigma_{t}<\alpha_{\uparrow}\ell/u\leqslant\alpha_{\downarrow}<1$ since we have assumed that $u/\ell\geqslant\alpha_{\uparrow}/\alpha_{\downarrow}$ . Hence, for $V(\theta_{t})$ the 2nd summand in Eq. 28 is zero. Also, $\alpha_{\uparrow}\ell\|m_{t+1}\|/\sigma_{t+1}\leqslant\alpha_{\uparrow}\ell/(\alpha_{\downarrow}u)=(\alpha_{\uparrow}/\alpha_{\downarrow})\ell/u\geqslant 1$ and thus for $V(\theta_{t+1})$ the 2nd summand in Eq. 28 also equals $0$ . We obtain

V(\theta_{t+1})-V(\theta_{t})=(1-v)\big{(}\log\left(f_{\mu}(m_{t+1})\right)-\log\left(f_{\mu}(m_{t})\right)\big{)}+v\log\left(\sigma_{t+1}/\sigma_{t}\right).

The first term on the RHS is guaranteed to be non-positive since $v<1$ , yielding $\Delta_{t}\leqslant v\log(\sigma_{t+1}/\sigma_{t})$ . On the other hand,

	$\textstyle v\log(\sigma_{t+1}/\sigma_{t})$	$\textstyle=v\left(\log(\alpha_{\uparrow})1\left\{\scriptstyle{\sigma_{t+1}>\sigma_{t}}\right\}+\log(\alpha_{\downarrow})1\left\{\scriptstyle{\sigma_{t+1}<\sigma_{t}}\right\}\right)$
		$\textstyle=v\left(\log(\alpha_{\uparrow}/\alpha_{\downarrow})1\left\{\scriptstyle{\sigma_{t+1}>\sigma_{t}}\right\}-\log(1/\alpha_{\downarrow})\right)$
		$\textstyle\geqslant-v\log(1/\alpha_{\downarrow})\geqslant-A\enspace,$

where the last inequality comes from the prerequisite $v\leqslant A/\log(1/\alpha_{\downarrow})$ . Hence,

\max\{\Delta_{t}\,,\allowbreak\,-A\}\leqslant\max\{v\log(\sigma_{t+1}/\sigma_{t}),-A\}=v\log(\sigma_{t+1}/\sigma_{t})\enspace.

Then, the conditional expectation of $\max\{\Delta_{t}\,,\allowbreak\,-A\}$ is

(35)

\textstyle\mathbb{E}\left[\max\{\Delta_{t}\,,\,-A\}|\theta_{t}\right]\leqslant v\left(\log(\alpha_{\downarrow})+\log\left(\frac{\alpha_{\uparrow}}{\alpha_{\downarrow}}\right)p^{\mathrm{succ}}_{0}\left(\frac{\sigma_{t}}{f_{\mu}(m_{t})};m_{t},\Sigma_{t}\right)\right)\\ \textstyle\leqslant v\left(\log(\alpha_{\downarrow})+\log\left(\frac{\alpha_{\uparrow}}{\alpha_{\downarrow}}\right)p_{u}\right){=v\log\left(\frac{\alpha_{\uparrow}}{\alpha_{\downarrow}}\right)\left(-p_{\mathrm{target}}+p_{u}\right)}=-v\frac{p_{\ell}-p_{u}}{2}\log\left(\frac{\alpha_{\uparrow}}{\alpha_{\downarrow}}\right)\enspace.

Here we used $p^{\mathrm{succ}}_{0}\left(\frac{\sigma_{t}}{f_{\mu}(m_{t})};m_{t},\Sigma_{t}\right)\leqslant p_{u}$ .

Conclusion. Inequalities Eqs. 34, 35 and 33 together cover all possible cases and we hence obtain Eq. 24.

Finally, we prove the positivity of $B$ for an arbitrary $A>0$ . Lemma 4.2 guarantees the positivity of $p_{r}^{*}$ for any choice of $A$ since $r=1-\exp(-A/(1-v))\in\left(0,1\right)$ for any $A>0$ and $v<1$ . Therefore, $Ap_{r}^{*}>0$ for any $A$ and $v\leqslant\min(1,\ A/\log(1/\alpha_{\downarrow}),\ A/\log(\alpha_{\uparrow}))$ . Moreover, for a sufficiently small $v$ , $p^{*}_{r}$ is strictly positive for any $A>0$ . Therefore, one can take a sufficiently small $v$ that satisfies $Ap_{r}^{*}>v\log(\alpha_{\uparrow}/\alpha_{\downarrow})$ . The first term in the minimum in Eq. 24 is positive. The second term therein is clearly positive for $v>0$ . This completes the proof.

B.6 Proof of Proposition 4.15

Consider $d\geqslant 2$ . We set $A=1/d$ . We bound $B$ from below by taking a specific value for $v\in(0,\ \min(1,A/\log(1/\alpha_{\downarrow}),\ A/\log(\alpha_{\uparrow}))$ instead of considering $\sup$ for $v$ . Our candidate is $v=\frac{Ap^{\prime}}{\log(\alpha_{\uparrow}/\alpha_{\downarrow})}\frac{2}{(2+p_{\ell}-p_{u})}$ , where $p^{\prime}=\inf_{\bar{\sigma}\in[\ell,u]}p_{r^{\prime}}(\bar{\sigma})$ and $r^{\prime}=1-\exp\big{(}-A\big{(}1-\frac{1}{d\log(\alpha_{\uparrow}/\alpha_{\downarrow})}\big{)}^{-1}\big{)}$ . It holds $v<\frac{1}{d\log(\alpha_{\uparrow}/\alpha_{\downarrow})}$ and hence $r^{\prime}>r$ , from which we obtain $p^{\prime}<p^{*}$ .

We bound the terms in Eq. 24 as: $Ap^{*}-v\log(\alpha_{\uparrow}/\alpha_{\downarrow})=\frac{p^{\prime}}{d}\left(\frac{p^{*}}{p^{\prime}}-\frac{2}{2+p_{\ell}-p_{u}}\right)\geqslant\frac{p^{\prime}}{d}\left(\frac{p_{\ell}-p_{u}}{2+p_{\ell}-p_{u}}\right)$ and $v\frac{p_{\ell}-p_{u}}{2}\log\left(\frac{\alpha_{\uparrow}}{\alpha_{\downarrow}}\right)=\frac{p^{\prime}}{d}\frac{p_{\ell}-p_{u}}{2+p_{\ell}-p_{u}}$ . Therefore, we have $B\geqslant\frac{p^{\prime}}{d}\frac{p_{\ell}-p_{u}}{2+p_{\ell}-p_{u}}$ . Note that one can take $p_{\ell}-p_{u}\in\Theta(1)$ since the only condition is $p_{\mathrm{target}}=(p_{\ell}+p_{u})/2\in\Theta(1)$ . To obtain $B\in\Omega(1/d)$ , it is sufficient to show $p^{\prime}\in\Theta(1)$ for $d\to\infty$ .

Fix $p_{\ell}$ and $p_{u}$ independently of $d$ . In the light of Lemma 3.1 in [4], we have that $p_{0}:\mathbb{R}_{>}\to(0,1/2)$ is continuous and strictly decreasing from $1/2$ to $0$ for all $d\in\mathbb{N}$ . Therefore, for each $d\in\mathbb{N}$ there exists an inverse map $p_{0}^{-1}:(0,1/2)\to\mathbb{R}_{>}$ . Define $\hat{\sigma}_{\ell}^{d}=dV_{d}p_{0}^{-1}(p_{\ell})$ and $\hat{\sigma}_{u}^{d}=dV_{d}p_{0}^{-1}(p_{u})$ for each $d\in\mathbb{N}$ . It follows from Lemma 3.2 in [4] that $p_{0}^{\mathrm{lim}}:\bar{\sigma}\mapsto\lim_{d\to\infty}p_{0}(\bar{\sigma})$ is also strictly decreasing, hence invertible. The existence of $\lim_{d\to\infty}p_{0}(\cdot)$ is also proved in [4]. We let $\hat{\sigma}_{\ell}^{\infty}=(p_{0}^{\mathrm{lim}})^{-1}(p_{\ell})$ and $\hat{\sigma}_{u}^{\infty}=(p_{0}^{\mathrm{lim}})^{-1}(p_{u})$ . Because of the pointwise convergence of $p_{0}(\bar{\sigma}=\hat{\sigma}/(dV_{d}))$ to $p_{0}^{\mathrm{lim}}(\hat{\sigma})$ , we have $\hat{\sigma}_{\ell}^{d}\to\hat{\sigma}_{\ell}^{\infty}$ and $\hat{\sigma}_{u}^{d}\to\hat{\sigma}_{u}^{\infty}$ for $d\to\infty$ . Hence, for any $\hat{u}>\hat{\sigma}_{u}^{\infty}$ and $\hat{\ell}<\hat{\sigma}_{\ell}^{\infty}$ with $u/\ell\geqslant\alpha_{\uparrow}/\alpha_{\downarrow}$ , there exists $D\in\mathbb{N}$ such that for all $d\geqslant D$ we have $\hat{u}>\hat{\sigma}_{u}^{d}$ and $\hat{\ell}<\hat{\sigma}_{\ell}^{d}$ . Now we fix $\hat{u}$ and $\hat{\ell}$ in this way. This amounts to selecting $u=dV_{d}\hat{u}$ and $\ell=dV_{d}\hat{\ell}$ .

We have $\lim_{d\to\infty}dr^{\prime}=1$ since $\lim_{d\to\infty}d\log(\alpha_{\uparrow}/\alpha_{\downarrow})=\infty$ and hence according to Lemma 3.2 in [4] we have

	$\textstyle\liminf_{d\to\infty}p^{\prime}$	$\textstyle=\liminf_{d\to\infty}\min_{\bar{\sigma}\in[\ell,u]}\left\{p_{r^{\prime}}(\bar{\sigma})\right\}=\liminf_{d\to\infty}\min_{\hat{\sigma}\in[\hat{\ell},\hat{u}]}p_{r^{\prime}}\left(\frac{\hat{\sigma}}{dV_{d}}\right)$
		$\textstyle\overset{(\star)}{=}\min_{\hat{\sigma}\in[\hat{\ell},\hat{u}]}\lim_{d\to\infty}\left(p_{r^{\prime}}\left(\frac{\hat{\sigma}}{dV_{d}}\right)\right)=\min_{\hat{\sigma}\in[\hat{\ell},\hat{u}]}\Psi\left(-\frac{1}{\hat{\sigma}}-\frac{{\hat{\sigma}}}{2}\right)\enspace,$

where the equality $(\star)$ follows from the pointwise convergence of $p_{r^{\prime}}$ to $\lim_{d\to\infty}p_{r^{\prime}}$ and the continuity of $p_{r^{\prime}}$ and $\lim_{d\to\infty}p_{r^{\prime}}$ .²²2Let $\{f_{n}:n\geqslant 1\}$ be a sequence of continuous functions on $\mathbb{R}$ and $f$ be a continuous function such that $f$ is the pointwise limit $\lim_{n}f_{n}(x)=f(x)$ of the sequence. Since they are continuous, there exist the minimizers of $f_{n}$ and $f$ in a compact set $[\ell,u]$ . Let $x_{n}=\operatornamewithlimits{argmin}f_{n}(x)$ and $x^{*}=\operatornamewithlimits{argmin}f(x)$ , where $\operatornamewithlimits{argmin}$ is taken over $x\in[\ell,u]$ and we pick one if there exist more than one minimizers. It is easy to see that $f_{n}(x_{n})\leqslant f_{n}(x^{*})$ , hence $\liminf_{n}f_{n}(x_{n})\leqslant\liminf_{n}f_{n}(x^{*})=f(x^{*})$ . Let $\{n_{i}:i\geqslant 1\}$ be the sub-sequence of the indices such that $\liminf_{n}f_{n}(x_{n})=\lim_{i}f_{n_{i}}(x_{n_{i}})$ . Since $\{x_{n_{i}}:i\geqslant 1\}$ is a bounded sequence, Bolzano-Weirstraß theorem provides a convergent sub-sequence $\{x_{n_{i_{k}}}:k\geqslant 1\}$ and we denote its limit as $x_{*}$ . Of course we have $\liminf_{n}f_{n}(x_{n})=\lim_{k}f_{n_{i_{k}}}(x_{n_{i_{k}}})$ . Due to the continuity of $\{f_{n}:n\geqslant 1\}$ and the pointwise convergence to $f$ , we have $\lim_{k}f_{n_{i_{k}}}(x_{n_{i_{k}}})=\lim_{k}f_{n_{i_{k}}}(x_{*})=f(x_{*})$ . Therefore, $\liminf_{n}f_{n}(x_{n})=f(x_{*})\leqslant f(x^{*})$ . Since $x^{*}$ is the minimizer of $f$ in $[\ell,u]$ and $x_{*}\in[\ell,u]$ , it must hold $f(x_{*})\geqslant f(x^{*})$ . Hence, $\liminf_{n}f_{n}(x_{n})=f(x^{*})$ . This completes the proof.