This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\newsiamthm

assumptionAssumption \newsiamremarkremarkRemark \headersGlobal Linear Convergence of Evolution StrategiesY. Akimoto, A. Auger, T. Glasmachers and D. Morinaga

Global Linear Convergence of Evolution Strategies on More than Smooth Strongly Convex Functions

Youhei Akimoto Faculty of Engineering, Information and Systems, University of Tsukuba; RIKEN AIP, Tsukuba, Japan (). akimoto@cs.tsukuba.ac.jp    Anne Auger Inria and CMAP, Ecole Polytechnique, IP Paris, France (). anne.auger@inria.fr    Tobias Glasmachers Institute for Neural Computation, Ruhr-University Bochum, Bochum, Germany (). tobias.glasmachers@ini.rub.de    Daiki Morinaga Department of Computer Science, University of Tsukuba; RIKEN AIP, Tsukuba, Japan (). morinaga@bbo.cs.tsukuba.ac.jp
Abstract

Evolution strategies (ESs) are zeroth-order stochastic black-box optimization heuristics invariant to monotonic transformations of the objective function. They evolve a multivariate normal distribution, from which candidate solutions are generated. Among different variants, CMA-ES is nowadays recognized as one of the state-of-the-art zeroth-order optimizers for difficult problems. Albeit ample empirical evidence that ESs with a step-size control mechanism converge linearly, theoretical guarantees of linear convergence of ESs have been established only on limited classes of functions. In particular, theoretical results on convex functions are missing, where zeroth-order and also first-order optimization methods are often analyzed. In this paper, we establish almost sure linear convergence and a bound on the expected hitting time of an ES family, namely the (1+1)κ(1+1)_{\kappa}-ES, which includes the (1+1)-ES with (generalized) one-fifth success rule and an abstract covariance matrix adaptation with bounded condition number, on a broad class of functions. The analysis holds for monotonic transformations of positively homogeneous functions and of quadratically bounded functions, the latter of which particularly includes monotonic transformation of strongly convex functions with Lipschitz continuous gradient. As far as the authors know, this is the first work that proves linear convergence of ES on such a broad class of functions.

keywords:
Evolution strategies, Randomized derivative free optimization, Black-box optimization, Linear convergence, Stochastic algorithms
{AMS}

65K05, 90C25, 90C26, 90C56, 90C59

1 Introduction

We consider the unconstrained minimization of an objective function f:df:\mathbb{R}^{d}\to\mathbb{R} without the use of derivatives where an optimization solver sees ff as a zeroth-order black-box oracle [48, 13, 49]. This setting is also referred to as derivative-free optimization [16]. Such problems can be advantageously approached by randomized algorithms that can typically be more robust to noise, non-convexity and irregularities of the objective function than deterministic algorithms. There has been recently a vivid interest in randomized derivative-free algorithms giving rise to several theoretical studies of randomized direct search methods [26], trust region [10, 27] and model-based methods [14, 50]. We refer to [41] for an in-depth survey including the references of this paragraph and additional ones.

In this context, we investigate Evolution Strategies (ES), which are among the oldest randomized derivative-free or zeroth-order black-box methods [17, 54, 51]. They are widely used in applications in different domains [57, 40, 45, 23, 5, 28, 58, 12, 21, 22]. Notably a specific ES called covariance-matrix-adaptation ES (CMA-ES) [31] is among the best solvers to address difficult black-box problems. It is affine-invariant and implements complex adaptation mechanisms for the sampling covariance matrix and step-size. It performs well on many ill-conditioned, non-convex, non-smooth, and non-separable problems [30, 53]. ES are known to be difficult to analyze. Yet, given their importance in practice, it is essential to study them from a theoretical convergence perspective.

We focus on the arguably simplest and oldest adaptive ES, denoted (1+1)-ES. It samples a candidate solution from a Gaussian distribution whose step-size (standard deviation) is adapted. The candidate solution is accepted if and only if it is better than the current one (see pseudo-code Algorithm 1). The algorithm shares some similarities with simplified direct search whose complexity analysis has been presented in [39]. Yet the (1+1)-ES is comparison-based and thus invariant to strictly increasing transformations of the objective function. Simplified direct search can be thought of as a variant of mesh adaptive direct search [7, 2]. Arguably, in contrast to direct search, a sufficient decrease condition cannot be guaranteed. This causes some difficulties for the analysis. The (1+1)-ES is rotational invariant, while direct search candidate solutions are created along a predefined set of vectors. While the CMA-ES should always be preferred for practical applications over the (1+1)-ES variant analyzed here, this latter variant achieves faster linear convergence on well-conditioned problems when compared to algorithms with established complexity analysis (see [55, Table 6.3 and Figure 6.1] and [9, Figure B.4] where the random pursuit algorithm and the (1+1)-ES algorithms are compared, and also Appendix A).

Prior theoretical studies of the (1+1)-ES with 1/51/5 success rule have established the global linear convergence on differentiable positively homogeneous functions (composed with a strictly increasing function) with a single optimum [9, 8]. Those results establish the almost sure linear convergence from all initial states. They however do not provide the dependency of the convergence rate with respect to the dimension. A more specific study on the sphere function f(x)=12x2f(x)=\frac{1}{2}\|x\|^{2} establishes lower and upper bounds on the expected hitting time of an ϵ\epsilon-ball of the optimum in Θ(log(dm0x/ϵ))\Theta(\log(d\|m_{0}-x^{*}\|/\epsilon)), where xx^{*} is the optimum of the function, m0m_{0} is the initial solution, and dd is the problem dimension [4]. Prior to that, a variant of the (1+1)(1+1)-ES with one-fifth success rule had been analyzed on the sphere and certain convex quadratic functions establishing bounds on the expected hitting time with overwhelming probability in Θ(log(κfdm0x/ϵ))\Theta(\log(\kappa_{f}d\|m_{0}-x^{*}\|/\epsilon)), where κf\kappa_{f} is the condition number (the ratio between the greatest and smallest eigenvalues) of the Hessian [34, 36, 37, 35]. Recently, the class of functions where the convergence of the (1+1)-ES was proven has been extended to continuously differentiable functions. This analysis does not address the question of linear convergence, focusing only on convergence as such, which is possibly sublinear [24].

Our main contribution is as follows. For a generalized version of the (1+1)-ES with one-fifth success rule, we prove bounds on the expected hitting time akin to linear convergence, i.e., hitting an ϵ\epsilon-ball in Θ(logm0x/ϵ)\Theta(\log\|m_{0}-x^{*}\|/\epsilon) iterations on a quite general class of functions. This class of functions includes all composites of Lipschitz-smooth strongly convex functions with a strictly increasing transformation. This latter transformation allows to include some non-continuous functions, and even functions with non-smooth level sets. We additionally deduce linear convergence with probability one. Our analysis relies on finding an appropriate Lyapunov function with lower and upper-bounded expected drift. It is building on classical fundamental ideas presented by Hajek [29] and widely used to analyze stochastic hill-climbing algorithms on discrete search spaces [43].

Notation

Throughout the paper, we use the following notations. The set of natural numbers {1,2,,}\{1,2,\ldots,\} is denoted \mathbb{N}. Open, closed, and left open intervals on \mathbb{R} are denoted by (,)(,), [,][,], and (,](,], respectively. The set of strictly positive real numbers is denoted by >\mathbb{R}_{>}. The Euclidean norm on d\mathbb{R}^{d} is denoted by \lVert~{}\rVert. Open and closed balls with center cc and radius rr are denoted as (c,r)={xd:xc<r}\mathcal{B}(c,r)=\{x\in\mathbb{R}^{d}:\lVert x-c\rVert<r\} and ¯(c,r)={xd:xcr}\bar{\mathcal{B}}(c,r)=\{x\in\mathbb{R}^{d}:\lVert x-c\rVert\leqslant r\}, respectively. Lebesgue measures on \mathbb{R} and d\mathbb{R}^{d} are both denoted by the same symbol μ\mu. A multivariate normal distribution with mean mm and covariance matrix Σ\Sigma is denoted by 𝒩(m,Σ)\mathcal{N}(m,\Sigma). Its probability measure and its induced probability density under Lebesgue measure are denoted by Φ(;m,Σ)\Phi(\cdot;m,\Sigma) and φ(;m,Σ)\varphi(\cdot;m,\Sigma). The indicator function of a set or condition CC is denoted by 1{C}1\left\{\scriptstyle{C}\right\}. We use Bachmann-Landau notations fo(g)f\in o(g), O(g)O(g), ω(g)\omega(g), Ω(g)\Omega(g) and Θ(g)\Theta(g) to mean lim supϵ0f(ϵ)/g(ϵ)=0\limsup_{\epsilon\to 0}f(\epsilon)/g(\epsilon)=0, lim supϵ0f(ϵ)/g(ϵ)C\limsup_{\epsilon\to 0}f(\epsilon)/g(\epsilon)\leqslant C for some C>0C>0, lim infϵ0f(ϵ)/g(ϵ)=\liminf_{\epsilon\to 0}f(\epsilon)/g(\epsilon)=\infty, lim infϵ0f(ϵ)/g(ϵ)C\liminf_{\epsilon\to 0}f(\epsilon)/g(\epsilon)\geqslant C for some C>0C>0, and fO(g)f\in O(g) and fΩ(g)f\in\Omega(g), respectively.

2 Algorithm, Definitions and Objective Function Assumptions

2.1 Algorithm: (1+1)-ES with Success-based Step-size Control

We analyze a generalized version of the (1+1)-ES with one-fifth success rule presented in Algorithm 1, which implements one of the oldest approaches to adapt the step-size in randomized optimization methods [51, 17, 54]. The specific implementation was proposed in [38]. At each iteration, a candidate solution xtx_{t} is sampled. It is centered in the current incumbent mtm_{t} and follows a multivariate normal distribution with mean vector mtm_{t} and covariance matrix equal to σt2Id\sigma_{t}^{2}I_{d} where IdI_{d} denotes the identity matrix. The candidate solution is accepted, that is mtm_{t} becomes xtx_{t}, if and only if xtx_{t} is better than mtm_{t} (i.e. f(xt)f(mt)f(x_{t})\leqslant f(m_{t})). In this case, we say that the candidate solution is successful. The step-size σt\sigma_{t} is adapted so as to maintain a probability of success to be approximately the target success probability denoted by ptarget:=log(1/α)log(α/α)p_{\mathrm{target}}:=\frac{\log(1/\alpha_{\downarrow})}{\log(\alpha_{\uparrow}/\alpha_{\downarrow})}. To do so, the step-size is increased by the increase factor α>1\alpha_{\uparrow}>1 in case of success (which is an indication that the step-size is likely to be too small) and decreased by the decrease factor α<1\alpha_{\downarrow}<1 otherwise. The covariance matrix Σt\Sigma_{t} of the sampling distribution of candidate solutions is adapted in the set 𝒮κ\mathcal{S}_{\kappa} of positive-definite symmetric matrices with determinant det(Σ)=1\det(\Sigma)=1 and condition number Cond(Σ)κ\operatorname{Cond}(\Sigma)\leqslant\kappa. We do not assume any specific update mechanism for Σ\Sigma, but we assume that the update of Σ\Sigma is invariant to any strictly increasing transformation of ff. We call such an update comparison-based (see Lines 7 and 11 of Algorithm 1). Then, our algorithm behaves exact-equally on ff and on gfg\circ f for all strictly increasing functions g:g:\mathbb{R}\to\mathbb{R} (i.e., g(s)g(t)stg(s)\lesseqgtr g(t)\Leftrightarrow s\lesseqgtr t). This defines a class of comparison-based randomized algorithms and we denote it as (1+1)-ESκ\text{-ES}_{\kappa}. For κ=1\kappa=1, it is simply denoted as (1+1)-ES.

Algorithm 1 (1+1)-ESκ\text{-ES}_{\kappa} with success-based step-size adaptation
1:  input m0dm_{0}\in\mathbb{R}^{d}, σ0>0\sigma_{0}>0, Σ0=I\Sigma_{0}=I, f:df:\mathbb{R}^{d}\to\mathbb{R}, parameter α>1>α>0\alpha_{\uparrow}>1>\alpha_{\downarrow}>0
2:  for t=1,2,t=1,2,\dots, until stopping criterion is met do
3:     sample xtmt+σt𝒩(0,Σt)x_{t}\sim m_{t}+\sigma_{t}\mathcal{N}(0,\Sigma_{t})
4:     if f(xt)f(mt)f\big{(}x_{t}\big{)}\leqslant f\big{(}m_{t}\big{)} then
5:        mt+1xtm_{t+1}\leftarrow x_{t}\triangleright move to the better solution
6:        σt+1σtα\sigma_{t+1}\leftarrow\sigma_{t}\alpha_{\uparrow}\triangleright increase the step size
7:        Σt+1𝒮κ\Sigma_{t+1}\in\mathcal{S}_{\kappa}\triangleright adapt the covariance matrix
8:     else
9:        mt+1mtm_{t+1}\leftarrow m_{t}\triangleright stay where we are
10:        σt+1σtα\sigma_{t+1}\leftarrow\sigma_{t}\alpha_{\downarrow}\triangleright decrease the step size
11:        Σt+1𝒮κ\Sigma_{t+1}\in\mathcal{S}_{\kappa}\triangleright adapt the covariance matrix

Note that α\alpha_{\uparrow} and α\alpha_{\downarrow} are not meant to be tuned depending on the function properties. How to choose such constants for Σt=Id\Sigma_{t}=I_{d} is well-known and is related to the so-called evolution window [52]. In practice, α=α1/4\alpha_{\downarrow}=\alpha_{\uparrow}^{-1/4} is the most commonly used setting, which leads to ptarget=1/5p_{\mathrm{target}}=1/5. It has been shown to be close to optimal, which gives nearly optimal (linear) convergence rate on the sphere function [51, 17]. Hereunder we write θ=(m,σ,Σ)\theta=(m,\sigma,\Sigma) as the state of the algorithm, θt=(mt,σt,Σt)\theta_{t}=(m_{t},\sigma_{t},\Sigma_{t}) and the state-space is denoted by Θ\Theta.

Figure 1 shows typical runs of the (1+1)-ES and a version of (1+1)-ESκ\text{-ES}_{\kappa} proposed in [6], which is known as the (1+1)-CMA-ES, on a 1010-dimensional ellipsoidal function with different condition numbers κf\kappa_{f} of the Hessian. It is empirically observed that Σt\Sigma_{t} in the (1+1)-CMA-ES approaches the inverse Hessian 2f(mt)\nabla^{2}f(m_{t}) of the objective function up to the scalar factor if the objective function is convex quadratic. The runtime of (1+1)-ES scales linearly with κf\kappa_{f} (notice the logarithmic scale of the horizontal axis), while the runtime of the (1+1)-CMA-ES suffers only an additive penalty, roughly proportional to the logarithm of κf\kappa_{f}. Once the Hessian is well approximated by Σ\Sigma (up to a scalar factor), it approaches the global optimum geometrically at the same rate for different values of κf\kappa_{f}.

In our analysis, we do not assume any specific Σ\Sigma update mechanism, hence it does not necessarily behave as shown in Figure 1. Our analysis is therefore the worst case analysis (for the upper bound of the runtime) and the best case analysis (for the lower bound of the runtime) among the algorithms in (1+1)-ESκ\text{-ES}_{\kappa}.

Refer to caption
Refer to caption
Refer to caption
Figure 1: Convergence of the (1+1)-ES (left) and the (1+1)-CMA-ES (middle) on 1010 dimensional ellipsoidal function f(x)=12i=1dκfi1d1xi2f(x)=\frac{1}{2}\sum_{i=1}^{d}\kappa_{f}^{\frac{i-1}{d-1}}x_{i}^{2} with κf=100,101,,106\kappa_{f}=10^{0},10^{1},\dots,10^{6}. The y-axis displays the distance to the optimum (and not the function value). We employ the covariance matrix adaptation mechanism proposed by [6], where σ\sigma is adapted as in Algorithm 1 with α=e0.1\alpha_{\uparrow}=e^{0.1} and α=e0.025\alpha_{\downarrow}=e^{-0.025}. Note the logarithmic scale of the time axis of the left plot vs. the linear time axis of the middle plot.
Right: Three runs of (1+1)-ES (α=e0.1\alpha_{\uparrow}=e^{0.1} and α=e0.025\alpha_{\uparrow}=e^{-0.025}) on 1010 dimensional spherical function f(x)=12xx2f(x)=\frac{1}{2}\lVert x-x^{*}\rVert^{2} with initial step-size σ0=104\sigma_{0}=10^{-4}, 11, and 10410^{4} (in blue, red, green, respectively). Plotted are the distance to the optimum (dotted line), the step-size (dashed line), and the potential function V(θ)V(\theta) defined in (22) (solid line) with v=4/dv=4/d, =α10\ell=\alpha_{\uparrow}^{-10}, and u=α10u=\alpha_{\downarrow}^{-10}.

2.2 Preliminary Definitions

2.2.1 Spatial Suboptimality Function

The algorithms studied in this paper are comparison-based and thus invariant to strictly increasing transformations of ff. If the convergence of the algorithms is measured in terms of ff, say by investigating the convergence or hitting time of the sequence f(mt)f(m_{t}), this will not reflect the invariance to monotonic transformations of ff because the first iteration t0t_{0} such that f(mt0)ϵf(m_{t_{0}})\leqslant\epsilon is not equal to the first iteration t0t_{0}^{\prime} such that g(f(mt0))ϵg(f(m_{t_{0}^{\prime}}))\leqslant\epsilon for some ϵ>0\epsilon>0. For this reason, we introduce a quality measure called spatial suboptimality function [24]. It is the ddth root of the volume of the sub-levelset where the function value is better or equal to f(x)f(x):

Definition 2.1 (Spatial Suboptimality Function).

Let f:df:\mathbb{R}^{d}\to\mathbb{R} be a measurable function with respect to the Borel σ\sigma algebra of d\mathbb{R}^{d} (simply referred to as measurable function in the sequel). Then the spatial suboptimality function fμ:d[0,+]f_{\mu}:\mathbb{R}^{d}\to[0,+\infty] is defined as

(1) fμ(x)=μ(f1((,f(x)]))d=μ({yd|f(y)f(x)})d.f_{\mu}(x)=\sqrt[d]{\mu\left(f^{-1}\left((-\infty,f(x)]\right)\right)}=\sqrt[d]{\mu\left(\left\{y\in\mathbb{R}^{d}\,\big{|}\,f(y)\leqslant f(x)\right\}\right)}\enspace.

We remark that for any ff, the suboptimality function fμf_{\mu} is greater or equal to zero. For any ff and any strictly increasing function g:Im(f)g:{\rm Im}(f)\to\mathbb{R}, ff and its composite gfg\circ f have the same spatial suboptimality function such that hitting time of fμf_{\mu} smaller than ϵ>0\epsilon>0 will be the same for ff or gfg\circ f. Moreover, there exists a strictly increasing function gg such that fμ(x)=g(f(x))f_{\mu}(x)=g(f(x)) holds μ\mu-almost everywhere [24, Lemma 1].

We will investigate the expected first hitting time of mtx\lVert m_{t}-x^{*}\rVert to ϵ>0\epsilon>0. For this, we will bound the first hitting time of mtx\lVert m_{t}-x^{*}\rVert to ϵ\epsilon by the first hitting time of fμ(mt)f_{\mu}(m_{t}) to a constant times ϵ\epsilon. To understand why, consider first a strictly convex quadratic function ff with Hessian HH and minimal solution xx^{*}. We have fμ(x)=Vd[2(f(x)f(x))/det(H)1/d]1/2f_{\mu}(x)=V_{d}\big{[}2(f(x)-f(x^{*}))/\det(H)^{1/d}\big{]}^{1/2} for all xdx\in\mathbb{R}^{d}, where Vd=π1/2/Γ1/d(d/2+1)V_{d}=\pi^{1/2}/\Gamma^{1/d}(d/2+1) is the ddth root of the volume of the dd-dimensional unit hyper-sphere [3]. This implies that the first hitting time of fμ(mt)f_{\mu}(m_{t}) translates to the first hitting time of f(mt)f(x)\sqrt{f(m_{t})-f(x^{*})}. We have λminxxf(x)f(x)λmaxxx\sqrt{\lambda_{\min}}\lVert x-x^{*}\rVert\leqslant\sqrt{f(x)-f(x^{*})}\leqslant\sqrt{\lambda_{\max}}\lVert x-x^{*}\rVert, where λmin\lambda_{\min} and λmax\lambda_{\max} are the minimal and maximal eigenvalues of HH. E.g., consider f(x)=xx2+1f(x)=\lVert x-x^{*}\rVert^{2}+1. Then the above condition also translates to the first hitting time of mtx\lVert m_{t}-x^{*}\rVert. More generally, we will formalize an assumption on ff later on (Assumption A1), which allows us to bound xx\lVert x-x^{*}\rVert by a constant times fμ(x)f_{\mu}(x) from above and below (see (6)), implying that the first hitting time of mtx\lVert m_{t}-x^{*}\rVert to ϵ\epsilon is bounded by that of fμ(mt)f_{\mu}(m_{t}) to ϵ\epsilon, times a constant.

2.2.2 Success Probability

The success probability, i.e., the probability of sampling a candidate solution xtx_{t} with an objective function better than or equal to that of the current solution mtm_{t}, plays an important role in the analysis of the (1+1)-ESκ\text{-ES}_{\kappa} with success-based step-size control mechanism. We present here several useful definitions related to the success probability.

We start with the definition of the success domain with rate rr and the success probability with rate rr. The probability to sample in the rr-success domain is called success probability with rate rr. When r=0r=0 we simply talk about success probability.111For r=0r=0, the success domain S0(m)S_{0}(m) is not necessarily equivalent to the sub-levelset S0(m):={xdf(x)f(m)}S_{0}^{\prime}(m):=\{x\in\mathbb{R}^{d}\mid f(x)\leqslant f(m)\}, where it always holds that S0(m)S0(m)S_{0}^{\prime}(m)\subseteq S_{0}(m). However, since it is guaranteed that μ(S0(m)S0(m))=0\mu(S_{0}(m)\setminus S_{0}^{\prime}(m))=0 by [24, Lemma 1], due to the absolute continuity of Φ(;0,Σ)\Phi(;0,\Sigma) for Σ𝒮κ\Sigma\in\mathcal{S}_{\kappa}, the success probability with rate r=0r=0 is equivalent to Prz𝒩(0,Σ)[m+fμ(m)σ¯zS0(m)]\Pr_{z\sim\mathcal{N}(0,\Sigma)}\left[m+f_{\mu}(m)\cdot\bar{\sigma}z\in S_{0}^{\prime}(m)\right], with σ¯\bar{\sigma} defined in (3).

Definition 2.2 (Success Domain).

For a measurable function f:df:\mathbb{R}^{d}\to\mathbb{R} and mdm\in\mathbb{R}^{d} such that fμ(m)<f_{\mu}(m)<\infty, the rr-success domain at mm with r[0,1]r\in[0,1] is defined as

(2) Sr(m)={xdfμ(x)(1r)fμ(m)}.S_{r}(m)=\{x\in\mathbb{R}^{d}\mid f_{\mu}(x)\leqslant(1-r)f_{\mu}(m)\}\enspace.

Definition 2.3 (Success Probability).

Let ff be a measurable function and let m0dm_{0}\in\mathbb{R}^{d} be the initial search point satisfying fμ(m0)<f_{\mu}(m_{0})<\infty. For any r[0,1]r\in[0,1] and any mS0(m0)m\in S_{0}(m_{0}), the success probability with rate rr at mm under the normalized step-size σ¯\bar{\sigma} is defined as

(3) prsucc(σ¯;m,Σ)=Prz𝒩(0,Σ)[m+fμ(m)σ¯zSr(m)].p^{\mathrm{succ}}_{r}(\bar{\sigma};m,\Sigma)=\Pr_{z\sim\mathcal{N}(0,\Sigma)}\left[m+f_{\mu}(m)\bar{\sigma}z\in S_{r}(m)\right]\enspace.

Definition 2.3 introduces the notion of normalized step-size σ¯\bar{\sigma} and the success probability is defined as a function of σ¯\bar{\sigma} rather than the actual step-size σ=fμ(m)σ¯\sigma=f_{\mu}(m)\bar{\sigma}. This is motivated by the fact that as mm approaches the global optimum xx^{*} of ff, the step-size σ\sigma needs to shrink for the success probability to be constant. If the objective function is f(x)=12xx2f(x)=\frac{1}{2}\lVert x-x^{*}\rVert^{2} and the covariance matrix is the identity matrix, then the success probability is fully controlled by σ¯t=σt/fμ(mt)σt/mtx\bar{\sigma}_{t}=\sigma_{t}/f_{\mu}(m_{t})\propto\sigma_{t}/\lVert m_{t}-x^{*}\rVert and is independent of mtm_{t}. This statement can be formalized in the following way.

Lemma 2.4.

If f(x)=12xx2f(x)=\frac{1}{2}\lVert x-x^{*}\rVert^{2}, then letting e1=(1,0,,0)e_{1}=(1,0,\ldots,0), we have

prsucc(σ¯;m,I)=Prz𝒩(0,I)[m+fμ(m)σ¯zSr(m)]=Prz𝒩(0,I)[e1+Vdσ¯z(1r)].p^{\mathrm{succ}}_{r}(\bar{\sigma};m,\mathrm{I})=\Pr_{z\sim\mathcal{N}(0,\mathrm{I})}\left[m+f_{\mu}(m)\bar{\sigma}z\in S_{r}(m)\right]=\Pr_{z\sim\mathcal{N}(0,\mathrm{I})}\left[\|e_{1}+V_{d}\bar{\sigma}z\|\leqslant(1-r)\right]\enspace.

Proof 2.5.

The suboptimality function is the dd-th rooth of the volume of a sphere of radius xx\|x-x^{*}\|. Hence fμ(x)=Vdxxf_{\mu}(x)=V_{d}\lVert x-x^{*}\rVert. Then, the proof follows the derivation in Section 3 in [4].

Therefore, σ¯\bar{\sigma} is more discriminative than σ\sigma itself. In general, the optimal step-size is not necessarily proportional to neither mtx\lVert m_{t}-x^{*}\rVert nor fμ(mt)f_{\mu}(m_{t}).

Since the success probability under a given normalized step-size depends on mm and Σ\Sigma, we define the upper and lower success probability as follows.

Definition 2.6 (Lower and Upper Success Probability).

Let 𝒳ab={xd:a<fμ(x)b}\mathcal{X}_{a}^{b}=\{x\in\mathbb{R}^{d}:a<f_{\mu}(x)\leqslant b\}. Given the normalized step-size σ¯>0\bar{\sigma}>0, the lower and upper success probabilities are defined as

p(a,b]lower(σ¯)\displaystyle p^{\mathrm{lower}}_{(a,b]}(\bar{\sigma}) =infm𝒳abinfΣ𝒮κp0succ(σ¯;m,Σ),\displaystyle=\inf_{m\in\mathcal{X}_{a}^{b}}\inf_{\Sigma\in\mathcal{S}_{\kappa}}p^{\mathrm{succ}}_{0}(\bar{\sigma};m,\Sigma)\enspace, p(a,b]upper(σ¯)\displaystyle p^{\mathrm{upper}}_{(a,b]}(\bar{\sigma}) =supm𝒳absupΣ𝒮κp0succ(σ¯;m,Σ).\displaystyle=\sup_{m\in\mathcal{X}_{a}^{b}}\sup_{\Sigma\in\mathcal{S}_{\kappa}}p^{\mathrm{succ}}_{0}(\bar{\sigma};m,\Sigma)\enspace.

A central quantity for our analysis is the limit for σ¯\bar{\sigma} to 0 of the success probability p0succ(σ¯;m,Σ)p^{\mathrm{succ}}_{0}(\bar{\sigma};m,\Sigma). Intuitively, if this limit is too small for a given mm (compared to ptargetp_{\rm target}), because the ruling principle of the algorithm is to decrease the step-size if the probability of success is smaller than ptargetp_{\rm target}, the step-size will keep decreasing, causing undesired convergence. Following Glasmachers [24], we introduce the concepts of pp-improbability and pp-criticality. They are defined in [24] by the probability of sampling a better point from the isotropic normal distribution in the limit of the step-size to zero. Here, we define pp-improvability and pp-criticality for a general multivariate normal distribution.

Definition 2.7 (pp-improvability and pp-criticality).

Let ff be a measurable function. The function ff is called pp-improvable at mdm\in\mathbb{R}^{d} under the covariance matrix Σ𝒮κ\Sigma\in\mathcal{S}_{\kappa} if there exists p(0,1]p\in(0,1] such that

(4) p=lim infσ¯+0p0succ(σ¯;m,Σ).p=\liminf_{\bar{\sigma}\to+0}p^{\mathrm{succ}}_{0}(\bar{\sigma};m,\Sigma)\enspace.

Otherwise, it is called pp-critical.

The connection to the classical definition of the critical points for continuously differentiable functions is summarized in the following proposition, which is an extension of Lemma 4 in [24], taking a non-identity covariance matrix into account.

Proposition 2.8.

Let f=ghf=g\circ h be a measurable function where gg is any strictly increasing function and hh is continuously differentiable. Then, ff is pp-improvable with p=1/2p=1/2 at any regular point mm where h(m)0\nabla h(m)\neq 0 under any Σ𝒮κ\Sigma\in\mathcal{S}_{\kappa}. Moreover, if hh is twice continuously differentiable at a critical point mm where h(m)=0\nabla h(m)=0 and at least one eigenvalue of 2f(m)\nabla^{2}f(m) is non-zero, under any Σ𝒮κ\Sigma\in\mathcal{S}_{\kappa}, mm is pp-improvable with p=1p=1 if 2h(m)\nabla^{2}h(m) has only non-positive eigenvalues, pp-critical if 2h(m)\nabla^{2}h(m) has only non-negative eigenvalues, and pp-improvable with some p>0p>0 if 2h(x)\nabla^{2}h(x) has at least one strictly negative eigenvalue.

Proof 2.9.

Note that p0succ(σ¯;m,Σ)p^{\mathrm{succ}}_{0}(\bar{\sigma};m,\Sigma) on ff is equivalent to p0succ(σ¯;m,Id)p^{\mathrm{succ}}_{0}(\bar{\sigma};m,I_{d}) on f(x)=f(m+Σ(xm))f^{\prime}(x)=f(m+\sqrt{\Sigma}(x-m)). Therefore, it suffices to show that the claims hold for Σ=Id\Sigma=I_{d} on ff^{\prime}, which is proven in Lemma 4 in [24].

2.3 Main Assumptions on the Objective Functions

Refer to caption
Refer to caption
Figure 2: The sampling distribution is indicated by the mean mm and the shaded orange circle, indicating one standard deviation. The blue set is the sub-levelset S0(m)S_{0}(m) of points improving upon mm. Left: Illustration of property A1 in Section 2.3. The blue set is enclosed in the red (outer) ball of radius Cufμ(m)C_{u}f_{\mu}(m) and contains the dark green (inner) ball of radius Cfμ(m)C_{\ell}f_{\mu}(m). The shaded light green ball indicates the worst case situation captured by the bound, namely that the small ball is positioned within the large ball at maximal distance to mm. Right: Illustration of property A2 in Section 2.3. Although the level set has a kink at mm, there exists a cone centered at mm covering a probability mass of plimitp^{\mathrm{limit}} of improving steps (inside S0(m)S_{0}(m)) for small enough step size σ\sigma (green outline). It contains a smaller cone (red outline) covering a probability mass of ptargetp^{\mathrm{target}}.

Given positive real numbers aa and bb satisfying 0a<b+0\leqslant a<b\leqslant+\infty, and a measurable objective function, let 𝒳ab{\mathcal{X}_{a}^{b}} be the set defined in Definition 2.6.

We pose two core assumptions on the objective functions under which we will derive an upper bound on the expected first hitting time of [0,ϵ][0,\epsilon] by fμ(mt)f_{\mu}(m_{t}) (Theorem 4.6) provided aϵfμ(m0)ba\leqslant\epsilon\leqslant f_{\mu}(m_{0})\leqslant b. First, we require to be able to embed and include balls of radius scaling with fμ(m)f_{\mu}(m) into the sublevel sets of ff. We do not require this to hold on the whole search space but, for a set 𝒳ab\mathcal{X}_{a}^{b}.

  1. A1

    We assume that ff is a measurable function and that there exists 𝒳ab\mathcal{X}_{a}^{b} such that for any m𝒳abm\in\mathcal{X}_{a}^{b}, there exist an open ball \mathcal{B}_{\ell} with radius Cfμ(m)C_{\ell}f_{\mu}(m) and a closed ball ¯u\bar{\mathcal{B}}_{u} with radius Cufμ(m)C_{u}f_{\mu}(m) such that it holds {xdfμ(x)<fμ(m)}\mathcal{B}_{\ell}\subseteq\{x\in\mathbb{R}^{d}\mid f_{\mu}(x)<f_{\mu}(m)\} and {xdfμ(x)fμ(m)}¯u\{x\in\mathbb{R}^{d}\mid f_{\mu}(x)\leqslant f_{\mu}(m)\}\subseteq\bar{\mathcal{B}}_{u}.

We do not specify the center of those balls that may or may not be centered on an optimum of the function. We will see in Proposition 4.1 that this assumption allows to bound p(a,b]lower(σ¯)p^{\mathrm{lower}}_{(a,b]}(\bar{\sigma}) and p(a,b]upper(σ¯)p^{\mathrm{upper}}_{(a,b]}(\bar{\sigma}) by tractable functions of σ¯\bar{\sigma} which will be essential for the analysis. The property is illustrated in Figure 2.

The second assumption requires that the functions are pp-improvable for pp which is lower-bounded uniformly over 𝒳ab{\mathcal{X}_{a}^{b}}.

  1. A2

    Let ff be a measurable function, we assume that there exists 𝒳ab{\mathcal{X}_{a}^{b}} and there exists plimit>ptargetp^{\mathrm{limit}}>p^{\mathrm{target}} such that for any m𝒳abm\in\mathcal{X}_{a}^{b} and any Σ𝒮κ\Sigma\in\mathcal{S}_{\kappa}, the objective function ff is pp-improvable for some pplimitp\geqslant p^{\mathrm{limit}}, i.e.,

    (5) lim infσ¯0p(a,b]lower(σ¯)plimit.\liminf_{\bar{\sigma}\downarrow 0}p^{\mathrm{lower}}_{(a,b]}(\bar{\sigma})\geqslant p^{\mathrm{limit}}\enspace.

The property is illustrated in Figure 2. This assumption implies in particular for a continuous function that 𝒳ab{\mathcal{X}_{a}^{b}} does not contain any local optimum. This latter assumption is required to obtain global convergence [24, Theorem 2] even without any covariance matrix adaptation (i.e. with κ=1\kappa=1) and it can be intuitively understood: If we have a point which is pp-improvable with p<ptargetp<p_{\mathrm{target}} and which is not a local minimum of the function, then, starting with a small step-size, the success-based step-size control may keep decreasing the step-size at such a point and the (1+1)-ESκ\text{-ES}_{\kappa} will prematurely converge to a point that is not a local optimum.

If A1 is satisfied with balls centered at the optimum xx^{*} of the function ff, then it is easy to see that for all x𝒳abx\in{\mathcal{X}_{a}^{b}}

(6) Cfμ(x)xxCufμ(x).C_{\ell}f_{\mu}(x)\leqslant\lVert x-x^{*}\rVert\leqslant C_{u}f_{\mu}(x)\enspace.

If the balls are not centered at the optimum, we have the one-side inequality xx2Cufμ(x)\lVert x-x^{*}\rVert\leqslant 2C_{u}f_{\mu}(x). Hence, the expected first hitting time of fμ(mt)f_{\mu}(m_{t}) to [0,ϵ][0,\epsilon] translates to an upper bound for the expected first hitting time of mtx\lVert m_{t}-x^{*}\rVert to [0,2Cuϵ][0,2C_{u}\epsilon].

We remark that A1 and A2 satisfied for a=0a=0 allow to include some non-differentiable functions with non-convex sublevel sets as illustrated in Figure 2.

We now give two examples of functions that satisfy A1 and A2, including function classes where linear convergence of numerical optimization algorithms are typically analyzed. The first class consists of quadratically bounded functions. It includes all strongly-convex functions with Lipschitz continuous gradient. It also includes some non-convex functions. The second class consists of positively homogeneous functions. The levelsets of a positively homogeneous function are all geometrically similar around xx^{*}.

  1. A3

    We assume that f=ghf=g\circ h where gg is a strictly increasing function and hh is measurable, continuously differentiable with the unique critical point xx^{*}, and quadratically bounded around xx^{*}, i.e., for some LuL>0L_{u}\geqslant L_{\ell}>0,

    (7) (L/2)xx2h(x)h(x)(Lu/2)xx2.\displaystyle(L_{\ell}/2)\lVert x-x^{*}\rVert^{2}\leqslant h(x)-h(x^{*})\leqslant(L_{u}/2)\lVert x-x^{*}\rVert^{2}\enspace.
  2. A4

    We assume that f=ghf=g\circ h where hh is continuously differentiable and positively homogeneous with a unique optimum xx^{*}, i.e., for some γ>0\gamma>0

    (8) h(x+γx)=h(x)+γ(h(x+x)h(x)).h(x^{*}+\gamma x)=h(x^{*})+\gamma\left(h(x^{*}+x)-h(x^{*})\right)\enspace.

The following lemmas show that these assumptions imply A1 and A2. The proofs of the lemmas are presented in Section B.1 and Section B.2, respectively.

Lemma 2.10.

Let ff satisfy A3. Then, ff satisfies A1 and A2 with a=0a=0, b=b=\infty, C=1VdLLuC_{\ell}=\frac{1}{V_{d}}\sqrt{\frac{L_{\ell}}{L_{u}}} and Cu=1VdLuLC_{u}=\frac{1}{V_{d}}\sqrt{\frac{L_{u}}{L_{\ell}}}.

Lemma 2.11.

Let ff be positively homogeneous satisfying A4, then the suboptimality function fμ(x)f_{\mu}(x) is proportional to h(x)h(x)h(x)-h(x^{*}) and satisfies A1 and A2 for a=0a=0 and b=b=\infty with Cu=sup{xx:fμ(x)=1}C_{u}=\sup\{\lVert x-x^{*}\rVert:f_{\mu}(x)=1\} and C=inf{xx:fμ(x)=1}C_{\ell}=\inf\{\lVert x-x^{*}\rVert:f_{\mu}(x)=1\}.

3 Methodology: Additive Drift on Unbounded Continuous Domains

3.1 First Hitting Time

We start with the generic definition of the first hitting time of a stochastic process {Xt:t0}\{X_{t}:t\geqslant 0\}, defined as follows.

Definition 3.1 (First hitting time).

Let {Xt:t0}\{X_{t}:t\geqslant 0\} be a sequence of real-valued random variables adapted to the natural filtration {t:t0}\{\mathcal{F}_{t}:t\geqslant 0\} with initial condition X0=β0X_{0}=\beta_{0}\in\mathbb{R}. For β<β0\beta<\beta_{0}, the first hitting time TβXT_{\beta}^{X} of XtX_{t} to the set (,β](-\infty,\beta] is defined as TβX=inf{t:Xtβ}T_{\beta}^{X}=\inf\{t:X_{t}\leqslant\beta\}.

The first hitting time is the number of iterations that the stochastic process requires to reach the target level β<β0\beta<\beta_{0} for the first time. In our situation, Xt=mtxX_{t}=\lVert m_{t}-x^{*}\rVert measures the distance from the current solution mtm_{t} to the target point xx^{*} (typically, global or local optimal point) after tt iterations. Then, β=ϵ>0\beta=\epsilon>0 defines the target accuracy and TϵXT_{\epsilon}^{X} is the runtime of the algorithm until it finds an ϵ\epsilon-neighborhood (x,ϵ)\mathcal{B}(x^{*},\epsilon). The first hitting time TϵXT_{\epsilon}^{X} is a random variable as mtm_{t} is a random variable. In this paper, we focus on the expected first hitting time 𝔼[TϵX]\mathbb{E}[T_{\epsilon}^{X}]. We want to derive lower and upper bounds on this expected hitting time that relate to the linear convergence of XtX_{t} towards xx^{*}. Such bounds take the following form: There exist CT,C~TC_{T},\tilde{C}_{T}\in\mathbb{R} and CR>0C_{R}>0, C~R>0\tilde{C}_{R}>0 such that for any 0<ϵβ00<\epsilon\leqslant\beta_{0}

(9) C~T+log(m0x/ϵ)C~R𝔼[TϵX|0]CT+log(m0x/ϵ)CR.\tilde{C}_{T}+\frac{\log\left({\lVert m_{0}-x^{*}\rVert}/{\epsilon}\right)}{\tilde{C}_{R}}\leqslant\mathbb{E}[T_{\epsilon}^{X}|\mathcal{F}_{0}]\leqslant C_{T}+\frac{\log(\lVert m_{0}-x^{*}\rVert/\epsilon)}{C_{R}}\enspace.

That is, the time to reach the target accuracy scales logarithmically with the ratio between the initial accuracy m0x\lVert m_{0}-x^{*}\rVert and the target accuracy ϵ\epsilon. The first pair of constants, CTC_{T} and C~T\tilde{C}_{T}, capture the transient time, which is the time that adaptive algorithms typically spend for adaptation. The second pair of constants, CRC_{R} and C~R\tilde{C}_{R}, reflect the speed of convergence (logarithmic convergence rate). Intuitively, assuming that CRC_{R} and C~R\tilde{C}_{R} are close, the distance to the optimum decreases in each step at a rate of approximately exp(CR)exp(C~R)\exp(-C_{R})\approx\exp(-\tilde{C}_{R}). While upper-bounds inform us about the (linear) convergence, the lower-bound helps understanding whether the upper bounds are tight.

Alternatively, linear convergence can be defined as the following property: there exits C>0C>0 such that

(10) lim supt1tlogmtxm0xCalmostsurely.\limsup_{t\to\infty}\frac{1}{t}\log\frac{\|m_{t}-x^{*}\|}{\|m_{0}-x^{*}\|}\leqslant-C\,\,{\rm almost\,\,surely.}

When we have an equality in the previous statement, we say that exp(C)\exp(-C) is the convergence rate.

Figure 1 (right plot) visualizes three different runs of the (1+1)-ES on a function with spherical level sets with different initial step-sizes. First of all, we clearly observe linear convergence. The first hitting time of (x,ϵ)\mathcal{B}(x^{*},\epsilon) scales linearly with log(1/ϵ)\log(1/\epsilon) for a sufficiently small ϵ>0\epsilon>0. Second, its convergence speed is independent of the initial condition. Therefore, we expect to have universal constants CRC_{R} and C~R\tilde{C}_{R} independent of the initial state. Last, depending on the initial step-size, the transient time can vary. If the initial step-size is too large or too small, it does not produce progress in terms of mtx\lVert m_{t}-x^{*}\rVert until the step-size is well adapted. Therefore, CTC_{T} and C~T\tilde{C}_{T} depend on the initial condition, with a logarithmic dependency on the initial multiplicative mismatch.

3.2 Bounds of the Hitting Time via Drift Conditions

We are going to use drift analysis that consists in deducing properties of a sequence {Xt:t0}\{X_{t}:t\geqslant 0\} (adapted to a natural filtration {t:t0}\{\mathcal{F}_{t}:t\geqslant 0\}) from its drift defined as 𝔼[Xt+1t]Xt\mathbb{E}[X_{t+1}\mid\mathcal{F}_{t}]-X_{t} [29]. Drift analysis has been widely used to analyze hitting times of evolutionary algorithms defined on discrete search spaces (mainly on binary search spaces) [32, 33, 11, 46, 20, 19]. Though they were developed mainly for finite search spaces, the drift theorems can naturally be generalized to continuous domains [42, 44]. Indeed, Jägersküpper’s work [34, 36, 37] is based on the same idea, while the link to the drift analysis was implicit.

Since many drift conditions have been developed for analyzing algorithms on discrete domains, the domain of XtX_{t} is often implicitly assumed to be bounded. However, this assumption is violated in our situation, where we will use Xt=log(fμ(mt))X_{t}=\log\big{(}f_{\mu}(m_{t})\big{)} as the quality measure, which takes values in {}\mathbb{R}\cup\{-\infty\}, and is meant to approach -\infty. We refer to [4] for additional details. In general, translating expected progress requires bounding the tail of the progress distribution, as formalized in [29].

To control the tails of the drift distribution, we construct a stochastic process {Yt:t0}\{Y_{t}:t\geqslant 0\} iteratively as follows: Y0=X0Y_{0}=X_{0} and

(11) Yt+1=Yt+max{Xt+1Xt,A}1{TβX>t}B1{TβXt}Y_{t+1}=Y_{t}+\max\big{\{}X_{t+1}-X_{t},-A\big{\}}1\left\{\scriptstyle{T_{\beta}^{X}>t}\right\}-B1\left\{\scriptstyle{T_{\beta}^{X}\leqslant t}\right\}

for some AB>0A\geqslant B>0 and β<β0\beta<\beta_{0} with X0=β0X_{0}=\beta_{0}. It clips Xt+1XtX_{t+1}-X_{t} to some constant A-A (A>0A>0) from below. We introduce the indicator 1{TβX>t}1\left\{\scriptstyle{T_{\beta}^{X}>t}\right\} for a technical reason. The process disregards progress larger than AA, and it fixes the progress of the step that hits the target set to BB. It is formalized in the following theorem, which is our main mathematical tool to derive an upper bound of the expected first hitting time of (1+1)-ESκ\text{-ES}_{\kappa} in the form of (9).

Theorem 3.2.

Let {Xt:t0}\{X_{t}:t\geqslant 0\} be a sequence of real-valued random variables adapted to a filtration {t:t0}\{\mathcal{F}_{t}:t\geqslant 0\} with X0=β0X_{0}=\beta_{0}\in\mathbb{R}. For β<β0\beta<\beta_{0}, let TβX=inf{t:Xtβ}T^{X}_{\beta}=\inf\left\{t:X_{t}\leqslant\beta\right\} be the first hitting time of the set (,β](-\infty,\beta]. Define a stochastic process {Yt:t0}\{Y_{t}:t\geqslant 0\} iteratively through (11) with Y0=X0Y_{0}=X_{0} for some AB>0A\geqslant B>0, and let TβY=inf{t:Ytβ}T^{Y}_{\beta}=\inf\left\{t:Y_{t}\leqslant\beta\right\} be the first hitting time of the set (,β](-\infty,\beta]. If YtY_{t} is integrable, i.e. 𝔼[|Yt|]<\mathbb{E}\left[\big{|}Y_{t}\big{|}\right]<\infty, and

(12) 𝔼[max{Xt+1Xt,A}1{TβX>t}|t]B1{TβX>t},\mathbb{E}\left[\max\left\{X_{t+1}-X_{t},-A\right\}1\left\{\scriptstyle{T_{\beta}^{X}>t}\right\}\,\big{|}\,\mathcal{F}_{t}\right]\leqslant-B1\left\{\scriptstyle{T_{\beta}^{X}>t}\right\}\enspace,

then the expectation of TβXT^{X}_{\beta} satisfies

(13) 𝔼[TβX]𝔼[TβY]A+β0βB.\mathbb{E}\left[T^{X}_{\beta}\right]\leqslant\mathbb{E}\left[T^{Y}_{\beta}\right]\leqslant\frac{A+\beta_{0}-\beta}{B}\enspace.

Proof 3.3 (Proof of Theorem 3.2).

We consider the stopped process Y¯t=Ymin{t,TβY}\bar{Y}_{t}=Y_{\min\{t,T^{Y}_{\beta}\}}. Then, we have Yt=Y¯tY_{t}=\bar{Y}_{t} for tTβYt\leqslant T^{Y}_{\beta} and Y¯tYTβY\bar{Y}_{t}\geqslant Y_{T^{Y}_{\beta}} for all tt.

We will prove that

(14) 𝔼[Y¯t+1t]Y¯tB1{TβY>t}.\mathbb{E}[\bar{Y}_{t+1}\mid\mathcal{F}_{t}]\leqslant\bar{Y}_{t}-B1\left\{\scriptstyle{T^{Y}_{\beta}>t}\right\}\enspace.

We start from

(15) 𝔼[Y¯t+1t]=𝔼[Y¯t+11{TβYt}t]+𝔼[Y¯t+11{TβY>t}t]\mathbb{E}[\bar{Y}_{t+1}\mid\mathcal{F}_{t}]=\mathbb{E}[\bar{Y}_{t+1}1\left\{\scriptstyle{T^{Y}_{\beta}\leqslant t}\right\}\mid\mathcal{F}_{t}]+\mathbb{E}[\bar{Y}_{t+1}1\left\{\scriptstyle{T^{Y}_{\beta}>t}\right\}\mid\mathcal{F}_{t}]

and bound the different terms:

(16) 𝔼[Y¯t+11{TβYt}t]=𝔼[Y¯t1{TβYt}t]=Y¯t1{TβYt},\mathbb{E}[\bar{Y}_{t+1}1\left\{\scriptstyle{T^{Y}_{\beta}\leqslant t}\right\}\mid\mathcal{F}_{t}]=\mathbb{E}[\bar{Y}_{t}1\left\{\scriptstyle{T^{Y}_{\beta}\leqslant t}\right\}\mid\mathcal{F}_{t}]=\bar{Y}_{t}1\left\{\scriptstyle{T^{Y}_{\beta}\leqslant t}\right\}\enspace,

where we have used that 1{TβX>t}1_{\{T_{\beta}^{X}>t\}}, YtY_{t}, 1{TβY>t}1_{\{T_{\beta}^{Y}>t\}}, and Y¯t\bar{Y}_{t} are all t\mathcal{F}_{t}-measurable. Also

(17) 𝔼[Y¯t+11{TβY>t}t]=𝔼[Yt+1t]1{TβY>t}(YtB1{TβX>t}B1{TβXt})1{TβY>t}=(Y¯tB)1{TβY>t},\mathbb{E}[\bar{Y}_{t+1}1\left\{\scriptstyle{T^{Y}_{\beta}>t}\right\}\mid\mathcal{F}_{t}]=\mathbb{E}[Y_{t+1}\mid\mathcal{F}_{t}]1\left\{\scriptstyle{T^{Y}_{\beta}>t}\right\}\\ \leqslant(Y_{t}-B1\left\{\scriptstyle{T^{X}_{\beta}>t}\right\}-B1\left\{\scriptstyle{T^{X}_{\beta}\leqslant t}\right\})1\left\{\scriptstyle{T^{Y}_{\beta}>t}\right\}=(\bar{Y}_{t}-B)1\left\{\scriptstyle{T^{Y}_{\beta}>t}\right\}\enspace,

where we have used condition Eq. 12. Hence, by injecting Eq. 16 and Eq. 17 into Eq. 15, we obtain Eq. 14. From Eq. 14, by taking the expectation we deduce 𝔼[Y¯t+1]𝔼[Y¯t]BPr[TβY>t]\mathbb{E}[\bar{Y}_{t+1}]\leqslant\mathbb{E}[\bar{Y}_{t}]-B\Pr[T_{\beta}^{Y}>t]. Following the same approach as [44, Theorem 1], since TβYT^{Y}_{\beta} is a random variable taking values in \mathbb{N}, it can be rewritten as 𝔼[TβY]=t=0+Pr[TβY>t]\mathbb{E}[T^{Y}_{\beta}]=\sum_{t=0}^{+\infty}\Pr[T^{Y}_{\beta}>t] and thus it holds

B𝔼[TβY]t~t=0t~BPr[TβY>t]t=0t~(𝔼[Y¯t]𝔼[Y¯t+1])=𝔼[Y¯0]𝔼[Y¯t~].B\mathbb{E}\left[T^{Y}_{\beta}\right]\stackrel{{\scriptstyle\tilde{t}\to\infty}}{{\longleftarrow}}\sum_{t=0}^{\tilde{t}}B\Pr\left[T^{Y}_{\beta}>t\right]\leqslant\sum_{t=0}^{\tilde{t}}\Big{(}\mathbb{E}[\bar{Y}_{t}]-\mathbb{E}[\bar{Y}_{t+1}]\Big{)}=\mathbb{E}[\bar{Y}_{0}]-\mathbb{E}[\bar{Y}_{\tilde{t}}]\enspace.

Since Yt+1YtAY_{t+1}\geqslant Y_{t}-A, we have YTβYβAY_{T^{Y}_{\beta}}\geqslant\beta-A. Given that Y¯t~YTβY\bar{Y}_{\tilde{t}}\geqslant Y_{T^{Y}_{\beta}}, we deduce that 𝔼[Y¯t~]βA\mathbb{E}[\bar{Y}_{\tilde{t}}]\geqslant\beta-A for all t~\tilde{t}. With 𝔼[Y¯0]=β0\mathbb{E}[\bar{Y}_{0}]=\beta_{0}, we have

𝔼[TβY](A/B)+B1(β0β).\mathbb{E}\left[T^{Y}_{\beta}\right]\leqslant(A/B)+B^{-1}(\beta_{0}-\beta)\enspace.

Because β<XtYt\beta<X_{t}\leqslant Y_{t} for 0t<TβX0\leqslant t<T^{X}_{\beta}, we have TβXTβYT^{X}_{\beta}\leqslant T^{Y}_{\beta}, implying 𝔼[TβX]𝔼[TβY]\mathbb{E}[T^{X}_{\beta}]\leqslant\mathbb{E}[T^{Y}_{\beta}]. This completes the proof.

This theorem can be intuitively understood: we assume for the sake of simplicity a process XtX_{t} such that Xt+1XtAX_{t+1}\geqslant X_{t}-A. Then (12) states that the process progresses in expectation by at least B-B. The theorem concludes that the expected time needed to reach a value smaller than β\beta when started in β0\beta_{0} equals to (β0β)/B(\beta_{0}-\beta)/B (what we would get for a deterministic algorithm) plus A/BA/B. This last term is due to the stochastic nature of the algorithm. It is minimized if AA is as close as possible to BB, which corresponds to a highly concentrated process.

Jägersküpper [36, Theorem 2] established a general lower bound of the expected first hitting time of the (1+1)-ES. We borrow the same idea to prove the following general theorem for a lower bound of the expected first hitting time, which generalizes [37, Lemma 12]. See Theorem 2.3 in [4] for its proof.

Theorem 3.4.

Let {Xt:t0}\{X_{t}:t\geqslant 0\} be a sequence of real-valued random variables adapted to a filtration {t:t0}\{\mathcal{F}_{t}:t\geqslant 0\} and integrable such that X0=β0X_{0}=\beta_{0}, Xt+1XtX_{t+1}\leqslant X_{t}, and 𝔼[Xt+1t]XtC\mathbb{E}[X_{t+1}\mid\mathcal{F}_{t}]-X_{t}\geqslant-C for C>0C>0. For β<β0\beta<\beta_{0} we define TβX=min{t:Xtβ}T^{X}_{\beta}=\min\left\{t:X_{t}\leqslant\beta\right\}. Then the expected hitting time is lower bounded by 𝔼[TβX](1/2)+(4C)1(β0β)\mathbb{E}\left[T_{\beta}^{X}\right]\geqslant-(1/2)+(4C)^{-1}(\beta_{0}-\beta).

4 Main Result: Expected First Hitting Time Bound

4.1 Mathematical Modeling of the Algorithm

In the sequel, we will analyze the process {θt:t0}\{\theta_{t}:t\geqslant 0\} where θt=(mt,σt,Σt)n×>×𝒮κ\theta_{t}=(m_{t},\sigma_{t},\Sigma_{t})\in\mathbb{R}^{n}\times\mathbb{R}_{>}\times\mathcal{S}_{\kappa} generated by the (1+1)-ESκ\text{-ES}_{\kappa} algorithm. We assume from now on that the optimized objective function ff is measurable with respect to the Borel σ\sigma-algebra. We equip the state-space 𝒳=n×>×𝒮κ\mathcal{X}=\mathbb{R}^{n}\times\mathbb{R}_{>}\times\mathcal{S}_{\kappa} with its Borel σ\sigma-algebra denoted (𝒳)\mathcal{B}(\mathcal{X}).

4.2 Preliminaries

We present two preliminary results. In Assumption A1, we assume that for m𝒳abm\in{\mathcal{X}_{a}^{b}}, we can include a ball of radius Cfμ(m)C_{\ell}f_{\mu}(m) into the sublevel set S0(m)S_{0}(m) and embed S0(m)S_{0}(m) into a ball of radius Cufμ(m)C_{u}f_{\mu}(m). This allows us to upper bound and lower bound the probability of success for all m𝒳abm\in{\mathcal{X}_{a}^{b}}, for all Σ𝒮κ\Sigma\in\mathcal{S}_{\kappa}, by the probability to sample inside of balls of radius Cufμ(m)C_{u}f_{\mu}(m) and Cfμ(m)C_{\ell}f_{\mu}(m) with appropriate center. From this we can upper-bound p(a,b]upper(σ¯)p^{\mathrm{upper}}_{(a,b]}(\bar{\sigma}) by a function of σ¯\bar{\sigma}. Similarly we can lower-bound p(a,b]lower(σ¯)p^{\mathrm{lower}}_{(a,b]}(\bar{\sigma}) by a function of σ¯\bar{\sigma}. The corresponding proof is given in Section B.3.

Proposition 4.1.

Suppose that ff satisfies A1. Consider the lower and upper success probabilities p(a,b]upperp^{\mathrm{upper}}_{(a,b]} and p(a,b]lowerp^{\mathrm{lower}}_{(a,b]} defined in Definition 2.6, then

(18) p(a,b]upper(σ¯)\displaystyle p^{\mathrm{upper}}_{(a,b]}(\bar{\sigma}) κd/2Φ(¯(0,Cuσ¯κ1/2);0,I)\displaystyle\leqslant\kappa^{d/2}\Phi\left(\bar{\mathcal{B}}\left(0,\frac{C_{u}}{\bar{\sigma}\kappa^{1/2}}\right);0,\mathrm{I}\right)
(19) p(a,b]lower(σ¯)\displaystyle p^{\mathrm{lower}}_{(a,b]}(\bar{\sigma}) κd/2Φ(¯((2CuC)κ1/2σ¯e1,Cκ1/2σ¯);0,I),\displaystyle\geqslant\kappa^{-d/2}\Phi\left(\bar{\mathcal{B}}\left(\frac{(2C_{u}-C_{\ell})\kappa^{1/2}}{\bar{\sigma}}e_{1},\frac{C_{\ell}\kappa^{1/2}}{\bar{\sigma}}\right);0,\mathrm{I}\right)\enspace,

where e1=(1,0,,0)e_{1}=(1,0,\dots,0).

We use the previous proposition to establish the next lemma that guarantees the existence of a finite range of normalized step-size that leads to the success probability into some range (pu,p)(p_{u},p_{\ell}) independent of mm and Σ\Sigma, and provides a lower bound on the success probability with rate rr when the normalized step-size is in the above range. Its proof is provided in Section B.4.

Lemma 4.2.

We assume that ff satisfies A1 and A2 for some 0a<b0\leqslant a<b\leqslant\infty. Then, for any pup_{u} and pp_{\ell} satisfying 0<pu<ptarget<p<plimit0<p_{u}<p^{\mathrm{target}}<p_{\ell}<p^{\mathrm{limit}}, the constants

σ¯\displaystyle\bar{\sigma}_{\ell} =sup{σ¯>0:p(a,b]lower(σ¯)p}\displaystyle=\sup\left\{\bar{\sigma}>0:p^{\mathrm{lower}}_{(a,b]}(\bar{\sigma})\geqslant p_{\ell}\right\} and σ¯u\displaystyle\bar{\sigma}_{u} =inf{σ¯>0:p(a,b]upper(σ¯)pu}\displaystyle=\inf\left\{\bar{\sigma}>0:p^{\mathrm{upper}}_{(a,b]}(\bar{\sigma})\leqslant p_{u}\right\}

exist as positive finite values. Let σ¯\ell\leqslant\bar{\sigma}_{\ell} and uσ¯uu\geqslant\bar{\sigma}_{u} such that u/α/αu/\ell\geqslant\alpha_{\uparrow}/\alpha_{\downarrow}. Then, for r[0,1]r\in[0,1], prp^{*}_{r} defined as

(20) pr:=infσ¯uinfm𝒳abinfΣ𝒮κprsucc(σ¯;m,Σ)p^{*}_{r}:=\inf_{\ell\leqslant\bar{\sigma}\leqslant u}\inf_{m\in\mathcal{X}_{a}^{b}}\inf_{\Sigma\in\mathcal{S}_{\kappa}}p^{\mathrm{succ}}_{r}(\bar{\sigma};m,\Sigma)

is lower bounded by a positive number defined by

(21) minσ¯uκd/2Φ((((2Cu(1r)C)κ1/2σ¯)e1,(1r)Cκ1/2σ¯);0,I).\min_{\ell\leqslant\bar{\sigma}\leqslant u}\kappa^{-d/2}\Phi\left(\mathcal{B}\left(\left(\frac{(2C_{u}-(1-r)C_{\ell})\kappa^{1/2}}{\bar{\sigma}}\right)e_{1},\frac{(1-r)C_{\ell}\kappa^{1/2}}{\bar{\sigma}}\right);0,\mathrm{I}\right)\enspace.

4.3 Potential Function

Lemma 4.2 divides the domain of the normalized step-size into three disjoint subsets: σ¯(0,)\bar{\sigma}\in(0,\ell) is a too small normalized step-size situation where we have p0succ(σ¯;m,Σ)pp^{\mathrm{succ}}_{0}(\bar{\sigma};m,\Sigma)\geqslant p_{\ell} for all m𝒳abm\in\mathcal{X}_{a}^{b} and Σ𝒮κ\Sigma\in\mathcal{S}_{\kappa}; σ¯(u,)\bar{\sigma}\in(u,\infty) is a too large normalized step-size situation where we have p0succ(σ¯;m,Σ)pup^{\mathrm{succ}}_{0}(\bar{\sigma};m,\Sigma)\leqslant p_{u} for all m𝒳abm\in\mathcal{X}_{a}^{b} and Σ𝒮κ\Sigma\in\mathcal{S}_{\kappa}; and σ¯[,u]\bar{\sigma}\in[\ell,u] is a reasonable normalized step-size situation where the success probability with rate rr is lower bounded by Eq. 21. Since ptarget[pu,p]p_{\mathrm{target}}\in[p_{u},p_{\ell}], the normalized step-size is supposed to be maintained in the reasonable range.

Our potential function is defined as follows. In light of Lemma 4.2, we can take σ¯\ell\leqslant\bar{\sigma}_{\ell} and uσ¯uu\geqslant\bar{\sigma}_{u} such that u/α/αu/\ell\geqslant\alpha_{\uparrow}/\alpha_{\downarrow}. With some constant v>0v>0, we define our potential function as

(22) V(θ)=log(fμ(m))+max{0,vlog[αfμ(m)σ],vlog[σαufμ(m)]}.V(\theta)=\log(f_{\mu}(m))+\max\left\{0,v\log\left[\frac{\alpha_{\uparrow}\ell f_{\mu}(m)}{\sigma}\right],v\log\left[\frac{\sigma}{\alpha_{\downarrow}uf_{\mu}(m)}\right]\right\}\enspace.

The rationale behind the second term on the right-hand side (RHS) is as follows. The second and third terms inside max\max are positive only if the normalized step-size σ¯=σ/fμ(m)\bar{\sigma}=\sigma/f_{\mu}(m) is smaller than α\ell\alpha_{\uparrow} and greater than uαu\alpha_{\downarrow}, respectively. The potential value is logfμ(m)\log f_{\mu}(m) if the normalized step-size is in [α,uα][\ell\alpha_{\uparrow},u\alpha_{\downarrow}] and it is penalized if the normalized step-size is too small or too large. We need this penalization for the following reason. If the normalized step-size is too small, the success probability is close to 1/21/2 for non-critical points, assuming f=ghf=g\circ h where hh is a continuously differentiable function, but the progress per step is very small because the step-size directly controls the progress for instance measured as mt+1mt=σt𝒩(0,Σt)1{f(mt+1)f(mt)}\|m_{t+1}-m_{t}\|=\sigma_{t}\|\mathcal{N}(0,\Sigma_{t})\|1_{\{f(m_{t+1})\leqslant f(m_{t})\}}. If the normalized step-size is too large, the success probability is close to zero and produces no progress with high probability. If we would use logfμ(m)\log f_{\mu}(m) as a potential function instead of V(θ)V(\theta) then the progress is arbitrarily small in such situations, which prevents the application of drift arguments. The above potential function penalizes such situations, and guarantees a certain progress in the penalized quantity since the step-size will be increased or decreased, respectively, with high probability, leading to a certain decrease of V(θ)V(\theta). We illustrate in Figure 1 that log(fμ(m))\log(f_{\mu}(m)) cannot work alone as a potential function while V(θ)V(\theta) does: when we start from a too small or too large step-size, log(fμ(m))\log(f_{\mu}(m)) looks constant (doted line in green and blue). Only when the step-size is started at 11, we see progress in log(fμ(m))\log(f_{\mu}(m)). Also, the step size can always get arbitrarily worse, with a very small probability, which forces us to handle the case of badly adapted step size properly. Yet the simulation of V(θ)V(\theta) shows that in all three situations (small, large and well adapted step-sizes compared to the distance to the optimum), we observe a geometric decrease of V(θ)V(\theta).

4.4 Upper Bound of the First Hitting Time

We are now ready to establish that the potential function defined in (22) satisfies a (truncated)-drift condition from Theorem 3.2. This will in turn imply an upper bound on the expected hitting time of fμ(m)f_{\mu}(m) to [0,ϵ][0,\epsilon] provided aϵa\leqslant\epsilon. The proof follows the same line of argumentation as the proof of [4, Proposition 4.2], which was restricted to the case of spherical functions. It was generalized under similar assumptions as in this paper, but for a fixed covariance matrix equal to the identity, in [47, Proposition 6]. The detailed proof is given in Section B.5.

Proposition 4.3.

Consider the (1+1)-ESκ\text{-ES}_{\kappa} described in Algorithm 1 with state θt=(mt,σt,Σt)\theta_{t}=(m_{t},\sigma_{t},\Sigma_{t}). Assume that the minimized objective function ff satisfies A1 and A2 for some 0a<b0\leqslant a<b\leqslant\infty. Let pup_{u} and pp_{\ell} be constants satisfying 0<pu<ptarget<p<plimit0<p_{u}<p_{\mathrm{target}}<p_{\ell}<p^{\mathrm{limit}} and p+pu=2ptargetp_{\ell}+p_{u}=2p_{\mathrm{target}}. Then, there exists σ¯\ell\leqslant\bar{\sigma}_{\ell} and uσ¯uu\geqslant\bar{\sigma}_{u} such that u/α/αu/\ell\geqslant\alpha_{\uparrow}/\alpha_{\downarrow}, where σ¯\bar{\sigma}_{\ell} and σ¯u\bar{\sigma}_{u} are defined in Lemma 4.2. For any A>0A>0, taking vv satisfying 0<v<min{1,Alog(1/α),Alog(α)}0<v<\min\left\{1,\ \frac{A}{\log(1/\alpha_{\downarrow})},\ \frac{A}{\log(\alpha_{\uparrow})}\right\}, and the potential function Eq. 22, we have

(23) 𝔼[max{V(θt+1)V(θt),A}1{mt𝒳ab}t]B1{mt𝒳ab}\mathbb{E}\left[\max\{V(\theta_{t+1})-V(\theta_{t}),-A\}1\left\{\scriptstyle{m_{t}\in\mathcal{X}_{a}^{b}}\right\}\mid\mathcal{F}_{t}\right]\leqslant-B1\left\{\scriptstyle{m_{t}\in\mathcal{X}_{a}^{b}}\right\}

where

(24) B=min{Aprvlog(αα),vppu2log(αα)},B=\min\left\{Ap^{*}_{r}-v\log\left(\frac{\alpha_{\uparrow}}{\alpha_{\downarrow}}\right),\ v\frac{p_{\ell}-p_{u}}{2}\log\left(\frac{\alpha_{\uparrow}}{\alpha_{\downarrow}}\right)\right\}\enspace,

and pr=infσ¯[,u]infm𝒳abinfΣ𝒮κprsucc(σ¯;m,Σ)withr=1exp(A1v).p^{*}_{r}=\inf_{\bar{\sigma}\in[\ell,u]}\inf_{m\in\mathcal{X}_{a}^{b}}\inf_{\Sigma\in\mathcal{S}_{\kappa}}p^{\mathrm{succ}}_{r}(\bar{\sigma};m,\Sigma)\ \text{with}\ r=1-\exp\left(-\frac{A}{1-v}\right). Moreover, for any A>0A>0 there exists vv such that B<AB<A is positive.

We apply Theorem 3.2 along with Proposition 4.3 to derive the expected first hitting time bound. To do so, we need to confirm that it satisfies the prerequisite of the theorem: integrability of the process {Yt:t0}\{Y_{t}:t\geqslant 0\} defined in Eq. 11 with Xt=V(θt)X_{t}=V(\theta_{t}).

Lemma 4.4.

Let {θt:t0}\{\theta_{t}:t\geqslant 0\} be the sequence of parameters θt=(mt,σt,Σt)\theta_{t}=(m_{t},\sigma_{t},\Sigma_{t}) defined by the (1+1)-ESκ\text{-ES}_{\kappa} with the initial condition θ0=(m0,σ0,Σ0)\theta_{0}=(m_{0},\sigma_{0},\Sigma_{0}) optimizing a measurable function ff. Set Xt=V(θt)X_{t}=V(\theta_{t}) as defined in Eq. 22 and define the process YtY_{t} as defined in Theorem 3.2. Then, for any A>0A>0, {Yt:t0}\{Y_{t}:t\geqslant 0\} is integrable, i.e., 𝔼[|Yt|]<\mathbb{E}[\lvert Y_{t}\rvert]<\infty for each tt.

Proof 4.5 (Proof of Lemma 4.4).

The drift Yt+1=Yt+max{V(θt+1)V(θt),A}1{TβX>t}B1{TβXt}Y_{t+1}=Y_{t}+\max\big{\{}V(\theta_{t+1})-V(\theta_{t}),-A\big{\}}1\left\{\scriptstyle{T_{\beta}^{X}>t}\right\}-B1\left\{\scriptstyle{T_{\beta}^{X}\leqslant t}\right\} is by construction bounded by A-A from below. It is also bounded by a constant from above. Indeed, from the proof of Proposition 4.3, it is easy to find the upper bound, say CC, of the truncated one-step change, Δt\Delta_{t} in the proof of Proposition 4.3, without using A1 and A2. Let D=max{A,C}D=\max\{A,C\}. Then, by recursion, |V(θt)||V(θ0)|+|V(θt)V(θ0)||Y0|+Dt\lvert V(\theta_{t})\rvert\leqslant\lvert V(\theta_{0})\rvert+\lvert V(\theta_{t})-V(\theta_{0})\rvert\leqslant\lvert Y_{0}\rvert+Dt. Hence 𝔼[|Yt|]|Y0|+Dt<\mathbb{E}[\lvert Y_{t}\rvert]\leqslant\lvert Y_{0}\rvert+Dt<\infty for all tt.

Finally, we derive the expected first hitting time of logfμ(mt)\log f_{\mu}(m_{t}).

Theorem 4.6.

Consider the same situation as described in Proposition 4.3. Let Tϵ=min{t:fμ(mt)ϵ}T_{\epsilon}=\min\{t:f_{\mu}(m_{t})\leqslant\epsilon\} be the first hitting time of fμ(mt)f_{\mu}(m_{t}) to [0,ϵ][0,\epsilon]. Choose aϵ<fμ(mt)ba\leqslant\epsilon<f_{\mu}(m_{t})\leqslant b, where aa and bb appear in Definition 2.6. If m0𝒳abm_{0}\in{\mathcal{X}_{a}^{b}}, the first hitting time is upper bounded by 𝔼[Tϵ](V(θ0)log(ϵ)+A)/B\mathbb{E}[T_{\epsilon}]\leqslant\big{(}V(\theta_{0})-\log(\epsilon)+A\big{)}/B for A>B>0A>B>0 described in Proposition 4.3, where V(θ)V(\theta) is the potential function defined in Eq. 22. Equivalently, we have 𝔼[Tϵ]CT+CR1log(fμ(m0)/ϵ)\mathbb{E}[T_{\epsilon}]\leqslant C_{T}+C_{R}^{-1}\log(f_{\mu}(m_{0})/\epsilon), where

CT\displaystyle C_{T} =AB+vBmax{0,log(αfμ(m0)σ0),log(σ0αufμ(m0))},\displaystyle=\frac{A}{B}+\frac{v}{B}\max\left\{0,\log\left(\frac{\alpha_{\uparrow}\ell f_{\mu}(m_{0})}{\sigma_{0}}\right),\log\left(\frac{\sigma_{0}}{\alpha_{\downarrow}uf_{\mu}(m_{0})}\right)\right\}\enspace, CR\displaystyle C_{R} =B.\displaystyle=B\enspace.

Moreover, the above result yields an upper bound of the expected first hitting time of mtx\lVert m_{t}-x^{*}\rVert to [0,2Cuϵ][0,2C_{u}\epsilon].

Proof 4.7.

Theorem 3.2 with Proposition 4.3 and Lemma 4.4 together bounds the expected first hitting time of V(θt)V(\theta_{t}) to (,log(ϵ)](-\infty,\log(\epsilon)] by (V(θ0)log(ϵ)+A)/B\big{(}V(\theta_{0})-\log(\epsilon)+A\big{)}/B. Since logfμ(mt)V(θt)\log f_{\mu}(m_{t})\leqslant V(\theta_{t}), TϵT_{\epsilon} is bounded by the first hitting time of V(θt)V(\theta_{t}) to (,log(ϵ)](-\infty,\log(\epsilon)]. The inequality is preserved if we take the expectation. The last claim is trivial from the inequality xx2Cufμ(x)\lVert x-x^{*}\rVert\leqslant 2C_{u}f_{\mu}(x), which holds under A1.

Theorem 4.6 shows an upper bound on the expected hitting time of the (1+1)-ESκ\text{-ES}_{\kappa} with success-based step-size adaptation for linear convergence towards the global optimum xx^{*} on functions satisfying A1 and A2 with a=0a=0. Moreover, for b=b=\infty, this bound holds from all initial search points m0m_{0}. If a>0a>0, the bound in Theorem 4.6 does not translate into linear convergence, but we still obtain an upper bound on the expected first hitting time of the target accuracy ϵa\epsilon\geqslant a. This is useful for understanding the behavior of (1+1)-ESκ\text{-ES}_{\kappa} on multimodal functions, and on functions with degenerated Hessian matrix at the optimum.

4.5 Lower Bound of the First Hitting Time

We derive a general lower bound of the expected first hitting time of mtx\lVert m_{t}-x^{*}\rVert to [0,ϵ][0,\epsilon]. The following results hold for an arbitrary measurable function ff and for a (1+1)-ESκ\text{-ES}_{\kappa} with an arbitrary σ\sigma-control mechanism. The following lemma provides the lower bound of the expected one-step progress measured by the logarithm of the distance to the optimum.

Lemma 4.8.

We consider the process {θt:t0}\{\theta_{t}:t\geqslant 0\} generated by a (1+1)-ESκ\text{-ES}_{\kappa} algorithm with an arbitrary step-size adaptation mechanism and an arbitrary covariance matrix update optimizing an arbitrary measurable function ff. We assume d2d\geqslant 2 and κt=Cond(Σt)κ\kappa_{t}=\operatorname{Cond}(\Sigma_{t})\leqslant\kappa. We consider the natural filtration t\mathcal{F}_{t}. Then, the expected single-step progress is lower-bounded by

(25) 𝔼[min(log(mt+1x/mtx),0)t]κtd2/d.\mathbb{E}[\min(\log(\lVert m_{t+1}-x^{*}\rVert/\lVert m_{t}-x^{*}\rVert),0)\mid\mathcal{F}_{t}]\geqslant-\kappa_{t}^{\frac{d}{2}}/d\enspace.

Proof 4.9 (Proof of Lemma 4.8).

Note first that log(mt+1x/mtx)=log(xtx/mtx)1{f(xt)f(mt)}\log(\lVert m_{t+1}-x^{*}\rVert/\lVert m_{t}-x^{*}\rVert)=\log(\lVert x_{t}-x^{*}\rVert/\lVert m_{t}-x^{*}\rVert)1\left\{\scriptstyle{f(x_{t})\leqslant f(m_{t})}\right\}. This value can be positive since f(xt)f(mt)f(x_{t})\leqslant f(m_{t}) does not imply xtxmtx\lVert x_{t}-x^{*}\rVert\leqslant\lVert m_{t}-x^{*}\rVert in general. Clipping the positive part to zero, we obtain a lower bound, which is the RHS of the above equality times the indicator 1{xtxmtx}1\left\{\scriptstyle{\lVert x_{t}-x^{*}\rVert\leqslant\lVert m_{t}-x^{*}\rVert}\right\}. Since the quantity is non-positive, dropping the indicator 1{f(xt)f(mt)}1\left\{\scriptstyle{f(x_{t})\leqslant f(m_{t})}\right\} only decreases the lower bound. Hence, we have min(log(mt+1x/mtx),0)log(xtx/mtx)1{xtxmtx}\min(\log(\lVert m_{t+1}-x^{*}\rVert/\lVert m_{t}-x^{*}\rVert),0)\geqslant\log(\lVert x_{t}-x^{*}\rVert/\lVert m_{t}-x^{*}\rVert)1\left\{\scriptstyle{\lVert x_{t}-x^{*}\rVert\leqslant\lVert m_{t}-x^{*}\rVert}\right\}. Then,

𝔼[min(log(mt+1x)log(mtx),0)t]𝔼[log(xtx/mtx)1{xtxmtx}t].\mathbb{E}[\min(\log(\lVert m_{t+1}-x^{*}\rVert)-\log(\lVert m_{t}-x^{*}\rVert),0)\mid\mathcal{F}_{t}]\\ \geqslant\mathbb{E}[\log(\lVert x_{t}-x^{*}\rVert/\lVert m_{t}-x^{*}\rVert)1\left\{\scriptstyle{\lVert x_{t}-x^{*}\rVert\leqslant\lVert m_{t}-x^{*}\rVert}\right\}\mid\mathcal{F}_{t}]\enspace.

We rewrite the lower bound of the drift. The RHS of the above inequality is the integral of log(xx/mtx)\log(\lVert x-x^{*}\rVert/\lVert m_{t}-x^{*}\rVert) in the integral domain ¯(x,mtx)\bar{\mathcal{B}}(x^{*},\lVert m_{t}-x^{*}\rVert) under the probability measure Φ(;mt,σt2Σt)\Phi\left(;m_{t},\sigma_{t}^{2}\Sigma_{t}\right). Performing a variable change (through rotation and scaling) so that mtxm_{t}-x^{*} becomes e1=(1,0,,0)e_{1}=(1,0,\cdots,0) and letting σ~t=σt/mtx\tilde{\sigma}_{t}=\sigma_{t}/\lVert m_{t}-x^{*}\rVert, we can further rewrite it as the integral of log(x)\log(\lVert x\rVert) in ¯(0,1)\bar{\mathcal{B}}(0,1) under Φ(;e1,σ~t2Σt)\Phi\left(;e_{1},\tilde{\sigma}_{t}^{2}\Sigma_{t}\right). With κt=Cond(Σt)\kappa_{t}=\operatorname{Cond}(\Sigma_{t}), we have φ(;e1,σ~t2Σt)κtd/2φ(;e1,κtσ~t2I)\varphi\left(;e_{1},\tilde{\sigma}_{t}^{2}\Sigma_{t}\right)\leqslant\kappa_{t}^{d/2}\varphi\left(;e_{1},\kappa_{t}\tilde{\sigma}_{t}^{2}\mathrm{I}\right), see Lemma B.1. Altogether, we obtain the lower bound 𝔼[log(xtx/mtx)1{xtxmtx}t]κtd/2¯(0,1)log(x)φ(;e1,κtσ~t2I)dx\mathbb{E}[\log(\lVert x_{t}-x^{*}\rVert/\lVert m_{t}-x^{*}\rVert)1\left\{\scriptstyle{\lVert x_{t}-x^{*}\rVert\leqslant\lVert m_{t}-x^{*}\rVert}\right\}\mid\mathcal{F}_{t}]\geqslant\kappa_{t}^{d/2}\int_{\bar{\mathcal{B}}(0,1)}\log(\lVert x\rVert)\varphi\left(;e_{1},\kappa_{t}\tilde{\sigma}_{t}^{2}\mathrm{I}\right)\mathrm{d}x. The RHS is equivalent to κtd/2-\kappa_{t}^{d/2} times the single step progress of the (1+1)-ES on the spherical function at mt=e1m_{t}=e_{1} and σ=κσ~t\sigma=\sqrt{\kappa}\tilde{\sigma}_{t}, which is proven in the proof of Lemma 4.4 of [4] to be lower bounded by 1/d1/d for d2d\geqslant 2. This completes the proof.

The following theorem proves that the expected first hitting time of (1+1)-ESκ\text{-ES}_{\kappa} is Ω(log(m0x/ϵ))\Omega(\log(\lVert m_{0}-x^{*}\rVert/\epsilon)) for any measurable function ff, implying that it can not converge faster than linearly. In case of κ=1\kappa=1 the lower runtime bound becomes Ω(d(log(m0x/ϵ)))\Omega(d(\log(\lVert m_{0}-x^{*}\rVert/\epsilon))), meaning that the runtime scales linearly with respect to dd. The proof is a direct application of Lemma 4.8 to Theorem 3.4.

Theorem 4.10.

We consider the process {θt:t0}\{\theta_{t}:t\geqslant 0\} generated by a (1+1)-ESκ\text{-ES}_{\kappa} described in Algorithm 1 and assume that ff is a measurable function with d2d\geqslant 2. Let Tϵ=inf{t:mtxϵ}T_{\epsilon}=\inf\{t:\|m_{t}-x^{*}\|\leqslant\epsilon\} be the first hitting time of [0,ϵ][0,\epsilon] by mtx\|m_{t}-x^{*}\|. Then, the expected first hitting time is lower bounded by 𝔼[Tϵ](1/2)+d4κd/2log(m0x/ϵ)\mathbb{E}[T_{\epsilon}]\geqslant-(1/2)+\frac{d}{4\kappa^{d/2}}\log(\lVert m_{0}-x^{*}\rVert/\epsilon). The bound holds for arbitrary step-size adaptation mechanisms. If A1 holds, it gives a lower bound for the expected first hitting time bound of fμ(mt)f_{\mu}(m_{t}) to [0,2Cϵ][0,2C_{\ell}\epsilon].

Proof 4.11 (Proof of Theorem 4.10).

Let Xt=logmtxX_{t}=\log\lVert m_{t}-x^{*}\rVert for t0t\geqslant 0. Define YtY_{t} iteratively as Y0=X0Y_{0}=X_{0} and Yt+1=Yt+min(Xt+1Xt,0)Y_{t+1}=Y_{t}+\min(X_{t+1}-X_{t},0). Then, it is easy to see that YtXtY_{t}\leqslant X_{t} and Yt+1YtY_{t+1}\leqslant Y_{t} for all t0t\geqslant 0. Note that 𝔼[Yt+1Ytt]=𝔼[min(Xt+1Xt,0)t]=𝔼[min(log(mt+1x/mtx),0)t]\mathbb{E}[Y_{t+1}-Y_{t}\mid\mathcal{F}_{t}]=\mathbb{E}[\min(X_{t+1}-X_{t},0)\mid\mathcal{F}_{t}]=\mathbb{E}[\min(\log(\lVert m_{t+1}-x^{*}\rVert/\lVert m_{t}-x^{*}\rVert),0)\mid\mathcal{F}_{t}], where the RMS is lower bounded in light of Lemma 4.8. Then, applying Theorem 3.4, we obtain the lower bound. The last statement directly follows from xx2Cfμ(x)\lVert x-x^{*}\rVert\leqslant 2C_{\ell}f_{\mu}(x) under A1.

4.6 Almost Sure Linear Convergence

Additionally to the expected first hitting time bound, we can deduce from Proposition 4.3, almost sure linear convergence as stated in the following proposition.

Proposition 4.12.

Consider the same situation as described in Proposition 4.3, where a=0a=0 and 0<b0<b\leqslant\infty. Then, for any m0𝒳0bm_{0}\in\mathcal{X}_{0}^{b}, σ0>0\sigma_{0}>0 and Σ𝒮κ\Sigma\in\mathcal{S}_{\kappa}, we have

(26) Pr[lim supt1tlogfμ(mt)B]=Pr[lim supt1tlogmtxB]=1,\Pr\left[\limsup_{t\to\infty}\frac{1}{t}\log f_{\mu}(m_{t})\leqslant-B\right]=\Pr\left[\limsup_{t\to\infty}\frac{1}{t}\log\|m_{t}-x^{*}\|\leqslant-B\right]=1\enspace,

where B>0B>0 is as defined in Proposition 4.3. Hence almost sure linear convergence holds at a rate exp(C)\exp(-C) such that exp(C)exp(B)\exp(-C)\leqslant\exp(-B).

Proof 4.13 (Proof of Proposition 4.12).

Let VV be defined in (22). Let Y0=V(θ0)Y_{0}=V(\theta_{0}) and Yt+1=Yt+max(A,V(θt+1)V(θt))Y_{t+1}=Y_{t}+\max(-A,V(\theta_{t+1})-V(\theta_{t})). Define Zt=Yt𝔼t1[Yt]Z_{t}=Y_{t}-\mathbb{E}_{t-1}[Y_{t}] for t0t\geqslant 0. Then, {Zt}\{Z_{t}\} is a martingale difference sequence on the filtration {t}\{\mathcal{F}_{t}\} produced by {θt}\{\theta_{t}\}. We hence have 1tlogfμ(mt)1tV(θt)1tYt\frac{1}{t}\log f_{\mu}(m_{t})\leqslant\frac{1}{t}V(\theta_{t})\leqslant\frac{1}{t}Y_{t}, and from Proposition 4.3 we obtain

Yt=𝔼t1[Yt]+Zt=Yt1+𝔼t1[YtYt1]+ZtYt1B+Zt.\displaystyle Y_{t}=\mathbb{E}_{t-1}[Y_{t}]+Z_{t}=Y_{t-1}+\mathbb{E}_{t-1}[Y_{t}-Y_{t-1}]+Z_{t}\leqslant Y_{t-1}-B+Z_{t}\enspace.

By repeatedly applying the above inequality and dividing it by tt, we obtain 1tYtB+1tY0+1ti=1tZi\frac{1}{t}Y_{t}\leqslant-B+\frac{1}{t}Y_{0}+\frac{1}{t}\sum_{i=1}^{t}Z_{i}, where limt1tY0=0\lim_{t\to\infty}\frac{1}{t}Y_{0}=0 and i=1tZi\sum_{i=1}^{t}Z_{i} is a martingale sequence. In light of the strong law of large numbers for martingales [15], if t=1𝔼[Zt2]/t2<\sum_{t=1}^{\infty}\mathbb{E}[Z_{t}^{2}]/t^{2}<\infty, we have limt1ti=1tZi=0\lim_{t\to\infty}\frac{1}{t}\sum_{i=1}^{t}Z_{i}=0 almost surely. By the definition of V(θt)V(\theta_{t}) and the working mechanism of the (1+1)-ESκ\text{-ES}_{\kappa}, we have V(θi)V(θi1)vlog(α/α)V(\theta_{i})-V(\theta_{i-1})\leqslant v\log(\alpha_{\uparrow}/\alpha_{\downarrow}). Hence, 𝔼[Zi2]=𝔼[(Yi𝔼i1[Yi])2]=𝔼[max(A,V(θi)V(θi1))2]max(A,vlog(α/α))2\mathbb{E}[Z_{i}^{2}]=\mathbb{E}[(Y_{i}-\mathbb{E}_{i-1}[Y_{i}])^{2}]=\mathbb{E}[\max(-A,V(\theta_{i})-V(\theta_{i-1}))^{2}]\leqslant\max(A,v\log(\alpha_{\uparrow}/\alpha_{\downarrow}))^{2}. Hence, we have lim supt1tlogfμ(mt)B+limt1tY0+limt1ti=1tZi=B\limsup_{t\to\infty}\frac{1}{t}\log f_{\mu}(m_{t})\leqslant-B+\lim_{t\to\infty}\frac{1}{t}Y_{0}+\lim_{t\to\infty}\frac{1}{t}\sum_{i=1}^{t}Z_{i}=-B almost surely. Along with xx2Cufμ(x)\lVert x-x^{*}\rVert\leqslant 2C_{u}f_{\mu}(x), we obtain Equation 26.

4.7 Wrap-up of the Results: Global Linear Convergence

As a corollary to the lower-bound from Theorem 4.10, the upper bound from Theorem 4.6, Proposition 4.12 stating the almost sure linear convergence and the fact that different assumptions discussed in Section 2.3 imply A1 and A2, we summarize our linear convergence results in the following theorem.

Theorem 4.14 (Global Linear Convergence).

We consider the (1+1)-ESκ\text{-ES}_{\kappa} optimizing an objective function ff. Suppose either

  • (a)

    ff satisfies A1 and A2 for a=0a=0, plimit>ptargetp^{\mathrm{limit}}>p^{\mathrm{target}}, and m0𝒳0bm_{0}\in\mathcal{X}_{0}^{b}; or

  • (b)

    ff satisfies either A3 or A4, ptarget<1/2p^{\mathrm{target}}<1/2, and m0dm_{0}\in\mathbb{R}^{d}.

Then, for any σ0>0\sigma_{0}>0 and Σ0𝒮κ\Sigma_{0}\in\mathcal{S}_{\kappa}, the expected hitting time 𝔼[Tϵ]\mathbb{E}[T_{\epsilon}] of mtx\lVert m_{t}-x^{*}\rVert to [0,ϵ][0,\epsilon] is Θ(log(m0x/ϵ))\Theta\big{(}\log(\lVert m_{0}-x^{*}\rVert/\epsilon)\big{)} for all ϵ>0\epsilon>0. Moreover, both fμ(mt)f_{\mu}(m_{t}) and mtx\lVert m_{t}-x^{*}\rVert linearly converge almost surely, i.e.

Pr[lim supt1tlogfμ(mt)B]=Pr[lim supt1tlogmtxB]=1,\Pr\left[\limsup_{t\to\infty}\frac{1}{t}\log f_{\mu}(m_{t})\leqslant-B\right]=\Pr\left[\limsup_{t\to\infty}\frac{1}{t}\log\lVert m_{t}-x^{*}\rVert\leqslant-B\right]=1\enspace,

where B>0B>0 is as defined in Proposition 4.3. The convergence rate exp(C)\exp(-C) is thus upper-bounded by exp(B)\exp(-B).

4.8 Tightness in the Sphere Function Case

Now we consider a specific convex quadratic function, namely the sphere function f(x)=12x2f(x)=\frac{1}{2}\lVert x\rVert^{2} where the spatial suboptimality function equals fμ(x)=Vdxf_{\mu}(x)=V_{d}\lVert x\rVert. In Theorem 4.14 we have formulated that the expected hitting time of a ball of radius ϵ\epsilon for the (1+1)-ESκ\text{-ES}_{\kappa} equals Θ(logm0x/ϵ)\Theta(\log\|m_{0}-x^{*}\|/\epsilon). Yet, this statement does not give information on how the constants hidden in the Θ\Theta-notation scale with the dimension. In particular the convergence rate of the algorithm is upper-bounded by exp(B)\exp(-B) where BB is given in (24), see Theorem 4.6. In this section, we estimate precisely the scaling of BB in Proposition 4.3 with respect to the dimension and compare it with the general lower bound of the expected first hitting time given in Theorem 4.10. We then conclude that the bound is tight with respect to the scaling with dd in the case of the sphere function.

Let us assume κ=1\kappa=1, that is, we consider the (1+1)-ES without covariance matrix adaptation (Σ=I\Sigma=I). Then, p(a,b]lower(σ¯)=p(a,b]upper(σ¯)=prsucc(σ¯;m,Σ)p^{\mathrm{lower}}_{(a,b]}(\bar{\sigma})=p^{\mathrm{upper}}_{(a,b]}(\bar{\sigma})=p_{r}^{\mathrm{succ}}(\bar{\sigma};m,\Sigma), where the right-most side is independent of mm and Σ\Sigma as described in Lemma 2.4. This means that the success probability is solely controlled by the normalized step-size σ¯\bar{\sigma}.

The following proposition states that the convergence speed is Ω(1/d)\Omega(1/d), hence the expected first hitting time scales as O(1/d)O(1/d). The proof is provided in Section B.6.

Proposition 4.15.

For A=1/dA=1/d, ptargetΘ(1)p_{\mathrm{target}}\in\Theta(1) and log(α/α)ω(1/d)\log(\alpha_{\uparrow}/\alpha_{\downarrow})\in\omega(1/d), we have BΩ(1/d)B\in\Omega(1/d).

Two conditions on the choice of α\alpha_{\uparrow} and α\alpha_{\downarrow}: ptarget=log(1/α)/log(α/α)Θ(1)p_{\mathrm{target}}=\log(1/\alpha_{\downarrow})/\log(\alpha_{\uparrow}/\alpha_{\downarrow})\in\Theta(1) and log(α/α)ω(1/d)\log(\alpha_{\uparrow}/\alpha_{\downarrow})\in\omega(1/d), are understood as follows. The first condition implies that the target success probability ptargetp_{\mathrm{target}} must be independent of dd. In the 1/5 success rule, α\alpha_{\uparrow} and α\alpha_{\downarrow} are set so that ptarget=1/5p_{\mathrm{target}}=1/5 independent of dd. The second condition implies that the factors of the step-size increase and decrease must be log(α)ω(1/d)\log(\alpha_{\uparrow})\in\omega(1/d) and log(1/α)ω(1/d)\log(1/\alpha_{\downarrow})\in\omega(1/d). Note that on the sphere function the normalized step-size σ¯σ/mx\bar{\sigma}\propto\sigma/\lVert m-x^{*}\rVert is kept around a constant during the search. It implies that the convergence speed of mx\lVert m-x^{*}\rVert and σ\sigma must agree. Therefore the speed of the adaptation of the step-size must not be too small to achieve Θ(d)\Theta(d) scaling of the expected first hitting time.

Proposition 4.15 and Theorem 4.6 imply 𝔼[Tϵ]O(dlog(m0/ϵ))\mathbb{E}[T_{\epsilon}]\in O(d\log(\lVert m_{0}\rVert/\epsilon)) and Theorem 4.10 implies 𝔼[Tϵ]Ω(dlog(m0/ϵ))\mathbb{E}[T_{\epsilon}]\in\Omega(d\log(\lVert m_{0}\rVert/\epsilon)). They yield 𝔼[Tϵ]Θ(dlog(m0/ϵ))\mathbb{E}[T_{\epsilon}]\in\Theta(d\log(\lVert m_{0}\rVert/\epsilon)). This result shows i) that the runtime of the (1+1)-ES on the sphere function is proportional to dd as long as log(α/α)ω(1/d)\log(\alpha_{\uparrow}/\alpha_{\downarrow})\in\omega(1/d), and ii) that from our methodology one can derive a tight bound of the runtime in some cases. The result is formally stated as follows.

Theorem 4.16.

The (1+1)-ES (Algorithm 1) with κ=1\kappa=1 and ptarget<1/2p^{\mathrm{target}}<1/2 converges globally and linearly in terms of logmtx\log\lVert m_{t}-x^{*}\rVert from any starting point m0dm_{0}\in\mathbb{R}^{d}, σ0>0\sigma_{0}>0, and Σ0=I\Sigma_{0}=I on any function f(x)=g(xx)f(x)=g(\lVert x-x^{*}\rVert), where gg is a strictly increasing function. Moreover, if ptargetΘ(1)p^{\mathrm{target}}\in\Theta(1) and log(α/α)ω(1/d)\log(\alpha_{\uparrow}/\alpha_{\downarrow})\in\omega(1/d), the expected first hitting time TϵT_{\epsilon} of logmtx\log\lVert m_{t}-x^{*}\rVert to (,log(ϵ)](-\infty,\log(\epsilon)] is Θ(dlog(m0/ϵ))\Theta(d\log(\lVert m_{0}\rVert/\epsilon)) and the almost sure convergence rate is upper-bounded by exp(Θ(1/d))\exp(-\Theta(1/d)).

Since the lower bound holds for an arbitrary σ\sigma-adaptation mechanism, the above result not only implies that our upper bound is tight, but it also implies that the success-based σ\sigma-control mechanism achieves the best possible convergence rate except for a constant factor on the spherical function.

5 Discussion

We have established the almost sure global linear convergence of the (1+1)-ESκ\text{-ES}_{\kappa} and also expressed as a bound on the expected hitting time of an ϵ\epsilon-neighborhood of the solution. Assumption A1 has been the key to obtaining the expected first hitting time bound of (1+1)-ESκ\text{-ES}_{\kappa} in the form of (9). The convergence results hold on a wide class of functions. It includes

  • (i)

    strongly convex functions with Lipschitz gradient, where linear convergence of numerical optimization algorithm is usually analyzed,

  • (ii)

    continuously differentiable positively homogenous functions, where previous linear convergence results had been introduced, and

  • (iii)

    functions with non-smooth level sets as illustrated in Figure 2.

Because the analyzed algorithms are invariant to strictly monotonic transformations of the objective functions, all results that hold on ff also hold on gfg\circ f where g:Im(f)g:{\rm Im}(f)\to\mathbb{R} is a strictly increasing transformation, which can thus introduce discontinuities on the objective function. In contrast to the previous result establishing the convergence of CMA-ES [18] by adding a step to enforce a sufficient decrease (which works well for direct search methods, but which is unnatural for ESs), we did not need to modify the adaptation mechanism of the (1+1)-ES to achieve our convergence proofs. We believe that this is crucial, since it allows our analysis to reflect the main mechanism that makes the algorithm work well in practice.

Theorem 4.16 proves that we can derive a tight convergence rate with Proposition 4.3 on the sphere function in the case where κ=1\kappa=1, i.e., without covariance matrix adaptation. This partially supports the utility of our methodology. However, its derivation relies on the fact that both the level sets of the objective function and the equal-density curves of the sampling distribution are isotropic, and hence does not generalize immediately. Moreover, the lower bound (Theorem 4.10) seems to be loose even for κ=1\kappa=1 on convex quadratic functions, where we empirically observe that the logarithmic convergence rate scales like Θ(1/Cond(f))\Theta(1/\operatorname{Cond}(\nabla\nabla f)), see Figure 1, while its dependency on the dimension is tight.

A better lower bound of the expected first hitting time and a handy way to estimate the convergence rate are relevant directions of future work. Further directions of future work are as follows:

Proving linear convergence of (1+1)-ESκ\text{-ES}_{\kappa} does not reveal the benefits of (1+1)-ESκ\text{-ES}_{\kappa} over the (1+1)-ES without covariance matrix adaptation. The motivation of the introduction of the covariance matrix is to improve the convergence rate and to broaden the class of functions on which linear convergence is exhibited. None of them are achieved in this paper.

On convex quadratic functions, we empirically observe that the covariance matrix approaches a stable distribution that is closely concentrated around the inverse Hessian up to a scalar factor, and the convergence speed on all convex quadratic functions is equal to that on the sphere function (see Figure 1). This behavior is not described by our result.

Covariance matrix adaptation is also important for optimizing functions with non-smooth level sets. On continuously differentiable functions, we can always set α\alpha_{\uparrow} and α\alpha_{\downarrow} so that p=log(1/α)log(α/α)<plimit=1/2p=\frac{\log(1/\alpha_{\downarrow})}{\log(\alpha_{\uparrow}/\alpha_{\downarrow})}<p^{\mathrm{limit}}=1/2. This is the rationale behind the 1/5 success rule, where p=1/5p=1/5. Indeed, p=1/5p=1/5 is known to approximate the optimal situation on the sphere function where the expected one-step progress is maximized [51]. Therefore, one does not need to tune these parameters in a problem-specific manner. However, if the objective is not continuously differentiable and levelsets are non-smooth, then plimitp^{\mathrm{limit}} is in general smaller than 1/21/2. For example, it can be as low as plimit=1/2dp^{\mathrm{limit}}=1/2^{d} on f(x)=x=maxi=1,,n|xi|f(x)=\|x\|_{\infty}=\max_{i=1,\dots,n}\lvert x_{i}\rvert. Without an appropriate adaptation of the covariance matrix the success probability will be smaller than p=1/5p=1/5 and one must tune α\alpha_{\uparrow} and α\alpha_{\downarrow} in order to converge to the optimum, which requires information about plimitp^{\mathrm{limit}}. By adapting the covariance matrix appropriately, the success probability can be increased arbitrary close to 1/21/2 (by elongating steps in the direction of the success domain) and α\alpha_{\uparrow} and α\alpha_{\downarrow} do not require tuning.

To achieve a reasonable convergence rate bound and broaden the class of functions on which linear convergence is exhibited, one needs to find another potential function VV that may penalize a high condition number Cond(f(mt)Σt)\operatorname{Cond}(\nabla\nabla f(m_{t})\Sigma_{t}) and replace the definitions of pupperp^{\mathrm{upper}} and plowerp^{\mathrm{lower}} accordingly. This point is left for future work.

Acknowledgement

We gratefully acknowledge support by Dagstuhl seminar 17191 “Theory of Randomized Search Heuristics”. We would like to thank Per Kristian Lehre, Carsten Witt, and Johannes Lengler for valuable discussions and advice on drift theory. Y. A. is supported by JSPS KAKENHI Grant Number 19H04179.

References

  • [1]
  • [2] M.A. Abramson, C. Audet, J.E. Dennis Jr, and S. Le Digabel, OrthoMADS: A deterministic MADS instance with orthogonal directions, SIAM Journal on Optimization, 20 (2009), pp. 948–966.
  • [3] Y. Akimoto, Analysis of a natural gradient algorithm on monotonic convex-quadratic-composite functions, in GECCO, 2012, pp. 1293–1300.
  • [4] Y. Akimoto, A. Auger, and T. Glasmachers, Drift theory in continuous search spaces: expected hitting time of the (1+ 1)-ES with 1/5 success rule, in GECCO, 2018, pp. 801–808.
  • [5] S. Alvernaz and J. Togelius, Autoencoder-augmented neuroevolution for visual doom playing, in IEEE CIG, 2017, pp. 1–8.
  • [6] D. V. Arnold and N. Hansen, Active covariance matrix adaptation for the (1+1)-CMA-ES, in GECCO, 2010, pp. 385–392.
  • [7] C. Audet and J.E. Dennis Jr, Mesh adaptive direct search algorithms for constrained optimization, SIAM Journal on Optimization, 17 (2006), pp. 188–217.
  • [8] A. Auger and N. Hansen, Linear convergence on positively homogeneous functions of a comparison based step-size adaptive randomized search: the (1+1)-ES with generalized one-fifth success rule, 2013, https://arxiv.org/abs/1310.8397.
  • [9] A. Auger and N. Hansen, Linear convergence of comparison-based step-size adaptive randomized search via stability of Markov chains, SIAM Journal on Optimization, 26 (2016), pp. 1589–1624.
  • [10] A. S. Bandeira, K. Scheinberg, and L. N. Vicente, Convergence of trust-region methods based on probabilistic models, SIAM Journal on Optimization, 24 (2014), pp. 1238–1264.
  • [11] B. Baritompa and M. Steel, Bounds on absorption times of directionally biased random sequences, Random Structures & Algorithms, 9 (1996), pp. 279–293.
  • [12] P. Bontrager, A. Roy, J. Togelius, N. Memon, and A. Ross, DeepMasterPrints: Generating MasterPrints for Dictionary Attacks via Latent Variable Evolution, in IEEE BTAS, 2018, pp. 1–9.
  • [13] S. Bubeck, Convex optimization: Algorithms and complexity, 2014, https://arxiv.org/abs/1405.4980.
  • [14] C. Cartis and K. Scheinberg, Global convergence rate analysis of unconstrained optimization methods based on probabilistic models, Mathematical Programming, 169 (2018), pp. 337–375.
  • [15] Y. S. Chow, On a strong law of large numbers for martingales, Ann. Math. Statist., 38 (1967), p. 610.
  • [16] A. R. Conn, K. Scheinberg, and L. N. Vicente, Introduction to Derivative-Free Optimization, SIAM, 2009.
  • [17] L. Devroye, The compound random search, in International Symposium on Systems Engineering and Analysis, 1972, pp. 195–110.
  • [18] Y. Diouane, S. Gratton, and L. N. Vicente, Globally convergent evolution strategies, Mathematical Programming, 152 (2015), pp. 467–490.
  • [19] B. Doerr and L. A. Goldberg, Adaptive drift analysis, Algorithmica, 65 (2013), pp. 224–250.
  • [20] B. Doerr, D. Johannsen, and C. Winzen, Multiplicative drift analysis, Algorithmica, 64 (2012), pp. 673–697.
  • [21] Y. Dong, H. Su, B. Wu, Z. Li, W. Liu, T. Zhang, and J. Zhu, Efficient decision-based black-box adversarial attacks on face recognition, in CVPR, 2019.
  • [22] G. Fujii, M. Takahashi, and Y. Akimoto, CMA-ES-based structural topology optimization using a level set boundary expression—application to optical and carpet cloaks, Computer Methods in Applied Mechanics and Engineering, 332 (2018), pp. 624 – 643.
  • [23] T. Geijtenbeek, M. Van De Panne, and A. F. Van Der Stappen, Flexible muscle-based locomotion for bipedal creatures, ACM Transactions on Graphics (TOG), 32 (2013), pp. 1–11.
  • [24] T. Glasmachers, Global convergence of the (1+1) Evolution Strategy to a critical point, Evolutionary Computation, 28 (2020), pp. 27–53.
  • [25] D. Golovin, J. Karro, G. Kochanski, C. Lee, X. Song, and Q. Zhang, Gradientless descent: High-dimensional zeroth-order optimization, in ICLR, 2020.
  • [26] S. Gratton, C. W. Royer, L. N. Vicente, and Z. Zhang, Direct search based on probabilistic descent, SIAM Journal on Optimization, 25 (2015), pp. 1515–1541.
  • [27] S. Gratton, C. W. Royer, L. N. Vicente, and Z. Zhang, Complexity and global rates of trust-region methods based on probabilistic models, IMA Journal of Numerical Analysis, 38 (2017), pp. 1579–1597.
  • [28] D. Ha and J. Schmidhuber, Recurrent world models facilitate policy evolution, in NeurIPS, 2018, pp. 2450–2462.
  • [29] B. Hajek, Hitting-time and occupation-time bounds implied by drift analysis with applications, Advances in Applied probability, 14 (1982), pp. 502–525.
  • [30] N. Hansen, A. Auger, R. Ros, S. Finck, and P. Pošík, Comparing results of 31 algorithms from the black-box optimization benchmarking bbob-2009, in GECCO, 2010, pp. 1689–1696.
  • [31] N. Hansen and A. Ostermeier, Completely derandomized self-adaptation in evolution strategies, Evolutionary Computation, 9 (2001), pp. 159–195.
  • [32] J. He and X. Yao, Drift analysis and average time complexity of evolutionary algorithms, Artificial intelligence, 127 (2001), pp. 57–85.
  • [33] J. He and X. Yao, A study of drift analysis for estimating computation time of evolutionary algorithms, Natural Computing, 3 (2004), pp. 21–35.
  • [34] J. Jägersküpper, Analysis of a simple evolutionary algorithm for minimization in Euclidean spaces, Automata, Languages and Programming, 2003, pp. 188–188.
  • [35] J. Jägersküpper, Rigorous runtime analysis of the (1+1)-ES: 1/5-rule and ellipsoidal fitness landscapes, in FOGA, 2005, pp. 260–281.
  • [36] J. Jägersküpper, How the (1+1)-ES using isotropic mutations minimizes positive definite quadratic forms, Theoretical Computer Science, 361 (2006), pp. 38–56.
  • [37] J. Jägersküpper, Algorithmic analysis of a basic evolutionary algorithm for continuous optimization, Theoretical Computer Science, 379 (2007), pp. 329–347.
  • [38] S. Kern, S. D. Müller, N. Hansen, D. Büche, J. Ocenasek, and P. Koumoutsakos, Learning probability distributions in continuous evolutionary algorithms–a comparative review, Natural Computing, 3 (2004), pp. 77–112.
  • [39] J. Konečnỳ and P. Richtárik, Simple complexity analysis of simplified direct search, 2014, https://arxiv.org/abs/1410.0390.
  • [40] I. Kriest, V. Sauerland, S. Khatiwala, A. Srivastav, and A. Oschlies, Calibrating a global three-dimensional biogeochemical ocean model (mops-1.0), Geoscientific Model Development, 10 (2017), p. 127.
  • [41] J. Larson, M. Menickelly, and S. M. Wild, Derivative-free optimization methods, Acta Numerica, 28 (2019), pp. 287–404.
  • [42] P. K. Lehre and C. Witt, General drift analysis with tail bounds, 2013, https://arxiv.org/abs/1307.2559.
  • [43] J. Lengler, Drift analysis, in Theory of Evolutionary Computation, Springer, 2020, pp. 89–131.
  • [44] J. Lengler and A. Steger, Drift analysis and evolutionary algorithms revisited, 2016, https://arxiv.org/abs/1608.03226.
  • [45] P. MacAlpine, S. Barrett, D. Urieli, V. Vu, and P. Stone, Design and optimization of an omnidirectional humanoid walk: A winning approach at the RoboCup 2011 3D simulation competition, in AAAI, 2012.
  • [46] B. Mitavskiy, J. Rowe, and C. Cannings, Theoretical analysis of local search strategies to optimize network communication subject to preserving the total number of links, International Journal of Intelligent Computing and Cybernetics, 2 (2009), pp. 243–284.
  • [47] D. Morinaga and Y. Akimoto, Generalized drift analysis in continuous domain: linear convergence of (1+ 1)-ES on strongly convex functions with lipschitz continuous gradients, in FOGA, 2019, pp. 13–24.
  • [48] A. Nemirovski, Information-based complexity of convex programming, Lecture Notes, (1995).
  • [49] Y. Nesterov, Lectures on convex optimization, vol. 137, Springer, 2018.
  • [50] C. Paquette and K. Scheinberg, A stochastic line search method with convergence rate analysis, 2018, https://arxiv.org/abs/1807.07994.
  • [51] I. Rechenberg, Evolutionsstrategie: Optimierung technisher Systeme nach Prinzipien der biologischen Evolution, Frommann-Holzboog, 1973.
  • [52] I. Rechenberg, Evolutionsstrategie’94, frommann-holzboog, 1994.
  • [53] L. M. Rios and N. V. Sahinidis, Derivative-free optimization: a review of algorithms and comparison of software implementations, Journal of Global Optimization, 56 (2013), pp. 1247–1293.
  • [54] M. Schumer and K. Steiglitz, Adaptive step size random search, Automatic Control, IEEE Transactions on, 13 (1968), pp. 270–276.
  • [55] S. U. Stich, C. L. Muller, and B. Gartner, Optimization of convex functions with random pursuit, SIAM Journal on Optimization, 23 (2013), pp. 1284–1309.
  • [56] S. U. Stich, C. L. Müller, and B. Gärtner, Variable metric random pursuit, Mathematical Programming, 156 (2016), pp. 549–579.
  • [57] J. Uhlendorf, A. Miermont, T. Delaveau, G. Charvin, F. Fages, S. Bottani, G. Batt, and P. Hersen, Long-term model predictive control of gene expression at the population and single-cell levels, Proceedings of the National Academy of Sciences, 109 (2012), pp. 14271–14276.
  • [58] V. Volz, J. Schrum, J. Liu, S. M. Lucas, A. Smith, and S. Risi, Evolving Mario levels in the latent space of a deep convolutional generative adversarial network, in GECCO, 2018, pp. 221–228.

Appendix A Some Numerical Results

We present experiments with five algorithms on two convex quadratic functions. We compare (1+1)-ES, (1+1)-CMA-ES, simplified direction search [39], random pursuit [55], and gradientless descent [25].

All algorithms were started at the initial search point x0=1d(1,,1)dx_{0}=\frac{1}{\sqrt{d}}(1,\dots,1)\in\mathbb{R}^{d}. We implemented the algorithms as follows, with their parameters tuned where necessary: The ES always uses the setting α=exp(4/d)\alpha_{\uparrow}=\exp(4/d) and α=α1/4\alpha_{\downarrow}=\alpha_{\uparrow}^{-1/4} for step size adaptation. We set the constant cc in the sufficient decrease condition of Simplified Direction Search to 110\frac{1}{10}, and we employed the standard basis as well as the negatives of these vectors as candidate directions. In each iteration we looped over the set of directions in random order. Randomizing the order greatly boosted performance over a fixed order. Random Pursuit was implemented with a golden section line search in the range [2σ,2σ][-2\sigma,2\sigma] with a rather loose target precision of σ/2\sigma/2, where σ\sigma is either the initial step size or the length of the previous step. For Gradientless Descent we used the initial step size as the maximal step size and defined a target precision of 101010^{-10}. This target is reached by the ES in all cases. The experiments are designed to demonstrate several different effects: (a) We perform all experiments in d=10d=10 and d=50d=50 dimensions to investigate dimension-dependent effects. (b) We investigate best-case performance by running the algorithms on the spherical function x2\|x\|^{2}, i.e., on the separable convex quadratic function with minimal condition number. The initial step size is set to σ0=1\sigma_{0}=1. All algorithms have a budget of 100d100d function evaluations. (c) We investigate the dependency of the performance on initial parameter settings by repeating the same experiment as above, but with an initial step size of σ0=11000\sigma_{0}=\frac{1}{1000}. All algorithms have a budget of 700d700d function evaluations. (d) We investigate the dependence on problem difficulty by running the algorithms on an ellipsoid problem with a moderate condition number of κf=100\kappa_{f}=100. The eigenvalues of the Hessian are evenly distributed on a log-scale. We use σ0=1\sigma_{0}=1 like in the first experiment. All algorithms have a budget of 500d500d function evaluations.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 3: Comparison of (1+1)-ES with and without covariance matrix adaptation with three well-analyzed derivative-free optimization algorithms on two convex quadratic functions. The left column of plots shows the performance on the sphere function x2\|x\|^{2} in dimensions 10 (top) and 50 (bottom). The middle column shows the same problem, but the initial step size is smaller by a factor of 10001000 (and the horizontal axis differs), simulating that the distance to the optimum was under-estimated. The right column shows the performance on the ellipsoid function (defined in Figure 1). The plots show the evolution of the best-so-far function value (on a logarithmic scale), with five individual runs (thin curves) as well as median performance (bold curves).

The experimental results are presented in Figure 3.

Interpretation. We observe only moderate dimension-dependent effects, besides the expected linear increase of the runtime. We see robust performance of the ES, in particular with covariance matrix adaptation. The second experiment demonstrates the practical importance of the ability to grow the step size: the ES is essentially unaffected by wrong initial parameter settings while the gradientless descent and the simplified direct search are (which can be understood directly from the algorithms themselves). This property does not show up in convergence rates and is therefore often (but not always) neglected in algorithm design. The last experiment clearly demonstrates the benefit of variable-metric methods like CMA-ES. It should be noted that variable metric techniques can be implemented into most existing algorithms. This is rarely done though, with random pursuit being a notable exception [56].

Appendix B Proofs

B.1 Proof of Lemma 2.10

Since fμf_{\mu} is invariant to gg, without loss of generality we assume f(x)=h(x)h(x)f(x)=h(x)-h(x^{*}) in this proof. Inequality Eq. 7 implies that f(y)f(x)(L/2)yx2f(x)f(y)\leqslant f(x)\Rightarrow(L_{\ell}/2)\lVert y-x^{*}\rVert^{2}\leqslant f(x), meaning that {y:f(y)f(x)}¯(x,f(x)L/2)\{y:f(y)\leqslant f(x)\}\subseteq\bar{\mathcal{B}}\Big{(}x^{*},\sqrt{\frac{f(x)}{L_{\ell}/2}}\Big{)}. Since fμ(x)f_{\mu}(x) is the ddth root of the volume of the left-hand side of the above relation, we find fμ(x)μ1d(¯(x,f(x)L/2))=Vdf(x)L/2f_{\mu}(x)\leqslant\mu^{\frac{1}{d}}\Big{(}\bar{\mathcal{B}}\Big{(}x^{*},\sqrt{\frac{f(x)}{L_{\ell}/2}}\Big{)}\Big{)}=V_{d}\sqrt{\frac{f(x)}{L_{\ell}/2}}. Analogously, we obtain (x,f(x)Lu/2){y:f(y)<f(x)}\mathcal{B}\Big{(}x^{*},\sqrt{\frac{f(x)}{L_{u}/2}}\Big{)}\subseteq\{y:f(y)<f(x)\} and fμ(x)Vdf(x)Lu/2f_{\mu}(x)\geqslant V_{d}\sqrt{\frac{f(x)}{L_{u}/2}}. From these inequalities, we obtain {y:f(y)f(x)}¯(x,LuLfμ(x)Vd)\{y:f(y)\leqslant f(x)\}\subseteq\bar{\mathcal{B}}\Big{(}x^{*},\sqrt{\frac{L_{u}}{L_{\ell}}}\frac{f_{\mu}(x)}{V_{d}}\Big{)} and (x,LLufμ(x)Vd){y:f(y)<f(x)}\mathcal{B}\Big{(}x^{*},\sqrt{\frac{L_{\ell}}{L_{u}}}\frac{f_{\mu}(x)}{V_{d}}\Big{)}\subseteq\{y:f(y)<f(x)\}. This implies A1 for 𝒳0\mathcal{X}_{0}^{\infty}. A2 is immediately implied by Proposition 2.8. This completes the proof.

B.2 Proof of Lemma 2.11

We first prove that A1 holds for a=0a=0 and b=b=\infty with Cu=sup{xx:fμ(x)=1}C_{u}=\sup\{\lVert x-x^{*}\rVert:f_{\mu}(x)=1\} and C=inf{xx:fμ(x)=1}C_{\ell}=\inf\{\lVert x-x^{*}\rVert:f_{\mu}(x)=1\} and they are finite.

It is easy to see that the spatial suboptimality function fμ(x)f_{\mu}(x) is proportional to h(x)h(x)h(x)-h(x^{*}). Let fμ(x)=c(h(x)h(x))f_{\mu}(x)=c(h(x)-h(x^{*})) for some c>0c>0. Then, fμf_{\mu} is also a homogeneous function. Since it is homogeneous, A1 reduces to that there are open and closed balls with radius CC_{\ell} and CuC_{u} satisfying the conditions described in the assumption with fμ(m)=1f_{\mu}(m)=1. Such constants are obtained by Cu=sup{xx:fμ(x)=1}C_{u}=\sup\{\lVert x-x^{*}\rVert:f_{\mu}(x)=1\} and C=inf{xx:fμ(x)=1}C_{\ell}=\inf\{\lVert x-x^{*}\rVert:f_{\mu}(x)=1\}.

Due to the continuity of ff there exists an open ball \mathcal{B} around xx^{*} such that h(x)<h(x)+1/ch(x)<h(x^{*})+1/c for all xx\in\mathcal{B}. Then, it holds that fμ(x)<1f_{\mu}(x)<1 for all xx\in\mathcal{B}. It implies that CC_{\ell} is no smaller than the radius of \mathcal{B}, which is positive. Hence, C>0C_{\ell}>0.

We show the finiteness of CuC_{u} by a contradiction argument. Suppose Cu=C_{u}=\infty. Then, there is a direction vv such that fμ(x+Mv)1f_{\mu}(x^{*}+Mv)\leqslant 1 with an arbitrarily large M>0M>0. Since fμf_{\mu} is homogeneous, we have fμ(x+v)1/Mf_{\mu}(x^{*}+v)\leqslant 1/M and this must hold for any M>0M>0. This implies fμ(x+v)=c(h(x)h(x))=0f_{\mu}(x^{*}+v)=c(h(x)-h(x^{*}))=0, which contradicts the assumption that xx^{*} is the unique global optimum. Hence, Cu<C_{u}<\infty.

The above argument proves that A1 holds with the above constants for a=0a=0 and b=b=\infty. Proposition 2.8 proves A2.

B.3 Proof of Proposition 4.1

For a given m𝒳abm\in\mathcal{X}_{a}^{b}, there is a closed ball ¯u\bar{\mathcal{B}}_{u} such that S0(m)¯uS_{0}(m)\subseteq\bar{\mathcal{B}}_{u}, see Figure 2. We have

p(a,b]upper(σ¯)\textstyle p^{\mathrm{upper}}_{(a,b]}(\bar{\sigma}) =supm𝒳absupΣ𝒮κS0(m)φ(x;m,(fμ(m)σ¯)2Σ)𝑑x\textstyle=\sup_{m\in\mathcal{X}_{a}^{b}}\sup_{\Sigma\in\mathcal{S}_{\kappa}}\int_{S_{0}(m)}\varphi\big{(}x;m,\left(f_{\mu}(m)\bar{\sigma}\right)^{2}\Sigma\big{)}dx
supm𝒳absupΣ𝒮κ¯uφ(x;m,(fμ(m)σ¯)2Σ)𝑑x(1).\textstyle\leqslant\sup_{m\in\mathcal{X}_{a}^{b}}\sup_{\Sigma\in\mathcal{S}_{\kappa}}\underbrace{\int_{\bar{\mathcal{B}}_{u}}\varphi\big{(}x;m,\left(f_{\mu}(m)\bar{\sigma}\right)^{2}\Sigma\big{)}dx}_{(*1)}\enspace.

The integral is maximized if the ball is centered at mm. By a variable change (xxmx\leftarrow x-m),

(1)\textstyle(*1) xCufμ(m)φ(x;0,(fμ(m)σ¯)2Σ)𝑑x=xCu/σ¯φ(x;0,Σ)𝑑x\textstyle\leqslant\int_{\lVert x\rVert\leqslant C_{u}f_{\mu}(m)}\varphi\big{(}x;0,\left(f_{\mu}(m)\bar{\sigma}\right)^{2}\Sigma\big{)}dx=\int_{\lVert x\rVert\leqslant C_{u}/\bar{\sigma}}\varphi(x;0,\Sigma)dx
κd/2Φ(¯(0,Cuσ¯κ1/2);0,I).\textstyle\leqslant\kappa^{d/2}\Phi\left(\bar{\mathcal{B}}\left(0,\frac{C_{u}}{\bar{\sigma}\kappa^{1/2}}\right);0,\mathrm{I}\right)\enspace.

Here we used Φ(¯(0,r));0,Σ)κd/2Φ(¯(0,κ1/2r);0,I)\Phi\big{(}\bar{\mathcal{B}}(0,r)\big{)};0,\Sigma)\leqslant\kappa^{d/2}\Phi\left(\bar{\mathcal{B}}\left(0,\kappa^{-1/2}r\right);0,\mathrm{I}\right) for any r>0r>0, which is proven in Lemma B.1 below. The right-most side (RMS) of the above inequality is independent of mm. It proves Eq. 18.

Similarly, there are balls \mathcal{B}_{\ell} and ¯u\bar{\mathcal{B}}_{u} such that S0(m)¯u\mathcal{B}_{\ell}\subseteq S_{0}(m)\subseteq\bar{\mathcal{B}}_{u}. We have

p(a,b]lower(σ¯)\textstyle p^{\mathrm{lower}}_{(a,b]}(\bar{\sigma}) =infm𝒳abinfΣ𝒮κS0(m)φ(x;m,(fμ(m)σ¯)2Σ)𝑑x\textstyle=\inf_{m\in\mathcal{X}_{a}^{b}}\inf_{\Sigma\in\mathcal{S}_{\kappa}}\int_{S_{0}(m)}\varphi\big{(}x;m,\left(f_{\mu}(m)\bar{\sigma}\right)^{2}\Sigma\big{)}dx
infm𝒳abinfΣ𝒮κφ(x;m,(fμ(m)σ¯)2Σ)𝑑x(2).\textstyle\geqslant\inf_{m\in\mathcal{X}_{a}^{b}}\inf_{\Sigma\in\mathcal{S}_{\kappa}}\underbrace{\int_{\mathcal{B}_{\ell}}\varphi\big{(}x;m,\left(f_{\mu}(m)\bar{\sigma}\right)^{2}\Sigma\big{)}dx}_{(*2)}\enspace.

The integral is minimized if the ball is at the opposite side of mm on the ball ¯u\bar{\mathcal{B}}_{u}, see Figure 2. By a variable change (moving mm to the origin) and letting em=m/me_{m}=m/\lVert m\rVert,

(2)\textstyle(*2) x((2CuC)fμ(m))emCfμ(m)φ(x;0,(fμ(m)σ¯)2Σ)𝑑x\textstyle\geqslant\int_{\lVert x-((2C_{u}-C_{\ell})f_{\mu}(m))e_{m}\rVert\leqslant C_{\ell}f_{\mu}(m)}\varphi\big{(}x;0,\left(f_{\mu}(m)\bar{\sigma}\right)^{2}\Sigma\big{)}dx
=x((2CuC)/σ¯)emC/σ¯φ(x;0,Σ)𝑑x\textstyle=\int_{\lVert x-((2C_{u}-C_{\ell})/\bar{\sigma})e_{m}\rVert\leqslant C_{\ell}/\bar{\sigma}}\varphi(x;0,\Sigma)dx
κd/2Φ(¯(((2CuC)κ1/2σ¯)em,Cκ1/2σ¯);0,I).\textstyle\geqslant\kappa^{-d/2}\Phi\left(\bar{\mathcal{B}}\left(\left(\frac{(2C_{u}-C_{\ell})\kappa^{1/2}}{\bar{\sigma}}\right)e_{m},\frac{C_{\ell}\kappa^{1/2}}{\bar{\sigma}}\right);0,\mathrm{I}\right)\enspace.

Here we used Φ(¯(c,r);0,Σ)κd/2Φ(¯(κ1/2c,κ1/2r);0,I)\Phi\big{(}\bar{\mathcal{B}}(c,r);0,\Sigma\big{)}\geqslant\kappa^{-d/2}\Phi\big{(}\bar{\mathcal{B}}(\kappa^{1/2}c,\kappa^{1/2}r);0,\mathrm{I}\big{)} for any cdc\in\mathbb{R}^{d} and r>0r>0 (Lemma B.1). The RMS of the above inequality is independent of mm as its value is constant over all unit vectors eme_{m}. Replacing eme_{m} with e1e_{1}, we have Eq. 19.

Lemma B.1.

For all Σ𝒮κ\Sigma\in\mathcal{S}_{\kappa}, κd/2φ(x;0,κ1I)φ(x;0,Σ)κd/2φ(x;0,κI)\kappa^{-d/2}\varphi\left(x;0,\kappa^{-1}\mathrm{I}\right)\leqslant\varphi\big{(}x;0,\Sigma\big{)}\leqslant\kappa^{d/2}\varphi\left(x;0,\kappa\mathrm{I}\right) and κd/2Φ((κc,κr);0,I)Φ((c,r);0,Σ)κd/2Φ((c/κ,r/κ);0,I)\kappa^{-d/2}\Phi\left(\mathcal{B}(\sqrt{\kappa}c,\sqrt{\kappa}r);0,\mathrm{I}\right)\leqslant\Phi\big{(}\mathcal{B}(c,r);0,\Sigma\big{)}\leqslant\kappa^{d/2}\Phi\left(\mathcal{B}(c/\sqrt{\kappa},r/\sqrt{\kappa});0,\mathrm{I}\right).

Proof B.2.

For Σ𝒮κ\Sigma\in\mathcal{S}_{\kappa}, we have det(Σ)=1\det(\Sigma)=1 and Cond(Σ)=λmax(Σ)/λmin(Σ)κ\operatorname{Cond}(\Sigma)=\lambda_{\max}(\Sigma)/\lambda_{\min}(\Sigma)\leqslant\kappa. Since det(Σ)=1\det(\Sigma)=1 and det(Σ)=i=1dλi(Σ)\det(\Sigma)=\prod_{i=1}^{d}\lambda_{i}(\Sigma), we have λmax(Σ)1λmin(Σ)\lambda_{\max}(\Sigma)\geqslant 1\geqslant\lambda_{\min}(\Sigma). Therefore, we have λmin(Σ)λmax/κκ1\lambda_{\min}(\Sigma)\geqslant\lambda_{\max}/\kappa\geqslant\kappa^{-1} and λmax(Σ)κλmin(λ)κ\lambda_{\max}(\Sigma)\leqslant\kappa\lambda_{\min}(\lambda)\leqslant\kappa. Then we obtain κ1xTIxxTΣ1xκxTIx\kappa^{-1}x^{\mathrm{T}}\mathrm{I}x\leqslant x^{\mathrm{T}}\Sigma^{-1}x\leqslant\kappa x^{\mathrm{T}}\mathrm{I}x. With this inequality we have

φ(x;0,Σ)=(2π)d/2exp(xTΣ1x/2)(2π)d/2exp(xTIx/(2κ))=κd/2(2πκ)d/2exp(xTIx/(2κ))=κd/2φ(x;0,κI).\textstyle\varphi\big{(}x;0,\Sigma\big{)}=(2\pi)^{-d/2}\exp(-x^{\mathrm{T}}\Sigma^{-1}x/2)\leqslant(2\pi)^{-d/2}\exp(-x^{\mathrm{T}}\mathrm{I}x/(2\kappa))\\ \textstyle=\kappa^{d/2}(2\pi\kappa)^{-d/2}\exp(-x^{\mathrm{T}}\mathrm{I}x/(2\kappa))=\kappa^{d/2}\varphi\big{(}x;0,\kappa\mathrm{I}\big{)}\enspace.

Analogously, we obtain φ(x;0,Σ)κd/2φ(x;0,κ1I)\varphi\big{(}x;0,\Sigma\big{)}\geqslant\kappa^{-d/2}\varphi\big{(}x;0,\kappa^{-1}\mathrm{I}\big{)}. Taking the integral over (c,r)\mathcal{B}(c,r), we obtain the second statement.

B.4 Proof of Lemma 4.2

The upper bound of p(a,b]upperp^{\mathrm{upper}}_{(a,b]} given in (18) is strictly decreasing in σ¯\bar{\sigma} and converges to zero when σ¯\bar{\sigma} goes to infinity. This guarantees the existence of σ¯u\bar{\sigma}_{u} as a finite value. The existence of σ¯>0\bar{\sigma}_{\ell}>0 is obvious under A2. A1 guarantees that there exists an open ball BB_{\ell} with radius C(1r)fμ(m)C_{\ell}(1-r)f_{\mu}(m) such that {xdfμ(x)<(1r)fμ(m)}\mathcal{B}_{\ell}\subseteq\{x\in\mathbb{R}^{d}\mid f_{\mu}(x)<(1-r)f_{\mu}(m)\}. Then, analogously to the proof of Proposition 4.1, the success probability with rate rr is lower bounded by

(27) prsucc(σ¯;m,Σ)κd/2Φ((((2Cu(1r)C)κ1/2σ¯)e1,(1r)Cκ1/2σ¯);0,I).\textstyle p^{\mathrm{succ}}_{r}(\bar{\sigma};m,\Sigma)\geqslant\kappa^{-d/2}\Phi\left(\mathcal{B}\left(\left(\frac{(2C_{u}-(1-r)C_{\ell})\kappa^{1/2}}{\bar{\sigma}}\right)e_{1},\frac{(1-r)C_{\ell}\kappa^{1/2}}{\bar{\sigma}}\right);0,\mathrm{I}\right).

The probability is independent of mm, positive, and continuous in σ¯[,u]\bar{\sigma}\in[\ell,u]. Therefore the minimum is attained. This completes the proof.

B.5 Proof of Proposition 4.3

First, we remark that mt𝒳a,bm_{t}\in\mathcal{X}_{a,b} is equivalent to the condition a<fμ(mt)ba<f_{\mu}(m_{t})\leqslant b. If fμ(mt)af_{\mu}(m_{t})\leqslant a or fμ(mt)>bf_{\mu}(m_{t})>b, both sides of Eq. 23 are zero, hence the inequality is trivial. In the following we assume that mt𝒳abm_{t}\in\mathcal{X}_{a}^{b}.

For the sake of simplicity we introduce log+(x)=log(x)1{x1}\log^{+}(x)=\log(x)1\left\{\scriptstyle{x\geqslant 1}\right\}. We rewrite the potential function as

(28) V(θt)=\textstyle V(\theta_{t})= log(fμ(mt))+vlog+(αfμ(mt)σt)+vlog+(σtαufμ(mt)).\textstyle\log\left(f_{\mu}(m_{t})\right)+v\log^{+}\left(\frac{\alpha_{\uparrow}\ell f_{\mu}(m_{t})}{\sigma_{t}}\right)+v\log^{+}\left(\frac{\sigma_{t}}{\alpha_{\downarrow}uf_{\mu}(m_{t})}\right)\enspace.

The potential function at time t+1t+1 can be written as

V(θt+1)=logfμ(mt+1)+vlog+fμ(mt+1)σt1{σt+1>σt}P2+vlog+αfμ(mt)ασt1{σt+1<σt}P3+vlog+ασtαufμ(mt+1)1{σt+1>σt}P4+vlog+σtufμ(mt)1{σt+1<σt}P5.\textstyle V(\theta_{t+1})=\log f_{\mu}(m_{t+1})+\underbrace{v\log^{+}\frac{\ell f_{\mu}(m_{t+1})}{\sigma_{t}}1\left\{\scriptstyle{\sigma_{t+1}>\sigma_{t}}\right\}}_{P_{2}}+\underbrace{v\log^{+}\frac{\alpha_{\uparrow}\ell f_{\mu}(m_{t})}{\alpha_{\downarrow}\sigma_{t}}1\left\{\scriptstyle{\sigma_{t+1}<\sigma_{t}}\right\}}_{P_{3}}\\ \textstyle+\underbrace{v\log^{+}\frac{\alpha_{\uparrow}\sigma_{t}}{\alpha_{\downarrow}uf_{\mu}(m_{t+1})}1\left\{\scriptstyle{\sigma_{t+1}>\sigma_{t}}\right\}}_{P_{4}}+\underbrace{v\log^{+}\frac{\sigma_{t}}{uf_{\mu}(m_{t})}1\left\{\scriptstyle{\sigma_{t+1}<\sigma_{t}}\right\}}_{P_{5}}\enspace.

We want to estimate the conditional expectation

(29) 𝔼[max{V(θt+1)V(θt),A}θt].\mathbb{E}\left[\max\{V(\theta_{t+1})-V(\theta_{t})\,,\,-A\}\mid\theta_{t}\right].

We partition the possible values of θt\theta_{t} into three sets: first the set of θt\theta_{t} such that σt<fμ(mt)\sigma_{t}<\ell f_{\mu}(m_{t}) (σt\sigma_{t} is small), second the set of θt\theta_{t} such that σt>ufμ(mt)\sigma_{t}>uf_{\mu}(m_{t}) (σt\sigma_{t} is large), and last the set of θt\theta_{t} such that fμ(mt)σtufμ(mt)\ell f_{\mu}(m_{t})\leqslant\sigma_{t}\leqslant uf_{\mu}(m_{t}) (reasonable σt\sigma_{t}). In the following, we bound Eq. 29 for each of the three cases and in the end our bound BB will equal the minimum of the three bounds obtained for each case.

Reasonable σt\sigma_{t} case: fμ(mt)σt[1u,1]\frac{f_{\mu}(m_{t})}{\sigma_{t}}\in\left[\frac{1}{u},\frac{1}{\ell}\right]. In case of success, where 1{σt+1>σt}=11\left\{\scriptstyle{\sigma_{t+1}>\sigma_{t}}\right\}=1, we have fμ(mt+1)/σt+1fμ(mt)/(ασt)1/(α)f_{\mu}(m_{t+1})/\sigma_{t+1}\leqslant f_{\mu}(m_{t})/(\alpha_{\uparrow}\sigma_{t})\leqslant 1/(\alpha_{\uparrow}\ell), implying that P2P_{2} is always 0. Similarly, in case of failure, fμ(mt+1)/σt+1=fμ(mt)/(ασt)1/(αu)f_{\mu}(m_{t+1})/\sigma_{t+1}=f_{\mu}(m_{t})/(\alpha_{\downarrow}\sigma_{t})\geqslant 1/(\alpha_{\downarrow}u) and we find that P5P_{5} is always zero. We rearrange P3P_{3} and P4P_{4} into

P3\textstyle P_{3} =vlog+(αfμ(mt)ασt)1{σt+1<σt},\textstyle=v\log^{+}\left(\frac{\alpha_{\uparrow}\ell f_{\mu}(m_{t})}{\alpha_{\downarrow}\sigma_{t}}\right)1\left\{\scriptstyle{\sigma_{t+1}<\sigma_{t}}\right\}\enspace,
P4\textstyle P_{4} =v[log(ασtαufμ(mt))log(fμ(mt+1)fμ(mt))]1{αufμ(mt+1)ασt<1}1{σt+1>σt}.\textstyle=v\left[\log\left(\frac{\alpha_{\uparrow}\sigma_{t}}{\alpha_{\downarrow}uf_{\mu}(m_{t})}\right)-\log\left(\frac{f_{\mu}(m_{t+1})}{f_{\mu}(m_{t})}\right)\right]1\left\{\scriptstyle{\frac{\alpha_{\downarrow}uf_{\mu}(m_{t+1})}{\alpha_{\uparrow}\sigma_{t}}<1}\right\}1\left\{\scriptstyle{\sigma_{t+1}>\sigma_{t}}\right\}\enspace.

Then, the one-step change Δt=V(θt+1)V(θt)\Delta_{t}=V(\theta_{t+1})-V(\theta_{t}) is upper bounded by

(30) Δt(1v1{αufμ(mt)ασt<1}1{σt+1>σt})log(fμ(mt+1)fμ(mt))+vlog+(αfμ(mt)ασt)1{σt+1<σt}+vlog+(ασtαufμ(mt))1{σt+1>σt}(1v)logfμ(mt+1)fμ(mt)+vlog+αfμ(mt)ασt1{σt+1<σt}+vlog+ασtαufμ(mt)1{σt+1>σt}.\textstyle\Delta_{t}\leqslant\left(1-v1\left\{\scriptstyle{\frac{\alpha_{\downarrow}uf_{\mu}(m_{t})}{\alpha_{\uparrow}\sigma_{t}}<1}\right\}1\left\{\scriptstyle{\sigma_{t+1}>\sigma_{t}}\right\}\right)\log\left(\frac{f_{\mu}(m_{t+1})}{f_{\mu}(m_{t})}\right)\\ \textstyle+v\log^{+}\left(\frac{\alpha_{\uparrow}\ell f_{\mu}(m_{t})}{\alpha_{\downarrow}\sigma_{t}}\right)1\left\{\scriptstyle{\sigma_{t+1}<\sigma_{t}}\right\}+v\log^{+}\left(\frac{\alpha_{\uparrow}\sigma_{t}}{\alpha_{\downarrow}uf_{\mu}(m_{t})}\right)1\left\{\scriptstyle{\sigma_{t+1}>\sigma_{t}}\right\}\\ \textstyle\leqslant(1-v)\log\frac{f_{\mu}(m_{t+1})}{f_{\mu}(m_{t})}+v\log^{+}\frac{\alpha_{\uparrow}\ell f_{\mu}(m_{t})}{\alpha_{\downarrow}\sigma_{t}}1\left\{\scriptstyle{\sigma_{t+1}<\sigma_{t}}\right\}+v\log^{+}\frac{\alpha_{\uparrow}\sigma_{t}}{\alpha_{\downarrow}uf_{\mu}(m_{t})}1\left\{\scriptstyle{\sigma_{t+1}>\sigma_{t}}\right\}\enspace.

The truncated one-step change max{Δt,A}\max\{\Delta_{t}\,,\,-A\} is upper bounded by

(31) max{Δt,A}(1v)max{log(fμ(mt+1)fμ(mt)),A1v}+vlog+(αfμ(mt)ασt)1{σt+1<σt}+vlog+(ασtαufμ(mt))1{σt+1>σt}.\textstyle\max\{\Delta_{t}\,,\,-A\}\leqslant(1-v)\max\left\{\log\left(\frac{f_{\mu}(m_{t+1})}{f_{\mu}(m_{t})}\right)\,,\,-\frac{A}{1-v}\right\}\\ \textstyle+v\log^{+}\left(\frac{\alpha_{\uparrow}\ell f_{\mu}(m_{t})}{\alpha_{\downarrow}\sigma_{t}}\right)1\left\{\scriptstyle{\sigma_{t+1}<\sigma_{t}}\right\}+v\log^{+}\left(\frac{\alpha_{\uparrow}\sigma_{t}}{\alpha_{\downarrow}uf_{\mu}(m_{t})}\right)1\left\{\scriptstyle{\sigma_{t+1}>\sigma_{t}}\right\}\enspace.

To consider the expectation of the above upper bound, we need to compute the expectation of the maximum of log(fμ(mt+1)fμ(mt))\log\left(\frac{f_{\mu}(m_{t+1})}{f_{\mu}(m_{t})}\right) and A1v-\frac{A}{1-v}. Let a0a\leqslant 0 and bb\in\mathbb{R} then max(a,b)=a1{a>b}+b1{ab}b1{ab}\max(a,b)=a1\left\{\scriptstyle{a>b}\right\}+b1\left\{\scriptstyle{a\leqslant b}\right\}\leqslant b1\left\{\scriptstyle{a\leqslant b}\right\}. Applying this and taking the conditional expectation, a trivial upper bound for the conditional expectation of max{log(fμ(mt+1)fμ(mt)),A1v}\max\left\{\log\left(\frac{f_{\mu}(m_{t+1})}{f_{\mu}(m_{t})}\right)\,,\,-\frac{A}{1-v}\right\} is A1v-\frac{A}{1-v} times the probability of log(fμ(mt+1)fμ(mt))\log\left(\frac{f_{\mu}(m_{t+1})}{f_{\mu}(m_{t})}\right) being no greater than A1v-\frac{A}{1-v}. The latter condition is equivalent to fμ(mt+1)(1r)fμ(mt)f_{\mu}(m_{t+1})\leqslant(1-r)f_{\mu}(m_{t}) corresponding to successes with rate r=1exp(A1v)r=1-\exp\left(-\frac{A}{1-v}\right) or better. That is,

(32) (1v)𝔼[max{log(fμ(mt+1)fμ(mt)),A1v}]Aprsucc(σtfμ(mt);mt,Σt).\textstyle(1-v)\mathbb{E}\left[\max\left\{\log\left(\frac{f_{\mu}(m_{t+1})}{f_{\mu}(m_{t})}\right)\,,\,-\frac{A}{1-v}\right\}\right]\\ \leqslant-Ap^{\mathrm{succ}}_{r}\left(\frac{\sigma_{t}}{f_{\mu}(m_{t})};m_{t},\Sigma_{t}\right)\enspace.

Note also that the expected value of 1{σt+1>σt}1\left\{\scriptstyle{\sigma_{t+1}>\sigma_{t}}\right\} is the success probability, namely, p0succ(σtfμ(mt);mt,Σt)p^{\mathrm{succ}}_{0}\left(\frac{\sigma_{t}}{f_{\mu}(m_{t})};m_{t},\Sigma_{t}\right). We obtain an upper bound for the conditional expectation of max{Δt,A}\max\{\Delta_{t}\,,\,-A\} in the case of reasonable σt\sigma_{t} as

(33) 𝔼[max{Δt,A}|θt]Aprsucc(σtfμ(mt);mt,Σt)+(log(αα)+log(fμ(mt)σt)0)v(1p0succ(σtfμ(mt);mt,Σt))+(log(αα)+log(σtufμ(mt))0)vp0succ(σtfμ(mt);mt,Σt)Apr+vlog(αα).\textstyle\mathbb{E}\left[\max\{\Delta_{t}\,,\,-A\}|\theta_{t}\right]\leqslant-Ap^{\mathrm{succ}}_{r}\left(\frac{\sigma_{t}}{f_{\mu}(m_{t})};m_{t},\Sigma_{t}\right)\\ \textstyle+\bigg{(}\log\left(\frac{\alpha_{\uparrow}}{\alpha_{\downarrow}}\right)+\underbrace{\log\left(\frac{\ell f_{\mu}(m_{t})}{\sigma_{t}}\right)}_{\leqslant 0}\bigg{)}v\left(1-p^{\mathrm{succ}}_{0}\left(\frac{\sigma_{t}}{f_{\mu}(m_{t})};m_{t},\Sigma_{t}\right)\right)\\ \textstyle+\bigg{(}\log\left(\frac{\alpha_{\uparrow}}{\alpha_{\downarrow}}\right)+\underbrace{\log\left(\frac{\sigma_{t}}{uf_{\mu}(m_{t})}\right)}_{\leqslant 0}\bigg{)}vp^{\mathrm{succ}}_{0}\left(\frac{\sigma_{t}}{f_{\mu}(m_{t})};m_{t},\Sigma_{t}\right)\leqslant-Ap^{*}_{r}+v\log\left(\frac{\alpha_{\uparrow}}{\alpha_{\downarrow}}\right)\enspace.

Small σt\sigma_{t} case: fμ(mt)σt>1\frac{f_{\mu}(m_{t})}{\sigma_{t}}>\frac{1}{\ell}. If fμ(mt)>σt\ell f_{\mu}(m_{t})>\sigma_{t}, the 2nd summand in Eq. 28 is positive. Moreover, if σt+1<σt\sigma_{t+1}<\sigma_{t}, we have fμ(mt+1)=fμ(mt)>σt>σt+1\ell f_{\mu}(m_{t+1})=\ell f_{\mu}(m_{t})>\sigma_{t}>\sigma_{t+1} and hence the 2nd summand in Eq. 28 is positive for V(θt+1)V(\theta_{t+1}) as well. If σt+1>σt\sigma_{t+1}>\sigma_{t}, any regime can happen. Then, V(θt+1)V(θt)=V(\theta_{t+1})-V(\theta_{t})=

=\textstyle= logfμ(mt+1)fμ(mt)vlogαfμ(mt)σt+vlogfμ(mt+1)σt1{fμ(mt+1)σt>1}1{σt+1>σt}\textstyle\log\frac{f_{\mu}(m_{t+1})}{f_{\mu}(m_{t})}-v\log\frac{\alpha_{\uparrow}\ell f_{\mu}(m_{t})}{\sigma_{t}}+v\log\frac{\ell f_{\mu}(m_{t+1})}{\sigma_{t}}1\left\{\scriptstyle{\frac{\ell f_{\mu}(m_{t+1})}{\sigma_{t}}>1}\right\}1\left\{\scriptstyle{\sigma_{t+1}>\sigma_{t}}\right\}
+vlogαfμ(mt)ασt1{αfμ(mt)ασt>1}1{σt+1<σt}\textstyle+v\log\frac{\alpha_{\uparrow}\ell f_{\mu}(m_{t})}{\alpha_{\downarrow}\sigma_{t}}1\left\{\scriptstyle{\frac{\alpha_{\uparrow}\ell f_{\mu}(m_{t})}{\alpha_{\downarrow}\sigma_{t}}>1}\right\}1\left\{\scriptstyle{\sigma_{t+1}<\sigma_{t}}\right\}
+vlogασtαufμ(mt+1)1{αufμ(mt+1)ασt<1}1{σt+1>σt}\textstyle+v\log\frac{\alpha_{\uparrow}\sigma_{t}}{\alpha_{\downarrow}uf_{\mu}(m_{t+1})}1\left\{\scriptstyle{\frac{\alpha_{\downarrow}uf_{\mu}(m_{t+1})}{\alpha_{\uparrow}\sigma_{t}}<1}\right\}1\left\{\scriptstyle{\sigma_{t+1}>\sigma_{t}}\right\}
=\textstyle= log(fμ(mt+1)fμ(mt))[1+v(1{fμ(mt+1)σt>1}1{αufμ(mt+1)ασt<1})1{σt+1>σt}]\textstyle\log\left(\frac{f_{\mu}(m_{t+1})}{f_{\mu}(m_{t})}\right)\left[1+v\left(1\left\{\scriptstyle{\frac{\ell f_{\mu}(m_{t+1})}{\sigma_{t}}>1}\right\}-1\left\{\scriptstyle{\frac{\alpha_{\downarrow}uf_{\mu}(m_{t+1})}{\alpha_{\uparrow}\sigma_{t}}<1}\right\}\right)1\left\{\scriptstyle{\sigma_{t+1}>\sigma_{t}}\right\}\right]
vlog(αufμ(mt)ασt)1{αufμ(mt+1)ασt<1}1{σt+1>σt}\textstyle-v\log\left(\frac{\alpha_{\downarrow}uf_{\mu}(m_{t})}{\alpha_{\uparrow}\sigma_{t}}\right)1\left\{\scriptstyle{\frac{\alpha_{\downarrow}uf_{\mu}(m_{t+1})}{\alpha_{\uparrow}\sigma_{t}}<1}\right\}1\left\{\scriptstyle{\sigma_{t+1}>\sigma_{t}}\right\}
vlog(fμ(mt)σt)[11{fμ(mt+1)σt>1}1{σt+1>σt}1{αfμ(mt)ασt>1}1{σt+1<σt}]\textstyle-v\log\left(\frac{\ell f_{\mu}(m_{t})}{\sigma_{t}}\right)\left[1-1\left\{\scriptstyle{\frac{\ell f_{\mu}(m_{t+1})}{\sigma_{t}}>1}\right\}1\left\{\scriptstyle{\sigma_{t+1}>\sigma_{t}}\right\}-1\left\{\scriptstyle{\frac{\alpha_{\uparrow}\ell f_{\mu}(m_{t})}{\alpha_{\downarrow}\sigma_{t}}>1}\right\}1\left\{\scriptstyle{\sigma_{t+1}<\sigma_{t}}\right\}\right]
v(log(α)log(αα)1{αfμ(mt)ασt>1}1{σt+1<σt}).\textstyle-v\left(\log(\alpha_{\uparrow})-\log\left(\frac{\alpha_{\uparrow}}{\alpha_{\downarrow}}\right)1\left\{\scriptstyle{\frac{\alpha_{\uparrow}\ell f_{\mu}(m_{t})}{\alpha_{\downarrow}\sigma_{t}}>1}\right\}1\left\{\scriptstyle{\sigma_{t+1}<\sigma_{t}}\right\}\right)\enspace.

On the RMS of the above equality, the first term is guaranteed to be non-positive since v(0,1)v\in(0,1). The second and third terms are non-positive as well since αufμ(mt)ασt>αuα1\frac{\alpha_{\downarrow}uf_{\mu}(m_{t})}{\alpha_{\uparrow}\sigma_{t}}>\frac{\alpha_{\downarrow}u}{\alpha_{\uparrow}\ell}{\geqslant}1 and fμ(mt)σt>1\frac{\ell f_{\mu}(m_{t})}{\sigma_{t}}>1. Replacing the indicator 1{αfμ(mt)ασt>1}1\left\{\scriptstyle{\frac{\alpha_{\uparrow}\ell f_{\mu}(m_{t})}{\alpha_{\downarrow}\sigma_{t}}>1}\right\} with 11 in the last term provides an upper bound. Altogether, we obtain

Δt=V(θt+1)V(θt)v(log(α)log(α/α)1{σt+1<σt}).\Delta_{t}=V(\theta_{t+1})-V(\theta_{t})\leqslant-v\left(\log(\alpha_{\uparrow})-\log(\alpha_{\uparrow}/\alpha_{\downarrow})1\left\{\scriptstyle{\sigma_{t+1}<\sigma_{t}}\right\}\right)\enspace.

Note that the RHS is larger than A-A since it is lower bounded by vlog(α)-v\log(\alpha_{\uparrow}) and vA/log(α)v\leqslant A/\log(\alpha_{\uparrow}). Then, the conditional expectation of max{Δt,A}\max\{\Delta_{t}\,,\,-A\} is

(34) 𝔼[max{Δt,A}|t]v(log(αα)p0succ(σtfμ(mt);mt,Σt)+log(α))v(log(αα)p+log(α))=vlog(αα)(pptarget)=vppu2log(αα).\textstyle\mathbb{E}\left[\max\{\Delta_{t}\,,\,-A\}|\mathcal{F}_{t}\right]\leqslant-v\left(\log\left(\frac{\alpha_{\uparrow}}{\alpha_{\downarrow}}\right)p^{\mathrm{succ}}_{0}\left(\frac{\sigma_{t}}{f_{\mu}(m_{t})};m_{t},\Sigma_{t}\right)+\log(\alpha_{\downarrow})\right)\\ \textstyle\leqslant-v\left(\log\left(\frac{\alpha_{\uparrow}}{\alpha_{\downarrow}}\right)p_{\ell}+\log(\alpha_{\downarrow})\right){=-v\log\left(\frac{\alpha_{\uparrow}}{\alpha_{\downarrow}}\right)\left(p_{\ell}-p_{\mathrm{target}}\right)}=-v\frac{p_{\ell}-p_{u}}{2}\log\left(\frac{\alpha_{\uparrow}}{\alpha_{\downarrow}}\right)\enspace.

Here we used 𝔼[1{σt+1<σt}t]=1p0succ(σtfμ(mt);mt,Σt)\mathbb{E}[1\{\sigma_{t+1}<\sigma_{t}\}\mid\mathcal{F}_{t}]=1-p_{0}^{\mathrm{succ}}\left(\frac{\sigma_{t}}{f_{\mu}(m_{t})};m_{t},\Sigma_{t}\right) for the first inequality, p0succ(σtfμ(mt);mt,Σt)>pp^{\mathrm{succ}}_{0}\left(\frac{\sigma_{t}}{f_{\mu}(m_{t})};m_{t},\Sigma_{t}\right)>p_{\ell} for the second inequality, ptarget=log(1α)/log(αα)p_{\mathrm{target}}=\log\left(\frac{1}{\alpha_{\downarrow}}\right)\Big{/}\log\left(\frac{\alpha_{\uparrow}}{\alpha_{\downarrow}}\right) and ptarget=(pu+p)/2p_{\mathrm{target}}=(p_{u}+p_{\ell})/2 for the last two equalities.

Large σt\sigma_{t} case: fμ(mt)σt<1u\frac{f_{\mu}(m_{t})}{\sigma_{t}}<\frac{1}{u}. Since fμ(mt+1)σt+1fμ(mt)ασt<1αu\frac{f_{\mu}(m_{t+1})}{\sigma_{t+1}}\leqslant\frac{f_{\mu}(m_{t})}{\alpha_{\downarrow}\sigma_{t}}<\frac{1}{\alpha_{\downarrow}u}, the 3rd summand in Eq. 28 is positive in both V(θt)V(\theta_{t}) and V(θt+1)V(\theta_{t+1}). For the 2nd summand in Eq. 28, recall that αfμ(mt)/σt<α/uα<1\alpha_{\uparrow}\ell f_{\mu}(m_{t})/\sigma_{t}<\alpha_{\uparrow}\ell/u\leqslant\alpha_{\downarrow}<1 since we have assumed that u/α/αu/\ell\geqslant\alpha_{\uparrow}/\alpha_{\downarrow}. Hence, for V(θt)V(\theta_{t}) the 2nd summand in Eq. 28 is zero. Also, αmt+1/σt+1α/(αu)=(α/α)/u1\alpha_{\uparrow}\ell\|m_{t+1}\|/\sigma_{t+1}\leqslant\alpha_{\uparrow}\ell/(\alpha_{\downarrow}u)=(\alpha_{\uparrow}/\alpha_{\downarrow})\ell/u\geqslant 1 and thus for V(θt+1)V(\theta_{t+1}) the 2nd summand in Eq. 28 also equals 0. We obtain

V(θt+1)V(θt)=(1v)(log(fμ(mt+1))log(fμ(mt)))+vlog(σt+1/σt).V(\theta_{t+1})-V(\theta_{t})=(1-v)\big{(}\log\left(f_{\mu}(m_{t+1})\right)-\log\left(f_{\mu}(m_{t})\right)\big{)}+v\log\left(\sigma_{t+1}/\sigma_{t}\right).

The first term on the RHS is guaranteed to be non-positive since v<1v<1, yielding Δtvlog(σt+1/σt)\Delta_{t}\leqslant v\log(\sigma_{t+1}/\sigma_{t}). On the other hand,

vlog(σt+1/σt)\textstyle v\log(\sigma_{t+1}/\sigma_{t}) =v(log(α)1{σt+1>σt}+log(α)1{σt+1<σt})\textstyle=v\left(\log(\alpha_{\uparrow})1\left\{\scriptstyle{\sigma_{t+1}>\sigma_{t}}\right\}+\log(\alpha_{\downarrow})1\left\{\scriptstyle{\sigma_{t+1}<\sigma_{t}}\right\}\right)
=v(log(α/α)1{σt+1>σt}log(1/α))\textstyle=v\left(\log(\alpha_{\uparrow}/\alpha_{\downarrow})1\left\{\scriptstyle{\sigma_{t+1}>\sigma_{t}}\right\}-\log(1/\alpha_{\downarrow})\right)
vlog(1/α)A,\textstyle\geqslant-v\log(1/\alpha_{\downarrow})\geqslant-A\enspace,

where the last inequality comes from the prerequisite vA/log(1/α)v\leqslant A/\log(1/\alpha_{\downarrow}). Hence,

max{Δt,A}max{vlog(σt+1/σt),A}=vlog(σt+1/σt).\max\{\Delta_{t}\,,\allowbreak\,-A\}\leqslant\max\{v\log(\sigma_{t+1}/\sigma_{t}),-A\}=v\log(\sigma_{t+1}/\sigma_{t})\enspace.

Then, the conditional expectation of max{Δt,A}\max\{\Delta_{t}\,,\allowbreak\,-A\} is

(35) 𝔼[max{Δt,A}|θt]v(log(α)+log(αα)p0succ(σtfμ(mt);mt,Σt))v(log(α)+log(αα)pu)=vlog(αα)(ptarget+pu)=vppu2log(αα).\textstyle\mathbb{E}\left[\max\{\Delta_{t}\,,\,-A\}|\theta_{t}\right]\leqslant v\left(\log(\alpha_{\downarrow})+\log\left(\frac{\alpha_{\uparrow}}{\alpha_{\downarrow}}\right)p^{\mathrm{succ}}_{0}\left(\frac{\sigma_{t}}{f_{\mu}(m_{t})};m_{t},\Sigma_{t}\right)\right)\\ \textstyle\leqslant v\left(\log(\alpha_{\downarrow})+\log\left(\frac{\alpha_{\uparrow}}{\alpha_{\downarrow}}\right)p_{u}\right){=v\log\left(\frac{\alpha_{\uparrow}}{\alpha_{\downarrow}}\right)\left(-p_{\mathrm{target}}+p_{u}\right)}=-v\frac{p_{\ell}-p_{u}}{2}\log\left(\frac{\alpha_{\uparrow}}{\alpha_{\downarrow}}\right)\enspace.

Here we used p0succ(σtfμ(mt);mt,Σt)pup^{\mathrm{succ}}_{0}\left(\frac{\sigma_{t}}{f_{\mu}(m_{t})};m_{t},\Sigma_{t}\right)\leqslant p_{u}.

Conclusion. Inequalities Eqs. 34, 35 and 33 together cover all possible cases and we hence obtain Eq. 24.

Finally, we prove the positivity of BB for an arbitrary A>0A>0. Lemma 4.2 guarantees the positivity of prp_{r}^{*} for any choice of AA since r=1exp(A/(1v))(0,1)r=1-\exp(-A/(1-v))\in\left(0,1\right) for any A>0A>0 and v<1v<1. Therefore, Apr>0Ap_{r}^{*}>0 for any AA and vmin(1,A/log(1/α),A/log(α))v\leqslant\min(1,\ A/\log(1/\alpha_{\downarrow}),\ A/\log(\alpha_{\uparrow})). Moreover, for a sufficiently small vv, prp^{*}_{r} is strictly positive for any A>0A>0. Therefore, one can take a sufficiently small vv that satisfies Apr>vlog(α/α)Ap_{r}^{*}>v\log(\alpha_{\uparrow}/\alpha_{\downarrow}). The first term in the minimum in Eq. 24 is positive. The second term therein is clearly positive for v>0v>0. This completes the proof.

B.6 Proof of Proposition 4.15

Consider d2d\geqslant 2. We set A=1/dA=1/d. We bound BB from below by taking a specific value for v(0,min(1,A/log(1/α),A/log(α))v\in(0,\ \min(1,A/\log(1/\alpha_{\downarrow}),\ A/\log(\alpha_{\uparrow})) instead of considering sup\sup for vv. Our candidate is v=Aplog(α/α)2(2+ppu)v=\frac{Ap^{\prime}}{\log(\alpha_{\uparrow}/\alpha_{\downarrow})}\frac{2}{(2+p_{\ell}-p_{u})}, where p=infσ¯[,u]pr(σ¯)p^{\prime}=\inf_{\bar{\sigma}\in[\ell,u]}p_{r^{\prime}}(\bar{\sigma}) and r=1exp(A(11dlog(α/α))1)r^{\prime}=1-\exp\big{(}-A\big{(}1-\frac{1}{d\log(\alpha_{\uparrow}/\alpha_{\downarrow})}\big{)}^{-1}\big{)}. It holds v<1dlog(α/α)v<\frac{1}{d\log(\alpha_{\uparrow}/\alpha_{\downarrow})} and hence r>rr^{\prime}>r, from which we obtain p<pp^{\prime}<p^{*}.

We bound the terms in Eq. 24 as: Apvlog(α/α)=pd(pp22+ppu)pd(ppu2+ppu)Ap^{*}-v\log(\alpha_{\uparrow}/\alpha_{\downarrow})=\frac{p^{\prime}}{d}\left(\frac{p^{*}}{p^{\prime}}-\frac{2}{2+p_{\ell}-p_{u}}\right)\geqslant\frac{p^{\prime}}{d}\left(\frac{p_{\ell}-p_{u}}{2+p_{\ell}-p_{u}}\right) and vppu2log(αα)=pdppu2+ppuv\frac{p_{\ell}-p_{u}}{2}\log\left(\frac{\alpha_{\uparrow}}{\alpha_{\downarrow}}\right)=\frac{p^{\prime}}{d}\frac{p_{\ell}-p_{u}}{2+p_{\ell}-p_{u}}. Therefore, we have Bpdppu2+ppuB\geqslant\frac{p^{\prime}}{d}\frac{p_{\ell}-p_{u}}{2+p_{\ell}-p_{u}}. Note that one can take ppuΘ(1)p_{\ell}-p_{u}\in\Theta(1) since the only condition is ptarget=(p+pu)/2Θ(1)p_{\mathrm{target}}=(p_{\ell}+p_{u})/2\in\Theta(1). To obtain BΩ(1/d)B\in\Omega(1/d), it is sufficient to show pΘ(1)p^{\prime}\in\Theta(1) for dd\to\infty.

Fix pp_{\ell} and pup_{u} independently of dd. In the light of Lemma 3.1 in [4], we have that p0:>(0,1/2)p_{0}:\mathbb{R}_{>}\to(0,1/2) is continuous and strictly decreasing from 1/21/2 to 0 for all dd\in\mathbb{N}. Therefore, for each dd\in\mathbb{N} there exists an inverse map p01:(0,1/2)>p_{0}^{-1}:(0,1/2)\to\mathbb{R}_{>}. Define σ^d=dVdp01(p)\hat{\sigma}_{\ell}^{d}=dV_{d}p_{0}^{-1}(p_{\ell}) and σ^ud=dVdp01(pu)\hat{\sigma}_{u}^{d}=dV_{d}p_{0}^{-1}(p_{u}) for each dd\in\mathbb{N}. It follows from Lemma 3.2 in [4] that p0lim:σ¯limdp0(σ¯)p_{0}^{\mathrm{lim}}:\bar{\sigma}\mapsto\lim_{d\to\infty}p_{0}(\bar{\sigma}) is also strictly decreasing, hence invertible. The existence of limdp0()\lim_{d\to\infty}p_{0}(\cdot) is also proved in [4]. We let σ^=(p0lim)1(p)\hat{\sigma}_{\ell}^{\infty}=(p_{0}^{\mathrm{lim}})^{-1}(p_{\ell}) and σ^u=(p0lim)1(pu)\hat{\sigma}_{u}^{\infty}=(p_{0}^{\mathrm{lim}})^{-1}(p_{u}). Because of the pointwise convergence of p0(σ¯=σ^/(dVd))p_{0}(\bar{\sigma}=\hat{\sigma}/(dV_{d})) to p0lim(σ^)p_{0}^{\mathrm{lim}}(\hat{\sigma}), we have σ^dσ^\hat{\sigma}_{\ell}^{d}\to\hat{\sigma}_{\ell}^{\infty} and σ^udσ^u\hat{\sigma}_{u}^{d}\to\hat{\sigma}_{u}^{\infty} for dd\to\infty. Hence, for any u^>σ^u\hat{u}>\hat{\sigma}_{u}^{\infty} and ^<σ^\hat{\ell}<\hat{\sigma}_{\ell}^{\infty} with u/α/αu/\ell\geqslant\alpha_{\uparrow}/\alpha_{\downarrow}, there exists DD\in\mathbb{N} such that for all dDd\geqslant D we have u^>σ^ud\hat{u}>\hat{\sigma}_{u}^{d} and ^<σ^d\hat{\ell}<\hat{\sigma}_{\ell}^{d}. Now we fix u^\hat{u} and ^\hat{\ell} in this way. This amounts to selecting u=dVdu^u=dV_{d}\hat{u} and =dVd^\ell=dV_{d}\hat{\ell}.

We have limddr=1\lim_{d\to\infty}dr^{\prime}=1 since limddlog(α/α)=\lim_{d\to\infty}d\log(\alpha_{\uparrow}/\alpha_{\downarrow})=\infty and hence according to Lemma 3.2 in [4] we have

lim infdp\textstyle\liminf_{d\to\infty}p^{\prime} =lim infdminσ¯[,u]{pr(σ¯)}=lim infdminσ^[^,u^]pr(σ^dVd)\textstyle=\liminf_{d\to\infty}\min_{\bar{\sigma}\in[\ell,u]}\left\{p_{r^{\prime}}(\bar{\sigma})\right\}=\liminf_{d\to\infty}\min_{\hat{\sigma}\in[\hat{\ell},\hat{u}]}p_{r^{\prime}}\left(\frac{\hat{\sigma}}{dV_{d}}\right)
=()minσ^[^,u^]limd(pr(σ^dVd))=minσ^[^,u^]Ψ(1σ^σ^2),\textstyle\overset{(\star)}{=}\min_{\hat{\sigma}\in[\hat{\ell},\hat{u}]}\lim_{d\to\infty}\left(p_{r^{\prime}}\left(\frac{\hat{\sigma}}{dV_{d}}\right)\right)=\min_{\hat{\sigma}\in[\hat{\ell},\hat{u}]}\Psi\left(-\frac{1}{\hat{\sigma}}-\frac{{\hat{\sigma}}}{2}\right)\enspace,

where the equality ()(\star) follows from the pointwise convergence of prp_{r^{\prime}} to limdpr\lim_{d\to\infty}p_{r^{\prime}} and the continuity of prp_{r^{\prime}} and limdpr\lim_{d\to\infty}p_{r^{\prime}}.222Let {fn:n1}\{f_{n}:n\geqslant 1\} be a sequence of continuous functions on \mathbb{R} and ff be a continuous function such that ff is the pointwise limit limnfn(x)=f(x)\lim_{n}f_{n}(x)=f(x) of the sequence. Since they are continuous, there exist the minimizers of fnf_{n} and ff in a compact set [,u][\ell,u]. Let xn=argminfn(x)x_{n}=\operatornamewithlimits{argmin}f_{n}(x) and x=argminf(x)x^{*}=\operatornamewithlimits{argmin}f(x), where argmin\operatornamewithlimits{argmin} is taken over x[,u]x\in[\ell,u] and we pick one if there exist more than one minimizers. It is easy to see that fn(xn)fn(x)f_{n}(x_{n})\leqslant f_{n}(x^{*}), hence lim infnfn(xn)lim infnfn(x)=f(x)\liminf_{n}f_{n}(x_{n})\leqslant\liminf_{n}f_{n}(x^{*})=f(x^{*}). Let {ni:i1}\{n_{i}:i\geqslant 1\} be the sub-sequence of the indices such that lim infnfn(xn)=limifni(xni)\liminf_{n}f_{n}(x_{n})=\lim_{i}f_{n_{i}}(x_{n_{i}}). Since {xni:i1}\{x_{n_{i}}:i\geqslant 1\} is a bounded sequence, Bolzano-Weirstraß theorem provides a convergent sub-sequence {xnik:k1}\{x_{n_{i_{k}}}:k\geqslant 1\} and we denote its limit as xx_{*}. Of course we have lim infnfn(xn)=limkfnik(xnik)\liminf_{n}f_{n}(x_{n})=\lim_{k}f_{n_{i_{k}}}(x_{n_{i_{k}}}). Due to the continuity of {fn:n1}\{f_{n}:n\geqslant 1\} and the pointwise convergence to ff, we have limkfnik(xnik)=limkfnik(x)=f(x)\lim_{k}f_{n_{i_{k}}}(x_{n_{i_{k}}})=\lim_{k}f_{n_{i_{k}}}(x_{*})=f(x_{*}). Therefore, lim infnfn(xn)=f(x)f(x)\liminf_{n}f_{n}(x_{n})=f(x_{*})\leqslant f(x^{*}). Since xx^{*} is the minimizer of ff in [,u][\ell,u] and x[,u]x_{*}\in[\ell,u], it must hold f(x)f(x)f(x_{*})\geqslant f(x^{*}). Hence, lim infnfn(xn)=f(x)\liminf_{n}f_{n}(x_{n})=f(x^{*}). This completes the proof.