Trust-region algorithms: probabilistic complexity and intrinsic noise with applications to subsampling techniques

S. Bellavia, G. Gurioli, B. Morini and Ph. L. Toint Dipartimento di Ingegneria Industriale, Università degli Studi di Firenze, Italy. Member of the INdAM Research Group GNCS. Email: stefania.bellavia@unifi.itDipartimento di Matematica e Informatica “Ulisse Dini”, Università degli Studi di Firenze, Italy. Member of the INdAM Research Group GNCS. Email: gianmarco.gurioli@unifi.itDipartimento di Ingegneria Industriale, Università degli Studi di Firenze, Italy. Member of the INdAM Research Group GNCS. Email: benedetta.morini@unifi.it Namur Center for Complex Systems (naXys), University of Namur, 61, rue de Bruxelles, B-5000 Namur, Belgium. Email: philippe.toint@unamur.be

(21 X 2021)

Abstract

A trust-region algorithm is presented for finding approximate minimizers of smooth unconstrained functions whose values and derivatives are subject to random noise. It is shown that, under suitable probabilistic assumptions, the new method finds (in expectation) an $\epsilon$ -approximate minimizer of arbitrary order $q\geq 1$ in at most ${\cal O}(\epsilon^{-(q+1)})$ inexact evaluations of the function and its derivatives, providing the first such result for general optimality orders. The impact of intrinsic noise limiting the validity of the assumptions is also discussed and it is shown that difficulties are unlikely to occur in the first-order version of the algorithm for sufficiently large gradients. Conversely, should these assumptions fail for specific realizations, then “degraded” optimality guarantees are shown to hold when failure occurs. These conclusions are then discussed and illustrated in the context of subsampling methods for finite-sum optimization.

Keywords: evaluation complexity, trust-region methods, inexact functions and derivatives, probabilistic analysis, finite-sum optimization, subsampling methods.

1 Introduction

This paper is concerned with trust-region methods for solving the unconstrained optimization problem

\min_{x\in\mathbb{R}^{n}}f(x),\qquad f:\mathbb{R}^{n}\rightarrow\mathbb{R},

(1.1)

where we assume that the values of the objective function $f$ and its derivatives are computed subject to random noise. Our objective is twofold. Firstly, we introduce a version of the deterministic method proposed in [10] which is able to handle the random context and provide, under reasonable probabilistic assumptions, a sharp evaluation complexity bound (in expectation) for arbitrary optimality order. Secondly, we investigate the effect of intrinsic noise (that is noise whose level cannot be assumed to vanish) on a first-order version of our algorithm and prove “degraded” optimality, should this noise limit the validity of our assumptions. The new results are then detailed and illustrated in the framework of finite-sum minimization using subsampling.

Minimization algorithms using adaptive steplength and allowing for random noise in the objective function or derivatives’ evaluations have already generated a significant literature (e.g. [1, 13, 6, 16, 7, 4, 2]). We focus here on trust-region methods, which are methods in which a trial step is computed by approximately minimizing a model of the objective function in a “trust region” where this model is deemed sufficiently accurate. The trial step is then accepted or rejected depending on whether a sufficient improvement in objective function value predicted by the model is obtained or not, the radius of trust-region being then reduced in the latter case. We refer the reader to [15] for an in-depth coverage of this class of algorithms and to [17] for a more recent survey. Trust-region methods involving stochastic errors in function/derivatives values were considered in particular in [1, 12] and [7, 13], the latter being the only methods (to the author’s knowledge) handling random perturbations in both the objective function and its derivatives. The complexity analysis of the STORM (STochastic Optimization with Random Models) algorithm described in [7, 13] is based on supermartingales and makes probabilistic assumptions on the accuracy of these evaluations which become tighter when the trust-region radius becomes small. It also hinges on the definition of a monotonically decreasing “merit function” associated with the stochastic process corresponding to the algorithm. The method proposed in this paper can be viewed as an alternative in the same context, but differs from the STORM approach in several aspects. The first is that the method discussed here uses a model whose degree is chosen adaptively at each iteration, requiring the (noisy) evaluation of higher derivatives only when necessary. The second is that its scope is not limited to searching for first- and second-order approximate minimizers, but is capable of computing them to arbitrary optimality order. The third is that the probabilistic accuracy conditions on the derivatives’ estimations no longer depends on the trust-region radius, but rather on the predicted reduction in objective function values, which may be less sensitive to problem conditioning. Finally, its evaluation complexity analysis makes no use of a merit function of the type used in [7].

In [5], the impact of intrinsic random noise on the evaluation complexity of a deterministic “noise-aware” trust-region algorithm for unconstrained nonlinear optimization was investigated and constrasted with that of an inexact version where noise is fully controllable. The current paper considers the question in the more general probabilistic framework.

Even if the analysis presented below does not depend in any way on the choice of the optimality order $q$ , the authors are well aware that, while requests for optimality of orders $q\in\{1,2\}$ lead to practical, implementable algorithms, this may no longer be the case for $q>2$ . For high orders, the methods discussed in the paper therefore constitute an “idealized” setting (in which complicated subproblems can be approximately solved without affecting the evaluation complexity) and thus indicate the limits of achievable results.

The paper is organized as follows. After introducing the new stochastic trust-region algorithm in Section 2, its evaluation complexity analysis is presented in Section 3. Section 4 is then devoted to an in-depth discussion of the impact of noise on the first-order instantiation of the algorithm, with a particular emphasis on the case where noise is generated by subsampling in finite-sum minimization context. Conclusions and perspectives are finally proposed in Section 5. Because our contribution borrows ideas from [4], themselves being partly inspired by [12], repeating some material from these sources is necessary to keep our argument understandable. We have however done our best to limit this repetition as much as possible.

Basic notations. Unless otherwise specified, $\|\cdot\|$ denotes the standard Euclidean norm for vectors and matrices. For a general symmetric tensor $S$ of order $p$ , we define

\|S\|_{[p]}\stackrel{{\scriptstyle\rm def}}{{=}}\max_{\|v\|=1}|S[v]^{p}|=\max_{\|v_{1}\|=\cdots=\|v_{p}\|=1}|S[v_{1},\ldots,v_{p}]|

the induced Euclidean norm. We also denote by $\nabla_{x}^{j}f(x)$ the $j$ -th order derivative tensor of $f$ evaluated at $x$ and note that such a tensor is always symmetric for any $j\geq 2$ . $\nabla_{x}^{0}f(x)$ is a synonym for $f(x)$ . $\lceil\alpha\rceil$ denotes the smallest integer not smaller than $\alpha$ . Moreover, given a set ${\cal B}$ , $|{\cal B}|$ denotes its cardinality, $\mathbbm{1}_{\cal B}$ refers to its indicator function and ${\cal{B}}^{c}$ indicates its complement. All stochastic quantities live in a probability space denoted by $(\Omega,{\cal A},\mathbb{P}{\rm r})$ with the probability measure $\mathbb{P}{\rm r}$ and the $\sigma$ -algebra ${\cal A}$ containing subsets of $\Omega$ . We never explicitly define $\Omega$ , but specify it through random variables. $\mathbb{P}{\rm r}[{\rm event}]$ finally denotes the probability of an event and $\mathbb{E}[X]$ the expectation of a random variable $X$ .

2 A trust-region minimization method for problems with
randomly perturbed function values and derivatives

We make the following assumptions on the optimization problem (1.1).

AS.1

The function $f$ is $q$ -times continuously differentiable in $\mathbb{R}^{n}$ , for some $q\geq 1$ . Moreover, its $j$ -th order derivative tensor is Lipschitz continuous for $j\in\{1,\ldots,q\}$ in the sense that, for each $j\in\{1,\ldots,q\}$ , there exists a constant $\vartheta_{f,j}\geq 0$ such that, for all $x,y\in\mathbb{R}^{n}$ ,

\|\nabla_{x}^{j}f(x)-\nabla_{x}^{j}f(y)\|\leq\vartheta_{f,j}\|x-y\|.

(2.1)

AS.2

$f$ is bounded below in $\mathbb{R}^{n}$ , that is there exists a constant $f_{\rm low}$ such that $f(x)\geq f_{\rm low}$ for all $x\in\mathbb{R}^{n}$ .

AS.2 ensures that the minimization problem (1.1) is well-posed. AS.1 is a standard assumption in evaluation complexity analysis⁽¹⁾⁽¹⁾(1)It is well-known that requesting (2.1) to hold for all $x,y\in\mathbb{R}^{n}$ is strong. The weakest form of AS.1 which we could use in what follows is to require (2.1) to hold for all $x=x_{k}$ (the iterates of the minimization algorithm we are about to describe) and all $y=x_{k}+\xi s_{k}$ (where $s_{k}$ is the associated step and $\xi$ is arbitrary in [0,1]). However, ensuring this condition a priori, although maybe possible for specific applications, is hard in general, especially for a non-monotone algorithm with a random element.. It is important because we consider algorithms that are able to exploit all available derivatives of $f$ and, as in many minimization methods, our approach is based on using the Taylor expansions (now of degree $j$ for $j\in\{1,\ldots,q\}$ ) given by

t_{f,j}(x,s)\stackrel{{\scriptstyle\rm def}}{{=}}f(x)+\sum_{\ell=1}^{j}\nabla_{x}^{\ell}f(x)[s]^{\ell}.

(2.2)

AS.1 then has the following crucial consequence.

Lemma 2.1

Suppose that AS.1 holds. Then for all

x,s\in\mathbb{R}^{n}

|f(x+s)-t_{f,j}(x,s)|\leq\frac{\displaystyle\vartheta_{f,j}}{\displaystyle(j+1)!}\,\|s\|^{j+1}.

(2.3)

Proof. See [9, Lemma 2.1] with $\beta=1$ . $\Box$

At a given iterate $x_{k}$ of our algorithm, we will be interested in finding a step $s\in\mathbb{R}^{n}$ which makes the Taylor decrements

\Delta t_{f,j}(x_{k},s)\stackrel{{\scriptstyle\rm def}}{{=}}f(x_{k})-t_{f,j}(x_{k},s)=t_{f,j}(x_{k},0)-t_{f,j}(x_{k},s)

(2.4)

large (note that $\Delta t_{f,j}(x,s)$ is independent of $f(x)$ ). When this is possible, we anticipate from the approximating properties of the Taylor expansion that some significant decrease is also possible in $f$ . Conversely, if $\Delta t_{f,j}(x,s)$ cannot be made large in a neighbourhood of $x$ , we must be close to an approximate minimizer. More formally, we define, for some $\theta\in(0,1]$ and some optimality radius $\delta\in(0,\theta]$ , the measure

\phi_{f,j}^{\delta}(x)=\max_{\|d\|\leq\delta}\Delta t_{f,j}(x,d),

(2.5)

that is the maximal decrease in $t_{f,j}(x,d)$ achievable in a ball of radius $\delta$ centered at $x$ . (The practical purpose of introducing $\theta$ is to avoid unnecessary computations, as discussed below.) We then define $x$ to be a $q$ -th order $(\epsilon,\delta)$ -approximate minimizer (for some accuracy requests $\epsilon\in(0,1]^{q}$ ) if and only if

\phi_{f,j}^{\delta}(x)\leq\epsilon_{j}\frac{\delta^{j}}{j!}\;\;\mbox{ for }\;\;j\in\{1,\ldots,q\},

(2.6)

(a vector $d$ solving the optimization problem defining $\phi_{f,j}^{\delta}(x)$ in (2.5) is called an optimality displacement) [8, 10]. In other words, a $q$ -th order $(\epsilon,\delta)$ -approximate minimizer is a point from which no significant decrease of the Taylor expansions of degree one to $q$ can be obtained in a ball of optimality radius $\delta$ . This notion is coherent with standard optimality measures for low orders⁽²⁾⁽²⁾(2)It is easy to verify that, irrespective of $\delta$ , (2.6) holds for $j=1$ if and only if $\|\nabla_{x}^{1}f(x)\|\leq\epsilon_{1}$ and that, if $\|\nabla_{x}^{1}f(x)\|=0$ , $\lambda_{\min}[\nabla_{x}^{2}f(x)]\geq-\epsilon_{2}$ if and only if $\phi_{f,2}^{\delta}(x)\leq{\scriptstyle\frac{1}{2}}\epsilon_{2}\delta^{2}$ . and has the advantage of being well-defined and continuous in $x$ for every order. Note that $\phi_{f,j}^{\delta}(x)$ is always non-negative.

This paper is concerned with the case where the values of the objective function $f$ and of its derivatives $\nabla_{x}^{j}f$ are subject to random noise and can only be computed inexactly (our assumptions on random noise will be detailed below). Our notational convention will be to denote inexact quantities with an overbar, so $\overline{f}(x,\xi)$ and $\overline{\nabla_{x}^{j}f}(x,\xi)$ denote inexact values of $f(x)$ and $\nabla_{x}^{j}f(x)$ , where $\xi$ is a random variable causing inexactness. Thus (2.2) and (2.4) are unavailable, and we have to consider

\overline{t}_{f,j}(x_{k},s,\xi)\stackrel{{\scriptstyle\rm def}}{{=}}\overline{f}(x_{k},\xi)+\sum_{\ell=1}^{j}\overline{\nabla_{x}^{\ell}f}(x_{k},\xi)[s]^{\ell}

and the associated decrement

\overline{\Delta t}_{f,j}(x,s_{k},\xi)\stackrel{{\scriptstyle\rm def}}{{=}}\overline{t}_{f,j}(x_{k},0,\xi)-\overline{t}_{f,j}(x_{k},s_{k},\xi)=-\sum_{\ell=1}^{j}\overline{\nabla_{x}^{\ell}f}(x_{k},\xi)[s]^{\ell}

(2.7)

instead. For simplicity, we will often omit to mention the dependence of inexact values on the random variable $\xi$ in what follows, so (2.7) is rewritten as

\overline{\Delta t}_{f,j}(x,s_{k})\stackrel{{\scriptstyle\rm def}}{{=}}\overline{t}_{f,j}(x_{k},0)-\overline{t}_{f,j}(x_{k},s_{k})=-\sum_{\ell=1}^{j}\overline{\nabla_{x}^{\ell}f}(x_{k})[s]^{\ell}.

(2.8)

This in turn would require that we measure optimality using

\overline{\phi}_{f,j}^{\delta}(x)\stackrel{{\scriptstyle\rm def}}{{=}}\max_{\|d\|\leq\delta}\overline{\Delta t}_{f,j}(x,d)

(2.9)

instead of (2.5). However, computing this exact global maximizer may be costly, so we choose to replace the computation of (2.9) by an approximation, that is with the computation of an optimality displacement $d$ with $\|d\|\leq\delta$ such that $\varsigma\overline{\phi}_{f,j}^{\delta}(x)\leq\overline{\Delta t}_{f,j}(x,d)$ for some constant $\varsigma\in(0,1]$ . We state the Trust-Region with Noisy Evaluations (TR $q$ NE ) algorithm 2 using all the ingredients we have described. The trust region radius at iteration $k$ is denoted by $r_{k}$ instead of the standard notation $\Delta_{k}$ .

Algorithm 2.1: The TR $q$ NE algorithm
Step 0: Initialisation. A criticality order $q$ , a starting point $x_{0}$ and accuracy levels $\epsilon\in(0,1)^{q}$ are given. For a given constant $\eta\in(0,1)$ , define $\epsilon_{\min}\stackrel{{\scriptstyle\rm def}}{{=}}\min_{j\in\{1,..,q\}}\epsilon_{j}\;\;\mbox{ and }\;\;\nu\stackrel{{\scriptstyle\rm def}}{{=}}\min\big{[}{\scriptstyle\frac{1}{2}}\eta,{\scriptstyle\frac{1}{4}}(1-\eta)\big{]}.$ (2.10) The constants $\theta\in[\epsilon_{\min},1]$ , $\varsigma\in(0,1]$ , $\gamma>1$ , $r_{\max}\geq 1$ and an initial trust-region radius $r_{0}\in(\epsilon_{\min},r_{\max}]$ are also given. Set $k=0$ . Step 1: Derivatives estimation. Set $\delta_{k}=\min[r_{k},\theta]$ . For $j=1,\ldots,q$ , 1. Compute derivatives’ estimates $\overline{\nabla_{x}^{j}f}(x_{k})$ and find an optimality displacement $d_{k,j}$ with $\|d_{k,j}\|\leq\delta_{k}$ such that $\varsigma\overline{\phi}_{f,j}^{\delta_{k}}(x_{k})\leq\overline{\Delta t}_{f,j}(x_{k},d_{k,j}).\vspace*{-2mm}$ (2.11) 2. If $\overline{\Delta t}_{f,j}(x_{k},d_{k,j})>\left(\frac{\varsigma\epsilon_{j}}{1+\nu}\right)\frac{\delta_{k}^{j}}{j!},$ (2.12) go to Step 2 with $j_{k}=j$ . Set $j_{k}=q$ . Step 2: Step computation. If $r_{k}=\delta_{k}$ , set $s_{k}=d_{k,j_{k}}$ and $\overline{\Delta t}_{f,j}(x_{k},s_{k})=\overline{\Delta t}_{f,j}(x_{k},d_{k,j_{k}})$ . Otherwise, compute a step $s_{k}$ such that $\|s_{k}\|\leq r_{k}$ and $\overline{\Delta t}_{f,j}(x_{k},s_{k})\geq\overline{\Delta t}_{f,j}(x_{k},d_{k,j_{k}}).$ (2.13) Step 3: Function decrease estimation. Compute the estimate $\overline{f}(x_{k})-\overline{f}(x_{k}+s_{k})$ of $f(x_{k})-f(x_{k}+s_{k})$ . Step 4: Test of acceptance. Compute $\rho_{k}=\frac{\overline{f}(x_{k})-\overline{f}(x_{k}+s_{k})}{\overline{\Delta t}_{f,j}(x_{k},s_{k})}.$ (2.14) If $\rho_{k}\geq\eta$ (successful iteration), then set $x_{k+1}=x_{k}+s_{k}$ ; otherwise (unsuccessful iteration) set $x_{k+1}=x_{k}$ . Step 5: Trust-region radius update. Set $r_{k+1}=\left\{\begin{array}[]{ll}{}\frac{1}{\gamma}r_{k},&\;\;\mbox{if}\;\;\rho_{k}<\eta,\\ {}\min[r_{\max},\gamma r_{k}],&\;\;\mbox{if}\;\;\rho_{k}\geq\eta,\end{array}\right.$ Increment $k$ by one and go to Step 1.

A feature of the TR $q$ NE algorithm is that it uses an adaptive strategy (in Step 1) to choose the model’s degree in view of the desired accuracy and optimality order. Indeed, the model of the objective function used to compute the step is $\overline{t}_{f,j_{k}}(x_{k},s)$ , whose degree $j_{k}$ can vary from an iteration to the other, depending on the “order of (inexact) optimality” achieved at $x_{k}$ (as determined by Step 1). Also observe that, if the trust-region radius is small (that is $r_{k}\leq\theta$ ), the optimality displacement $d_{k,j_{k}}$ is an approximate global minimizer of the model within the trust region, which justifies the choice $s_{k}=d_{k,j_{k}}$ in this case. If $r_{k}>\theta$ , the step computation is allowed to be fairly approximate as the only requirement for a step in the trust region is (2.13). This can be interpreted as a generalization of the familiar notions of “Cauchy” and “eigen” points (see [15, Chapter 6]). In addition, note that, while nothing guarantees that $f(x_{k})\geq f(x_{k+1})$ , the mechanism of the algorithm ensures that $\overline{f}(x_{k})\geq\overline{f}(x_{k+1})$ .

The TR $q$ NE algorithm generates a random process. Randomness occurs because of the random noise present in the Taylor decreases and objective function values, the former resulting itself from the randomly perturbed derivatives values and, as the algorithm proceeds, from the random realizations of the iterates $x_{k}$ and steps $s_{k}$ . In the following analysis, uppercase letters denote random quantities, while lowercase ones denote realizations of these random quantities. Thus, given $\omega\in\Omega$ , $x_{k}=X_{k}(\omega)$ , $g_{k}=G_{k}(\omega)$ , etc. In particular, we distinguish

•

$\Delta t_{f,j}(x,s)$ , the value at a (deterministic) $x,s$ of the exact Taylor decrement, that is of the Taylor decrement using the exact values of its derivatives at $x$ ;
•

$\overline{\Delta t}_{f,j}(x,s)=\overline{\Delta t}_{f,j}(x,s,\xi)$ , the value at a (deterministic) $x,s$ of an inexact Taylor decrement, that is of a Taylor decrement using the inexact values of its derivatives (at $x$ ) resulting from the realization of random noise;
•

$\Delta t_{f,j}(X,S)$ , the random variable corresponding to the exact Taylor decrement taken at the random variables $X,S$ ;
•

$\Delta T_{f,j}(X,S)$ , the random variable giving the value of the Taylor decrement using randomly perturbed derivatives, taken at the random variables $X,S$ .

Analogously, $F_{k}^{0}\stackrel{{\scriptstyle\rm def}}{{=}}F(X_{k})$ and $F_{k}^{s}\stackrel{{\scriptstyle\rm def}}{{=}}F(X_{k}+S_{k})$ denote the random variables associated with the estimates of $f(X_{k})$ and $f(X_{k}+S_{k})$ , with their realizations $f_{k}^{0}=\bar{f}(x_{k})=\bar{f}(x_{k},\xi)$ and $f_{k}^{s}=\bar{f}(x_{k}+s_{k})=\bar{f}(x_{k}+s_{k},\xi)$ . Similarly, the iterates $X_{k}$ , as well as the trust-region radiuses $R_{k}$ , the indeces $J_{k}$ , the optimality radiuses $\Delta_{k}$ , displacements $D_{k,j}$ and the steps $S_{k}$ are random variables while $x_{k}$ , $r_{k}$ , $j_{k}$ , $\delta_{k}$ , $d_{k,j}$ , and $s_{k}$ denote their realizations. Hence, the TR $q$ NE algorithm generates the random process

\{X_{k},R_{k},M_{k},J_{k},\Delta_{k},\{D_{k,j}\}_{j=1}^{J_{k}},S_{k},F_{k}\}

(2.15)

in which $X_{0}=x_{0}$ (the initial guess) and $R_{0}=r_{0}$ (the initial trust-region radius) are deterministic quantities, and where

M_{k}=\{\overline{\nabla_{x}^{1}f}(X_{k}),\ldots,\overline{\nabla_{x}^{j_{k}}f}(X_{k})\}\;\;\mbox{ and }\;\;F_{k}=\{F(X_{k}),F(X_{k}+S_{k})\}.

2.1 The probabilistic setting

We now state our probabilistic assumptions on the TR $q$ NE algorithm. For $k\geq 0$ , our assumption on the past is formalized by considering ${\cal A}_{k-1}$ the $\sigma$ -algebra induced by the random variables $M_{0}$ , $M_{1}$ ,…, $M_{k-1}$ and $F_{0}^{0}$ , $F_{0}^{s}$ , $F_{1}^{0}$ , $F_{1}^{s}$ , …, $F_{k-1}^{0}$ , $F_{k-1}^{s}$ and let ${\cal A}_{k-1/2}$ be that induced by $M_{0}$ , $M_{1}$ ,…, $M_{k}$ and $F_{0}^{0}$ , $F_{0}^{s}$ , …, $F_{k-1}^{0}$ , $F_{k-1}^{s}$ , with ${\cal A}_{-1}=\sigma(x_{0})$ .

We first define an event ensuring that the model is accurate enough at iteration $k$ . At the end of Step 2 of this iteration and given $J_{k}\in\{1,\ldots,q\}$ , we now define,

$\displaystyle{\cal M}_{k,j}^{(1)}$	$\displaystyle=\left\{\phi_{f,j}^{\Delta_{k}}(X_{k})\leq\left(\frac{1+\nu}{\varsigma}\right)\Delta T_{f,j}(X_{k},D_{k,j})\right\}\;\;\;\;(j\in\{1,\ldots,J_{k}\}),$
$\displaystyle{\cal M}_{k}^{(2)}$	$\displaystyle=\left\{(1-\nu)\Delta T_{f,J_{k}}(X_{k},S_{k})\leq\Delta t_{f,J_{k}}(X_{k},S_{k})\leq(1+\nu)\Delta T_{f,J_{k}}(X_{k},S_{k})\right\},$
$\displaystyle{\cal M}_{k}\;$	$\displaystyle=\left(\bigcap_{j\in\{1,\ldots,J_{k}\}}{\cal M}_{k,j}^{(1)}\right)\cap{\cal M}_{k}^{(2)}.$	(2.16)

The event ${\cal M}_{k,j}^{(1)}$ occurs when the $j$ -th order optimality measure ( $j\leq j_{k}$ ) at iteration $k$ is meaningful, while ${\cal M}_{k}^{(2)}$ occurs when this is the case for the model decrease. At first sight, these events may seem only vaguely related to the accuracy of the function’s derivatives but a closer examination gives the following sufficient condition for ${\cal M}_{k}$ to happen.

Lemma 2.2

At iteration

k

of any realization, the inequalities defining the event

{\cal M}_{k}

are satisfied if, for

j\in\{1,\ldots,j_{k}\}

and

\ell\in\{1,\ldots,j_{k}\}

\|\left(\overline{\nabla_{x}^{\ell}f}(x_{k})-\nabla_{x}^{\ell}f(x_{k})\right)[s_{k}]^{\ell}\|\leq\frac{\nu}{2}\overline{\Delta t}_{f,j_{k}}(x_{k},s_{k})

(2.17) and

\|\left(\overline{\nabla_{x}^{\ell}f}(x_{k})-\nabla_{x}^{\ell}f(x_{k})\right)[\hat{d}_{k,j}]^{\ell}\|\leq{\frac{\nu}{2}\overline{\Delta t}_{f,j}(x_{k},\hat{d}_{k,j}),}

(2.18) where

\hat{d}_{k,j}=\operatorname*{arg\,max}_{\|d\|\leq\delta_{k}}\Delta t_{f,j}(x_{k},d).

(2.19)

Proof. If (2.18) holds, we have that, for every $j\in\{1,\ldots,j_{k}\}$ and $v_{k}\in\{\hat{d}_{k,1},\ldots,\hat{d}_{k,j}\}$ , with $\hat{d}_{k,\ell}$ , $\ell=1,\ldots,j$ , given in (2.19),

\begin{array}[]{lcl}|\overline{\Delta t}_{f,j}(x_{k},v_{k})-\Delta t_{f,j}(x_{k},v_{k})|&\leq&\displaystyle\sum_{\ell=1}^{j}\frac{\displaystyle 1}{\displaystyle\ell!}\,\left\|\left(\overline{\nabla_{x}^{\ell}f}(x_{k})-\nabla_{x}^{\ell}f(x_{k})\right)[v_{k}]^{\ell}\right\|\\[8.61108pt] &\leq&\frac{\displaystyle 1}{\displaystyle 2}\nu\overline{\Delta t}_{f,j}(x_{k},v_{k})\displaystyle\sum_{\ell=1}^{j}\frac{1}{\ell!}\\[8.61108pt] &<&\nu\,\overline{\Delta t}_{f,j}(x_{k},v_{k})\end{array}

(2.20)

where we have used the bound $\displaystyle\sum_{\ell=1}^{j}\frac{1}{\ell!}<e-1<2$ . Now note that the definition of $\overline{\phi}_{f,j}^{\delta_{k}}(x_{k})$ in (2.9), (2.20) for $v_{k}=\hat{d}_{k,j}$ and (2.11) imply that, for any $j\in\{1,\ldots,j_{k}\}$ ,

\begin{array}[]{lcl}\phi_{f,j}^{\delta_{k}}(x_{k})=\Delta t_{f,j}(x_{k},\hat{d}_{k,j})&\leq&\overline{\Delta t}_{f,j}(x_{k},\hat{d}_{k,j})+|\Delta t_{f,j}(x_{k},\hat{d}_{k,j})-\overline{\Delta t}_{f,j}(x_{k},\hat{d}_{k,j})|\\[8.61108pt] &\leq&\big{(}1+\nu\big{)}\overline{\Delta t}_{f,j}(x_{k},\hat{d}_{k,j})\\[8.61108pt] &\leq&\big{(}1+\nu\big{)}\max_{\|d\|\leq\delta_{k}}\overline{\Delta t}_{f,j}(x_{k},d)\\[8.61108pt] &=&\big{(}1+\nu\big{)}\overline{\phi}_{f,j}^{\delta_{k}}(x_{k})\\[8.61108pt] &\leq&\left(\frac{\displaystyle 1+\nu}{\displaystyle\varsigma}\right)\overline{\Delta t}_{f,j}(x_{k},d_{k,j}).\end{array}

Hence the inequality in the definition of ${\cal M}_{k,j}^{(1)}$ holds for $j\in\{1,\ldots,j_{k}\}$ . The proof of the inequalities defining ${\cal M}_{k}^{(2)}$ is analog to that of (2.20). We have from (2.17) that

\begin{array}[]{lcl}|\overline{\Delta t}_{f,j_{k}}(x_{k},s_{k})-\Delta t_{f,j_{k}}(x_{k},s_{k})|&\leq&\displaystyle\sum_{\ell=1}^{j_{k}}\frac{\displaystyle 1}{\displaystyle\ell!}\,\left\|\left(\overline{\nabla_{x}^{\ell}f}(x_{k})-\nabla_{x}^{\ell}f(x_{k})\right)[s_{k}]^{\ell}\right\|\\[8.61108pt] &\leq&\frac{\displaystyle 1}{\displaystyle 2}\nu\overline{\Delta t}_{f,j_{k}}(x_{k},s_{k})\displaystyle\sum_{\ell=1}^{j_{k}}\frac{1}{\ell!}\\[8.61108pt] &<&\nu\,\overline{\Delta t}_{f,j_{k}}(x_{k},s_{k})\end{array}

(2.21)

where we have again used the bound $\displaystyle\sum_{\ell=1}^{j_{k}}\frac{1}{\ell!}<2$ . $\Box$

This result immediately suggests a few comments.

•

The conditions (2.17)-(2.18) are merely sufficient, not necessary. In particular, they ignore any possible cancellation of errors between terms of the Taylor expansion of different degree.
•

We note that (2.17)-(2.18) require the $\ell$ -th derivative to be relatively accurate along a finite and limited set of directions, independent of problem dimension.
•

Since $\|d_{k,j}\|$ and $\|\hat{d}_{k,j}\|$ are bounded by $\delta_{k}\leq\theta\leq 1$ , we also note that the accuracy required by these conditions decreases when the degree $\ell$ increases. Moreover, for a fixed degree, the request is weaker for small displacements (a typical situation when a solution is approached) than for large ones.

•

Requiring

\|\overline{\nabla_{x}^{\ell}f}(x_{k})-\nabla_{x}^{\ell}f(x_{k})\|\leq\frac{\nu}{2\|\hat{d}_{k,j}\|^{\ell}}{\overline{\Delta t}_{f,j}(x_{k},\hat{d}_{k,j}),}

(2.22)

instead of (2.18) is of course again sufficient to ensure the desired conclusions. These conditions are reminiscent of the conditions required in [7] for the STORM algorithm with $p=2$ , namely that, for some constant $\kappa_{\ell}$ and all $y$ in the trust-region $\{y\in\mathbb{R}^{n}\mid\|y-x_{k}\|\leq r_{k}\}$ ,

\|\overline{\nabla_{x}^{\ell}f}(y)-\nabla_{x}^{\ell}f(y)\|\leq\kappa_{\ell}\,\,r_{k}^{3-\ell}\;\;\;\;(\ell\in\{0,1,2\}).

This latter condition is however stronger than (2.17)–(2.18) because it insists on a uniform accuracy guarantee in the full-dimensional trust region.

Having considered the accuracy of the model, we now turn to that on the objective function’s values. At the end of Step $3$ of the $k$ -th iteration, we define the event

{\cal F}_{k}=\left\{|\Delta f(X_{k},S_{k})-\Delta F(X_{k},S_{k})|\leq 2\nu\Delta T_{f,j_{k}}(X_{k},S_{k})\right\}

(2.23)

where $\Delta f(X_{k},S_{k})\stackrel{{\scriptstyle\rm def}}{{=}}f(X_{k})-f(X_{k}+S_{k})$ and $\Delta F(X_{k},S_{k})\stackrel{{\scriptstyle\rm def}}{{=}}F(X_{k})-F(X_{k}+S_{k})$ . This occurs when the difference in function values used in the course of iteration $k$ are reasonably accurate relative to the model decrease obtained in that iteration. Note that, because of the triangular inequality,

	$\displaystyle\|\Delta f(X_{k},S_{k})-\Delta F(X_{k},S_{k})\|$	$\displaystyle=\|(f(X_{k})-f(X_{k}+S_{k}))-(F(X_{k})-F(X_{k}+S_{k}))\|$
		$\displaystyle\leq\|f(X_{k})-F(X_{k})\|+\|f(X_{k}+S_{k})-F(X_{k}+S_{k})\|$

so that ${\cal F}_{k}$ must occur if both terms on the right-hand side are bounded above by $\nu\Delta T_{f,j_{k}}(X_{k},S_{k})$ . Combining accuracy requests on model and function values, we define

{\cal E}_{k}\stackrel{{\scriptstyle\rm def}}{{=}}{\cal F}_{k}\cap{\cal M}_{k}

(2.24)

and say that iteration $k$ is accurate if $\mathbbm{1}_{{\cal E}_{k}}=\mathbbm{1}_{{\cal F}_{k}}\mathbbm{1}_{{\cal M}_{k}}=1$ and the iteration $k$ is inaccurate if $\mathbbm{1}_{{\cal E}_{k}}=0$ . Moreover, we say that the iteration $k$ has accurate model if $\mathbbm{1}_{{\cal M}_{k}}=1$ and that iteration $k$ has accurate function estimates if $\mathbbm{1}_{{\cal F}_{k}}=1$ . Finally we let

p_{{\cal M}_{k}}\stackrel{{\scriptstyle\rm def}}{{=}}\mathbb{P}{\rm r}\Big{[}{\cal M}_{k}\mid{\cal A}_{k-1}\Big{]},\ \ p_{{\cal F}_{k}}\stackrel{{\scriptstyle\rm def}}{{=}}\mathbb{P}{\rm r}\Big{[}{\cal F}_{k}\mid{\cal A}_{k-1}\Big{]}.

We will verify in what follows that the TR $q$ NE algorithm does progress towards an approximate minimizer satisfying (2.6) as long as the following holds.

AS.3

There exists $\alpha_{*},\gamma_{*}\in({\scriptstyle\frac{1}{2}},1]$ such that $p_{*}=\alpha_{*}\gamma_{*}>{\scriptstyle\frac{1}{2}}$ ,

p_{{\cal M}_{k}}\geq\alpha_{*},\;\;\;\;p_{{\cal F}_{k}}\geq\gamma_{*}\;\;\mbox{ and }\;\;\mathbb{E}\Big{[}\mathbbm{1}_{{\cal S}_{k}}(1-\mathbbm{1}_{{\cal F}_{k}})\Delta f(X_{k},S_{k})\mid{\cal A}_{k-1}\Big{]}\geq 0,

(2.25)

where ${\cal S}_{k}$ is the event ${\cal S}_{k}\stackrel{{\scriptstyle\rm def}}{{=}}\{\mbox{iteration $k$ is successful}\}$ .

We notice that due to the tower property for conditional expectations

\mathbb{P}{\rm r}\Big{[}{\cal F}_{k}\mid{\cal A}_{k-1}\Big{]}=\mathbb{E}\Big{[}\mathbbm{1}_{{\cal F}_{k}}\mid{\cal A}_{k-1}\Big{]}=\mathbb{E}\Big{[}\mathbb{E}\Big{[}\mathbbm{1}_{{\cal F}_{k}}\mid{\cal A}_{k-{\scriptstyle\frac{1}{2}}}\Big{]}\mid{\cal A}_{k-1}\Big{]}.

and hence that assuming, as in [16] and [7],

\mathbb{P}{\rm r}\Big{[}{\cal F}_{k}\mid{\cal A}_{k-\frac{1}{2}}\Big{]}>\gamma_{*}

is stronger than assuming $p_{{\cal F}_{k}}\geq\gamma_{*}$ . Similarly,

\mathbb{E}\Big{[}\mathbbm{1}_{{\cal S}_{k}}(1-\mathbbm{1}_{{\cal F}_{k}})\Delta f(X_{k},S_{k})\mid{\cal A}_{k-\frac{1}{2}}\Big{]}\geq 0\;\;\mbox{implies}\;\;\mathbb{E}\Big{[}\mathbbm{1}_{{\cal S}_{k}}(1-\mathbbm{1}_{{\cal F}_{k}})\Delta f(X_{k},S_{k})\mid{\cal A}_{k-1}\Big{]}\geq 0.

(2.26)

Assuming AS.3 is not unreasonable as it merely requires that an accurate model and accurate functions “happen more often than not”, and that the discrepancy between true and inexact function values at successful iterations does not, on average, prevent decrease of the objective function. If either of these condition fails, it is easy to imagine that the TR $q$ NE algorithm could be completely hampered by noise and/or diverge completely. Because the last condition in (2.25) is less intuitive, we now show that it can be realistic in the specific context where reasonable assumptions are made on the (possibly extended) cumulative distribution of the error on the function decreases (conditioned to ${\cal A}_{k-{\scriptstyle\frac{1}{2}}}$ ).

Theorem 2.3

Let

G_{k}:\mathbb{R}_{+}\rightarrow[0,1]

be a differentiable monotone increasing random function which is measurable for

{\cal A}_{k-1}

and such that

\displaystyle G_{k}(0)=0\;\;\mbox{and}\;\;\lim_{\tau\rightarrow\infty}G_{k}(\tau)=1,

(2.27)

\displaystyle\lim_{\tau\rightarrow\infty}\tau\,(1-G_{k}(\tau))=0,

(2.28)

\displaystyle\int_{0}^{\infty}(1-G_{k}(\tau))\,d\tau<\infty,

(2.29) and

\mathbb{P}{\rm r}\Big{[}\Delta F(X_{k},S_{k})-\Delta f(X_{k},S_{k})>\tau\mid{\cal A}_{k-{\scriptstyle\frac{1}{2}}}\Big{]}\leq 1-G_{k}(\tau)

(2.30) for

\tau>0

. Then,

\mathbb{E}\Big{[}\mathbbm{1}_{{\cal S}_{k}}(1-\mathbbm{1}_{{\cal F}_{k}})(f(X_{k})-f(X_{k+1}))\mid{\cal A}_{k-1}\Big{]}\geq 0

(2.31) for each

k

such that

\Delta T_{f,j_{k}}(X_{k},S_{k})\geq\frac{1}{\eta}\int_{0}^{\infty}(1-G_{k}(\tau))\,d\tau.

Proof. Consider $\omega\in\Omega$ , an arbitrary realization of the stochastic process defined by the TR $q$ NE algorithm. Suppose first that $\mathbb{E}\Big{[}\mathbbm{1}_{{\cal S}_{k}}(1-\mathbbm{1}_{{\cal F}_{k}})\mid{\cal A}_{k-{\scriptstyle\frac{1}{2}}}\Big{]}(\omega)=0$ . We then deduce that

\mathbb{E}\Big{[}\mathbbm{1}_{{\cal S}_{k}}(1-\mathbbm{1}_{{\cal F}_{k}})(f(X_{k})-f(X_{k+1}))\mid{\cal A}_{k-{\scriptstyle\frac{1}{2}}}\Big{]}(\omega)=0.

(2.32)

Assume therefore that

\mathbb{E}\Big{[}\mathbbm{1}_{{\cal S}_{k}}(1-\mathbbm{1}_{{\cal F}_{k}})\mid{\cal A}_{k-{\scriptstyle\frac{1}{2}}}\Big{]}(\omega)=\bar{p}_{k}

(2.33)

for some $\bar{p}_{k}>0$ . To further simplify notations, set

\Delta T_{k}\stackrel{{\scriptstyle\rm def}}{{=}}\Delta T_{f,j_{k}}(X_{k},S_{k})\eta\;\;\mbox{and}\;\;E_{k}=\Delta F(X_{k},S_{k})-\Delta f(X_{k},S_{k}).

(2.34)

If we define ${\cal I}\stackrel{{\scriptstyle\rm def}}{{=}}\{E_{k}(\omega)>0\}$ , the definition of successful iterations, (2.14) and the triangular inequality then imply that, if $\mathbbm{1}_{{\cal S}_{k}(\omega)}=1$ then

\Delta f(X_{k},S_{k})(\omega)=\Delta F(X_{k},S_{k})(\omega)-E_{k}(\omega)\geq\eta\Delta T_{k}(\omega)-\mathbbm{1}_{{\cal I}}E_{k}(\omega).

(2.35)

This in turn ensures that

	$\displaystyle\mathbb{E}\Big{[}\mathbbm{1}_{{\cal S}_{k}}(1-\mathbbm{1}_{{\cal F}_{k}})\Delta f(X_{k},S_{k})\mid{\cal A}_{k-{\scriptstyle\frac{1}{2}}}\Big{]}(\omega)$	$\displaystyle\geq\eta\Delta T_{k}(\omega)\,\mathbb{E}\Big{[}\mathbbm{1}_{{\cal S}_{k}}(1-\mathbbm{1}_{{\cal F}_{k}})\mid{\cal A}_{k-{\scriptstyle\frac{1}{2}}}\Big{]}(\omega)$
		$\displaystyle-\mathbb{E}\Big{[}\mathbbm{1}_{{\cal S}_{k}}(1-\mathbbm{1}_{{\cal F}_{k}})\mathbbm{1}_{{\cal I}}E_{k}\,\mid{\cal A}_{k-{\scriptstyle\frac{1}{2}}}\Big{]}(\omega).$		(2.36)

Moreover, we have that

	$\displaystyle\mathbb{E}\Big{[}\mathbbm{1}_{{\cal S}_{k}}$	$\displaystyle(1-\mathbbm{1}_{{\cal F}_{k}})\mathbbm{1}_{{\cal I}}E_{k}\,\mid{\cal A}_{k-{\scriptstyle\frac{1}{2}}}\Big{]}(\omega)$
		$\displaystyle=\mathbb{E}\Big{[}\mathbbm{1}_{{\cal S}_{k}}(1-\mathbbm{1}_{{\cal F}_{k}})\mathbbm{1}_{{\cal I}}E_{k}\,\mid{\cal A}_{k-{\scriptstyle\frac{1}{2}}},{\cal S}_{k}\cap{\cal F}_{k}^{c}\Big{]}(\omega)\,\cdot\mathbb{P}{\rm r}[{\cal S}_{k}\cap{\cal F}_{k}^{c}\,\mid{\cal A}_{k-{\scriptstyle\frac{1}{2}}}](\omega)$
		$\displaystyle\hskip 28.45274pt+\mathbb{E}\Big{[}\mathbbm{1}_{{\cal S}_{k}}(1-\mathbbm{1}_{{\cal F}_{k}})\mathbbm{1}_{{\cal I}}E_{k}\,\mid{\cal A}_{k-{\scriptstyle\frac{1}{2}}},({\cal S}_{k}\cap{\cal F}_{k}^{c})^{c}\Big{]}(\omega)\,\cdot\mathbb{P}{\rm r}[({\cal S}_{k}\cap{\cal F}_{k}^{c})^{c}\,\mid{\cal A}_{k-{\scriptstyle\frac{1}{2}}}](\omega)$
		$\displaystyle=\bar{p}_{k}\,\mathbb{E}\Big{[}\mathbbm{1}_{{\cal I}}E_{k}\,\mid{\cal A}_{k-{\scriptstyle\frac{1}{2}}},{\cal S}_{k}\cap{\cal F}_{k}^{c}\Big{]}(\omega)$
		$\displaystyle\leq\bar{p}_{k}\,\mathbb{E}\Big{[}\mathbbm{1}_{{\cal I}}E_{k}\,\mid{\cal A}_{k-{\scriptstyle\frac{1}{2}}}\Big{]}(\omega),$

where we used the fact that $[\mathbbm{1}_{{\cal S}_{k}}(1-\mathbbm{1}_{{\cal F}_{k}})](\omega)=0$ whenever ${({\cal S}_{k}^{c}\cup{{\cal F}_{k}})}(\omega)$ happens, (2.33) to derive the second equality, and the bound $\mathbbm{1}_{{\cal I}}E_{k}(\omega)\geq 0$ to obtain the final inequality. Now, (2.30) implies that, for $\tau>0$

\mathbb{P}{\rm r}\Big{[}\mathbbm{1}_{{\cal I}}E_{k}>\tau\,\mid{\cal A}_{k-{\scriptstyle\frac{1}{2}}}\Big{]}(\omega)=\mathbb{P}{\rm r}\Big{[}E_{k}>\tau\,\mid{\cal A}_{k-{\scriptstyle\frac{1}{2}}}\Big{]}(\omega)\leq 1-g_{k}(\tau)

where $g_{k}(\tau)\stackrel{{\scriptstyle\rm def}}{{=}}G_{k}(\omega)(\tau)$ , and thus

\mathbb{P}{\rm r}\Big{[}\mathbbm{1}_{{\cal S}_{k}}(1-\mathbbm{1}_{{\cal F}_{k}})\mathbbm{1}_{{\cal I}}E_{k}>\tau\,\mid{\cal A}_{k-{\scriptstyle\frac{1}{2}}}\Big{]}(\omega)\leq(1-g_{k}(\tau))\,\bar{p}_{k}=\bar{p}_{k}\int_{\tau}^{\infty}g_{k}^{\prime}(t)dt.

Then, employing (2.27)–(2.30), and integrating by parts

\mathbb{E}\Big{[}\mathbbm{1}_{{\cal S}_{k}}(1-\mathbbm{1}_{{\cal F}_{k}})\mathbbm{1}_{{\cal I}}E_{k}\,\mid{\cal A}_{k-{\scriptstyle\frac{1}{2}}}\Big{]}(\omega)\leq\bar{p}_{k}\int_{0}^{\infty}t\,g_{k}^{\prime}(t)dt=\bar{p}_{k}\int_{0}^{\infty}(1-g_{k}(t))dt<\infty.

Finally, using (2.1),

\displaystyle\mathbb{E}\Big{[}\mathbbm{1}_{{\cal S}_{k}}(1-\mathbbm{1}_{{\cal F}_{k}})\Delta f(X_{k},S_{k})\mid{\cal A}_{k-{\scriptstyle\frac{1}{2}}}\Big{]}(\omega)

\displaystyle\geq\bar{p}_{k}\left[\eta\Delta T_{k}(\omega)-\displaystyle\int_{0}^{\infty}(1-g_{k}(t))\,dt\right]

and thus

\mathbb{E}\Big{[}\mathbbm{1}_{{\cal S}_{k}}(1-\mathbbm{1}_{{\cal F}_{k}})(f(X_{k})-f(X_{k+1}))\mid{\cal A}_{k-{\scriptstyle\frac{1}{2}}}\Big{]}(\omega)\geq 0

(2.37)

holds when

\Delta T_{k}(\omega)\geq\frac{1}{\eta}\int_{0}^{\infty}(1-g_{k}(t))\,dt.

Combining (2.32) and (2.37) and taking into account that $\omega$ is arbitrary give that

\mathbb{E}\Big{[}\mathbbm{1}_{{\cal S}_{k}}(1-\mathbbm{1}_{{\cal F}_{k}})(f(X_{k})-f(X_{k+1}))\mid{\cal A}_{k-{\scriptstyle\frac{1}{2}}}\Big{]}\geq 0,

which, in view of (2.26), yields (2.31). $\Box$

Note that the assumptions of the theorem are for instance satisfied for the exponential case where $G_{k}(\tau)=e^{-T\tau}$ for $T>0$ and measurable for ${\cal A}_{k-1}$ . We will return to this result in Section 4 and discuss there the condition that $\Delta T_{f,j_{k}}(X_{k},S_{k})$ should be sufficiently large.

3 Worst-case evaluation complexity

We now turn to the evaluation complexity analysis for the TR $q$ NE algorithm, whose aim is to derive a bound on the expected number of iterations for which optimality fails. This number is given by

N_{\epsilon}\stackrel{{\scriptstyle\rm def}}{{=}}\inf\left\{k\geq 0~{}\mid~{}\phi_{f,j}^{\Delta_{k,j}}(X_{k})\leq\epsilon_{j}\frac{\Delta_{k,j}^{j}}{j!}\;\;\mbox{for}\;\;j\in\{1,\ldots,q\}\right\}.

(3.1)

We first state a crucial lower bound on the model decrease, in the spirit of [10, Lemma 3.4].

Lemma 3.1

Consider any realization of the algorithm and assume that

{\cal M}_{k}

occurs. Assume that (2.6) fails at iteration

k

. Then, there exists a

j_{k}\in\{1,\ldots,q\}

such that

\overline{\Delta t}_{f,j_{k}}(x_{k},d_{k,j_{k}})>\varsigma\epsilon_{j_{k}}\delta_{k}^{j_{k}}/(j_{k}!(1+\nu))

at Step 1 of the iteration. Moreover,

\overline{\Delta t}_{f,j_{k}}(x_{k},s_{k})\geq\widehat{\phi}_{f,k}\frac{\delta_{k}^{j_{k}}}{j_{k}!}

(3.2) where

\widehat{\phi}_{f,k}\stackrel{{\scriptstyle\rm def}}{{=}}\frac{\displaystyle j_{k}!\,\,\overline{\Delta t}_{f,j_{k}}(x_{k},d_{k,j_{k}})}{\displaystyle\delta_{k}^{j_{k}}}>\frac{\varsigma\epsilon_{\min}}{1+\nu}.

(3.3)

Proof. We proceed by contradiction and assume that

\overline{\Delta t}_{f,j}(x_{k},d_{k,j})\leq\frac{\varsigma\epsilon_{j}}{1+\nu}\,\frac{\delta_{k}^{j}}{j!},

(3.4)

for all $j\in\{1,\ldots,q\}$ . Since ${\cal M}_{k}$ occurs, we have that, for all $j\in\{1,\ldots,q\}$ ,

\phi_{f,j}^{\delta_{k}}(x_{k})\leq\left(\frac{\displaystyle 1+\nu}{\displaystyle\varsigma}\right)\overline{\Delta t}_{f,j}(x_{k},d_{k,j})\leq\epsilon_{j}\,\frac{\delta_{k}^{j}}{j!},\quad j\in\{1,...,q\},

which contradicts the assumption that (2.6) does not hold for $x_{k}$ and $\delta_{k}$ . The bound (3.2) directly results from

\overline{\Delta t}_{f,j_{k}}(x_{k},s_{k})\geq\overline{\Delta t}_{f,j_{k}}(x_{k},d_{k,j_{k}})=\widehat{\phi}_{f,k}\frac{\delta_{k}^{j_{k}}}{j_{k}!},

where we have used (2.13) to derive the first inequality and the definition (3.3) to obtain the equality. The rightmost inequality in (3.3) trivially follows from the negation of (3.4) and (2.10). $\Box$

We now search for conditions ensuring that the iteration is successful. For simplicity of notation, given $\vartheta_{f,j}$ , $j\in\{1,\ldots,q\}$ , as in (2.1), we define

\vartheta_{f}\stackrel{{\scriptstyle\rm def}}{{=}}\max[1,\max_{j\in\{1,\ldots,q\}}\vartheta_{f,j}].

(3.5)

Lemma 3.2

Suppose that AS.1 holds. Consider any realization of the algorithm and suppose that (2.6) does not hold for

x_{k}

and

\delta_{k}

and that

{\cal E}_{k}

occurs. If

r_{k}\leq\overline{r}\stackrel{{\scriptstyle\rm def}}{{=}}\min\left\{\theta,\frac{\varsigma(1-\eta)}{4(1+\nu)\vartheta_{f}}\,\epsilon_{\min}\right\}=\frac{\varsigma(1-\eta)}{4(1+\nu)\vartheta_{f}}\,\epsilon_{\min}\stackrel{{\scriptstyle\rm def}}{{=}}\kappa_{r}\epsilon_{\min},

(3.6)

\kappa_{r}\in(0,1),

holds, then iteration

k

is successful.

Proof. First, note that the minimum in (3.6) is attained at $\kappa_{r}\epsilon_{\min}$ since $\theta\geq\epsilon_{\min}$ and $\kappa_{r}\in(0,1)$ . Suppose now that (3.6) holds, which implies that $\delta_{k}=\min[\theta,r_{k}]=r_{k}$ . Let $j_{k}$ be as in Lemma 3.1, and denote $\Delta f(x_{k},s_{k})=f(x_{k})-f(x_{k}+s_{k})$ , $\overline{\Delta f}(x_{k},s_{k})=\overline{f}(x_{k})-\overline{f}(x_{k}+s_{k})$ .

Using (2.14), the triangle inequality and $\mathbbm{1}_{{\cal E}_{k}}=\mathbbm{1}_{{\cal M}_{k}}\mathbbm{1}_{{\cal F}_{k}}=1$ , we obtain

	$\displaystyle\|\rho_{k}-1\|$	$\displaystyle=\left\|\frac{\overline{\Delta f}(x_{k},s_{k})-\overline{\Delta t}_{f,j_{k}}(x_{k},s_{k})}{\overline{\Delta t}_{f,j_{k}}(x_{k},s_{k})}\right\|$
		$\displaystyle\leq\frac{\left\|\Delta f(x_{k},s_{k})-\Delta t_{f,j_{k}}(x_{k},s_{k})\right\|}{\overline{\Delta t}_{f,j_{k}}(x_{k},s_{k})}+\frac{\left\|\Delta t_{f,j_{k}}(x_{k},s_{k})-\overline{\Delta t}_{f,j_{k}}(x_{k},s_{k})\right\|}{\overline{\Delta t}_{f,j_{k}}(x_{k},s_{k})}$
		$\displaystyle\hskip 19.91692pt+\frac{\left\|\overline{\Delta f}(x_{k},s_{k})-\Delta f(x_{k},s_{k})\right\|}{\overline{\Delta t}_{f,j_{k}}(x_{k},s_{k})}$
		$\displaystyle{\leq}\frac{\|f(x_{k}+s_{k})-t_{f,j_{k}}(x_{k},s_{k})\|}{\overline{\Delta t}_{f,j_{k}}(x_{k},s_{k})}+\frac{3\nu\overline{\Delta t}_{f,j_{k}}(x_{k},s_{k})}{\overline{\Delta t}_{f,j_{k}}(x_{k},s_{k})}.$

Invoking (2.3), the bound $\|s_{k}\|\leq r_{k}=\delta_{k}$ , (3.5), (3.2) and $\nu\leq\frac{1}{4}(1-\eta)$ we get

|\rho_{k}-1|<\frac{\displaystyle\vartheta_{f}r_{k}}{\displaystyle\widehat{\phi}_{f,k}}+\frac{3}{4}(1-\eta).

Using (3.3) and (3.6) we deduce that

|\rho_{k}-1|\leq\frac{\displaystyle(1+\nu)\vartheta_{f}r_{k}}{\displaystyle\varsigma\epsilon_{\min}}+\frac{3}{4}(1-\eta)\leq 1-\eta.

(3.7)

Thus, $\rho_{k}\geq\eta$ and the iteration $k$ is successful. $\Box$

The following crucial lower bound on $\Delta T_{f,j_{k}}(x_{k},s_{k})$ for accurate iterations $k$ can now be proved.

Lemma 3.3

Suppose that AS.1 holds. Consider any realization of the algorithm and suppose that (2.6) does not hold for

x_{k}

and

\delta_{k}

{\cal E}_{k}

occurs, and that

r_{k}\geq\bar{r}

with

\bar{r}_{k}

defined in (3.6). Then

\overline{\Delta t}_{f,j_{k}}(x_{k},s_{k})=\overline{t}_{f,j_{k}}(x_{k},0)-\overline{t}_{f,j_{k}}(x_{k},s_{k}){>\frac{\varsigma}{q!}\left(\kappa_{\delta}\epsilon_{\min}\right)^{q+1}},

(3.8) where

\kappa_{\delta}\in(0,1)

is defined by

\kappa_{\delta}\stackrel{{\scriptstyle\rm def}}{{=}}\frac{\kappa_{r}}{1+\nu}

(3.9) with

\kappa_{r}

defined in (3.6).

Proof. Let $j_{k}$ be as in Lemma 3.1. By (3.2), (3.3) we obtain

$\overline{\Delta t}_{f,j_{k}}(x_{k},s_{k}){>\frac{\varsigma\epsilon_{\min}}{1+\nu}\,\frac{\delta_{k}^{q}}{q!}.}$

If $r_{k}>\theta$ then $\delta_{k}=\theta$ and the bound $\theta\geq\epsilon_{\min}$ implies

$\overline{\Delta t}_{f,j}(x_{k},s_{k}){>\frac{\varsigma\epsilon_{\min}^{q+1}}{q!(1+\nu)}.}$

Thus (3.8) holds by definition of $\kappa_{\delta}$ and the fact that $\kappa_{r}\in(0,1)$ . If $\bar{r}<r_{k}\leq\theta$ , then $\delta_{k}=r_{k}$ . The proof is completed by noting that the form of $\bar{r}$ in (3.6) gives that $r_{k}>\kappa_{r}\epsilon_{\min}$ .
$\Box$

3.1 Bounding the expected number of steps with $R_{k}\leq\overline{r}$

We now return to the general stochastic process generated by the TR $q$ NE algorithm and bound the expected number of steps in $N_{\epsilon}$ from above. For this purpose, let us define, for all $0\leq k\leq N_{\epsilon}-1$ , the events

\Lambda_{k}\stackrel{{\scriptstyle\rm def}}{{=}}\{R_{k}>\overline{r}\},\qquad\Lambda^{c}_{k}\stackrel{{\scriptstyle\rm def}}{{=}}\ \{R_{k}\leq\overline{r}\},

where $\overline{r}$ is given by (3.6), and let

N_{\Lambda}\stackrel{{\scriptstyle\rm def}}{{=}}\sum_{k=0}^{N_{\epsilon}-1}\mathbbm{1}_{\Lambda_{k}},\qquad N_{\Lambda^{c}}\stackrel{{\scriptstyle\rm def}}{{=}}\sum_{k=0}^{N_{\epsilon}-1}\mathbbm{1}_{\Lambda_{k}^{c}},

(3.10)

be the number of steps, in the stochastic process induced by the TR $q$ NE algorithm and before $N_{\epsilon}$ , such that $R_{k}>\overline{r}$ or $R_{k}\leq\overline{r}$ , respectively. In what follows we suppose that AS.1–AS.2 hold.

An upper bound on $\mathbb{E}\big{[}N_{\Lambda^{c}}\big{]}$ can be derived as follows.

(i)

We apply [12, Lemma 2.2] to deduce that, for any $\ell\in\{0,\ldots,N_{\epsilon}-1\}$ and for all realizations of Algorithm 2, one has that

$\sum_{k=0}^{\ell}\mathbbm{1}_{\Lambda_{k}^{c}}\mathbbm{1}_{{\cal S}_{k}}\leq\frac{\ell+1}{2}.$ (3.11)

(ii)

Both $\sigma(\mathbbm{1}_{\Lambda_{k}})$ and $\sigma(\mathbbm{1}_{\Lambda_{k}^{c}})$ belong to ${\cal A}_{k-1}$ , because the random variable $\Lambda_{k}$ is fully determined by the first $k-1$ iterations of the TR $q$ NE algorithm. Setting $\ell=N_{\epsilon}-1$ we can rely on [12, Lemma $2.1$ ] (with $W_{k}=\mathbbm{1}_{\Lambda_{k}^{c}}$ ), whose proof is detailed in the appendix, to deduce that

\mathbb{E}\left[\sum_{k=0}^{N_{\epsilon}-1}\mathbbm{1}_{\Lambda_{k}^{c}}\mathbbm{1}_{{\cal E}_{k}}\right]\geq\,p_{*}\,\mathbb{E}\left[\sum_{k=0}^{N_{\epsilon}-1}\mathbbm{1}_{\Lambda_{k}^{c}}\right].

(3.12)

(iii)

As a consequence, given that Lemma 3.2 ensures that each iteration $k$ where ${\cal E}_{k}$ occurs and $r_{k}\leq\overline{r}$ is successful, we have that

\sum_{k=0}^{N_{\epsilon}-1}\mathbbm{1}_{\Lambda_{k}^{c}}\mathbbm{1}_{{\cal E}_{k}}\leq\sum_{k=0}^{N_{\epsilon}-1}\mathbbm{1}_{\Lambda_{k}^{c}}\mathbbm{1}_{{\cal S}_{k}}\leq\frac{N_{\epsilon}}{2},

in which the last inequality follows from (3.11), with $\ell=N_{\epsilon}-1$ . Taking expectation in the above inequality, using (3.12) and recalling the rightmost definition in (3.10), we obtain, as in [12, Lemma 2.3], that

\mathbb{E}[N_{\Lambda^{c}}]\leq\frac{1}{2p_{*}}\mathbb{E}[N_{\epsilon}].

(3.13)

3.2 Bounding the expected number of steps with $R_{k}>\overline{r}$

For analyzing $\mathbb{E}[N_{\Lambda}]$ , where $N_{\Lambda}$ is defined in (3.10), we now introduce the following variables.

Definition 1

Consider the random process (2.15) generated by the TR $q$ NE algorithm and define:

\begin{array}[]{lcl}\bullet\ \overline{\Lambda}_{k}&=&\{\;\;\mbox{iteration $k$ is such that $R_{k}\geq\overline{r}$}\;\;\};\\[6.45831pt] \bullet\ N_{I}&=&\displaystyle\sum_{k=0}^{N_{\epsilon}-1}\mathbbm{1}_{\overline{\Lambda}_{k}}\mathbbm{1}_{{\cal E}_{k}^{c}}:\mbox{the number of inaccurate iterations with $R_{k}\geq\overline{r}$};\\[6.45831pt] \bullet\ N_{A}&=&\displaystyle\sum_{k=0}^{N_{\epsilon}-1}\mathbbm{1}_{\overline{\Lambda}_{k}}\mathbbm{1}_{{\cal E}_{k}}:\mbox{the number of accurate iterations with $R_{k}\geq\overline{r}$};\\[6.45831pt] \bullet\ N_{AS}&=&\displaystyle\sum_{k=0}^{N_{\epsilon}-1}\mathbbm{1}_{\overline{\Lambda}_{k}}\mathbbm{1}_{{\cal E}_{k}}\mathbbm{1}_{{\cal S}_{k}}:\mbox{the number of accurate successful iterations with $R_{k}\geq\overline{r}$};\\[6.45831pt] \bullet\ N_{AU}&=&\displaystyle\sum_{k=0}^{N_{\epsilon}-1}\mathbbm{1}_{\Lambda_{k}}\mathbbm{1}_{{\cal E}_{k}}\mathbbm{1}_{{\cal S}_{k}^{c}}:\mbox{the number of accurate unsuccessful iterations with $R_{k}>\overline{r}$};\\[6.45831pt] \bullet\ N_{IS}&=&\displaystyle\sum_{k=0}^{N_{\epsilon}-1}\mathbbm{1}_{\overline{\Lambda}_{k}}\mathbbm{1}_{{\cal E}_{k}^{c}}\mathbbm{1}_{{\cal S}_{k}}:\mbox{the number of inaccurate successful iterations with $R_{k}\geq\overline{r}$};\\[6.45831pt] \bullet\ N_{S}&=&\displaystyle\sum_{k=0}^{N_{\epsilon}-1}\mathbbm{1}_{\overline{\Lambda}_{k}}\mathbbm{1}_{{\cal S}_{k}}:\mbox{the number of successful iterations with $R_{k}\geq\overline{r}$};\\[6.45831pt] \bullet\ N_{U}&=&\displaystyle\sum_{k=0}^{N_{\epsilon}-1}\mathbbm{1}_{\Lambda_{k}}\mathbbm{1}_{{\cal S}_{k}^{c}}:\mbox{the number of unsuccessful iterations with $R_{k}>\overline{r}$}.\end{array}

(3.14)

Observe that $\overline{\Lambda}_{k}$ is the “closure” of $\Lambda_{k}$ in that the inequality in its definition is no longer strict. We immediately notice that an upper bound on $\mathbb{E}[N_{\Lambda}]$ is available, once an upper bound on $\mathbb{E}[N_{I}]+\mathbb{E}[N_{A}]$ is known, since

\mathbb{E}[N_{\Lambda}]\leq\mathbb{E}\left[\sum_{k=0}^{N_{\epsilon}-1}\mathbbm{1}_{\overline{\Lambda}_{k}}\right]=\mathbb{E}\left[\sum_{k=0}^{N_{\epsilon}-1}\mathbbm{1}_{\overline{\Lambda}_{k}}\mathbbm{1}_{{\cal E}_{k}^{c}}+\sum_{k=0}^{N_{\epsilon}-1}\mathbbm{1}_{\overline{\Lambda}_{k}}\mathbbm{1}_{{\cal E}_{k}}\right]=\mathbb{E}[N_{I}]+\mathbb{E}[N_{A}].

(3.15)

Using again [12, Lemma $2.1$ ] (with $W_{k}=\mathbbm{1}_{\overline{\Lambda}_{k}}$ ) to give an upper bound on $\mathbb{E}[N_{I}]$ , we obtain the following result, whose proof is detailed in the appendix.

Lemma 3.4

[12, Lemma 2.6] Suppose that AS.1-AS.3 hold and let

N_{I}

N_{A}

be defined as in Definition 3.14 in the context of the stochastic process (2.15) generated by the TR

q

NE algorithm. Then,

\mathbb{E}[N_{I}]\leq\frac{1-p_{*}}{p_{*}}\,\mathbb{E}[N_{A}].

(3.16)

Turning to the upper bound for $\mathbb{E}[N_{A}]$ , we observe that

\mathbb{E}[N_{A}]=\mathbb{E}[N_{AS}]+\mathbb{E}[N_{AU}]\leq\mathbb{E}[N_{AS}]+\mathbb{E}[N_{U}].

(3.17)

Hence, bounding $\mathbb{E}[N_{A}]$ can be achieved by providing upper bounds on $\mathbb{E}[N_{AS}]$ and $\mathbb{E}[N_{U}]$ . Regarding the latter, we first note that the process induced by the TR $q$ NE algorithm ensures that $R_{k}$ is increased by a factor $\gamma$ on successful steps and decreased by the same factor on unsuccessful ones. Consequently, based on [12, Lemma $2.5$ ], we obtain the following bound.

Lemma 3.5

For any

\ell\in\{0,...,N_{\epsilon}-1\}

and for all realisations of Algorithm 2, we have that

\sum_{k=0}^{\ell}\mathbbm{1}_{\Lambda_{k}}\mathbbm{1}_{{\cal S}_{k}^{c}}\leq\sum_{k=0}^{\ell}\mathbbm{1}_{\overline{\Lambda}_{k}}\mathbbm{1}_{{\cal S}_{k}}+\left\lceil\Big{|}\log_{\gamma^{-1}}\left(\frac{\overline{r}}{r_{0}}\right)\Big{|}\right\rceil=\sum_{k=0}^{\ell}\mathbbm{1}_{\overline{\Lambda}_{k}}\mathbbm{1}_{{\cal S}_{k}}+\left\lceil\log_{\gamma}\left(\frac{r_{0}}{\overline{r}}\right)\right\rceil.

From the inequality stated in the previous lemma with $\ell=N_{\epsilon}-1$ , recalling Definition 3.14 and taking expectations, we therefore obtain that

\mathbb{E}[N_{U}]\leq\mathbb{E}[N_{S}]+\left\lceil\log_{\gamma}\left(\frac{r_{0}}{\overline{r}}\right)\right\rceil=\mathbb{E}[N_{AS}]+\mathbb{E}[N_{IS}]+\left\lceil\log_{\gamma}\left(\frac{r_{0}}{\overline{r}}\right)\right\rceil.

(3.18)

An upper bound on $\mathbb{E}[N_{AS}]$ is given by the following lemma.

Lemma 3.6

Suppose that AS.1, AS.2 and AS.3 hold. Then we have that

\mathbb{E}[N_{AS}]\leq\frac{2q!(f_{0}-f_{\rm low})}{\varsigma\eta\left(\kappa_{\delta}\epsilon_{\min}\right)^{q+1}}+1,

(3.19) where

\kappa_{\delta}

is defined in (3.9).

Proof. Consider any realization of the TR $q$ NE algorithm.

If iteration $k$ is successful and the functions are accurate (i.e., $\mathbbm{1}_{{\cal S}_{k}}\mathbbm{1}_{{\cal F}_{k}}=1$ ) then (2.14), (2.10) and (2.23) imply that

$\displaystyle f(x_{k})-f(x_{k+1})$	$\displaystyle\geq$	$\displaystyle[\overline{f}(x_{k})-\overline{f}(x_{k+1})]-2\nu\overline{\Delta t}_{f,j_{k}}(x_{k},s_{k})$	(3.20)
	$\displaystyle\geq$	$\displaystyle\eta\overline{\Delta t}_{f,j_{k}}(x_{k},s_{k})-2\nu\overline{\Delta t}_{f,j_{k}}(x_{k},s_{k})$
	$\displaystyle=$	$\displaystyle(\eta-2\nu)\overline{\Delta t}_{f,j_{k}}(x_{k},s_{k})$
	$\displaystyle\geq$	$\displaystyle\frac{1}{2}\eta\overline{\Delta t}_{f,j_{k}}(x_{k},s_{k}){\geq 0.}$

Thus

\mathbbm{1}_{{\cal S}_{k}}\mathbbm{1}_{{\cal F}_{k}}=\mathbbm{1}_{{\cal S}_{k}}\mathbbm{1}_{{\cal F}_{k}}\mathbbm{1}_{\{\Delta F_{k,+}\geq 0\}},

(3.21)

where $\Delta F_{k,+}=f(X_{k})-f(X_{k+1})$ . Moreover, if ${\cal M}_{k}$ also occurs with $r_{k}\geq\bar{r}$ (i.e., if $\mathbbm{1}_{{\cal S}_{k}}\mathbbm{1}_{{\cal E}_{k}}\mathbbm{1}_{\overline{\Lambda}_{k}}=1$ ) and (2.6) fails for $x_{k}$ and $\delta_{k}$ , we may then use (3.8) to deduce from (3.20) that

f(x_{k})-f(x_{k+1})\geq\frac{\varsigma\eta}{2q!}\left(\kappa_{\delta}\epsilon_{\min}\right)^{q+1}>0,

(3.22)

which implies that, as long as (2.6) fails,

\mathbbm{1}_{{\cal S}_{k}}\mathbbm{1}_{{\cal E}_{k}}\mathbbm{1}_{\overline{\Lambda}_{k}}=\mathbbm{1}_{{\cal S}_{k}}\mathbbm{1}_{{\cal E}_{k}}\mathbbm{1}_{\overline{\Lambda}_{k}}\mathbbm{1}_{\{\Delta F_{k,+}>0\}}.

(3.23)

ii)

If iteration $k$ is unsuccessful, the mechanism of the TR $q$ NE algorithm guarantees that $x_{k}=x_{k+1}$ and, hence, that $f(x_{k+1})=f(x_{k})$ , giving that $(1-\mathbbm{1}_{{\cal S}_{k}})\Delta F_{k,+}=0$ .

Setting $f_{0}\stackrel{{\scriptstyle\rm def}}{{=}}f(X_{0})$ and using this last relation and AS.2, we have that, for any $\ell\in\{0,...,N_{\epsilon}-1\}$ ,

f_{0}-f_{\rm low}\geq f_{0}-f(X_{\ell+1})=\sum_{k=0}^{\ell}\Delta F_{k,+}=\sum_{k=0}^{\ell}\mathbbm{1}_{{\cal S}_{k}}\Delta F_{k,+}

(3.24)

Remembering that $X_{0}$ and thus $f_{0}$ are deterministic and taking the total expectation on both sides of (3.24) then gives that

f_{0}-f_{\rm low}=\mathbb{E}[f_{0}-f_{\rm low}]\geq\sum_{k=0}^{\ell}\mathbb{E}\Big{[}\mathbbm{1}_{{\cal S}_{k}}\Delta F_{k,+}\Big{]}=\sum_{k=0}^{\ell}\mathbb{E}\Big{[}\mathbb{E}\Big{[}\mathbbm{1}_{{\cal S}_{k}}\Delta F_{k,+}\mid{\cal A}_{k-1}\Big{]}\Big{]}.

(3.25)

Now, for $k\in\{0,\ldots,\ell\}$ ,

\mathbbm{1}_{{\cal S}_{k}}\Delta F_{k,+}=\mathbbm{1}_{{\cal S}_{k}}\mathbbm{1}_{{\cal F}_{k}}\Delta F_{k,+}+\mathbbm{1}_{{\cal S}_{k}}(1-\mathbbm{1}_{{\cal F}_{k}})\Delta F_{k,+}

and so, using the second part of (2.25),

	$\displaystyle\mathbb{E}\Big{[}\mathbbm{1}_{{\cal S}_{k}}\Delta F_{k,+}\mid{\cal A}_{k-1}\Big{]}$	$\displaystyle=\mathbb{E}\Big{[}\mathbbm{1}_{{\cal S}_{k}}\mathbbm{1}_{{\cal F}_{k}}\Delta F_{k,+}\mid{\cal A}_{k-1}\Big{]}+\mathbb{E}\Big{[}\mathbbm{1}_{{\cal S}_{k}}(1-\mathbbm{1}_{{\cal F}_{k}})\Delta F_{k,+}\mid{\cal A}_{k-1}\Big{]}$
		$\displaystyle\geq\mathbb{E}\Big{[}\mathbbm{1}_{{\cal S}_{k}}\mathbbm{1}_{{\cal F}_{k}}\Delta F_{k,+}\mid{\cal A}_{k-1}\Big{]}.$		(3.26)

Thus, again using the law of total expectations, (3.25) yields that

f_{0}-f_{\rm low}\geq\sum_{k=0}^{\ell}\mathbb{E}\Big{[}\mathbb{E}\Big{[}\mathbbm{1}_{{\cal S}_{k}}\mathbbm{1}_{{\cal F}_{k}}\Delta F_{k,+}\mid{\cal A}_{k-1}\Big{]}\Big{]}=\sum_{k=0}^{\ell}\mathbb{E}\Big{[}\mathbbm{1}_{{\cal S}_{k}}\mathbbm{1}_{{\cal F}_{k}}\Delta F_{k,+}\Big{]}.

(3.27)

Moreover, successively using (3.21), (2.24), (3.23) and (3.22),

$\displaystyle\mathbbm{1}_{{\cal S}_{k}}\mathbbm{1}_{{\cal F}_{k}}\Delta F_{k,+}$	$\displaystyle=\mathbbm{1}_{{\cal S}_{k}}\mathbbm{1}_{{\cal F}_{k}}\mathbbm{1}_{\{\Delta F_{k,+}>0\}}\Delta F_{k,+}$
	$\displaystyle=\mathbbm{1}_{{\cal S}_{k}}\mathbbm{1}_{{\cal F}_{k}}\mathbbm{1}_{{\cal M}_{k}}\mathbbm{1}_{\{\Delta F_{k,+}>0\}}\Delta F_{k,+}+\mathbbm{1}_{{\cal S}_{k}}\mathbbm{1}_{{\cal F}_{k}}(1-\mathbbm{1}_{{\cal M}_{k}})\mathbbm{1}_{\{\Delta F_{k,+}>0\}}\Delta F_{k,+}$
	$\displaystyle\geq\mathbbm{1}_{{\cal S}_{k}}\mathbbm{1}_{{\cal E}_{k}}\mathbbm{1}_{\{\Delta F_{k,+}>0\}}\Delta F_{k,+}$
	$\displaystyle\geq\mathbbm{1}_{{\cal S}_{k}}\mathbbm{1}_{{\cal E}_{k}}\mathbbm{1}_{\overline{\Lambda}_{k}}\mathbbm{1}_{\{\Delta F_{k,+}>0\}}\Delta F_{k,+}$
	$\displaystyle\geq\frac{\varsigma\eta}{2q!}\left(\kappa_{\delta}\epsilon_{\min}\right)^{q+1}\Big{(}\mathbbm{1}_{{\cal S}_{k}}\mathbbm{1}_{{\cal E}_{k}}\mathbbm{1}_{\overline{\Lambda}_{k}}\Big{)}.$	(3.28)

Substituting this bound in (3.27) then gives that, as long as (2.6) fails for iterations $\{1,\ldots,\ell\}$ ,

f_{0}-f_{\rm low}\geq\frac{\varsigma\eta}{2q!}\left(\kappa_{\delta}\epsilon_{\min}\right)^{q+1}\sum_{k=0}^{\ell}\mathbb{E}\Big{[}\mathbbm{1}_{{\cal S}_{k}}\mathbbm{1}_{{\cal E}_{k}}\mathbbm{1}_{\overline{\Lambda}_{k}}\Big{]}.

(3.29)

We now notice that, by Definition 3.14,

N_{AS}-1\leq\sum_{k=0}^{N_{\epsilon}-2}\mathbbm{1}_{{\cal S}_{k}}\mathbbm{1}_{{\cal E}_{k}}\mathbbm{1}_{\overline{\Lambda}_{k}},

and therefore

\mathbb{E}[N_{AS}-1]\leq\sum_{k=0}^{N_{\epsilon}-2}\mathbb{E}\Big{[}\mathbbm{1}_{{\cal S}_{k}}\mathbbm{1}_{{\cal E}_{k}}\mathbbm{1}_{\overline{\Lambda}_{k}}\Big{]}.

(3.30)

Hence, letting $\ell=N_{\epsilon}-2$ , substituting (3.30) in (3.29), we deduce that

\mathbb{E}[N_{AS}-1]\frac{\varsigma\eta}{2q!}\left(\kappa_{\delta}\epsilon_{\min}\right)^{q+1}\leq f_{0}-f_{\rm low}

and (3.19) follows. $\Box$

While inequalities (3.18) and (3.19) provide upper bounds on $\mathbb{E}[N_{AS}]$ and $\mathbb{E}[N_{U}]$ , as desired, the first still depends on $\mathbb{E}[N_{IS}]$ , which has to be bounded from above as well. This can be done by following [12] once more: Definition 3.14, (3.16) and (3.17) directly imply that

\mathbb{E}[N_{IS}]\leq\mathbb{E}[N_{I}]\leq\frac{1-p_{*}}{p_{*}}\mathbb{E}[N_{A}]\leq\frac{1-p_{*}}{p_{*}}\left(\mathbb{E}[N_{AS}]+\mathbb{E}[N_{U}]\right)

(3.31)

and hence

\mathbb{E}[N_{IS}]\leq\frac{1-p_{*}}{2p_{*}-1}\left(2\mathbb{E}[N_{AS}]+\left\lceil\log_{\gamma}\left(\frac{r_{0}}{\overline{r}}\right)\right\rceil\right)

(3.32)

follows from (3.18) (remember that ${\scriptstyle\frac{1}{2}}<p_{*}\leq 1$ ). Thus, the right-hand side in (3.16) is in turn bounded above because of (3.17), (3.18), (3.32) and (3.19), giving

$\displaystyle\mathbb{E}[N_{A}]$	$\displaystyle\leq$	$\displaystyle\mathbb{E}[N_{AS}]+\mathbb{E}[N_{U}]\leq 2\mathbb{E}[N_{AS}]+\mathbb{E}[N_{IS}]+\left\lceil\log_{\gamma}\left(\frac{r_{0}}{\overline{r}}\right)\right\rceil$	(3.33)
	$\displaystyle\leq$	$\displaystyle\left(\frac{1-p_{}}{2p_{}-1}+1\right)\left(2\mathbb{E}[N_{AS}]+\left\lceil\log_{\gamma}\left(\frac{r_{0}}{\overline{r}}\right)\right\rceil\right)$
	$\displaystyle\leq$	$\displaystyle\frac{p_{}}{2p_{}-1}\left[\frac{4q!(f_{0}-f_{\rm low})}{\varsigma\eta\left(\kappa_{\delta}\epsilon_{\min}\right)^{q+1}}+\left\lceil\log_{\gamma}\left(\frac{r_{0}}{\overline{r}}\right)\right\rceil+2\right].$

This inequality, together with (3.15) and (3.16), finally gives the desired bound

\mathbb{E}[N_{\Lambda}]\leq\frac{1}{p_{*}}\mathbb{E}[N_{A}]\leq\frac{1}{2p_{*}-1}\left[\frac{4q!(f_{0}-f_{\rm low})}{\varsigma\eta\left(\kappa_{\delta}\epsilon_{\min}\right)^{q+1}}+\left\lceil\log_{\gamma}\left(\frac{r_{0}}{\overline{r}}\right)\right\rceil+2\right].

(3.34)

We can now express our final complexity result in full.

Theorem 3.7

Suppose that AS.1–AS.3 hold, then

\mathbb{E}[N_{\epsilon}]\leq\frac{2p_{*}}{(2p_{*}-1)^{2}}\left[\frac{4q!(f_{0}-f_{\rm low})}{\varsigma\eta\left(\kappa_{\delta}\epsilon_{\min}\right)^{q+1}}+\left\lceil\log_{\gamma}\left(\frac{r_{0}}{\overline{r}}\right)\right\rceil+2\right],

(3.35) with

N_{\epsilon}

\overline{r}

and

\kappa_{\delta}

defined as in (3.1), (3.6) and (3.9), respectively.

Proof. Recalling the definitions (3.10) and the bound (3.13), we obtain that

\mathbb{E}[N_{\epsilon}]=\mathbb{E}[N_{\Lambda}^{c}]+\mathbb{E}[N_{\Lambda}]\leq\frac{\mathbb{E}[N_{\epsilon}]}{2p_{*}}+\mathbb{E}[N_{\Lambda}],

which implies, using (3.34), that

\frac{2p_{*}-1}{2p_{*}}\mathbb{E}[N_{\epsilon}]\leq\frac{1}{2p_{*}-1}\left[\frac{4q!(f_{0}-f_{\rm low})}{\varsigma\eta\left(\kappa_{\delta}\epsilon_{\min}\right)^{q+1}}+\left\lceil\log_{\gamma}\left(\frac{r_{0}}{\overline{r}}\right)\right\rceil+2\right].

This bound and the inequality ${\scriptstyle\frac{1}{2}}<p_{*}\leq 1$ yield the desired result. $\Box$

We note that the ${\cal O}\left(\epsilon_{\min}^{-(q+1)}\right)$ evaluation bound given by (3.35) is known to be sharp in order of $\epsilon_{\min}$ for trust-region methods using exact evaluations of functions and derivatives (see [11, Theorem 12.2.6]), which implies that Theorem 3.7 is also sharp in order.

We conclude this section by noting that alternatives to the second part of (2.25) do exist. For instance, we could assume that

\mathbb{E}\Big{[}\mathbbm{1}_{{\cal S}_{k}}\Delta F_{k,+}\mid{\cal A}_{k-1}\Big{]}\geq\mu\mathbb{E}\Big{[}\mathbbm{1}_{{\cal S}_{k}}\mathbbm{1}_{{\cal F}_{k}}\Delta F_{k,+}\mid{\cal A}_{k-1}\Big{]}

for some $\mu>0$ . This condition can be used to replace the second part of (2.25) to ensure (3.26) in the proof of Lemma 3.6 and all subsequent arguments.

4 The impact of noise for first-order minimization

While the above theory covers inexact evaluations of the objective function and its derivatives, it does rely on AS.3. Thus, as long as the inexactness/noise on these values remains small enough for this assumption to hold, the TR $q$ NE algorithm iterates ultimately produce an approximate local minimizer. There are however practical applications, such as minimization of finite sum using sampling strategies (discussed in more detail below), where AS.3 may be unrealistic because of noise intrinsic to the application. We already saw that, under the assumptions of Theorem 2.3, a large enough value of $\Delta T_{f,j_{k}}(X_{k},S_{k})$ is sufficient for ensuring the third condition in AS.3, but we also know from (2.23),(2.34), (2.35) and the definition of $\nu$ that, at successful iterations,

\Delta f(X_{k},S_{k})\geq\eta\Delta T_{f,j_{k}}(X_{k},S_{k})-E_{k}\geq(\eta-2\nu)\Delta T_{f,j_{k}}\geq{\scriptstyle\frac{1}{2}}\eta\Delta T_{f,j_{k}}

whenever ${\cal F}_{k}$ holds. Thus a large $\Delta T_{f,j_{k}}(X_{k},S_{k})$ is only possible if either $\Delta f(X_{k},S_{k})$ is large or ${\cal F}_{k}$ fails. But a large $\Delta f(X_{k},S_{k})$ is impossible close to a (global) minimizer, and thus either ${\cal F}_{k}$ (and AS.3) fails, or the guarantee that the third condition of AS.3 holds vanishes when approaching a minimum.

Clearly, the above theory does not say anything about what happens for the algorithm once AS.3 fails due to intrinsic noise. Of course, this does not mean it will not proceed meaningfully, but we can’t guarantee it. In order to improve our understanding of what can happen, we need to consider particular realizations of the iterative process where AS.3 fails. This is the objective of this section where we focus on the instantiation TR $1$ NE of TR $q$ NE for first-order optimality. Fortunately, limiting one’s ambition to first order results in subtantial simplifications in the TR $q$ NE algorithm. We first note that the mechanism of Step 1 of TR $q$ NE (whose purpose is to determine $j_{k}$ ) is no longer necessary since $j_{k}$ must be equal to one if only (approximate) gradients are available, so we can implicitly set

D_{k,1}=\frac{\overline{\nabla_{x}^{1}f}(X_{k})}{\|\overline{\nabla_{x}^{1}f}(X_{k})\|}\Delta_{k}\;\;\mbox{ and }\;\;\Delta T_{f,1}(X_{k},D_{k,1})=-\overline{\nabla_{x}^{1}f}(X_{k})^{T}D_{k,1}=\|\overline{\nabla_{x}^{1}f}(X_{k})\|\Delta_{k}

and immediately branch to the step computation. This in turn simplifies to

S_{k}=\frac{\overline{\nabla_{x}^{1}f}(X_{k})}{\|\overline{\nabla_{x}^{1}f}(X_{k})\|}R_{k}\;\;\mbox{ and }\;\;\Delta T_{f,1}(X_{k},S_{k})=\|\overline{\nabla_{x}^{1}f}(X_{k})\|\,R_{k}

irrespective of the value of $\theta$ , and (2.13) automatically holds. We thus observe that the simplified algorithm no longer needs $\delta_{k,j}$ (since neither $\overline{\phi}_{f,j}^{\delta_{k}}(x_{k})$ or $\overline{\Delta t}_{f,j}(x_{k},d_{k,j})$ needs to be effectively calculated), that the computed step $s_{k}$ is the global minimizer within the trust-region and that the constant $\theta$ (used in Step 1 and the start of Step 2 of the TR $q$ NE algorithm) is no longer necessary. The resulting streamlined TR $1$ NE algorithm is stated as Algorithm 4 4.

Algorithm 4.1: The TR1NE algorithm
Step 0: Initialisation. A starting point $x_{0}$ , a maximum radius $r_{\max}>0$ and an accuracy level $\epsilon\in(0,1)$ are given. The initial trust-region radius $r_{0}\in(\epsilon,r_{\max}]$ is also given. For a given constant $\eta\in(0,1)$ , define $\nu\stackrel{{\scriptstyle\rm def}}{{=}}\min\big{[}{\scriptstyle\frac{1}{2}}\eta,{\scriptstyle\frac{1}{4}}(1-\eta)\big{]}$ . Set $k=0$ . Step 1: Derivatives estimation. Compute the derivative estimate $\overline{\nabla_{x}f}(x_{k})$ . Step 2: Step computation. Set $s_{k}=-\displaystyle\frac{\overline{\nabla^{1}_{x}f}(x_{k})}{\|\overline{\nabla^{1}_{x}f}(x_{k})\|}r_{k}$ . Step 3: Function decrease estimation. Compute the estimate $\overline{f}(x_{k})-\overline{f}(x_{k}+s_{k})$ of $f(x_{k})-f(x_{k}+s_{k})$ . Step 4: Test of acceptance. Compute $\rho_{k}=\frac{\displaystyle\overline{f}(x_{k})-\overline{f}(x_{k}+s_{k})}{\displaystyle\|\overline{\nabla^{1}_{x}f}(x_{k})\|r_{k}}.$
If $\rho_{k}\geq\eta$ (successful iteration), then set $x_{k+1}=x_{k}+s_{k}$ ; otherwise (unsuccessful iteration) set $x_{k+1}=x_{k}$ . Step 5: Trust-region radius update. Set $r_{k+1}=\left\{\begin{array}[]{ll}{}\frac{1}{\gamma}r_{k},&\;\;\mbox{if}\;\;\rho_{k}<\eta\\ {}\min[r_{\max},\gamma r_{k}],&\;\;\mbox{if}\;\;\rho_{k}\geq\eta.\end{array}\right.$ Increment $k$ by one and go to Step 1.

The definition of the event ${\cal M}_{k}$ in (2.16) ensures that ${\cal M}_{k}^{(2)}$ implies ${\cal M}_{k}^{(1)}$ when first-order models are considered, and thus, using also (2.23), that ${\cal M}_{k}$ and ${\cal F}_{k}$ then reduce to

{\cal M}_{k}=\{|\|\nabla_{x}^{1}f(X_{k})\|-\|\overline{\nabla_{x}^{1}f}(X_{k})\||\leq\nu\|\overline{\nabla_{x}^{1}f}(X_{k})\|\}

and

{\cal F}_{k}=\{|\Delta F(X_{k},S_{k})-{\Delta f}(X_{k},S_{k})|\leq 2\nu\|\overline{\nabla_{x}^{1}f}(X_{k})\|R_{k}\},

respectively. Observe now that, because of the triangle inequality, ${\cal M}_{k}$ is true whenever the event

\widetilde{{\cal M}}_{k}\stackrel{{\scriptstyle\rm def}}{{=}}\{\|\nabla_{x}^{1}f(X_{k})-\overline{\nabla_{x}^{1}f}(X_{k})\|\leq\nu\|\overline{\nabla_{x}^{1}f}(X_{k})\|\}

(4.1)

holds, and, since $\nu\|\overline{\nabla_{x}^{1}f}(X_{k})\|\min\{1,R_{k}\}\leq\nu\|\overline{\nabla_{x}^{1}f}(X_{k})\|$ , it also follows that ${\cal F}_{k}$ is true whenever the event

\widetilde{{\cal F}}_{k}\stackrel{{\scriptstyle\rm def}}{{=}}\{|\Delta F(X_{k},S_{k})-{\Delta f}(X_{k},S_{k})|\leq 2\nu\|\overline{\nabla_{x}^{1}f}(X_{k})\|\min\{1,R_{k}\}\}

(4.2)

holds. As a consequence,

\mathbb{P}{\rm r}\Big{[}{\cal M}_{k}\mid{\cal A}_{k-1}\Big{]}\geq\mathbb{P}{\rm r}\Big{[}\widetilde{{\cal M}}_{k}\mid{\cal A}_{k-1}\Big{]}\;\;\mbox{and}\;\;\mathbb{P}{\rm r}\Big{[}{\cal F}_{k}\mid{\cal A}_{k-1}\Big{]}\geq\mathbb{P}{\rm r}\Big{[}\widetilde{{\cal F}}_{k}\mid{\cal A}_{k-1}\Big{]}.

(4.3)

Our analysis of the impact of noise on the TR $1$ NE algorithm starts by considering a relatively general form for error distributions (as we did in Theorem 2.3) and we then specialize our arguments to the particular case of finite sum minimization with subsampling.

4.1 Failure of AS.3 for general error distributions

At a generic iteration $k$ , suppose that $H_{0,k}$ and $H_{1,k}$ , are continuous and increasing random functions from $\mathbb{R}_{+}$ to $[0,1]$ which are measurable for ${\cal A}_{k-1}$ and such that $H_{0,k}(0)=H_{1,k}(0)=0$ , $\lim_{\tau\rightarrow+\infty}H_{0,k}(\tau)=\lim_{\tau\rightarrow+\infty}H_{1,k}(\tau)=1$ and,

	$\displaystyle\mathbb{P}{\rm r}\Big{[}\|\Delta F(X_{k},S_{k})-{\Delta f}(X_{k},S_{k})\|<\tau\|{\cal A}_{k-1}\Big{]}$	$\displaystyle\geq$	$\displaystyle H_{0,k}(\tau)$
	$\displaystyle\mathbb{P}{\rm r}\Big{[}\\|\nabla_{x}^{1}f(X_{k})-\overline{\nabla_{x}^{1}f}(X_{k})\\|<\tau\|{\cal A}_{k-1}\Big{]}$	$\displaystyle\geq$	$\displaystyle H_{1,k}(\tau)$		(4.4)

For sake of simplicity, assume $\alpha_{*}=\gamma_{*}\geq\sqrt{{\scriptstyle\frac{1}{2}}}$ in AS.3 and let $B_{0}$ and $B_{1}$ such that $H_{0,k}(B_{0})=\sqrt{\alpha_{*}}$ and $H_{1,k}(B_{1})=\sqrt{\alpha_{*}}$ , and $B=\max[B_{0},B_{1}]$ . Then,

	$\displaystyle\mathbb{P}{\rm r}\Big{[}\|\Delta F(X_{k},S_{k})-{\Delta f}(X_{k},S_{k})\|<\tau\mid{\cal A}_{k-1},\tau\geq B\Big{]}\geq\sqrt{\alpha_{*}},$			(4.5)
	$\displaystyle\mathbb{P}{\rm r}\Big{[}\\|\nabla_{x}^{1}f(X_{k})-\overline{\nabla_{x}^{1}f}(X_{k})\\|<\tau\mid{\cal A}_{k-1},\tau\geq B\Big{]}\geq\sqrt{\alpha_{*}}.$			(4.6)

Define

\bar{B}=\frac{B}{\nu\min\{1,R_{k}\}}\geq\frac{B}{\nu}>B,

(4.7)

and note that $\bar{B}$ is measurable for ${\cal A}_{k-1}$ . Then (4.5) and (4.6) ensure that

	$\displaystyle\mathbb{P}{\rm r}\Big{[}\|\Delta F(X_{k},S_{k})-{\Delta f}(X_{k},S_{k})\|\leq\bar{B}\mid{\cal A}_{k-1}\Big{]}\geq\sqrt{\alpha_{*}}$		(4.8)
	$\displaystyle\mathbb{P}{\rm r}\Big{[}\\|\nabla_{x}^{1}f(X_{k})-\overline{\nabla_{x}^{1}f}(X_{k})\\|\leq\bar{B}\mid{\cal A}_{k-1}\Big{]}\geq\sqrt{\alpha_{*}}.$		(4.9)

Finally, define the events

$\displaystyle{\cal G}_{k}$	$\displaystyle\stackrel{{\scriptstyle\rm def}}{{=}}\{\\|\nabla_{x}^{1}f(X_{k})\\|\geq 2\bar{B}\},$	(4.10)
$\displaystyle\bar{{\cal G}}_{k}$	$\displaystyle\stackrel{{\scriptstyle\rm def}}{{=}}\{\\|\overline{\nabla_{x}^{1}f}(X_{k})\\|\geq\bar{B}\},$	(4.11)
$\displaystyle{\cal V}_{k}$	$\displaystyle\stackrel{{\scriptstyle\rm def}}{{=}}\{\\|\nabla_{x}^{1}f(X_{k})-\overline{\nabla_{x}^{1}f}(X_{k})\\|<\frac{B}{\nu}\},$	(4.12)

and observe that (4.6) implies that

\mathbb{P}{\rm r}\Big{[}{\cal V}_{k}\mid{\cal A}_{k-1}\Big{]}\geq\sqrt{\alpha_{*}}.

(4.13)

Theorem 4.1

Let

\bar{B}

as in (4.7). Then, for each iteration

k

of the TR

1

NE algorithm,

\mathbb{P}{\rm r}\Big{[}\bar{{\cal G}}_{k}\mid{\cal A}_{k-1},{\cal G}_{k}\Big{]}\geq\sqrt{\alpha_{*}}.

(4.14)

Proof. For any realization of the TR $1$ NE algorithm we have that

\|\overline{\nabla_{x}^{1}f}(x_{k})\|\geq\left|\|\nabla_{x}^{1}f(x_{k})\|-\|\nabla_{x}^{1}f(x_{k})-\overline{\nabla_{x}^{1}f}(x_{k})\|\right|.

Therefore, $\|\nabla_{x}^{1}f(x_{k})\|\geq 2\bar{\beta}$ (where $\bar{\beta}$ is the realization of $\bar{B}$ ) and $\|\nabla_{x}^{1}f(x_{k})-\overline{\nabla_{x}^{1}f}(x_{k})\|\leq\beta/\nu<\bar{\beta}$ ensure that $\|\overline{\nabla_{x}^{1}f}(x_{k})\|\geq\bar{\beta}.$ Then, ${\cal G}_{k}\cap{\cal V}_{k}$ implies $\bar{{\cal G}}_{k}$ , where the events ${\cal G}_{k}$ , ${\cal V}_{k}$ and $\bar{{\cal G}}_{k}$ are defined in (4.10)-(4.12), and $\mathbb{P}{\rm r}\Big{[}\bar{{\cal G}}_{k}\mid{\cal A}_{k-1},{\cal G}_{k},{\cal V}_{k}\Big{]}=1$ . We therefore have that

\mathbb{E}\Big{[}\mathbbm{1}_{\bar{{\cal G}}_{k}}\mid{\cal A}_{k-1},{\cal G}_{k}\Big{]}\geq\mathbb{E}\Big{[}\mathbbm{1}_{\bar{{\cal G}}_{k}}\mid{\cal A}_{k-1},{\cal G}_{k},{\cal V}_{k}\Big{]}\,\mathbb{P}{\rm r}\Big{[}{\cal V}_{k}\mid{\cal A}_{k-1},{\cal G}_{k}\Big{]}\geq 1\cdot\sqrt{\alpha_{*}},

where we have used (4.13) and the fact that

\mathbb{E}\Big{[}\mathbbm{1}_{{\cal V}_{k}}\mid{\cal A}_{k-1},{\cal G}_{k}\Big{]}=\mathbb{E}\big{[}\mathbb{E}\big{[}\mathbbm{1}_{{\cal V}_{k}}\mid{\cal A}_{k-1}\big{]}\mid{\cal A}_{k-1},{\cal G}_{k}\Big{]}

to derive the last inequality. The conclusion (4.14) then follows. $\Box$

Theorem 4.2

Let

\bar{B}

be defined by (4.7). Then, for each iteration

k

of the TR

1

NE algorithm,

\mathbb{P}{\rm r}\Big{[}{\cal M}_{k}\mid{\cal A}_{k-1},\left\{\|\nabla_{x}^{1}f(X_{k})\|\geq 2\bar{B}\right\}\Big{]}\geq\alpha_{*}.

(4.15) Moreover, if

\omega

is a realization for which

\mathbb{P}{\rm r}\Big{[}{\cal M}_{k}\mid{\cal A}_{k-1}\Big{]}(\omega)<\alpha_{*}

, then

\|\nabla_{x}^{1}f(x_{k})\|<2\bar{B}(\omega).

(4.16)

Proof. We obtain from (4.3) and (4.14) that

$\displaystyle\mathbb{P}{\rm r}\Big{[}{\cal M}_{k}\mid{\cal A}_{k-1},{\cal G}_{k}\Big{]}$	$\displaystyle\geq$	$\displaystyle\mathbb{P}{\rm r}\Big{[}\widetilde{{\cal M}}_{k}\mid{\cal A}_{k-1},{\cal G}_{k}\Big{]}=\mathbb{E}\Big{[}\mathbbm{1}_{\widetilde{{\cal M}}_{k}}\mid{\cal A}_{k-1},{\cal G}_{k}\Big{]}$	(4.17)
	$\displaystyle\geq$	$\displaystyle\mathbb{E}\Big{[}\mathbbm{1}_{\widetilde{{\cal M}}_{k}}\mid{\cal A}_{k-1},{\cal G}_{k},\bar{{\cal G}}_{k}\Big{]}\,\mathbb{P}{\rm r}\Big{[}\bar{{\cal G}}_{k}\mid{\cal A}_{k-1},{\cal G}_{k}\Big{]}$
	$\displaystyle\geq$	$\displaystyle\sqrt{\alpha_{*}}\,\mathbb{E}\Big{[}\mathbbm{1}_{\widetilde{{\cal M}}_{k}}\mid{\cal A}_{k-1},{\cal G}_{k},\bar{{\cal G}}_{k}\Big{]}.$

If $\bar{\cal G}_{k}$ is true, then it follows from (4.7) and (4.11) that $\nu\|\overline{\nabla_{x}^{1}f}(X_{k})\|\geq B.$ Then, (4.1) and (4.6) yield

\mathbb{E}\Big{[}\mathbbm{1}_{\widetilde{{\cal M}}_{k}}\mid{\cal A}_{k-1},\bar{{\cal G}}_{k}\Big{]}\geq\sqrt{\alpha_{*}}.

(4.18)

Because the trace $\sigma$ -algebra $\{{\cal A}_{k-1},\bar{{\cal G}}_{k}\}$ contains the trace $\sigma$ - algebra $\{{\cal A}_{k-1},{\cal G}_{k},\bar{{\cal G}}_{k}\}$ , the tower property and (4.18) then imply that

\begin{array}[]{lcl}\mathbb{E}\Big{[}\mathbbm{1}_{\widetilde{{\cal M}}_{k}}\mid{\cal A}_{k-1},{\cal G}_{k},\bar{{\cal G}}_{k}\Big{]}&=&\mathbb{E}\Big{[}\mathbb{E}\Big{[}\mathbbm{1}_{\widetilde{{\cal M}}_{k}}\mid{\cal A}_{k-1},\bar{{\cal G}}_{k}\Big{]}\mid{\cal A}_{k-1},{\cal G}_{k},\bar{{\cal G}}_{k}\Big{]}\\[8.61108pt] &\geq&\mathbb{E}\Big{[}\sqrt{\alpha_{*}}\,\mid{\cal A}_{k-1},{\cal G}_{k},\bar{{\cal G}}_{k}\Big{]}=\sqrt{\alpha_{*}}\end{array}

which, together with (4.17) gives (4.15). Since ${\cal G}_{k}$ is measurable for ${\cal A}_{k-1}$ we have

\mathbb{P}{\rm r}\Big{[}{\cal M}_{k}\mid{\cal A}_{k-1}\Big{]}\geq\mathbb{P}{\rm r}\Big{[}{\cal M}_{k}\mid{\cal A}_{k-1},{\cal G}_{k}\Big{]}\mathbb{E}\Big{[}\mathbbm{1}_{{\cal G}_{k}}\mid{\cal A}_{k-1}\Big{]}=\mathbb{P}{\rm r}\Big{[}{\cal M}_{k}\mid{\cal A}_{k-1},{\cal G}_{k}\Big{]}\,\mathbbm{1}_{{\cal G}_{k}}.

If we now consider a realization $\omega$ such that $\mathbb{P}{\rm r}\Big{[}{\cal M}_{k}\mid{\cal A}_{k-1}\Big{]}(\omega)<\alpha_{*}$ , we therefore obtain, using (4.15) taken for the realization $\omega$ , that

\alpha_{*}>\mathbb{P}{\rm r}\Big{[}{\cal M}_{k}\mid{\cal A}_{k-1},{\cal G}_{k}\Big{]}(\omega)\,\mathbbm{1}_{{\cal G}_{k}(\omega)}\geq\alpha_{*}\,\mathbbm{1}_{{\cal G}_{k}(\omega)},

which implies that $\mathbbm{1}_{{\cal G}_{k}(\omega)}=0$ , and thus that (4.16) holds. $\Box$

Theorem 4.3

Let

\bar{B}

be defined by (4.7). Then, for each iteration

k

of the TR

1

NE algorithm,

\mathbb{P}{\rm r}\Big{[}{\cal F}_{k}\mid{\cal A}_{k-1},\left\{\|\nabla_{x}^{1}f(X_{k})\|\geq 2\bar{B}\right\}\Big{]}\geq\alpha_{*}.

(4.19) Moreover, if

\omega

is a realization for which

\mathbb{P}{\rm r}\Big{[}{\cal F}_{k}\mid{\cal A}_{k-1}\Big{]}(\omega)<\alpha_{*}

, then

\|\nabla_{x}^{1}f(x_{k})\|<2\bar{B}(\omega).

(4.20)

Proof. The proof is similar to that of Theorem 4.2, and is given in appendix for completeness. $\Box$

Theorems 4.2 and 4.3 indicate that the assumptions made in AS.3 about ${\cal M}_{k}$ and ${\cal F}_{k}$ are likely to be satisfied as long as the gradients remain sufficiently large, allowing the TR $1$ NE algorithm to iterate meaningfully. Conversely, they show that, should these assumptions fail for a particular realization of the algorithm because of a high level of intrinsic noise, “degraded” versions of first-order optimality conditions given by (4.16) and (4.20) nevertheless hold when this failure occurs.

4.2 A subsampling example

We finally illustrate how intrinsic noise might affect our probabilistic framework on an example. Suppose that

f(x)=\frac{1}{m}\sum_{i=1}^{m}f_{i}(x),

(4.21)

where the $f_{i}$ are functions from $\mathbb{R}$ to $\mathbb{R}$ having Lipschitz continuous gradients and where $m$ is so large that computing the complete value of $f(x)$ or its derivatives is impractical. Such a situation occurs for instance in algorithms for deep-learning, an application of growing importance. A well-known strategy to obtain approximations of the desired values at an iterate $x_{k}$ is to sample the $f_{i}(x_{k})$ and compute the sample averages, that is

\overline{f}(x_{k})=\frac{1}{|\mathfrak{b}_{0}(x_{k})|}\sum_{i\in\mathfrak{b}_{0}(x_{k})}f_{i}(x_{k}),\;\;\;\;\overline{\nabla_{x}^{1}f}(x_{k})=\frac{1}{|\mathfrak{b}_{1}(x_{k})|}\sum_{i\in\mathfrak{b}_{1}(x_{k})}\nabla_{x}^{1}f_{i}(x_{k}),

(4.22)

where $\mathfrak{b}_{0}(x)$ and $\mathfrak{b}_{1}(x)$ are realizations of random “batches”, that is randomly selected⁽³⁾⁽³⁾(3)With uniform distribution. subsets of $\{1,\ldots,m\}$ . Observe that Step 3 of the TR $1$ NE algorithm computes the estimate $\overline{f}(x_{k})-\overline{f}(x_{k}+s_{k})$ , which we assume, in the context of (4.21), to be

\overline{f}(x_{k})-\overline{f}(x_{k}+s_{k})=\frac{1}{|\mathfrak{b}_{0}(x_{k})|}\sum_{i\in\mathfrak{b}_{0}(x)}(f_{i}(x_{k})-f_{i}(x_{k}+s_{k})),

(4.23)

(using a single batch for both the function estimates). Observe that our choice to make $\mathfrak{b}_{0}$ and $\mathfrak{b}_{1}$ dependent on $x_{k}$ implies that their random counterparts $\mathfrak{B}_{0}(X_{k})$ and $\mathfrak{B}_{1}(X_{k})$ are measurable for ${\cal A}_{k-1}$ (clearly we could have chosen a more complicated dependence on the past of the random process). The mean-value theorem then yields that, for some $\{y_{i}\}_{i\in\mathfrak{b}_{0}(x_{k})}$ in the segment $[x_{k},x_{k}+s_{k}]$ ,

|\overline{f}(x_{k})-\overline{f}(x_{k}+s_{k})|\leq\left(\frac{1}{|\mathfrak{b}_{0}(x_{k})|}\sum_{i\in\mathfrak{b}_{0}(x_{k})}\nabla_{x}^{1}f_{i}(y_{i})\right)\|s_{k}\|\leq r_{k}\max_{\stackrel{{\scriptstyle{\scriptstyle y\in[x_{k},x_{k}+s_{k}]}}}{{\scriptstyle i\in\mathfrak{b}_{0}(x_{k})}}}\|\nabla_{x}^{1}f_{i}(y)\|

Note that one expects the right-hand side of this inequality to be quite small when the trust-region radius is small or when convergence to a local minimizer occurs and $\overline{\nabla_{x}^{1}f}(x_{k})$ is a reasonable approximation of $\nabla_{x}^{1}f(x_{k})$ . To simplify our illustration, we assume, for the rest of this section, that there exists a constant $\kappa_{f}$ such that for any $y\in\mathbb{R}^{n}$ , for every realization $\mathfrak{b}_{0}(x_{k})$ ,

r_{k}\max_{i\in\{1,\ldots,n\}}\|\nabla_{x}^{1}f_{i}(y)\|\leq\kappa_{f}.

Returning to the random process and using the Bernstein concentration inequality, it results from [3, Relation (7.8)] that, for any $k$ and deterministic $\tau>0$ ,

\mathbb{P}{\rm r}\Big{[}\Delta F(X_{k},S_{k})-\Delta f(X_{k},S_{k})>\tau\Big{]}\leq e^{-W_{0}(\tau)}\;\;\mbox{where}\;\;W_{0}(\tau)=\frac{\tau^{2}|\mathfrak{B}_{0}(X_{k})|}{4\kappa_{f}(2\kappa_{f}+{\scriptstyle\frac{1}{3}}\tau)}.

(4.24)

Similarly,

\mathbb{P}{\rm r}\Big{[}\|\nabla_{x}^{1}F(X_{k},S_{k})-\nabla_{x}^{1}f(X_{k},S_{k})\|>\tau\Big{]}\leq\min\left[1,(n+1)e^{-W_{1}(\tau)}\right],\;\;\;\;W_{1}(\tau)=\frac{\tau^{2}|\mathfrak{B}_{1}(X_{k})|}{4\kappa_{g}(2\kappa_{g}+{\scriptstyle\frac{1}{3}}\tau)},

(4.25)

for some constant $\kappa_{g}>0$ . One also checks that, since $\mathfrak{B}_{0}(X_{k})$ and $\mathfrak{B}_{1}(X_{k})$ are measurable for ${\cal A}_{k-1}$ , so are $W_{0}$ and $W_{1}$ . One then easily verifies that $W_{0}(\tau)$ is an increasing function of $\tau$ , and hence $e^{-W_{0}(\tau)}$ is decreasing. Letting $G_{k}(\tau)=1-e^{-W_{0}(\tau)}$ , we immediately obtain that conditions (2.27) and (2.28) hold. Let us now analyze condition (2.29) and consider any realization $\omega$ , where $w_{0}(\tau)\stackrel{{\scriptstyle\rm def}}{{=}}W_{0}(\omega)(\tau)$ . Note that $w_{0}(\tau)\geq|\mathfrak{b}_{0}(x_{k})|^{{\scriptstyle\frac{1}{2}}}\tau$ when

\tau\geq\tau_{*}\stackrel{{\scriptstyle\rm def}}{{=}}\frac{8\kappa_{f}^{2}}{|\mathfrak{b}_{0}(x_{k})|^{{\scriptstyle\frac{1}{2}}}-\frac{4}{3}\kappa_{f}}

(4.26)

and $\frac{|\mathfrak{b}_{0}(x_{k})|^{{\scriptstyle\frac{1}{2}}}}{\kappa_{f}}$ is large enough so that

\frac{|\mathfrak{b}_{0}(x_{k})|^{{\scriptstyle\frac{1}{2}}}}{\kappa_{f}}>\frac{4}{3}.

(4.27)

Hence $e^{-w_{0}(\tau)}\leq e^{-|\mathfrak{b}_{0}(x_{k})|^{{\scriptstyle\frac{1}{2}}}\tau}$ for all $\tau\geq\tau_{*}$ and

\displaystyle\int_{0}^{\infty}e^{-w_{0}(\tau)}\,d\tau\leq\int_{0}^{\tau_{*}}e^{-w_{0}(\tau)}\,d\tau+\int_{\tau_{*}}^{\infty}e^{-|\mathfrak{b}_{0}(x_{k})|^{{\scriptstyle\frac{1}{2}}}\tau}\,d\tau.

In addition, since $e^{-w_{0}(\tau)}$ is decreasing and non-negative, we have that

\int_{0}^{\tau_{*}}e^{-w_{0}(\tau)}\,d\tau\leq\tau_{*}e^{-w_{0}(0)}=\tau_{*}.

This bound and (4.26) then imply that

\int_{0}^{\infty}(1-g_{k}(\tau))\,d\tau=\int_{0}^{\infty}e^{-w_{0}(\tau)}\,d\tau\leq\tau_{*}+\frac{e^{-|\mathfrak{b}_{0}(x_{k})|^{{\scriptstyle\frac{1}{2}}}\tau_{*}}}{|\mathfrak{b}_{0}(x_{k})|^{{\scriptstyle\frac{1}{2}}}}<+\infty,

(4.28)

proving that (2.29) also holds for the arbitrary realization $\omega$ . We may therefore apply Theorem 2.3 provided $\mathfrak{B}_{0}(X_{k})$ is sufficiently large⁽⁴⁾⁽⁴⁾(4)While the bound given by (4.28) is adequate for our proof, this inequality can be pessimistic. For instance, if we set $\kappa_{f}=1$ and $|\mathfrak{b}_{0}(X_{k})|=2056$ , the numerically computed value of the left-hand side is 0.0556 while that of the right-hand side is 0.1818. to ensure (4.27) and (4.26), and conclude that, under these conditions, (2.31) holds whenever

\Delta t_{f,1}(x_{k},s_{k})\geq\frac{1}{\eta}\int_{0}^{\infty}(1-g_{k}(\tau))\,d\tau.

We can also apply the analysis in Section 4.1 with

H_{0,k}(\tau)=1-e^{-W_{0}(\tau)}\;\;\mbox{ and }\;\;H_{1,k}(\tau)=\max\left[0,1-(n+1)e^{-W_{1}(\tau)}\right].

A short calculation shows that $B_{0}={\cal O}(\kappa_{f}|\mathfrak{B}_{0}|^{-{\scriptstyle\frac{1}{2}}}|)$ and $B_{1}={\cal O}(\kappa_{g}|\mathfrak{B}_{1}|^{-{\scriptstyle\frac{1}{2}}}|)$ , where $B_{0}$ and $B_{1}$ are defined below (4.4). Then, Theorems 4.2 and 4.3 hold with $\bar{B}={\cal O}\big{(}\frac{\max\{\kappa_{f}|\mathfrak{B}_{0}|^{-{\scriptstyle\frac{1}{2}}}|,\kappa_{g}|\mathfrak{B}_{1}|^{-{\scriptstyle\frac{1}{2}}}|\}}{\nu\max\{1,R_{k}\}}\big{)}$ .

We finally illustrate the impact of intrinsic noise on the (admittedly ad-hoc) problem of minimizing

f(x)=\frac{1}{2m}\sum_{i\in{\cal Z}}\left[{\scriptstyle\frac{1}{2}}x^{2}+{\scriptstyle\frac{1}{2}}\alpha\,{\rm sgn}(i)\,e^{-x^{2}}\right]\stackrel{{\scriptstyle\rm def}}{{=}}\frac{1}{2m}\sum_{i\in{\cal Z}}f_{i}(x),

(4.29)

where $\alpha>0$ is a noise level and where ${\cal Z}=\{-m,\ldots,m\}\setminus\{0\}$ for some large integer $m$ . Suppose furthermore that the $\{f_{i}(x)\}_{i\in{\cal Z}}$ and $\{\nabla_{x}^{1}f_{i}(x)\}_{i\in{\cal Z}}$ are computed by black-box routines, therefore hiding their relationships. Consider an iterate $x_{k}$ at the start of iteration $k$ of an arbitrary realization of the TR $1$ NE algorithm⁽⁵⁾⁽⁵⁾(5)With given $\nu$ and $\varsigma=1$ . applied to this problem. We verify that, for $i\in{\cal Z}$ ,

\nabla_{x}^{1}f_{i}(x_{k})=x_{k}\left(1-\alpha\,{\rm sgn}(i)\,e^{-x_{k}^{2}}\right)

and thus $\nabla_{x}^{1}f(x_{k})=x_{k}$ and, in view of (2.5),

\phi_{f,1}^{\delta_{k}}(x_{k})=|x_{k}|\,\delta_{k}

(4.30)

for all $x_{k}$ . As a consequence, $x=0$ is the unique global minimizer of $f(x)$ . Suppose, for the rest of this section, that $\mathfrak{B}_{0,k}\stackrel{{\scriptstyle\rm def}}{{=}}\mathfrak{B}_{0}(x_{k})=\mathfrak{B}_{0}(x_{k}+s_{k})$ , $\mathfrak{B}_{1,k}\stackrel{{\scriptstyle\rm def}}{{=}}\mathfrak{B}_{1}(x_{k})$ , and that $n_{0,k}^{\mathfrak{B}}$ and $n_{1,k}^{\mathfrak{B}}$ , the cardinalities of these two sets are known parameters. We deduce from (4.22) that

\overline{\nabla_{x}^{1}f}(x_{k})=x_{k}\left(1-\alpha\Psi(\mathfrak{B}_{1,k})e^{-x_{k}^{2}}\right)\;\;\mbox{ where }\;\;\Psi(\mathfrak{B})\stackrel{{\scriptstyle\rm def}}{{=}}\frac{1}{|\mathfrak{B}|}\sum_{i\in\mathfrak{B}}{\rm sgn}(i).

(4.31)

Thus $\Psi(\mathfrak{B})$ is a zero-mean random variable with values in $[-1,1]$ , depending on the randomly chosen batch $\mathfrak{B}\subseteq{\cal Z}$ of size $|\mathfrak{B}|$ . Using the hypergeometric distribution, it is possible to show that $|\Psi(\mathfrak{B})|$ is (in probability) a decreasing function of $|\mathfrak{B}|$ .

Moreover, the use of standard tail bounds [14] reveals that, for any $t\in(0,1)$ ,

\mathbb{P}{\rm r}\Big{[}|\Psi(\mathfrak{B}_{1,k})|\leq t\Big{]}=\mathbb{P}{\rm r}\Big{[}\Psi(\mathfrak{B}_{1,k})\leq t\Big{]}^{2}=\left(1-\mathbb{P}{\rm r}\Big{[}\Psi(\mathfrak{B}_{1,k})>t\Big{]}\right)^{2}\geq(1-e^{-{\scriptstyle\frac{1}{2}}t^{2}n_{1,k}^{\mathfrak{B}}})^{2},

(4.32)

in turn indicating that $\mathbb{P}{\rm r}\Big{[}|\Psi(\mathfrak{B}_{1,k})|\leq t\Big{]}>{\scriptstyle\frac{1}{2}}$ whenever

n_{1,k}^{\mathfrak{B}}\geq\frac{2}{t^{2}}\left|\log\left(1-\frac{1}{\sqrt{2}}\right)\right|\approx\frac{2.4559}{t^{2}}.

Occurence of ${\cal M}_{k}^{(1)}$ and ${\cal M}_{k}^{(2)}$ . Let us now examine at what conditions the events ${\cal M}_{k}^{(1)}$ and ${\cal M}_{k}^{(2)}$ do occur for a specific realization $\mathfrak{b}_{1,k}$ of $\mathfrak{B}_{1,k}$ , and consider the occurence of ${\cal M}_{k}^{(1)}$ first. Because the minimum of first-order models in a ball of radius $\delta_{k}$ must occur on the boundary, we choose $d_{k,1}=-{\rm sgn}(\overline{\nabla_{x}^{1}f}(x_{k}))\delta_{k}$ so that

\overline{\Delta t}_{f,1}(x_{k},d_{k,1})=|\overline{\nabla_{x}^{1}f}(x_{k})|\,\delta_{k}.

Using (4.31), we then have that

\overline{\Delta t}_{f,1}(x_{k},d_{k,1})=|x_{k}|\,\delta_{k}\,\left|1-\alpha\Psi(\mathfrak{b}_{1,k})\,e^{-x_{k}^{2}}\right|.

(4.33)

Thus the quantity $1-\alpha\Psi(\mathfrak{b}_{1,k})\,e^{-x_{k}^{2}}$ may be interpreted as the local noise relative to the model decrease.

Taking (4.30) and (4.33) into account, ${\cal M}_{k}^{(1)}$ occurs, in any realization, whenever

\left|1-\alpha\Psi(\mathfrak{b}_{1,k})\,e^{-x_{k}^{2}}\right|\geq\frac{1}{1+\nu},

(4.34)

that is

\Psi(\mathfrak{b}_{1,k})\leq\frac{e^{x_{k}^{2}}}{\alpha}\left(1-\frac{1}{1+\nu}\right)\;\;\mbox{ or }\;\;\Psi(\mathfrak{b}_{1,k})\geq\frac{e^{x_{k}^{2}}}{\alpha}\left(1+\frac{1}{1+\nu}\right).

This condition may be quite weak, as shown in left picture in Figure 4.1, where the shape of the left-hand side of (4.34) is shown for increasing values⁽⁶⁾⁽⁶⁾(6)Magenta for 0.5, blue for 4/3 and cyan for 4. of the local noise level $\alpha\,{\rm exp}(-x^{2})$ as a function of $\Psi(\mathfrak{b}_{1,k})$ , and where the lower bound $1/(1+\nu)$ is shown as a red horizontal dashed line. The corresponding ranges of acceptable values of $\Psi(\mathfrak{b}_{1,k})$ are shown below the horizontal axis (in matching colours). The one-sided nature of the inequality defining ${\cal M}_{k}^{(1)}$ is apparent in the picture, where restrictions on the acceptable values of $\Psi(\mathfrak{b}_{1,k})$ only occur for positive values. This reflects the fact that the model may be quite inaccurate and yet produce a decrease which is large enough for the condition to hold.

Refer to caption — Figure 4.1: An illustration of conditions (4.34) (left) and (4.36) (right) as a function of $\Psi(\mathfrak{b}_{1,k})$ , for $\nu={\scriptstyle\frac{1}{4}}$ and local relative noise levels $\alpha e^{-x_{k}^{2}}={\scriptstyle\frac{1}{2}}$ (magenta), ${\scriptstyle\frac{4}{3}}$ (blue) and $4$ (cyan). Acceptable ranges for $\Psi(\mathfrak{b}_{1,k})$ are shown below the horizontal axis in matching colours.

The constraints on $\Psi(\mathfrak{b}_{1,k})$ , and thus on $n_{1,k}^{\mathfrak{B}}$ , become more stringent when considering the occurence of ${\cal M}_{k}^{(2)}$ . Since, for any realization, $s_{k}=-{\rm sgn}\big{(}\overline{\nabla_{x}^{1}f}(x_{k})\big{)}r_{k}$ , we deduce from (4.31) that

	$\displaystyle\Delta t_{f,1}(x_{k},s_{k})=-x_{k}s_{k}$	$\displaystyle=-x_{k}\left[-{\rm sgn}(x_{k})\,{\rm sgn}\left(1-\alpha\,\Psi(\mathfrak{b}_{1,k})\,e^{-x_{k}^{2}}\right)r_{k}\right]$
		$\displaystyle=\|x_{k}\|\,r_{k}\,{\rm sgn}\left(1-\alpha\,\Psi(\mathfrak{b}_{1,k})\,e^{-x_{k}^{2}}\right)$

and

\overline{\Delta t}_{f,1}(x_{k},s_{k})=|x_{k}|\,r_{k}\,\left|1-\alpha\Psi(\mathfrak{b}_{1,k})\,e^{-x_{k}^{2}}\right|.

(4.35)

One then verifies that ${\cal M}_{k}^{(2)}$ occurs whenever

\frac{1}{1+\nu}\leq 1-\alpha\Psi(\mathfrak{b}_{1,k})\,e^{-x_{k}^{2}}\leq\frac{1}{1-\nu}.

(4.36)

The acceptable values for $\Psi(\mathfrak{b}_{1,k})$ are illustrated in the right picture of Figure 4.1, which shows the shape of the central term in (4.36) using the same conventions than for the left picture except that now the acceptable part of the curves lies between the lower and upper bounds resulting from (4.36) (again shown as dashed red lines). A short calculation reveals that (4.36) is equivalent to requiring

\frac{e^{x_{k}^{2}}}{\alpha}\left(1-\frac{1}{1-\nu}\right)\leq\Psi(\mathfrak{b}_{1,k})\leq\frac{e^{x_{k}^{2}}}{\alpha}\left(1-\frac{1}{1+\nu}\right).

This therefore defines intervals around the origin, whose widths clearly decrease with the local relative noise level. Because $|\Psi(\mathfrak{B}_{1,k})|$ is (in probability) a decreasing function of $n_{1,k}^{\mathfrak{B}}$ , this indicates that $n_{1,k}^{\mathfrak{B}}$ must increase with $\alpha e^{-x_{k}^{2}}$ , that is when the local relative noise is large.

Occurence of ${\cal F}_{k}$ . A similar reasoning holds when considering the event ${\cal F}_{k}$ . Given (4.23), we have that

|\Delta f(x_{k},s_{k})-\overline{\Delta f}(x_{k},s_{k})|={\scriptstyle\frac{1}{2}}\alpha\,|\Psi(\mathfrak{b}_{0,k})|\,\left|e^{-x_{k}^{2}}-e^{-(x_{k}+s_{k})^{2}}\right|

(4.37)

and, in view of (4.35), ${\cal F}_{k}$ thus occurs whenever

{\scriptstyle\frac{1}{2}}\alpha\,|\Psi(\mathfrak{b}_{0,k})|\,e^{-x_{k}^{2}}\left|1-e^{-(2x_{k}s_{k}+s_{k}^{2})}\right|\leq 2\nu|x_{k}|\,r_{k}\,\left|1-\alpha\Psi(\mathfrak{b}_{1,k})\,e^{-x_{k}^{2}}\right|.

(4.38)

Thus, if $|x_{k}|$ is small (e.g., if the optimum is close) then satisfying (4.38) requires the left-hand side of this inequality to be small, putting a high request on $n_{0,k}^{\mathfrak{B}}$ , while the inequality is more easily satisfied if $|x_{k}|$ is large, irrespective of the batch sizes. Note that, in the first case (i.e., when $|x_{k}|$ is small), the request on $n_{0,k}^{\mathfrak{B}}$ is stronger for smaller $n_{1,k}^{\mathfrak{B}}$ .

Occurence of (2.31). Given (4.32) and (4.35), we see from Theorem 2.3 that (2.31) holds whenever

\overline{\Delta t}_{f,1}(x_{k},s_{k})=|x_{k}|\,r_{k}\,\left|1-\alpha\Psi(\mathfrak{b}_{1,k})\,e^{-x_{k}^{2}}\right|\geq\frac{1}{\eta}\int_{0}^{\infty}(1-g_{k}(\tau))\,d\tau=\frac{1}{\eta}\sqrt{\frac{\pi}{2n_{1,k}^{\mathfrak{B}}}}

where we have used the definition of the erf function to derive the last equality. Thus, as $|\Psi(\mathfrak{b}_{1,k})|\leq 1$ , gauranteeing (2.31) requires a larger $n_{1,k}^{\mathfrak{B}}$ for small value of $x_{k}$ , that is when the optimum is approached.

5 Conclusions and perspectives

We have considered a trust-region method for unconstrained minimization inspired by [10] which is adapted to handle randomly perturbed function and derivatives values and is capable of finding approximate minimizers of arbitrary order. Exploiting ideas of [12, 7], we have shown that its evaluation complexity is (in expectation) of the same order in the requested accuracy as that known for related deterministic methods [7, 10].

In [5], the authors have considered the effect of intrinsic noise on complexity of a deterministic, noise tolerant variant of the trust-region algorithm. This important question is handled here by considering specific realizations of the algorithm under reasonable assumptions on the cumulative distribution of errors in the evaluations of the objective function and its derivatives. We have shown that, for such realizations, a first-order version of our trust-region algorithms still provides “degraded” optimality guarantees, should intrinsic noise cause the assumptions used for the complexity analysis to fail. We have specialized and illustrated those results in the case of sampling-based finite-sum minimization, a context of particular interest in deep-learning applications.

We have so far developed and analyzed “noise-aware” deterministic and stochastic algorithms for unconstrained optimization. Clearly, considering the constrained case is a natural extension of the type of analysis presented here.

References

[1] A. S. Bandeira, K. Scheinberg, and L. N. Vicente. Convergence of trust-region methods based on probabilistic models. SIAM Journal on Optimization, 24(3):1238–1264, 2014.
[2] S. Bellavia and G. Gurioli. Complexity analysis of a stochastic cubic regularisation method under inexact gradient evaluations and dynamic Hessian accuracy. Optimization, (to appear), 2021, https://doi.org/10.1080/02331934.2021.1892104.
[3] S. Bellavia, G. Gurioli, B. Morini, and Ph. L. Toint. Adaptive regularization algorithms with inexact evaluations for nonconvex optimization. SIAM Journal on Optimization, 29(4):2881–2915, 2019.
[4] S. Bellavia, G. Gurioli, B. Morini, and Ph. L. Toint. Adaptive regularization for nonconvex optimization using inexact function values and randomly perturbed derivatives. Journal of Complexity, 68, Article number 101591, 2022.
[5] S. Bellavia, G. Gurioli, B. Morini, and Ph. L. Toint. The impact of noise on evaluation complexity: The deterministic trust-region case. arXiv:2104.02519, 2021.
[6] A. Berahas, L. Cao, and K. Scheinberg. Global convergence rate analysis of a generic line search algorithm with noise. SIAM Journal on Optimization, 31(2):1489–1518, 2021.
[7] J. Blanchet, C. Cartis, M. Menickelly, and K. Scheinberg. Convergence rate analysis of a stochastic trust region method via supermartingales. INFORMS Journal on Optimization, 1(2):92–119, 2019.
[8] C. Cartis, N. I. M. Gould, and Ph. L. Toint. Second-Order Optimality and Beyond: Characterization and Evaluation Complexity in Convexly Constrained Nonlinear Optimization. Foundations of Computational Mathematics, 18, 1073–1107, 2020.
[9] C. Cartis, N. I. M. Gould, and Ph. L. Toint. Sharp worst-case evaluation complexity bounds for arbitrary-order nonconvex optimization with inexpensive constraints. SIAM Journal on Optimization, 30(1):513–541, 2020.
[10] C. Cartis, N. I. M. Gould, and Ph. L. Toint. Strong evaluation complexity of an inexact trust-region algorithm for arbitrary-order unconstrained nonconvex optimization. arXiv:2011.00854, 2020.
[11] C. Cartis, N. I. M. Gould, and Ph. L. Toint. Evaluation complexity of algorithms for nonconvex optimization. MPS-SIAM Series on Optimization. SIAM, Philadelphia, USA, to appear, 2021.
[12] C. Cartis and K. Scheinberg. Global convergence rate analysis of unconstrained optimization methods based on probabilistic models. Mathematical Programming, Series A, 159(2):337–375, 2018.
[13] R. Chen, M. Menickelly, and K. Scheinberg. Stochastic optimization using a trust-region method and random models. Mathematical Programming, Series A, 169(2):447–487, 2018.
[14] V. Chvátal. The tail of the hypergeometric distribution. Discrete Mathematics, 25:285–287, 1979.
[15] A. R. Conn, N. I. M. Gould, and Ph. L. Toint. Trust-Region Methods. MPS-SIAM Series on Optimization. SIAM, Philadelphia, USA, 2000.
[16] C. Paquette and K. Scheinberg. A stochastic line search method with convergence rate analysis. SIAM Journal on Optimization, 30(1):349–376, 2020.
[17] Y. Yuan. Recent advances in trust region algorithms. Mathematical Programming, Series A, 151(1):249–281, 2015.

Appendix: additional proofs

Proof of (3.12)

Proof. Since $\sigma(\mathbbm{1}_{\Lambda_{k}^{c}})$ belong to ${\cal A}_{k-1}$ , because the random variable $\Lambda_{k}$ is fully determined, assuming $Pr(\mathbbm{1}_{\Lambda_{k}^{c}})>0$ , the tower property yields:

\mathbb{E}\left[\mathbbm{1}_{{\cal E}_{k}}\mid\mathbbm{1}_{\Lambda_{k}^{c}}\right]=\mathbb{E}\left[\mathbb{E}\left[\mathbbm{1}_{{\cal E}_{k}}\mid{\cal A}_{k-1}\right]\mid\mathbbm{1}_{\Lambda_{k}^{c}}\right]\geq\mathbb{E}\left[p_{*}\mid\mathbbm{1}_{\Lambda_{k}^{c}}\right]=p_{*}.

Then, by the total expectation law we have

\mathbb{E}\left[\mathbbm{1}_{{\cal E}_{k}}\mathbbm{1}_{\Lambda_{k}^{c}}\right]=\mathbb{E}\left[\mathbbm{1}_{\Lambda_{k}^{c}}\mathbb{E}\left[\mathbbm{1}_{{\cal E}_{k}}\mid\mathbbm{1}_{\Lambda_{k}^{c}}\right]\right]\geq p_{*}\mathbb{E}\left[\mathbbm{1}_{\Lambda_{k}^{c}}\right].

Similarly,

\mathbb{E}\left[\mathbbm{1}_{\{k<N_{\epsilon}\}}\mathbbm{1}_{{\cal E}_{k}}\mathbbm{1}_{\Lambda_{k}^{c}}\right]\geq p_{*}\mathbb{E}\left[\mathbbm{1}_{\{k<N_{\epsilon}\}}\mathbbm{1}_{\Lambda_{k}^{c}}\right],

as $\mathbbm{1}_{\{k<N_{\epsilon}\}}$ is also determined by ${\cal A}_{k-1}.$ In case $Pr(\mathbbm{1}_{\Lambda_{k}^{c}})=0$ , the above inequality holds trivially. Then

\mathbb{E}\left[\sum_{k=0}^{N_{\epsilon}-1}\mathbbm{1}_{\Lambda_{k}^{c}}\mathbbm{1}_{{\cal E}_{k}}\right]=\mathbb{E}\left[\sum_{k=0}^{\infty}\mathbbm{1}_{\{k<N_{\epsilon}\}}\mathbbm{1}_{\Lambda_{k}^{c}}\mathbbm{1}_{{\cal E}_{k}}\right]\geq p_{M}\mathbb{E}\left[\sum_{k=0}^{\infty}\mathbbm{1}_{\{k<N_{\epsilon}\}}\mathbbm{1}_{\Lambda_{k}^{c}}\right]=p_{*}\,\mathbb{E}\left[\sum_{k=0}^{N_{\epsilon}-1}\mathbbm{1}_{\Lambda_{k}^{c}}\right].

and (3.12) follows. $\Box$

Proof of Lemma 3.4

Proof. Proceeding as in the proof of (3.12) with $\mathbbm{1}_{\Lambda_{k}}$ in place of $\mathbbm{1}_{\Lambda_{k}^{c}}$ , we obtain:

\mathbb{E}\left[\sum_{k=0}^{N_{\epsilon}-1}\mathbbm{1}_{\Lambda_{k}}\mathbbm{1}_{{\cal E}_{k}}\right]\geq p_{*}\,\mathbb{E}\left[\sum_{k=0}^{N_{\epsilon}-1}\mathbbm{1}_{\Lambda_{k}}\right].

Moreover, proceeding again as in the proof of (3.12) and substituting $\mathbbm{1}_{{\cal E}_{k}}$ with $\mathbbm{1}_{{\cal E}_{k}^{c}}$ we obtain

\mathbb{E}\left[\sum_{k=0}^{N_{\epsilon}-1}\mathbbm{1}_{\Lambda_{k}^{c}}\mathbbm{1}_{{\cal E}_{k}^{c}}\right]\leq(1-p_{*})\,\mathbb{E}\left[\sum_{k=0}^{N_{\epsilon}-1}\mathbbm{1}_{\Lambda_{k}}\right].

Using the above inequalities we obtain (3.16). $\Box$

Proof of Theorem 4.3

Proof. Because of (4.3) and (4.14), we have that

$\displaystyle\mathbb{P}{\rm r}\Big{[}{\cal F}_{k}\mid{\cal A}_{k-1},{\cal G}_{k}\Big{]}$	$\displaystyle\geq$	$\displaystyle\mathbb{P}{\rm r}\Big{[}\widetilde{{\cal F}}_{k}\mid{\cal A}_{k-1},{\cal G}_{k}\Big{]}=\mathbb{E}\Big{[}\mathbbm{1}_{\widetilde{{\cal F}}_{k}}\mid{\cal A}_{k-1},{\cal G}_{k}\Big{]}$	(A.1)
	$\displaystyle\geq$	$\displaystyle\mathbb{E}\Big{[}\mathbbm{1}_{\widetilde{{\cal F}}_{k}}\mid{\cal A}_{k-1},{\cal G}_{k},\bar{{\cal G}}_{k}\Big{]}\,\mathbb{P}{\rm r}\Big{[}\bar{{\cal G}}_{k}\mid{\cal A}_{k-1},{\cal G}_{k}\Big{]}$
	$\displaystyle\geq$	$\displaystyle\sqrt{\alpha_{*}}\,\mathbb{E}\Big{[}\mathbbm{1}_{\widetilde{{\cal F}}_{k}}\|{\cal A}_{k-1},{\cal G}_{k},\bar{{\cal G}}_{k}\Big{]}.$

If $\bar{{\cal G}}_{k}$ is true then by (4.7) it follows

2\nu\|\overline{\nabla_{x}^{1}f}(X_{k})\|\min\{1,R_{k}\}\geq\bar{B}\nu\min\{1,R_{k}\}>B.

Then, (4.2) and (4.5) yield

\mathbb{E}\Big{[}\mathbbm{1}_{\widetilde{{\cal F}}_{k}}\mid{\cal A}_{k-\frac{1}{2}},\bar{\cal G}_{k}\Big{]}\geq\sqrt{\alpha_{*}}.

(A.2)

\mathbb{E}\Big{[}\mathbbm{1}_{\widetilde{{\cal F}}_{k}}\mid{\cal A}_{k-1},{\cal G}_{k},\bar{{\cal G}}_{k}\Big{]}=\mathbb{E}\Big{[}\mathbb{E}\Big{[}\mathbbm{1}_{\widetilde{{\cal F}}_{k}}\mid{\cal A}_{k-1},\bar{{\cal G}}_{k}\Big{]}\mid{\cal A}_{k-1},{\cal G}_{k},\bar{{\cal G}}_{k}\Big{]}\geq\mathbb{E}\Big{[}\sqrt{\alpha_{*}}\,\mid{\cal A}_{k-1},{\cal G}_{k},\bar{{\cal G}}_{k}\Big{]}=\sqrt{\alpha_{*}}

which, together with (A.1), implies (4.19). Since ${\cal G}_{k}$ is measurable for ${\cal A}_{k-1}$ we have that

\displaystyle\mathbb{P}{\rm r}\Big{[}{\cal F}_{k}\mid{\cal A}_{k-1}\Big{]}\geq\mathbb{P}{\rm r}\Big{[}{\cal F}_{k}\mid{\cal A}_{k-1},{\cal G}_{k}\Big{]}\,\mathbb{E}\Big{[}\mathbbm{1}_{{\cal G}_{k}}\mid{\cal A}_{k-1}\Big{]}=\mathbb{P}{\rm r}\Big{[}{\cal F}_{k}\mid{\cal A}_{k-1},{\cal G}_{k}\Big{]}\mathbbm{1}_{{\cal G}_{k}}.

Considering now a realization $\omega$ such that $\mathbb{P}{\rm r}\Big{[}{\cal F}_{k}\mid{\cal A}_{k-1}\Big{]}(\omega)<\alpha_{*}$ , we therefore obtain, using (4.19) taken for this realization, that

\alpha_{*}>\mathbb{P}{\rm r}\Big{[}{\cal F}_{k}\mid{\cal A}_{k-1},{\cal G}_{k}\Big{]}(\omega)\,\mathbbm{1}_{{\cal G}_{k}(\omega)}\geq\alpha_{*}\,\mathbbm{1}_{{\cal G}_{k}(\omega)},

which implies that $\mathbbm{1}_{{\cal G}_{k}(\omega)}=0$ , in turn yielding (4.20). $\Box$

	$\displaystyle\|\rho_{k}-1\|$	$\displaystyle=\left\|\frac{\overline{\Delta f}(x_{k},s_{k})-\overline{\Delta t}_{f,j_{k}}(x_{k},s_{k})}{\overline{\Delta t}_{f,j_{k}}(x_{k},s_{k})}\right\|$
		$\displaystyle\leq\frac{\left\|\Delta f(x_{k},s_{k})-\Delta t_{f,j_{k}}(x_{k},s_{k})\right\|}{\overline{\Delta t}_{f,j_{k}}(x_{k},s_{k})}+\frac{\left\|\Delta t_{f,j_{k}}(x_{k},s_{k})-\overline{\Delta t}_{f,j_{k}}(x_{k},s_{k})\right\|}{\overline{\Delta t}_{f,j_{k}}(x_{k},s_{k})}$
		$\displaystyle\hskip 19.91692pt+\frac{\left\|\overline{\Delta f}(x_{k},s_{k})-\Delta f(x_{k},s_{k})\right\|}{\overline{\Delta t}_{f,j_{k}}(x_{k},s_{k})}$
		$\displaystyle{\leq}\frac{\|f(x_{k}+s_{k})-t_{f,j_{k}}(x_{k},s_{k})\|}{\overline{\Delta t}_{f,j_{k}}(x_{k},s_{k})}+\frac{3\nu\overline{\Delta t}_{f,j_{k}}(x_{k},s_{k})}{\overline{\Delta t}_{f,j_{k}}(x_{k},s_{k})}.$

Trust-region algorithms: probabilistic complexity and intrinsic noise with applications to subsampling techniques

Abstract

1 Introduction

2 A trust-region minimization method for problems with randomly perturbed function values and derivatives

Lemma 2.1

2.1 The probabilistic setting

Lemma 2.2

Theorem 2.3

3 Worst-case evaluation complexity

Lemma 3.1

Lemma 3.2

Lemma 3.3

3.1 Bounding the expected number of steps with 𝑹𝒌≤𝒓¯R_{k}\leq\overline{r}

3.2 Bounding the expected number of steps with 𝑹𝒌>𝒓¯R_{k}>\overline{r}

Definition 1

Lemma 3.4

Lemma 3.5

Lemma 3.6

Theorem 3.7

4 The impact of noise for first-order minimization

4.1 Failure of AS.3 for general error distributions

Theorem 4.1

Theorem 4.2

Theorem 4.3

4.2 A subsampling example

5 Conclusions and perspectives

References

Appendix: additional proofs

2 A trust-region minimization method for problems with
randomly perturbed function values and derivatives

3.1 Bounding the expected number of steps with $R_{k}\leq\overline{r}$

3.2 Bounding the expected number of steps with $R_{k}>\overline{r}$