This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Quadratic and Cubic Regularisation Methods with Inexact function and Random Derivatives for Finite-Sum Minimisation

Stefania Bellavia Department of Industrial Engineering
Università degli Studi di Firenze
Firenze, Italy
stefania.bellavia@unifi.it
   Gianmarco Gurioli Scuola Internazionale Superiore di
Studi Avanzati (SISSA)
Trieste, Italy
ggurioli@sissa.it
   Benedetta Morini Department of Industrial Engineering
Università degli Studi di Firenze
Firenze, Italy
benedetta.morini@unifi.it
   Philippe L. Toint Namur Center for Complex Systems (naXys)
University of Namur
Namur, Belgium
philippe.toint@unamur.be
Abstract

This paper focuses on regularisation methods using models up to the third order to search for up to second-order critical points of a finite-sum minimisation problem. The variant presented belongs to the framework of [3]: it employs random models with accuracy guaranteed with a sufficiently large prefixed probability and deterministic inexact function evaluations within a prescribed level of accuracy. Without assuming unbiased estimators, the expected number of iterations is 𝓞(ϵ𝟏𝟐)\mathcal{O}\bigl{(}\epsilon_{1}^{-2}\bigr{)}or 𝓞(ϵ𝟏𝟑/𝟐)\mathcal{O}\bigl{(}\epsilon_{1}^{-{3/2}}\bigr{)}when searching for a first-order critical point using a second or third order model, respectively, and of 𝓞(𝐦𝐚𝐱[ϵ𝟏𝟑/𝟐,ϵ𝟐𝟑])\mathcal{O}\bigl{(}\max[\epsilon_{1}^{-{3/2}},\epsilon_{2}^{-3}]\bigr{)}when seeking for second-order critical points with a third order model, in which ϵ𝒋\epsilon_{j}, 𝒋{𝟏,𝟐}j\in\{1,2\}, is the 𝒋jth-order tolerance. These results match the worst-case optimal complexity for the deterministic counterpart of the method. Preliminary numerical tests for first-order optimality in the context of nonconvex binary classification in imaging, with and without Artifical Neural Networks (ANNs), are presented and discussed.

Index Terms:
evaluation complexity, regularization methods, inexact functions and derivatives, stochastic analysis

I Introduction

We consider adaptive regularisation methods to compute an approximate qq-th order, q{1,2}q\in\{1,2\}, local minimum of the finite-sum minimisation problem

minxnf(x)=defminxni=1Nfi(x),\min_{x\in\mathbb{R}^{n}}f(x)\stackrel{{\scriptstyle\rm def}}{{=}}\min_{x\in\mathbb{R}^{n}}\sum_{i=1}^{N}f_{i}(x), (1)

where fi:nf_{i}:\mathbb{R}^{n}\rightarrow\mathbb{R}, i{1,,N}i\in\{1,...,N\}.

Problem (1) has recently received a large attention since it includes a variety of applications, such as least-squares approximation and Machine Learning, and covers optimization problems in imaging. In fact, Convolutional Neural Networks (CNNs) have been successfully employed for classification in imaging and for recovering an image from a set of noisy measurements, and compares favourably with well accessed iterative reconstruction methods combined with suitable regularization. Investigating the link between traditional approaches and CNNs is an active area of research as well as the efficient solution of nonconvex optimization problems which arise from training the networks on large databases of images and can be cast in the form (1) with NN being a large positive integer (see e.g., [2, 9, 28, 31, 39]).

The wide range of methods used in literature to solve (1) can be classified as first-order methods, requiring only the gradient of the objective ff and second-order procedures, where the Hessian is also needed. Although first-order schemes are generally characterised by a simple and low-cost iteration, their performance can be seriously hindered by ill-conditioning and their success is highly dependent on the fine-tuning of hyper-parameters. In addition, the objective function in (1) can be nonconvex, with a variety of local minima and/or saddle points. For this reason, second-order strategies using curvature information have been recently used as an instrument for escaping from saddle points more easily (see, e.g., [13, 19, 35, 7]). Clearly, the per-iteration cost is expected to be higher than for first-order methods, since second-order derivatives information is needed. By contrast, second-order methods have been shown to be significantly more resilient to ill-conditioned and badly-scaled problems, less sensitive to the choice of hyper-parameters and tuning [7, 35].

In this paper we propose methods up to order two, building on the Adaptive Regularization (AR) approach. We stress that methods employing cubic models show optimal complexity: a first- or second-order stationary point can be achieved in at most 𝒪(ϵ13/2)\mathcal{O}\bigl{(}\epsilon_{1}^{-{3/2}}\bigr{)} or 𝒪(max[ϵ13/2,ϵ23])\mathcal{O}\bigl{(}\max[\epsilon_{1}^{-{3/2}},\epsilon_{2}^{-3}]\bigr{)} iterations respectively, with ϵ1\epsilon_{1}, ϵ2\epsilon_{2} being the order-dependent requested accuracy thresholds (see, e.g., [6, 11, 19, 20, 22]). Allowing inexactness in the function and/or derivative evaluations of AR methods, and still preserving convergence properties and optimal complexity, has been a challenge in recent years [5, 23, 36, 18, 27, 35, 37, 41]. A first approach is to impose suitable accuracy requirements that ff and the derivatives have to deterministically fulfill at each iteration. But in machine learning applications, function and derivative evaluations are generally approximated by using uniformly randomly selected subsets of terms in the sums, and the accuracy levels can be satisfied only within a certain probability [6, 5, 23, 36, 37, 7, 18]. This suggests a stochastic analysis of the expected worst-case number of iterations needed to find a first- or second-order critical point [18, 41]. We pursue this approach and our contributions are as follows. We elaborate on [3] where adaptive regularisation methods with random models have been proposed for computing strong approximate minimisers of any order for constrained smooth optimization. We focus here on problems of the form (1) and on methods from this class using models based on adding a regularisation term to a truncated Taylor serie of order up to two. While gradient and Hessian are subject to random noise, function values are required to be approximated with a deterministic level of accuracy. Our approach is particularly suited for applications where evaluating derivatives is more expensive than performing function evaluations. This is for instance the case of deep neural networks training (see, e.g., [26]). We discuss a matrix-free implementation in the case where gradient and Hessian are approximated via sampling and with adaptive accuracy requirements. The outlined procedure retains the optimal worst-case complexity and our analysis complements that in [18], as we cover the approximation of second-order minimisers.

I-1 Notations

We use \|\cdot\| to indicate the 22-norm (matrices and vectors). x0f(x)\nabla_{x}^{0}f(x) is a synonym for f(x)f(x), given f:nf:\mathbb{R}^{n}\rightarrow\mathbb{R}. 𝔼[X]\mathbb{E}[X] denotes the expected value of a random variable XX. In addition, given a random event EE, Pr(E)Pr(E) denotes the probability of EE. All inexact quantities are denoted by an overbar.

II Employing inexact evaluations

II-A Preliminaries.

Given 1qp21\leq q\leq p\leq 2, we make the following assumptions on (1).

AS.1

There exists a constant flowf_{\rm low} such that f(x)flowf(x)\geq f_{\rm low} for all xnx\in\mathbb{R}^{n}.

AS.2

f𝒞p()f\in\mathcal{C}^{p}(\mathcal{B}) with \mathcal{B} a convex neighbourhood of n\mathbb{R}^{n}. Moreover, there exists a nonnegative constant LpL_{p}, such that, for all x,yx,y\in\mathcal{B}:
xpf(x)xpf(y)Lpxy\|\nabla_{x}^{p}f(x)-\nabla_{x}^{p}f(y)\|\leq L_{p}\|x-y\|.

Consequently, the pp-order Taylor expansions of ff centered at xx with increment ss are well-defined and are given by:

Tf,1(x,s)=deff(x)ΔTf,1(x,s),T_{f,1}(x,s)\stackrel{{\scriptstyle\rm def}}{{=}}f(x)-\Delta T_{f,1}(x,s),

with ΔTf,1(x,s)=defxf(x)s\Delta T_{f,1}(x,s)\stackrel{{\scriptstyle\rm def}}{{=}}-\nabla_{x}f(x)^{\top}s, and

Tf,2(x,s)=deff(x)ΔTf,2(x,s),T_{f,2}(x,s)\stackrel{{\scriptstyle\rm def}}{{=}}f(x)-\Delta T_{f,2}(x,s),

where

ΔTf,2(x,s)=defxf(x)s12sx2f(x)s.\Delta T_{f,2}(x,s)\stackrel{{\scriptstyle\rm def}}{{=}}-\nabla_{x}f(x)^{\top}s-\frac{1}{2}s^{\top}\nabla_{x}^{2}f(x)s. (2)

II-B Optimality conditions

To obtain complexity results for approximating first- and second-order minimisers, we use a compact formulation to characterise the optimality conditions. As in [16], given the order of accuracy q{1,2}q\in\{1,2\}, the tolerance vector ϵ=ϵ1\epsilon=\epsilon_{1}, if q=1q=1, or ϵ=(ϵ1,ϵ2)\epsilon=(\epsilon_{1},\epsilon_{2}), if q=2q=2, we say that xnx\in\mathbb{R}^{n} is a qq-th order ϵ\epsilon-approximate minimiser for (1) if

ϕf,j(x)ϵjj for j=1,,q,\phi_{f,j}(x)\leq\frac{\epsilon_{j}}{j}\textrm{ for }j=1,...,q, (3)

where

ϕf,j(x)=deff(x)globmind1Tf,j(x,d)=globmaxd1ΔTf,j(x,d).\phi_{f,j}(x)\stackrel{{\scriptstyle\rm def}}{{=}}f(x)-\operatorname*{globmin}_{\|d\|\leq 1}T_{f,j}(x,d)=\operatorname*{globmax}_{\|d\|\leq 1}\Delta T_{f,j}(x,d). (4)

The optimaliy measure ϕf,j\phi_{f,j} is a nonnegative (continuous) function that can be used as a measure of closeness to qq-th order stationary points [16]. For ϵ1=ϵ2=0\epsilon_{1}=\epsilon_{2}=0, it reduces to the known first- and second-order optimality conditions, respectively. Indeed, assuming that q=1q=1 we have from (3)–(4) with j=1j=1 that ϕf,1(x)=xf(x)=0\phi_{f,1}(x)=\|\nabla_{x}f(x)\|=0. If q=2q=2, (3)–(4) further imply that ϕf,2(x)=maxd1(12dx2f(x)d)=0\phi_{f,2}(x)=\max_{\|d\|\leq 1}\left(-\frac{1}{2}d^{\top}\nabla_{x}^{2}f(x)d\right)=0, which is the same as requiring the semi-positive definiteness of x2f(x)\nabla_{x}^{2}f(x).

II-C The Inexact Adaptive Regularisation (IARqpIAR_{qp}) algorithm

We now define our Inexact Adaptive Regularisation (IARqpIAR_{qp}) scheme, whose purpose is to find a qq-th ϵ\epsilon-approximate minimiser of (1) (see (3)) using a p+1p+1 order model, for 1qp21\leq q\leq p\leq 2.

The scheme, sketched in Algorithm 1 on page 1, is defined in analogy with the Adaptive Regularization Approach (see, [17]), but now uses the inexact values f¯(xk)\overline{f}(x_{k}), f¯(xk+s)\overline{f}(x_{k}+s), xjf¯(xk)\overline{\nabla_{x}^{j}f}(x_{k}) instead of f(xk)f(x_{k}), f(xk+s)f(x_{k}+s), xjf(xk)\nabla_{x}^{j}f(x_{k}), j=1,,qj=1,...,q, respectively. At iteration kk, the model

mk(s)\displaystyle m_{k}(s) ={ΔT¯f,1(xk,s)+σk2s2,ifp=1,ΔT¯f,2(xk,s)+σk6s3,ifp=2.\displaystyle=\left\{\begin{array}[]{ll}-\overline{\Delta T}_{f,1}(x_{k},s)+\frac{\sigma_{k}}{2}\|s\|^{2},&\textrm{if}~{}~{}p=1,\\ -\overline{\Delta T}_{f,2}(x_{k},s)+\frac{\sigma_{k}}{6}\|s\|^{3},&\textrm{if}~{}~{}p=2.\end{array}\right. (7)

is built at Step 1 and approximately minimised, at Step 2, finding a step sks_{k} such that

mk(sk)mk(0)=0m_{k}(s_{k})\leq m_{k}(0)=0 (8)

and

ϕ¯mk,j(sk)=maxd1ΔT¯mk,j(sk,d)θϵjj.\overline{\phi}_{m_{k},j}(s_{k})=\max_{\|d\|\leq 1}\overline{\Delta T}_{m_{k},j}(s_{k},d)\leq\theta\frac{\epsilon_{j}}{j}.

for j=1,,qj=1,...,q, and some θ(0,12)\theta\in(0,\frac{1}{2}). Taking into account that,

ΔT¯mk,1(sk,d)\displaystyle\overline{\Delta T}_{m_{k},1}(s_{k},d) =\displaystyle= sm¯(sk)d\displaystyle-\overline{\nabla_{s}m}(s_{k})^{\top}d
ΔT¯mk,2(sk,d)\displaystyle\overline{\Delta T}_{m_{k},2}(s_{k},d) =\displaystyle= sm¯(sk)d12ds2m¯(sk)d,\displaystyle-\overline{\nabla_{s}m}(s_{k})^{\top}d-\frac{1}{2}d^{\top}\overline{\nabla_{s}^{2}m}(s_{k})d,

we have

ϕ¯mk,1(sk)\displaystyle\overline{\phi}_{m_{k},1}(s_{k}) =\displaystyle= maxd1(sm¯(sk)d)\displaystyle\max_{\|d\|\leq 1}\left(-\overline{\nabla_{s}m}(s_{k})^{\top}d\right) (9)
=\displaystyle= smk(sk)θϵ1\displaystyle\|\nabla_{s}m_{k}(s_{k})\|\leq\theta\epsilon_{1}

for p=1,2p=1,2 and

ϕ¯mk,2(sk)=maxd1(sm¯(sk)d12ds2m¯(sk)d)θϵ2\overline{\phi}_{m_{k},2}(s_{k})=\max_{\|d\|\leq 1}\left(-\overline{\nabla_{s}m}(s_{k})^{\top}d-\frac{1}{2}d^{\top}\overline{\nabla_{s}^{2}m}(s_{k})d\right)\leq\theta\epsilon_{2}

for q=p=2q=p=2.

Algorithm 1 The IARqpIAR_{qp} Algorithm, 1qp21\leq q\leq p\leq 2.

Step 0: Initialization.

An initial point x0nx_{0}\in\mathbb{R}^{n}, initial regulariser σ0>0\sigma_{0}>0, tolerances ϵj\epsilon_{j}, j=1,,qj=1,...,q, and constants θ(0,12)\theta\in(0,\frac{1}{2}), η(0,1)\eta\in(0,1), γ>1\gamma>1, α(0,1)\alpha\in(0,1), ω0=min[12αη,1σ0]\omega_{0}=\min\left[\frac{1}{2}\alpha\eta,\frac{1}{\sigma_{0}}\right], σmin(0,σ0)\sigma_{\min}\in(0,\sigma_{0}) are given. Set k=0k=0.

Step 1: Model definition.

Build the approximate gradient xf¯(xk)\overline{\nabla_{x}f}(x_{k}). If p=2p=2, compute also the Hessian x2f¯(xk)\overline{\nabla_{x}^{2}f}(x_{k}). Compute the model mk(s)m_{k}(s) as defined in (7).

Step 2: Step calculation.

Compute a step sks_{k} satisfying (8)–(9), for j=1,,qj=1,...,q. If ΔT¯f,p(xk,sk)=0\overline{\Delta T}_{f,p}(x_{k},s_{k})=0, go to Step 4.

Step 3: Function approximations.

Compute f¯(xk)\overline{f}(x_{k}) and f¯(xk+sk)\overline{f}(x_{k}+s_{k}) satisfying (12)–(13).

Step 4: Test of acceptance.

Set

ρk={f¯(xk)f¯(xk+sk)ΔT¯f,p(xk,sk) if ΔT¯f,p(xk,sk)>0, otherwise.\rho_{k}=\left\{\begin{array}[]{ll}\frac{\overline{f}(x_{k})-\overline{f}(x_{k}+s_{k})}{\overline{\Delta T}_{f,p}(x_{k},s_{k})}&\textrm{ if ~{}}\overline{\Delta T}_{f,p}(x_{k},s_{k})>0,\\ -\infty&\textrm{ otherwise.}\end{array}\right.

If ρkη\rho_{k}\geq\eta (successful iteration), then set xk+1=xk+skx_{k+1}=x_{k}+s_{k}; otherwise (unsuccessful iteration) set xk+1=xkx_{k+1}=x_{k}.

Step 5: Regularisation parameter update.

Set

σk+1={max[σmin,1γσk],ifρkη,γσk,ifρk<η.\sigma_{k+1}=\left\{\begin{array}[]{ll}\max\left[\sigma_{\min},\frac{1}{\gamma}\sigma_{k}\right],&\textrm{if}\rho_{k}\geq\eta,\\ \gamma\sigma_{k},&\textrm{if}\rho_{k}<\eta.\end{array}\right. (10)

Step 6: Relative accuracy update.

Set

ωk+1=min[12αη,1σk+1].\omega_{k+1}=\min\left[\frac{1}{2}\alpha\eta,\frac{1}{\sigma_{k+1}}\right]. (11)

Increment kk by one and go to Step 1.

The existence of such a step can be proved as in Lemma 4.44.4 of [16] (see also [3]). We note that the model definition in (7) does not depend on the approximate value of ff at xkx_{k}. The ratio ρk\rho_{k}, depending on inexact function values and model, is then computed at Step 44 and affects the acceptance of the trial point xk+skx_{k}+s_{k}. Its magnitude also influences the regularisation parameter σk\sigma_{k} update at Step 55. While the gradient xf¯(xk)\overline{\nabla_{x}f}(x_{k}) and the Hessian x2f¯(xk)\overline{\nabla_{x}^{2}f}(x_{k}) approximations can be seen as random estimates, the values f¯(xk)\overline{f}(x_{k}), f¯(xk+sk)\overline{f}(x_{k}+s_{k}) are required to be deterministically computed to satisfy

|f¯(xk)f(xk)|\displaystyle\left|\overline{f}(x_{k})-f(x_{k})\right| \displaystyle\leq ωkΔT¯f,p(xk,sk),\displaystyle\omega_{k}\overline{\Delta T}_{f,p}(x_{k},s_{k}),~{}~{}~{}~{}~{}~{} (12)
|f¯(xk+sk)f(xk+sk)|\displaystyle\left|\overline{f}(x_{k}+s_{k})-f(x_{k}+s_{k})\right| \displaystyle\leq ωkΔT¯f,p(xk,sk),\displaystyle\omega_{k}\overline{\Delta T}_{f,p}(x_{k},s_{k}),~{}~{}~{}~{}~{}~{} (13)

in which ωk\omega_{k} is iteratively defined at Step 66. As for the implementation of the algorithm, we note that ϕ¯mk,1(sk)=smk(sk)\overline{\phi}_{m_{k},1}(s_{k})=\|\nabla_{s}m_{k}(s_{k})\|, while ϕ¯mk,2(sk)\overline{\phi}_{m_{k},2}(s_{k}) can be computed via a standard trust-region method at a cost which is comparable to that of computing the Hessian left-most eigenvalue. The approximate minimisation of the cubic model (7) at each iteration can be seen, for p=2p=2, as an issue in the AR framework. However, an approximate minimiser can be computed via matrix-free approaches accessing the Hessian only through matrix-vector products. A number of procedures have been proposed, ranging from Lanczos-type iterations where the minimisation is done via nested, lower dimensional, Krylov subspaces [21], up to minimisation via gradient descent (see, e.g., [1, 14, 15]) or the Barzilai-Borwein gradient method [10]. Hessian-vector products can be approximated by the finite difference approximation, with only two gradient evaluations [10, 15]. All these matrix-free implementations remain relevant if x2f(xk)\nabla_{x}^{2}f(x_{k}) is defined via subsampling, proceeding as in Section IV. Interestingly, back-propagation-like methods in deep learning also allow computations of Hessian-vector products at a similar cost [33, 38].

II-D Probabilistic assumptions on IARqpIAR_{qp}

In what follows, all random quantities are denoted by capital letters, while the use of small letters denotes their realisations. We refer to the random model MkM_{k} at iteration kk, while mkm_{k} = Mk(ζk)M_{k}(\zeta_{k}) is its realisation, with ζk\zeta_{k} being a random sample taken from a context-dependent probability space. As a consequence, the iterates XkX_{k}, as well as the regularisers Σk\Sigma_{k}, the steps SkS_{k} and Ωk\Omega_{k}, are the random variables such that xk=Xk(ζk)x_{k}=X_{k}(\zeta_{k}), σk=Σk(ζk)\sigma_{k}=\Sigma_{k}(\zeta_{k}), sk=Sk(ζk)s_{k}=S_{k}(\zeta_{k}) and ωk=Ωk(ζk)\omega_{k}=\Omega_{k}(\zeta_{k}). For the sake of brevity, we will omit ζk\zeta_{k} in what follows. Due to the randomness of the model construction at Step 11, the IARqpIAR_{qp} algorithm induces a random process formalised by {Xk,Sk,Mk,Σk,Ωk}\{X_{k},S_{k},M_{k},\Sigma_{k},\Omega_{k}\} (x0x_{0} and σ0\sigma_{0} are deterministic quantities). For k0k\geq 0, we formalise the conditioning on the past by using 𝒜k1M\mathcal{A}_{k-1}^{M}, the σ^\hat{\sigma}-algebra induced by the random variables M0M_{0}, M1M_{1},…, Mk1M_{k-1}, with 𝒜1M=σ^(x0)\mathcal{A}_{-1}^{M}=\hat{\sigma}(x_{0}). We also denote by dk,jd_{k,j} and d¯k,j\overline{d}_{k,j} the arguments in the maximum in the definitions of ϕmk,j(sk)\phi_{m_{k},j}(s_{k}) and ϕ¯mk,j(sk)\overline{\phi}_{m_{k},j}(s_{k}), respectively. We say that iteration kk is accurate if

xf¯(Xk)xf(Xk)ΩkΔT¯k,min6τk, for {1,,p},\|\overline{\nabla_{x}^{\ell}f}(X_{k})-\nabla_{x}^{\ell}f(X_{k})\|\leq\Omega_{k}\frac{\overline{\Delta T}_{k,\min}}{6\tau_{k}^{\ell}},\textrm{ for }\ell\in\{1,...,p\}, (14)

where

τk\displaystyle\tau_{k} =def\displaystyle\stackrel{{\scriptstyle\rm def}}{{=}} max[Sk,maxj=1,,q[Dk,j,D¯k,j]]\displaystyle\max\left[\|S_{k}\|,\max_{j=1,...,q}[\|D_{k,j}\|,\|\overline{D}_{k,j}\|]\right]
ΔT¯k,min\displaystyle\overline{\Delta T}_{k,\min} =def\displaystyle\stackrel{{\scriptstyle\rm def}}{{=}} min[ΔT¯f,p(Xk,Sk),\displaystyle\min\Bigl{[}\overline{\Delta T}_{f,p}(X_{k},S_{k}),
minj=1,,q[ΔT¯mk,j(Sk,Dk,j),\displaystyle~{}~{}~{}~{}~{}~{}\min_{j=1,...,q}\Bigl{[}\overline{\Delta T}_{m_{k},j}(S_{k},D_{k,j}),
ΔT¯mk,j(Sk,D¯k,j)]].\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\overline{\Delta T}_{m_{k},j}(S_{k},\overline{D}_{k,j})\Bigr{]}\Bigr{]}.

We emphasize that the above accuracy requirements are adaptive. At variance with the trust-region methods of [12, 18, 24], the above conditions do not require the model to be fully linear or quadratic in a ball centered at xkx_{k} of radius at least sk\|s_{k}\|. As standard in related papers [12, 18, 24, 32], we assume a lower bound on the probability of the model to be accurate at the kk-th iteration.

AS.3

For all k0k\geq 0, conditioned to 𝒜k1M\mathcal{A}_{k-1}^{M}, we assume that (14) is satisfied with probability at least p(12,1]p_{*}\in(\frac{1}{2},1] independent of kk.

We stress that the inequalities in (14) can be satisfied via uniform subsampling with probability of success at least (1t)(1-t_{\ell}), using the operator Bernstein inequality (see, Subsection IV-A and Section 7.27.2 in [6]), and this provides AS.3 with p==1p(1t)p_{*}=\prod_{\ell=1}^{p}(1-t_{\ell}).

III Worst-case complexity analysis

III-A Stopping time

Before starting with the worst-case evaluation complexity of the IARqpIAR_{qp} algorithm, the following clarifications are important. For each k1k\geq 1 we assume that the computation of sk1s_{k-1} and thus of the trial point xk1+sk1x_{k-1}+s_{k-1} (k1k\geq 1) are deterministic, once the inexact model mk1(s)m_{k-1}(s) is known, and that (12)-(13) at Step 33 of the algorithm are enforced deterministically; therefore, ρk1\rho_{k-1} and the fact that iteration k1k-1 is successful are deterministic outcomes of the realisation of the (random) inexact model. Hence,

Nϵ=inf{k0ϕf,j(Xk)ϵjj,j=1,,q}N_{\epsilon}=\inf\Bigl{\{}k\geq 0~{}\mid~{}\phi_{f,j}(X_{k})\leq\frac{\epsilon_{j}}{j},~{}j=1,...,q\Bigr{\}}

can be seen as a family of hitting times depending of ϵ\epsilon and corresponding to the number of iterations required until (3) is met for the first time. The assumptions made on top of this subsection imply that the variables Xk1+Sk1X_{k-1}+S_{k-1} and the event {Xk=Xk1+Sk1}\{X_{k}=X_{k-1}+S_{k-1}\}, occurring when iteration k1k-1 is successful, are measurable with respect to 𝒜k1M\mathcal{A}_{k-1}^{M}. Our aim is then to derive an upper bound on the expected number of steps 𝔼[Nϵ]\mathbb{E}[N_{\epsilon}] needed by the IARqpIAR_{qp} algorithm, in the worst-case, to reach an ϵ\epsilon-approximate qq-th-order-necessary minimiser, as in (3).

III-B Deriving expected upper bounds on NϵN_{\epsilon}

A crucial property for our complexity analysis is to show that, when the model (7) is accurate, iteration kk is successful but xf(xk+1)>ϵ1\|\nabla_{x}f(x_{k+1})\|>\epsilon_{1}, then skp+1\|s_{k}\|^{p+1} is bounded below by ψ(σk)ϵ1p+1pq+1\psi(\sigma_{k})\epsilon_{1}^{\frac{p+1}{p-q+1}} in which ψ(σ)\psi(\sigma) is a decreasing function of σ\sigma. This is central in [18] for proving the worst-case complexity bound for first-order optimality. By virtue of the compact formulation (4) of the optimality measure, for p=2p=2, the lower bound on sk3\|s_{k}\|^{3} also holds for second-order optimality and a suitable power of ϵ2\epsilon_{2}.

Lemma 1.

Suppose that AS.2 holds and consider any realisation of the algorithm. Suppose that (14) also occurs, that iteration kk is successful and that, for some j=1,,qj=1,...,q, (3) fails for xk+1x_{k+1}. Then skp+1ψ(σk)ϵjp+1pq+1\|s_{k}\|^{p+1}\geq\psi(\sigma_{k})\epsilon_{j}^{\frac{p+1}{p-q+1}}, with

ψ(σ)=min[1,((12θ)(3q)q(Lf,2+σ))p+1pq+1].\psi(\sigma)=\min\left[1,\left(\frac{(1-2\theta)(3-q)}{q(L_{f,2}+\sigma)}\right)^{\frac{p+1}{p-q+1}}\right]. (15)

We refer to Lemma 3.43.4 in [3] for a complete proof. In order to develop the complexity analysis, building on [18] and using results from the previous subsection, we first identify a threshold σs\sigma_{s} for the regulariser, above which each iteration of the algorithm with accurate model is successful. We also give some preliminary bounds (see [3]).

Lemma 2.

Suppose that AS.2 holds. For any realisation of the algorithm, if iteration kk is such that (14) occurs and

σkσs=defmax[2σ0,Lf,p+31η],\sigma_{k}\geq\sigma_{s}\stackrel{{\scriptstyle\rm def}}{{=}}\max\left[2\sigma_{0},\frac{L_{f,p}+3}{1-\eta}\right], (16)

then iteration kk is successful.

Lemma 3.

Let Assumptions AS.1–AS.3 hold. For all realisations of the IARqpIAR_{qp}, let NASN_{AS}, NΛN_{\Lambda} and NΛcN_{\Lambda^{c}} represent the number of accurate successful iterations with Σkσs\Sigma_{k}\leq\sigma_{s}, the number of iterations with Σk<σs\Sigma_{k}<\sigma_{s} and the number of iterations with Σkσs\Sigma_{k}\geq\sigma_{s}, respectively. Then,

𝔼[NΛc]\displaystyle\mathbb{E}[N_{\Lambda^{c}}] \displaystyle\leq 12p𝔼[Nϵ],\displaystyle\frac{1}{2p_{*}}\mathbb{E}[N_{\epsilon}],
𝔼[NAS]\displaystyle\mathbb{E}[N_{AS}] \displaystyle\leq 6(f0flow)ησmin(1α)ψ(σs)maxj=1,,q[ϵjp+1pq+1]+1,\displaystyle\frac{6(f_{0}-f_{\rm low})}{\eta\sigma_{\min}(1-\alpha)\psi(\sigma_{s})}\max_{j=1,...,q}\left[\epsilon_{j}^{-\frac{p+1}{p-q+1}}\right]+1,
𝔼[NΛ]\displaystyle\mathbb{E}[N_{\Lambda}] \displaystyle\leq 12p1[2𝔼[NAS]+logγ(σsσ0)].\displaystyle\frac{1}{2p_{*}-1}\left[2\mathbb{E}[N_{AS}]+\log_{\gamma}\Bigl{(}\frac{\sigma_{s}}{\sigma_{0}}\Bigr{)}\right].

The proofs of the first and the third bound of the previous lemma follow the reasoning of [18], the proof of the second bound is given in [3]. Finally, the fact that 𝔼[Nϵ]=𝔼[NΛ]+𝔼[NΛc]\mathbb{E}[N_{\epsilon}]=\mathbb{E}[N_{\Lambda}]+\mathbb{E}[N_{\Lambda^{c}}] allows us to state our complexity result.

Theorem 4.

Under the assumptions of Lemma 33,

𝔼[Nϵ]\displaystyle\mathbb{E}[N_{\epsilon}] \displaystyle\leq 2p(2p1)2[12(f0flow)ησmin(1α)ψ(σs)maxj=1,,q[ϵjp+1pq+1]\displaystyle\frac{2p_{*}}{(2p_{*}-1)^{2}}\Bigl{[}\frac{12(f_{0}-f_{\rm low})}{\eta\sigma_{\min}(1-\alpha)\psi(\sigma_{s})}\max_{j=1,...,q}\left[\epsilon_{j}^{-\frac{p+1}{p-q+1}}\right] (17)
+logγ(σsσ0)+2].\displaystyle~{}+\log_{\gamma}\left(\frac{\sigma_{s}}{\sigma_{0}}\right)+2\Bigr{]}.

Theorem 4 shows that for q=p=1q=p=1 the IAR1IAR_{1} algorithm reaches the hitting time in at most 𝒪(ϵ12)\mathcal{O}(\epsilon_{1}^{-2}) iterations. Thus, it matches the complexity bound of gradient-type methods for nonconvex problems employing exact gradients. Moreover, IARqpIAR_{qp} needs 𝒪(ϵ13/2)\mathcal{O}(\epsilon_{1}^{-3/2}) iterations when q=1q=1 and p=2p=2 are considered, while to approximate a second-order optimality point (case p=q=2p=q=2) 𝒪(max[ϵ13/2,ϵ23])\mathcal{O}\bigl{(}\max[\epsilon_{1}^{-{3/2}},\epsilon_{2}^{-3}]\bigr{)} are needed in the worst-case. Therefore, for 1qp21\leq q\leq p\leq 2, Theorem 4 generalises the complexity bounds stated in [16, Theorem 5.5] to the case where evaluations of ff and its derivatives are inexact, under probabilistic assumptions on the accuracies of the latter. Since in [16, Theorems 6.1 and 6.4] it is shown that the complexity bounds are sharp in order of the tolerances ϵ1\epsilon_{1} and ϵ2\epsilon_{2}, for exact evaluations and Lipschitz continuous derivatives of f, this is also the case for the bounds of Theorem 4.

IV Preliminary numerical tests

We here present preliminary novel numerical results on the IARqpIAR_{qp} algorithm with model up to order three (p{1,2}p\in\{1,2\}) and first-order optimality (q=1q=1). Numerical tests are performed within the context of nonconvex finite-sum minimisation problems for the binary digits detection of two datasets coming from the imaging community. Specifically, we consider the MNIST dataset in [29], usually considered for classifying the 1010 handwritten digits and renamed here as MNIST-B and used to discard even digits from odd digits, and the GISETTE database [30], to detect the highly confusable handwritten digits “4” and “9”.
The overall binary classification learning procedure focuses on the two macro-steps reported below.

  • The training phase. Given a training set {(ai,yi)}i=1N\{(a_{i},y_{i})\}_{i=1}^{N} of NN features aida_{i}\in\mathbb{R}^{d} and corresponding binary labels yi{0,1}y_{i}\in\{0,1\}, we solve the following minimisation problem:

    minxnf(x)=minxn1Ni=1Nfi(x)=defminxn1Ni=1N(yinet(ai;x))2,\begin{split}\min_{x\in\mathbb{R}^{n}}f(x)&=\min_{x\in\mathbb{R}^{n}}\frac{1}{N}\sum_{i=1}^{N}{f_{i}(x)}\\ &\stackrel{{\scriptstyle\rm def}}{{=}}\min_{x\in\mathbb{R}^{n}}\frac{1}{N}\sum_{i=1}^{N}\left(y_{i}-net\left(a_{i};x\right)\right)^{2},\end{split} (18)

    in which the objective function is the so-called training loss. We use a feed-forward Artificial Neural Network (ANN) net:ad(0,1)net:a\in\mathbb{R}^{d}\rightarrow(0,1), defined by the parameter vector xnx\in\mathbb{R}^{n}, as the model for predicting the values of the labels, and the square loss as a measure of the error on such predictions, that has to be minimised by approximately solving (18) in order to come out with the parameter vector xx^{*}, to be used for label predictions on the testing set of the following step. We use the notation (d1,.,dh)(d_{1},....,d_{h}) to denote a feed-forward ANN with hh hidden layers where the jjth hidden layer, j{1,,h}j\in\{1,...,h\}, has djd_{j} neurons. The number of neurons for the input layer and the output layer are constrained to be dd and 11, respectively. The tanhtanh is considered as the activation function for each inner layers, while the sigmoid σ\sigma constitutes the activation at the output layer. Therefore, if zero bias and no hidden layers (h=0h=0) are considered, d=nd=n and net(a;x)net(a;x) reduces to σ(ax)=1/(1+eax)\sigma(a^{\top}x)=1/(1+e^{-a^{\top}x}), that constitutes our no net model.

  • The testing phase. Once the training phase is completed, a number NTN_{T} of testing data {a¯i,y¯i}i=1NT\{\overline{a}_{i},\overline{y}_{i}\}_{i=1}^{N_{T}} is used to validate the computed model. The values net(a¯i;x)net(\overline{a}_{i};x^{*}) are used to predict the testing labels y¯i\overline{y}_{i}, i{1,,NT}i\in\{1,...,N_{T}\}, computing the corresponding error (testing loss), measured by 1NTi=1NT(y¯inet(a¯i;x))2.\frac{1}{N_{T}}\sum_{i=1}^{N_{T}}\left(\overline{y}_{i}-net(\overline{a}_{i};x^{*})\right)^{2}.

IV-A Implementation issues and results

The implementation of the IARqpIAR_{qp} algorithm respects the following specifications. The cubic regularisation parameter is initialised as σ0=101\sigma_{0}=10^{-1}, its minimum value is σmin=105\sigma_{\min}=10^{-5} and the initial guess vector x0=(0,,0)nx_{0}=(0,...,0)^{\top}\in\mathbb{R}^{n} is considered for all runs. Moreover, the parameters α=0.5\alpha=0.5, η=0.8\eta=0.8 and γ=2\gamma=2 are used. The estimators of the true objective functions and derivatives are obtained by averaging in a subsample of {1,,N}\{1,\ldots,N\}. More specifically, the approximations of the objective function values and of first and second derivatives take the form:

f¯k(xk)=1|𝒟k,1|i𝒟k,1fi(xk),f¯k(xk+sk)=1|𝒟k,2|i𝒟k,2fi(xk+sk),\begin{split}&\overline{f}_{k}(x_{k})=\frac{1}{|{\cal D}_{k,1}|}\sum_{i\in{\cal D}_{k,1}}f_{i}(x_{k}),\\ &\overline{f}_{k}(x_{k}+s_{k})=\frac{1}{|{\cal D}_{k,2}|}\sum_{i\in{\cal D}_{k,2}}f_{i}(x_{k}+s_{k}),\end{split} (19)
f¯(xk)=1|𝒢k|i𝒢kfi¯(xk),\displaystyle\overline{\nabla f}(x_{k})=\frac{1}{|{\cal G}_{k}|}\sum_{i\in{\cal G}_{k}}\overline{\nabla f_{i}}(x_{k}), (20)
2f¯(xk)=1|k|ik2fi¯(xk),\displaystyle\overline{\nabla^{2}f}(x_{k})=\frac{1}{|{\cal H}_{k}|}\sum_{i\in{\cal H}_{k}}\overline{\nabla^{2}f_{i}}(x_{k}), (21)

where 𝒟k,1{\cal D}_{k,1}, 𝒟k,2{\cal D}_{k,2}, 𝒢k{\cal G}_{k}, k{\cal H}_{k} are subsets of {1,2,,N}\{1,2,\ldots,N\} with sample sizes |𝒟k,1|,|𝒟k,2|,|𝒢k|,|k||\mathcal{D}_{k,1}|,|\mathcal{D}_{k,2}|,|\mathcal{G}_{k}|,|{\cal H}_{k}|. These sample sizes are adaptively chosen by the procedure in order to satisfy (12)–(14) with probability at least 1t1-t. We underline that the enforcement of (12)–(13) on function evaluations is relaxed in our experiments and that such accuracy requirements are now supposed to hold in probability. Given the prefixed absolute accuracies ν,k\nu_{\ell,k} for the derivative of order \ell at iteration kk and tt a prescribed probability of failure, by using results in [40] and [6], the approximations given in (19)–(21) satisfy ({1,2}\ell\in\{1,2\}):

Pr[f¯(xk)f(xk)ν,k]\displaystyle Pr\Big{[}\|\overline{\nabla^{\ell}f}(x_{k})-\nabla^{\ell}f(x_{k})\|\leq\nu_{\ell,k}\Big{]} \displaystyle\geq 1t,\displaystyle 1-t, (22)
Pr[|f¯k(xk+sk)f(xk+sk)|ν0,k]\displaystyle Pr\Big{[}|\overline{f}_{k}(x_{k}+s_{k})-f(x_{k}+s_{k})|\leq\nu_{0,k}\Big{]} \displaystyle\geq 1t,\displaystyle 1-t, (23)
Pr[|f¯k(xk)f(xk)|ν0,k]\displaystyle Pr\Big{[}|\overline{f}_{k}(x_{k})-f(x_{k})|\leq\nu_{0,k}\Big{]} \displaystyle\geq 1t,\displaystyle 1-t, (24)

provided that

|𝒟k,j|min{N,4κf,0(x)ν0,k(2κf,0(x)ν0,k+13)log(2t)},|\mathcal{D}_{k,j}|\geq\min\left\{N,\left\lceil\frac{4\kappa_{f,0}(x)}{\nu_{0,k}}\left(\frac{2\kappa_{f,0}(x)}{\nu_{0,k}}+\frac{1}{3}\right)\log\left(\frac{2}{t}\right)\right\rceil\right\}, (25)
x={xk,ifj=1,xk+sk,ifj=2,x=\left\{\begin{array}[]{ll}{}x_{k},&\mbox{if}~{}j=1,\\ {}x_{k}+s_{k},&\mbox{if}~{}j=2,\end{array}\right.
|𝒢k|min{N,4κf,1(xk)ν1,k(2κf,1(xk)ν1,k+13)log(n+1t)},|\mathcal{G}_{k}|\geq\min\left\{N,\left\lceil\frac{4\kappa_{f,1}(x_{k})}{\nu_{1,k}}\left(\frac{2\kappa_{f,1}(x_{k})}{\nu_{1,k}}+\frac{1}{3}\right)\log\left(\frac{n+1}{t}\right)\right\rceil\right\},\ (26)
|k|min{N,4κf,2(xk)ν2,k(2κf,2(xk)ν2,k+13)log(2nt)}.|\mathcal{H}_{k}|\geq\min\left\{N,\left\lceil\frac{4\kappa_{f,2}(x_{k})}{\nu_{2,k}}\left(\frac{2\kappa_{f,2}(x_{k})}{\nu_{2,k}}+\frac{1}{3}\right)\log\left(\frac{2n}{t}\right)\right\rceil\right\}. (27)

The nonnegative constants {κf,}=02\{\kappa_{f,\ell}\}_{\ell=0}^{2} in (19)–(21) should be such that (see, e.g., [6]), for xnx\in\mathbb{R}^{n} and all {0,1,2}\ell\in\{0,1,2\}, maxi{1,,N}fi(x)κf,(x)\max_{i\in\{1,...,N\}}\|\nabla^{\ell}f_{i}(x)\|\leq\kappa_{f,\ell}(x). Since their estimations can be challenging, we consider here a constant κ=defκf,\kappa\stackrel{{\scriptstyle\rm def}}{{=}}\kappa_{f,\ell}, for all {0,1,2}\ell\in\{0,1,2\}, setting its value experimentally, in order to control the growth of the sample sizes (25)–(27) throughout the running of the algorithm. In particular, κ=8104\kappa=8\cdot 10^{-4} and κ=3102\kappa=3\cdot 10^{-2} are considered by IAR1IAR_{1} on the GISETTE and MNIST-B dataset, respectively. Moreover, in (25)-(27), we used t=0.2t=0.2.

We stress that (12)-(14) ensure that ν0,k=ωkΔT¯f,p(xk,sk)\nu_{0,k}=\omega_{k}\overline{\Delta T}_{f,p}(x_{k},s_{k}), ν,k=ΩkΔT¯k,min6τk, for {1,,p}.\nu_{\ell,k}=\Omega_{k}\frac{\overline{\Delta T}_{k,\min}}{6\tau_{k}^{\ell}},\textrm{ for }{{\ell\in\{1,...,p\}}}. The computation of both ΔT¯f,p(xk,sk)\overline{\Delta T}_{f,p}(x_{k},s_{k}) and ΔT¯k,min\overline{\Delta T}_{k,\min} requires the knowledge of the step sks_{k}, that is available at Step 3 when function estimators are built, but it is not yet available at Step 1 when approximations of the gradient (and the Hessian) have to be computed. Thus, practical implementation of Step 1 calls for an inner loop where the accuracy requirements is progressively reduced in order to meet (14). This process terminates as for sufficiently small accuracy requirements the right-hand side in (20)-(21) reaches the full sample NN. To make clear this point a detailed description of Step 1 and Step 2 of the IARqpIAR_{qp} algorithm with q=p=1q=p=1, renamed for brevity as IAR1IAR_{1}, is given below.

Algorithm 2 Detailed Steps 1–2 of the IAR1IAR_{1} algorithm.

Step 1: Model definition.

Initialise ε1,0=κε\varepsilon_{1,0}=\kappa_{\varepsilon} and set i=0i=0. Do

  1. 1.

    compute f¯(xk)\overline{\nabla f}(x_{k}) with f¯(xk)f(xk)ε1,i\|\overline{\nabla f}(x_{k})-\nabla f(x_{k})\|\leq\varepsilon_{1,i} and increment ii by one.

  2. 2.

    if ε1,iωf¯(xk)\varepsilon_{1,i}\leq\omega\|\overline{\nabla f}(x_{k})\|, go to Step 2;

  3. 3.

    set ε1,i+1=γεε1,i\varepsilon_{1,i+1}=\gamma_{\varepsilon}\varepsilon_{1,i} and return to item 1 in this enumeration.

Step 2: Step calculation.

Set

sk=f¯(xk)/σk,ΔT¯f,1(xk,sk)=f¯(xk)2σk.s_{k}=-\overline{\nabla f}(x_{k})/\sigma_{k},\quad\overline{\Delta T}_{f,1}(x_{k},s_{k})=\frac{\|\overline{\nabla f}(x_{k})\|^{2}}{\sigma_{k}}.

Regarding algorithm IARqpIAR_{qp} with q=1q=1 and p=2p=2, hereafter called IAR2IAR_{2}, a similar loop as in Step 1 of Algorithm2 is needed that involves the step computation. The approximated gradient and Hessian are computed via (20) and (21), using predicted accuracy requirements ε1,0\varepsilon_{1,0} and ε2,0\varepsilon_{2,0}, respectively. Then, the step sks_{k} is computed and if (14) is not satisfied then the accuracy requirements are progressively decreased and the step is recomputed until (14) is satisfied for the first time. The values κϵ=γϵ=0.5\kappa_{\epsilon}=\gamma_{\epsilon}=0.5 are considered to initialize and decrease the the accuracy requirements in both IAR1IAR_{1} and IAR2IAR_{2} Algorithms.

As one may easily see, the step sks_{k} taken in IAR1IAR_{1} Algorithm (see the form of the model (7) and Step 2 in Algorithm 2) is simply the global minimiser the model. Therefore, for any successful iteration kk:

xk+1=xk+sk=xk1σkf¯(xk),x_{k+1}=x_{k}+s_{k}=x_{k}-\frac{1}{\sigma_{k}}\overline{\nabla f}(x_{k}), (28)

that can be viewed as the general iteration of a gradient descent method under inexact gradient evaluations with an adaptive choice of the learning rate (it corresponds to the opposite of the coefficient of f¯(xk)\overline{\nabla f}(x_{k}) in (28)), taken as the inverse of the quadratic regulariser σk>0\sigma_{k}>0.

On the other hand, in the IAR2IAR_{2} procedure, an approximated minimizer sks_{k} of the third model mkm_{k} (7) has to be computed. Such approximate minimizer has to satisfy mk(sk)mk(0)m_{k}(s_{k})\leq m_{k}(0) and (9), i.e. smk(sk)θϵ1\|\nabla_{s}m_{k}(s_{k})\|\leq\theta{\epsilon_{1}}. In our implementation, we used θ=0.5\theta=0.5, ϵ1=103\epsilon_{1}=10^{-3} and the computation of sks_{k} has been carried out using the Barzilai-Borwein method [34] with a non-monotone line-search following [25, Algorithm 1] and using parameter values as suggested therein. The Hessian-vector product required at each iteration of the Barzilai-Borwein procedure is approximated via finite-differences in the model gradient.

To assess the computational cost of the IAR1IAR_{1} and IAR2IAR_{2} Algorithms, we follow the approach in [35] for what concerns the cost measure definition. Specifically, at the generic iteration kk, we count the NN forward propagations needed to evaluate the objective in (18) at xkx_{k} has a unit Cost Measure (CM), while the evaluation of the approximated gradient at the same point requires an extra cost of (2|𝒢k(𝒢k𝒟k,1)|+|𝒢k𝒟k,1|)/N\left(2|\mathcal{G}_{k}\smallsetminus(\mathcal{G}_{k}\cap\mathcal{D}_{k,1})|+|\mathcal{G}_{k}\cap\mathcal{D}_{k,1}|\right)/N CM. Moreover, at each iteration of the Barzilai-Borwein procedure of the IAR2IAR_{2} Algorithm, the approximation of the Hessian-vector product by finite differences calls for the computation of f¯(xk+hv)\overline{\nabla f}(x_{k}+hv), (h+h\in\mathbb{R}^{+}), leading to additional |k||\mathcal{H}_{k}| forward and backward propagations, at the price of the weighted cost 2|k|/N2|\mathcal{H}_{k}|/N CM, and a potential extra-cost of |k(𝒢kk)|/N|\mathcal{H}_{k}\smallsetminus(\mathcal{G}_{k}\cap\mathcal{H}_{k})|/N CM to compute fi(xk)\nabla f_{i}(x_{k}), for ik(𝒢kk)i\in\mathcal{H}_{k}\smallsetminus(\mathcal{G}_{k}\cap\mathcal{H}_{k}). A budget of 8080 and 100100 CM have been considered as the stopping rule for our runs for the MNIST-B and the GISETTE dataset, respectively.
We now solve the binary classification task described at the beginning of Section IV on the MNIST-B and GISETTE datasets, described in Table I. We test the IAR1IAR_{1} algorithm on both datasets without net, while different ANN structures are considered for the MNIST-B via IAR1IAR_{1}. A comparison between IAR2IAR_{2} and IAR1IAR_{1} is performed on the GISETTE dataset. The accuracy in terms of the binary classification rate on the testing set, for each of the performed methods and ANN structures, is reported in Table II, averaging over 20 runs.

Dataset Training NN dd Testing NTN_{T}
GISETTE 48004800 50005000 12001200
MNIST-B 6000060000 784784 1000010000
TABLE I: MNIST-B and GISETTE datasets. Size of the training set (NN), problem dimension (dd) and size of the testing set (NTN_{T}).
Dataset IAR1IAR_{1} IAR1IAR_{1} IAR1IAR_{1} IAR2IAR_{2}
(no net) shallow net: (15)(15) 2-hidden layers (no net)
net: (15,2)(15,2)
GISETTE 87.75%87.75\% 94.67%94.67\%
MNIST-B 87.37%87.37\% 88.42%88.42\% 89.23%89.23\%
TABLE II: MNIST-B and GISETTE datasets. List of the considered methods and architectures with the binary classification rate on the testing set; mean values over 2020 runs.

We immediately note that, for the series of tests performed using IAR1IAR_{1}, adding hidden layers with a small amount of neurons seems to be effective, since the accuracy in the binary classification on the testing set increases. In order to avoid overfitting, no net is considered on the GISETTE dataset, because of the fact that the number of parameters n=dn=d is greater than the number of training samples NTN_{T} (see Table I) already in the case of no hidden layers. We further notice that, the second order procedure provides, for the GISETTE dataset, a higher classification accuracy. Therefore, on this test, adding second order information is beneficial. To get more insight, in Figures 12 and Figure 5 we plot the training and testing loss against CM, while Figures 34 and Figure 6 are reserved to the computed sample sizes, performed by the tested procedures. Concerning the MNIST-B dataset, Figures 12 show that the decrease of the training and testing loss against CM performed by IAR1IAR_{1} goes down to more or less the same values at termination, meaning that the training phase has been particularly effective and thus the method is able of generalising the ability of predicting labels learnt on the training phase when the testing data are considered.

Refer to caption
Refer to caption
Figure 1: MNIST-B dataset. Training loss (left) and testing loss (right) by means of IAR1IAR_{1} without net, against CM.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 2: MNIST-B dataset. Each row corresponds to the performance of IAR1IAR_{1} with a different ANN architecture: shallow net (15)(15) (first row) and 22-hidden layers net (15,2)(15,2) (second row). Training loss (left) and testing loss (right) against CM.

Figures 34, reporting the the adaptive sample sizes (in percentage) for computed function and gradient against CM within the IAR1IAR_{1} algorithm, show that the sample size for estimating the objective function oscillates for a while, flattening on the full sample size eventually. The same is true for the corresponding gradients estimations, even if the growth toward the full sample is slower and the trend often remains below this threshold, especially for the MNIST-B dataset with ANN.

Refer to caption
Figure 3: MNIST-B dataset. Sample sizes used by IAR1IAR_{1} with no net, against CM.
Refer to caption
Refer to caption
Figure 4: MNIST-B dataset. Each column corresponds to the sample sizes used by IAR1IAR_{1} with a different ANN architecture against CM: shallow net (15)(15) (left) and 22-hidden layers net (15,2)(15,2) (right).

Moving to the GISETTE dataset, Figure 5 makes clear that the use of second order information enables the method to obtain lower values of the training and testing loss, with the same computational effort, and this yields higher classification accuracy. We also note (see Figure 6) that IAR2IAR_{2} employs only occasionally the 75%75\% of sample to approximate the Hessian matrix and in most of the iterations the Hessian approximation is obtained averaging on less than 20%20\% of samples.

Refer to caption
Figure 5: GISETTE datasets. Training loss (top) and testing loss (bottom) by means of IAR1IAR_{1} and IAR2IAR_{2} without net, against CM.
Refer to caption
Figure 6: GISETTE dataset. Sample sizes used by IAR1IAR_{1} and IAR2IAR_{2} against CM.

V Conclusions and perspectives

The final expected bounds in (17) are sharp in the order of the tolerances ϵ1,ϵ2\epsilon_{1},\epsilon_{2}. The effect of inaccurate evaluations is thus limited to scaling the optimal complexity that we would otherwise derive from the deterministic analysis (see, e.g., Theorem 5.5 in [16] and Theorem 4.2 in [5]), by a factor which depends on the probability pp_{*} of the model being accurate. Finally, first promising numerical results within the range of nonconvex finite-sum binary classification for imaging datasets confirm the expected behaviour of the method. From a theoretical perspective, the inclusion of inexact function evaluations subject to random noise is at the moment an open and challenging issue.

Acknowledgment

INdAM-GNCS partially supported the first, second and third author under Progetti di Ricerca 2019 and 2020.

References

  • [1] Agarwal, N., Allen-Zhu, Z., Bullins, B., Hazan, E., Ma, T.: Finding approximate local minima faster than gradient descent. In: Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pp. 1195–1199 (2017).
  • [2] S. Bellavia, T. Bianconcini, N. Krejić, B. Morini, Subsampled first-order optimization methods with applications in imaging. Handbook of Mathematical Models and Algorithms in Computer Vision and Imaging. Springer, to appear.
  • [3] Bellavia, S., Gurioli, G., Morini, B., Toint, Ph. L.: High-order Evaluation Complexity of a Stochastic Adaptive Regularization Algorithm for Nonconvex Optimization Using Inexact Function Evaluations and Randomly Perturbed Derivatives. arXiv:2005.04639 (2021).
  • [4] Bellavia, S., Gurioli, G.: Complexity Analysis of a Stochastic Cubic Regularisation Method under Inexact Gradient Evaluations and Dynamic Hessian Accuracy. To appear, DOI:10.1080/02331934.2021.1892104 (2021).
  • [5] Bellavia, S., Gurioli, G., Morini, B. Adaptive Cubic Regularization Methods with Dynamic Inexact Hessian Information and Applications to Finite-Sum Minimization. IMA Journal of Numerical Analysis 41(1), 764–799 (2021).
  • [6] Bellavia, S., Gurioli, G., Morini, M., Toint, Ph. L.: Adaptive Regularization Algorithms with Inexact Evaluations for Nonconvex Optimization. SIAM Journal on Optimization 29(4), 2281–2915 (2019).
  • [7] Berahas, A. S., Bollapragada, R., Nocedal, J.: An investigation of Newton-sketch and subsampled Newton methods. Optimization Methods and Software 35(4), 1–20 (2020).
  • [8] Berahas, A., Cao, L., Scheinberg, K.: Global convergence rate analysis of a generic line search algorithm with noise. arXiv:1910.04055 (2019).
  • [9] C Bertocchi C.,Chouzenoux E. , Corbineau M.C., Pesquet J.C., Prato M.:Deep unfolding of a proximal interior point method for image restoration. Inverse problems, 36, 034005 (2020).
  • [10] Bianconcini, T., Liuzzi, G., Morini, B., Sciandrone, M.: On the use of iterative methods in cubic regularization for unconstrained optimization. Computational Optimization and Applications 60(1), 35–57 (2015).
  • [11] Birgin, E. G., Gardenghi, J. L., Martínez, J. M., Santos, S. A., Toint, Ph. L.: Worst-case evaluation complexity for unconstrained nonlinear optimization using high-order regularized models. Mathematical Programming Ser. A 163(1–2), 359–368 (2017).
  • [12] Blanchet, J., Cartis, C., Menickelly, M., Scheinberg, K.: Convergence rate analysis of a stochastic trust region method via supermartingales. INFORMS Journal on Optimization 1(2), 92–119 (2019).
  • [13] Bottou, L. , Curtis, F. E., Nocedal, J.: Optimization Methods for Large-Scale Machine Learning. SIAM Review 60(2), 223–311 (2018).
  • [14] Carmon, Y., Duchi, J. C.: Gradient descent efficiently finds the cubic-regularized non-convex Newton step. arXiv preprint arXiv:1612.00547 (2016).
  • [15] Carmon, Y., Duchi, J. C., Hinder, O., Sidford, A.: Accelerated methods for nonconvex optimization. SIAM Journal on Optimization 28(2), 1751–1772 (2018).
  • [16] Cartis, C., Gould, N. I. M., Toint, Ph. L.: Strong evaluation complexity bounds for arbitrary-order optimization of nonconvex nonsmooth composite functions. arXiv:2001.10802 (2020).
  • [17] Cartis, C., Gould, N. I. M., Toint, Ph. L.: Sharp worst-case evaluation complexity bounds for arbitrary-order nonconvex optimization with inexpensive constraints. SIAM Journal on Optimization 30(1), 513–541 (2020).
  • [18] Cartis, C., Scheinberg, K.: Global convergence rate analysis of unconstrained optimization methods based on probabilistic models. Mathematical Programming Ser. A 159(2), 337–375 (2018).
  • [19] Cartis, C., Gould, N. I. M., Toint, Ph. L.: Complexity bounds for second-order optimality in unconstrained optimization. J. Complex. 28(1), 93–108 (2012).
  • [20] Cartis, C., Gould, N. I. M., Toint, Ph. L. : An adaptive cubic regularisation algorithm for nonconvex optimization with convex constraints and its function-evaluation complexity. IMA Journal of Numerical Analysis 32(4), 1662–1695 (2012).
  • [21] Cartis, C., Gould, N. I. M., Toint, Ph. L.: Adaptive cubic overestextrmation methods for unconstrained optimization. Part I: motivation, convergence and numerical results. Mathematical Programming Ser. A 127, 245–295 (2011).
  • [22] Cartis, C. , Gould, N. I. M., Toint, Ph. L.: Adaptive cubic overestextrmation methods for unconstrained optimization. Part II: worst-case function and derivative-evaluation complexity. Mathematical Programming Ser. A 130(2), 295–319 (2011).
  • [23] Chen, X., Jiang, B., Lin, T., Zhang, S.: On Adaptive Cubic Regularized Newton’s Methods for Convex Optimization via Random Sampling. arXiv:1802.05426 (2018).
  • [24] Chen, R., Menickelly, M., Scheinberg, K.: Stochastic optimization using a trust-region method and random models. Mathematical Programming Ser. A 169(2), 447–487 (2018).
  • [25] , D. di Serafino and V. Ruggiero and G. Toraldo and L. Zanni, On the steplength selection in gradient methods for unconstrained optimization, Applied Mathematics and Computation, vol. 318, pp. 176 - 195, 2018.
  • [26] Goodfellow, I., Bengio, Y., Courville, A.: Deep learning, MIT press (2016).
  • [27] Kohler, J. M., Lucchi, A..: Sub-sampled cubic regularization for non-convex optimization. Proceedings of the 34th International Conference on Machine Learning 70, 1895–1904 (2017).
  • [28] Jin K.H, McCann M.T, Froustey E., Unser M.: Deep convolutional neural network for inverse problems in imaging. IEEE Trans. Image Process. 26 4509–4522 (2017)
  • [29] LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proceedings of the IEEE 86, 2278–2324 (1998).
  • [30] M. Lichman. UCI machine learning repository. https://archive.ics.uci.edu/ml/index.php (2013).
  • [31] McCann M.T, Jin K.H, Unser M., Convolutional neural networks for inverse problems in imaging: a review. IEEE Signal Process. Mag. 34, 85–95 (2017)
  • [32] Paquette, C., Scheinberg, K.: A stochastic line search method with convergence rate analysis. SIAM Journal on Optimization 30(1), 349–376 (2020).
  • [33] Pearlmutter, B. A.: Fast exact multiplication by the Hessian. Neural computation 6(1), 147–160 (1994).
  • [34] M. Raydan, Journal = SIAM Journal on Optimization, The Barzilai and Borwein gradient method for the large scale unconstrained minimization problem, SIAM Journal on Optimization, vol. 7(1), pp. 26–33, 1997.
  • [35] Xu, P., Roosta-Khorasani, F., Mahoney, M. W.: Second-Order Optimization for Non-Convex Machine Learning: An Empirical Study. In: Proceedings of the 2020 SIAM International Conference on Data Mining, pp.199–207 (2020).
  • [36] Xu, P., Roosta-Khorasani, F., Mahoney, M. W.: Newton-type methods for non-convex optimization under inexact Hessian information. Mathematical Programming 184, 35–70 (2020).
  • [37] Yao, Z., Xu, P., Roosta-Khorasani, F., Mahoney, M. W.: Inexact Non-Convex Newton-type Methods. arXiv:1802.06925 (2018).
  • [38] Schraudolph, N. N.: Fast curvature matrix-vector products for second-order gradient descent. Neural computation 14(7), 1723–1738 (2002).
  • [39] Shanmugamani, R., Deep Learning for Computer Vision: Expert techniques to train advanced neural networks using TensorFlow and Keras, Packt Publishing (2018).
  • [40] Tropp, J.: An Introduction to Matrix Concentration Inequalities. Number 8,1-2 in Foundations and Trends in Machine Learning. Now Publishing, Boston, USA, 2015.
  • [41] Zhou, D., Xu, P., Gu, Q. Stochastic Variance-Reduced Cubic Regularization Methods. Journal of Machine Learning Research 20, 1–47 (2019).