This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Optimization and Learning With Nonlocal Calculus

Sriram Nagaraj
sriram.nagaraj.atl@gmail.com
With the Federal Reserve Bank of Atlanta. Any views or opinions expressed in this article are those of the author alone and not necessarily those of the Federal Reserve Bank of Atlanta or the Federal Reserve System. This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. Declarations of interest: none.
Abstract

Nonlocal models have recently had a major impact in nonlinear continuum mechanics and are used to describe physical systems/processes which cannot be accurately described by classical, calculus based “local” approaches. In part, this is due to their multiscale nature that enables aggregation of micro-level behavior to obtain a macro-level description of singular/irregular phenomena such as peridynamics, crack propagation, anomalous diffusion and transport phenomena. At the core of these models are nonlocal differential operators, including nonlocal analogs of the gradient/Hessian. This paper initiates the use of such nonlocal operators in the context of optimization and learning. We define and analyze the convergence properties of nonlocal analogs of (stochastic) gradient descent and Newton’s method on Euclidean spaces. Our results indicate that as the nonlocal interactions become less noticeable, the optima corresponding to nonlocal optimization converge to the “usual” optima. At the same time, we argue that nonlocal learning is possible in situations where standard calculus fails. As a stylized numerical example of this, we consider the problem of non-differentiable parameter estimation on a non-smooth translation manifold and show that our nonlocal gradient descent recovers the unknown translation parameter from a non-differentiable objective function.

Keywords: Nonlocal operators, optimization, machine learning

1 Introduction

Nonlocality is an important paradigm in understanding the behavior of a system XX undergoing spatio-temporal or structural changes, particularly when such changes are abrupt or singular. A system exhibits nonlocality if the changes occurring at some point xXx\in X are influenced by the interactions between the xx and its neighbors. In contrast, a local phenomenon is one where changes at xx depend solely on xx. In order to perform predictive analysis on nonlocal systems, an important step is to develop nonlocal analytic/computational tools i.e., nonlocal calculus and corresponding nonlocal descriptive models.

A generic description of a nonlocal operator is a mapping FF from some topological space XX into a set YY such that the value of FF at a point xXx\in X depends on not only xx but also points in some neighborhood U(x)U(x). In contrast, for a local operator LL, its value at xx depends solely on xXx\in X. Some of these nonlocal analyses lend themselves to a multiscale viewpoint, where instead of a single nonlocal operator F:XYF:X\rightarrow Y, we have instead a family of such operators Fn:XYF_{n}:X\rightarrow Y indexed by a scale parameter nn\in\mathbb{R}. A common theme is the convergence of the nonlocal operators FnF_{n} to a local counterpart, under suitable assumptions. In other words, the scale parameter nn controls the nonlocality in such a way that, asymptotically, as nn\rightarrow\infty, the nonlocal operators FnFF_{n}\rightarrow F^{*}, where F:XYF^{*}:X\rightarrow Y is a local operator. The exact nature of convergence, i.e., the topology of the function spaces in question, naturally has an important role to play in such a situation.

The prototypical class of nonlocal operators is the class of integral operators. If Ω\Omega is an open subset of a Euclidean space, and f:Ωf:\Omega\rightarrow\mathbb{R} a suitable function, then we can consider the operator

(Tf)(x)=Ωf(y)K(x,y)𝑑y(Tf)(x)=\int_{\Omega}f(y)K(x,y)dy (1)

for an appropriate kernel K(,):Ω×ΩK(\cdot,\cdot):\Omega\times\Omega\rightarrow\mathbb{R}. A generic enough KK can capture the nonlocal interaction between points of Ω\Omega, or said differently, the modeler has the flexibility of choosing the kernel KK generic enough to model the phenomena under investigation. We also can consider a sequence of such kernels KnK_{n} which give rise to the integral operators

(Tnf)(x)=Ωf(y)Kn(x,y)𝑑y.(T_{n}f)(x)=\int_{\Omega}f(y)K_{n}(x,y)dy. (2)

Assuming that the KnK_{n} converge, in some sense, to a kernel KK, we can conclude, with reasonable assumptions, the convergence of the corresponding operators TnT_{n} to TT (see [6]). In case K(x,y)=δ(xy)K(x,y)=\delta(x-y), where δ()\delta(\cdot) is the Dirac mass at the origin, we have, at least formally, that the TnT_{n} converge to the identity, i.e., Tf(x)=f(x)Tf(x)=f(x). This observation underpins most of our further discussions.

While classical, differential operator (gradient, Hessian etc.) based methods work well under benign conditions, they cannot be used with singular domains and/or functions, since partial derivatives with the required regularity do not exist. The nonlocal calculus we study here is based on integral operators. The integral operators of nonlocal models come equipped with so-called “interaction kernels" and can be defined for irregular domains or non-differentiable functions. As such, they do not suffer from singularity issues to the extent of their differential counterparts ([3, 22, 19]) and have been applied in practical problems with great success. In addition to being applicable to singular problems, nonlocal operators can be shown to converge to their local counterparts under suitably natural conditions: nonlocal gradients (resp. Hessians) converge to the usual gradient (resp. Hessian) when the latter exists (see [10, 35, 30]).

1.1 Contributions

Our primary contributions are the theory and algorithmic applications of nonlocal differential operators, mainly nonlocal first and second order (gradient and Hessian) operators to perform optimization tasks. Specifically, we develop nonlocal analogs of gradient (including stochastic) descent, as well as Newton’s method. We establish convergence results for these methods that show how the optima obtained from nonlocal optimization converge to the standard gradient/Hessian based optima, as the nonlocal interaction effects become less pronounced. While our definition of nonlocal optimization methods is valid for irregular problems, our analysis will assume the required regularity of the objective functions in order to compare the local and nonlocal approaches. As a stylized numerical application of our nonlocal learning framework, we apply our nonlocal gradient descent procedure to the problem of parameter estimation on a non-differentiable “translation" manifold as studied via multiscale smoothing in [51] and [50]. In our framework, this problem can be solved without resorting to a multiscale analysis of the manifold; instead, the nonlocal gradient operators we use can be viewed as providing the required smoothing. While we have emphasized the use of nonlocal operators to non-differentiable optimization as an alternative to traditional subgradient methods ([43]), the scope of the theory is much wider. Indeed, one may use nonlocal methods to model long range interactions ([37]), phase transitions ([15]) etc.

1.2 Prior Work

Nonlocal models have, in recent years, been fruitfully deployed in a wide variety of application areas: in image processing ([25]) nonlocal operators are used for total variation (TV) inpainting and also image denoising ([11, 33, 32, 52]) with the nonlocal means framework. Computational/engineering mechanics has profited greatly in recent years with a plethora of nonlocal models. Plasticity theory ([8]) and fracture mechanics ([27, 21]) are two prominent examples. Indeed, the field of peridynamics (see [46, 45, 39]) is a prime consumer of nonlocal analysis. Nonlocal approaches have also featured in crowd dynamics [14], and cell biology ([1]). A nonlocal theory of vector calculus has been developed along with several numerical implementations (see [20, 4, 18]). The nonlocal approach has also found its way into deep learning ([49]). On the theoretical side, nonlocal and fractional differential operators have been studied in the context of nonlocal diffusion equations ([5] and references therein) as well as purely in the context of nonlocal gradient/Hessian operators ([35, 30, 10, 9, 29]). The use of fractional calculus via Caputo derivatives and Riemann-Liouville fractional integrals for gradient descent (GD) and backpropagation has been considered in [54, 38, 12, 42, 53]. Our work uses instead the related notion of nonlocal calculus from [35, 30, 10, 18] as its basis for developing not only a nonlocal (stochastic) GD, but also a nonlocal Newton’s method, and is backed by rigorous convergence theory. In addition, our nonlocal stochastic GD approach interprets, in a very natural way, the nonlocal interaction kernel as a conditional probability density. The multiple definitions of fractional and nonlocal calculi are related (see for e.g. [16]).

1.3 Notation and Review

In all that follows, ΩD\Omega\subset\mathbb{R}^{D} will denote a simply connected bounded domain (i.e., a bounded, connected, simply connected, open subset of D\mathbb{R}^{D}) with a smooth, compact boundary Ω\partial\Omega where DD is a fixed positive integer. Ω\Omega will be endowed with the subspace topology of D\mathbb{R}^{D}. We will consistently denote by x,y,zx,y,z points in Ω\Omega. Subscripts of vectorial (resp. tensorial) quantities will denote the corresponding components. For e.g., xix_{i} or Ai,jA_{i,j} will denote the ithi^{\text{th}} component of the vector xx or (i,j)th(i,j)^{\text{th}} entry of matrix AA respectively for i,j=1,,Di,j=1,\ldots,D. For vectors x,yDx,y\in\mathbb{R}^{D} the notation xyD×Dx\otimes y\in\mathbb{R}^{D\times D} denotes the outer product of xx and yy. The letters i,j,k,n,m,Ni,j,k,n,m,N will denote positive integers while t,s,p,ϵ,δ,M,Kt,s,p,\epsilon,\delta,M,K denote arbitrary real quantities. By Bϵ(x)B_{\epsilon}(x), we mean an open ball centered at xx with radius ϵ>0\epsilon>0. Furthermore, u,v,f,g,ρ:Ωu,v,f,g,\rho:\Omega\rightarrow\mathbb{R} will in general denote real-valued functions with domain Ω\Omega. All integrals will be in the sense of Lebesgue, and a.e. means almost every or almost everywhere with respect to Lebesgue measure. Given an arbitrary collection of subsets 𝒮\mathcal{S} of D\mathbb{R}^{D}, the σ\sigma-algebra generated by 𝒮\mathcal{S} will be denoted by σ(𝒮)\sigma(\mathcal{S}).

Given u:Ωu:\Omega\rightarrow\mathbb{R}, we define the support supp(u)Ω\text{supp}(u)\subset\Omega as

supp(u)={xΩ:|u(x)|>0 a.e.}¯,\text{supp}(u)=\overline{\{x\in\Omega:|u(x)|>0\text{ a.e.}\}},

where the overbar denotes closure in D\mathbb{R}^{D}. A multi-index n¯=(n1,,nD)\bar{n}=(n_{1},\ldots,n_{D}) is an ordered DD-tuple of non-negative integers ni,i=1,Dn_{i},i=1,\ldots D, and we let |n¯|:=i=1Dni|\bar{n}|:=\sum_{i=1}^{D}n_{i}. For a suitably regular function uu, we denote by

n¯uxn¯:=n1xn1n2xn2nDuxnD.\frac{\partial^{\bar{n}}u}{\partial x^{\bar{n}}}:=\frac{\partial^{n_{1}}}{\partial x^{n_{1}}}\frac{\partial^{n_{2}}}{\partial x^{n_{2}}}\ldots\frac{\partial^{n_{D}}u}{\partial x^{n_{D}}}.

If we consider all multi-indices n¯\bar{n} with |n¯|=1|\bar{n}|=1, we can identify n¯uxn¯\frac{\partial^{\bar{n}}u}{\partial x^{\bar{n}}} with the usual gradient u\nabla u of uu. In this case, we denote the ithi^{\text{th}} component uxi\frac{\partial u}{\partial x_{i}} of u\nabla u as uxiu_{x_{i}} for i=1,Di=1,\ldots D. Likewise, the Hessian of u:=Huu:=Hu at a point xΩx\in\Omega is the D×DD\times D matrix (Hu(x))i,j=2uxixj(x)(Hu(x))_{i,j}=\frac{\partial^{2}u}{\partial x_{i}\partial x_{j}}(x). We shall use the following function spaces:

Lp(Ω,k)={u:Ωk:Ω|u|p<},1p<},L(Ω)={u:Ω:ess sup(u)<} where ess sup(u):= essential supremum of u,Wm,p(Ω)={uLp(Ω):n¯uxn¯Lp(Ω),0|n¯|m, 1p},Cm(Ω):={u:Ω:u is m-times continuously differentiable},Ccm(Ω):={u:Ω:uCm(Ω) with compact support},Lip(Ω,M):={u:Ω:u is Lipschitz continuous with constant M.}\begin{split}L^{p}(\Omega,\mathbb{R}^{k})&=\{u:\Omega\rightarrow\mathbb{R}^{k}:\int_{\Omega}|u|^{p}<\infty\},1\leq p<\infty\},\\ L^{\infty}(\Omega)&=\{u:\Omega\rightarrow\mathbb{R}:\text{ess sup}(u)<\infty\}\text{ where ess sup}(u):=\text{ essential supremum of }u,\\ W^{m,p}(\Omega)&=\{u\in L^{p}(\Omega):\frac{\partial^{\bar{n}}u}{\partial x^{\bar{n}}}\in L^{p}(\Omega),0\leq|\bar{n}|\leq m,\,1\leq p\leq\infty\},\\ C^{m}(\Omega)&:=\{u:\Omega\rightarrow\mathbb{R}:u\text{ is $m$-times continuously differentiable}\},\\ C^{m}_{c}(\Omega)&:=\{u:\Omega\rightarrow\mathbb{R}:u\in C^{m}(\Omega)\text{ with compact support}\},\\ Lip(\Omega,M)&:=\{u:\Omega\rightarrow\mathbb{R}:u\text{ is Lipschitz continuous with constant $M$}.\}\end{split} (3)

We let Lp(Ω,1):=Lp(Ω)L^{p}(\Omega,\mathbb{R}^{1}):=L^{p}(\Omega). Note that each of the above spaces 𝒳\mathcal{X} is a normed linear space. In particular H1(Ω):=W1,2(Ω)H^{1}(\Omega):=W^{1,2}(\Omega) is a Hilbert space with the inner product (u,v)H1(Ω)=(u,v)L2(Ω)+(u,v)L2(Ω)(u,v)_{H^{1}(\Omega)}=(u,v)_{L^{2}(\Omega)}+(\nabla u,\nabla v)_{L^{2}(\Omega)}. The subspace H01(Ω)H_{0}^{1}(\Omega) of H1(Ω)H^{1}(\Omega) consists of those uH1(Ω)u\in H^{1}(\Omega) that vanish (i.e., have zero trace) on the boundary Ω\partial\Omega. Note that the space C01(Ω)C_{0}^{1}(\Omega) consisting of continuously differentiable functions vanishing on Ω\partial\Omega is dense in H01(Ω)H^{1}_{0}(\Omega) (see [2]). Next, |||\cdot| denotes the absolute value of real quantities, \|\cdot\| will denote the Euclidean norm of D\mathbb{R}^{D} and 𝒳\|\cdot\|_{\mathcal{X}} will denote the norm in any of the function spaces 𝒳\mathcal{X} above.

Following [35, 10], we will fix a sequence of radial functions ρn:D,n=1,\rho_{n}:\mathbb{R}^{D}\rightarrow\mathbb{R},n=1,\ldots which satisfy the following properties:

{ρn0,Dρn(x)𝑑x=1,limnx>δρn(x)𝑑x=0,δ>0.\begin{cases}\rho_{n}\geq 0,\\ \int_{\mathbb{R}^{D}}\rho_{n}(x)dx=1,\\ \lim_{n\rightarrow\infty}\int_{\|x\|>\delta}\rho_{n}(x)dx=0,\,\forall\delta>0.\\ \end{cases} (4)

Note that by radial we mean that ρn(x)=ρn(Ox)\rho_{n}(x)=\rho_{n}(Ox) for any orthogonal matrix O:DDO:\mathbb{R}^{D}\rightarrow\mathbb{R}^{D} with unit determinant. Alternatively, this means that ρn(x)=ρ^n(x)\rho_{n}(x)=\hat{\rho}_{n}(\|x\|) for some function ρ^n:\hat{\rho}_{n}:\mathbb{R}\rightarrow\mathbb{R}. We can specify (but will not insist) that the ρn\rho_{n} have compact support or even be of class Cc(Ω)C^{\infty}_{c}(\Omega). Note that for our analysis of the nonlocal stochastic gradient descent method, we shall interpret the ρn\rho_{n} as a sequence of probability densities on D\mathbb{R}^{D} that approach the Dirac density δ0\delta_{0} centered at the origin. Prototypical examples of ρn\rho_{n} are given by sequences of Gaussian distributions with increasing variance (the non-compact support case) and the so-called bump functions with increasing height (the compact support case). In both these cases, we can show that the sequence of ρn\rho_{n} converge as Schwartz distributions to the Dirac density δ0\delta_{0}. Details can be found in [23]. Finally, the set of all radial probability densities on D\mathbb{R}^{D}, i.e., those ρ:D\rho:\mathbb{R}^{D}\rightarrow\mathbb{R} that satisfy the first two conditions of equation 4 will be denoted by 𝒫\mathcal{P}.

1.4 Nonlocal Operators

Let us now focus on the main objects of study in this paper, nonlocal differential operators. We note that there is no unique notion of “the" nonlocal gradient (resp. Hessian). Indeed, differing choices of kernels may give rise to differing notions of gradients (resp. Hessians).

1.4.1 First Order Operators

As done in [35], for any radial density ρ𝒫\rho\in\mathcal{P}, we can consider the linear operators ρρxi\frac{\partial_{\rho}}{\partial_{\rho}x_{i}} for i=1,,Di=1,\ldots,D defined as:

ρuρxi(x):=Dlimϵ0ΩBϵc(x)u(x)u(y)xyxiyixyρ(xy)𝑑y,\frac{\partial_{\rho}u}{\partial_{\rho}x_{i}}(x):=D\lim_{\epsilon\rightarrow 0}\int_{\Omega\cap B_{\epsilon}^{c}(x)}\frac{u(x)-u(y)}{\|x-y\|}\frac{x_{i}-y_{i}}{\|x-y\|}\rho(x-y)dy, (5)

where the domain of ρρxi\frac{\partial_{\rho}}{\partial_{\rho}x_{i}} is the set of all functions for which the principle value integral on the right hand side of equation 5 exists (see [35]).

Stacking the ρuρxi\frac{\partial_{\rho}u}{\partial_{\rho}x_{i}} as a vector yields a nonlocal gradient:

ρu(x):=Dlimϵ0ΩBϵc(x)u(x)u(y)xyxyxyρ(xy)𝑑y,\nabla_{\rho}u(x):=D\lim_{\epsilon\rightarrow 0}\int_{\Omega\cap B_{\epsilon}^{c}(x)}\frac{u(x)-u(y)}{\|x-y\|}\frac{x-y}{\|x-y\|}\rho(x-y)dy, (6)

in other words ρuρxi(x)=(ρu(x))i\frac{\partial_{\rho}u}{\partial_{\rho}x_{i}}(x)=(\nabla_{\rho}u(x))_{i} for i=1,,Di=1,\ldots,D. Purely formally, if we allow for ρ=δ0\rho=\delta_{0}, the Dirac delta distribution centered at 0, then the “nonlocal" gradient defined above coincides (in the sense of distributions) to the usual “local" gradient. However, we would then be operating in the highly irregular realm of distributions, far removed from the space of functions that are more in line with the applications in mind. Nevertheless, much of the analysis in this paper revolves considering the case of a “well-behaved" ρ\rho approximating δ0\delta_{0} so that ρu\nabla_{\rho}u approximates the true gradient u\nabla u.

If uC1(Ω¯)u\in C^{1}(\bar{\Omega}), or more generally, as [35] note (see also [10]), if uW1,p(Ω)u\in W^{1,p}(\Omega), the map

y|u(y)u(x)|xyρ(xy)L1(Ω)y\mapsto\frac{|u(y)-u(x)|}{\|x-y\|}\rho(x-y)\in L^{1}(\Omega)

for a.e. yΩy\in\Omega. In these cases, the principle value integral in equation 6 exists and coincides with the Lebesgue integral of u(x)u(y)xyxyxyρ(xy)\frac{u(x)-u(y)}{\|x-y\|}\frac{x-y}{\|x-y\|}\rho(x-y) over all of Ω\Omega. Thus, if uC1(Ω¯)u\in C^{1}(\bar{\Omega}) (or uW1,p(Ω)u\in W^{1,p}(\Omega)), then we may drop the limit in equation 6 and write:

ρu(x):=DΩu(x)u(y)xyxyxyρ(xy)𝑑y.\nabla_{\rho}u(x):=D\int_{\Omega}\frac{u(x)-u(y)}{\|x-y\|}\frac{x-y}{\|x-y\|}\rho(x-y)dy. (7)

At times, we may wish to evaluate the integral on a subset ΩΩ\Omega^{*}\subset\Omega. In this case, we write ρΩu(x)\nabla_{\rho}^{\Omega^{*}}u(x) to denote the corresponding nonlocal gradient:

ρΩu(x):=DΩu(x)u(y)xyxyxyρ(xy)𝑑y.\nabla_{\rho}^{\Omega^{*}}u(x):=D\int_{\Omega^{*}}\frac{u(x)-u(y)}{\|x-y\|}\frac{x-y}{\|x-y\|}\rho(x-y)dy. (8)

If Ω=Ω\Omega^{*}=\Omega, we simply denote this as ρu(x)\nabla_{\rho}u(x).

The nonlocal gradient ρu\nabla_{\rho}u defined via ρ\rho above exists for uW1,pu\in W^{1,p}, which consists of functions that are only weakly differentiable and for which the standard (strong) gradient u\nabla u does not exist.

Specializing to the fixed sequence ρn\rho_{n} defined in equation 4, we write nunxi(x):=ρnuρnxi(x)\frac{\partial_{n}u}{\partial_{n}x_{i}}(x):=\frac{\partial_{\rho_{n}}u}{\partial_{\rho_{n}}x_{i}}(x) or, more succinctly, as uxinu^{n}_{x_{i}}, and similarly nu(x):=ρnu(x)\nabla_{n}u(x):=\nabla_{\rho_{n}}u(x). We refer to the subscript nn as the “scale of nonlocality". It is easy to verify that unlike the usual local gradient ()\nabla(\cdot), the nonlocal n()\nabla_{n}(\cdot) does not satisfy the product rule. The mode and topology of convergence of the sequence nu\nabla_{n}u to u\nabla u as nn\rightarrow\infty has been investigated by many authors (see [25, 20, 35])

1.4.2 Second Order Operators

Just as a first order nonlocal gradient operator has been defined earlier, we can define second order nonlocal differential operators as well. We consider nonlocal Hessians. In this case, even for a fixed scale of nonlocality nn and associated density (kernel) ρn()\rho_{n}(\cdot), one has several choices for defining a nonlocal Hessian. We list below four possibilities, and specify the (i,j)th(i,j)^{\text{th}} entry of nonlocal Hessian matrices for i,j=1,,Di,j=1,\ldots,D. For the time being, we assume enough regularity for uu in order for the Hessians to make sense. We will specialize to uCc2(Ω)u\in C^{2}_{c}(\Omega) and uW2,p(Ω)u\in W^{2,p}(\Omega) in section 4. Recall the notation uxin(x)=nunxi(x)u^{n}_{x_{i}}(x)=\frac{\partial_{n}u}{\partial_{n}x_{i}}(x).

{(Hn,m1)i,j(x):=DΩuxin(x)uxin(y)xyxjyjxyρm(xy)𝑑y:=muxinmxj(x):=(uxin)xjm,(Hn2)i,j(x):=DΩuxi(x)uxi(y)xyxjyjxyρn(xy)𝑑y:=nuxinxj(x):=(uxi)xjn,(Hn3)i,j(x):=uxinxj(x):=(uxin)xj,(Hn4)i,j(x):=D(D+1)2Du(x+h)2u(x)+u(xh))h2(hhh2D+2ID)i,jh2ρn(h)𝑑h.\begin{cases}(H_{n,m}^{1})_{i,j}(x)&:=D\int_{\Omega}\frac{u^{n}_{x_{i}}(x)-u^{n}_{x_{i}}(y)}{\|x-y\|}\frac{x_{j}-y_{j}}{\|x-y\|}\rho_{m}(x-y)dy:=\frac{\partial_{m}u^{n}_{x_{i}}}{\partial_{m}x_{j}}(x):=(u_{x_{i}}^{n})_{x_{j}}^{m},\\ (H_{n}^{2})_{i,j}(x)&:=D\int_{\Omega}\frac{u_{x_{i}}(x)-u_{x_{i}}(y)}{\|x-y\|}\frac{x_{j}-y_{j}}{\|x-y\|}\rho_{n}(x-y)dy:=\frac{\partial_{n}u_{x_{i}}}{\partial_{n}x_{j}}(x):=(u_{x_{i}})_{x_{j}}^{n},\\ (H_{n}^{3})_{i,j}(x)&:=\frac{\partial u^{n}_{x_{i}}}{\partial x_{j}}(x):=(u_{x_{i}}^{n})_{x_{j}},\\ (H_{n}^{4})_{i,j}(x)&:=\frac{D(D+1)}{2}\int_{\mathbb{R}^{D}}\frac{u(x+h)-2u(x)+u(x-h))}{\|h\|^{2}}\frac{(h\otimes h-\frac{\|h\|^{2}}{D+2}I_{D})_{i,j}}{\|h\|^{2}}\rho_{n}(h)dh.\\ \end{cases} (9)

Of the four Hessians above, Hn4H_{n}^{4} is explicitly defined and analyzed in great detailed by [30]. It is interesting to note that Hn4H_{n}^{4} is a non-local analog of the usual second-order central difference approximation to the second derivative ([31]).

1.5 Paper Organization

After this introductory section, we consider 1st1^{\text{st}} order nonlocal optimization in section 2. section 3 deals with a stylized numerical example of the 1st1^{\text{st}} order theory. In section 4, we develop the second order nonlocal optimization theory and the nonlocal analog of Newton’s method. We conclude in section 5. The proofs of all theoretical results are provided in the supplementary appendix section 6.

2 1st1^{\text{st}} Order Nonlocal Theory and Methods

We now analyze the 1st1^{\text{st}} order nonlocal optimization theory and associated numerical methods. Recall that nu\nabla_{n}u exists for uW1,pu\in W^{1,p}, even if the classical gradient u\nabla u is not defined and we also know from [35] that if, in addition, uC1(Ω¯)u\in C^{1}({\bar{\Omega}}) (so that the usual gradient u\nabla u also exists), then we have that nuu\nabla_{n}u\rightarrow\nabla u locally uniformly. Thus, if xΩx^{*}\in\Omega is a local minimizer of uu, then u(x)=0\nabla u(x^{*})=0 and we have that nu(x)u(x)=0\nabla_{n}u(x^{*})\rightarrow\nabla u(x^{*})=0, from which we can immediately conclude the following.

Proposition 2.1.

Let xΩx^{*}\in\Omega be a local minimum of uC1(Ω¯)u\in C^{1}(\bar{\Omega}). Then given an ϵ>0\epsilon>0, there is an N>0N>0 such that nu(x)<ϵ\|\nabla_{n}u(x^{*})\|<\epsilon for all n>Nn>N.

While we shall not be using the following result directly in the development of nonlocal optimization methods, it is nonetheless an important analog of the local case. nΩu\nabla_{n}^{\Omega^{*}}u refers to the restriction of nu\nabla_{n}u to a measurable subset ΩΩ\Omega^{*}\subset\Omega (see supplementary appendix for more details).

Theorem 2.2.

Let xΩx^{*}\in\Omega be an a.e. local minimum of uW1,p(Ω)u\in W^{1,p}(\Omega). For each nn\in\mathbb{N} there exists a measurable subset Ω=Ω(n)Ω\Omega^{*}=\Omega^{*}(n)\subset\Omega such that nΩu(x)=0\nabla_{n}^{\Omega^{*}}u(x^{*})=0.

2.1 Nonlocal Taylor’s Theorem

Consider first the standard affine approximant

A(x0,x):=u(x0)+(xx0)Tu(x0)A(x_{0},x):=u(x_{0})+(x-x_{0})^{T}\nabla u(x_{0})

to u(x)C1(Ω)u(x)\in C^{1}(\Omega) at x0x_{0} with the corresponding remainder term r(x0,x)=u(x)A(x0,x).r(x_{0},x)=u(x)-A(x_{0},x). The nonlocal analog of the affine approximant is

An(x0,x):=u(x0)+(xx0)Tnu(x0)A_{n}(x_{0},x):=u(x_{0})+(x-x_{0})^{T}\nabla_{n}u(x_{0})

and the corresponding nonlocal remainder term is rn(x0,x)=u(x)An(x0,x).r_{n}(x_{0},x)=u(x)-A_{n}(x_{0},x). Standard calculus dictates that the remainder term r(x0,x)=o(xx0)r(x_{0},x)=o(\|x-x_{0}\|), and by uniqueness of the derivative, we can conclude that the nonlocal remainder rn(x0,x)r_{n}(x_{0},x) will in general not be o(xx0)o(\|x-x_{0}\|). Indeed, the standard (local) affine approximant is asymptotically the unique best fit linear polynomial and hence, the nonlocal affine approximant must necessarily be sub-optimal. It is therefore important to understand how rn(x0,x)r_{n}(x_{0},x) relates to r(x0,x)r(x_{0},x) as nn\rightarrow\infty, and this is the objective of the next result.

Proposition 2.3.
  • Let uCc1(Ω)u\in C_{c}^{1}(\Omega). Then An(x0,x)A(x0,x)A_{n}(x_{0},x)\rightarrow A(x_{0},x), and, rnrr_{n}\rightarrow r uniformly as nn\rightarrow\infty.

  • Let uH01(Ω)u\in H_{0}^{1}(\Omega). Then rn(,)r(,)L2(Ω×Ω)0\|r_{n}(\cdot,\cdot)-r(\cdot,\cdot)\|_{L^{2}(\Omega\times\Omega)}\rightarrow 0 as nn\rightarrow\infty.

2.2 Nonlocal Gradient Descent

We define the nonlocal gradient descent as follows. We first fix the scale of nonlocality nn, and initialize x0,nΩx^{0,n}\in\Omega. We then perform the update:

xk+1,n=xk,nαnknu(xk,n).x^{k+1,n}=x^{k,n}-\alpha_{n}^{k}\nabla_{n}u(x^{k,n}). (10)

Although one can easily conceive of an adaptive procedure with nn varying with step kk, we emphasize that for our analysis, the scale nn of nonlocality does not change between the steps of the descent. Henceforth, we denote by uk:=u(xk)\nabla u^{k}:=\nabla u(x^{k}) and nuk,n:=nu(xk,n)\nabla_{n}u^{k,n}:=\nabla_{n}u(x^{k,n}). The step size αnk\alpha_{n}^{k} and initial point x0,nΩx^{0,n}\in\Omega are chosen small enough to ensure that the iterates xk,nx^{k,n} all remain in Ω\Omega. We emphasize that nonlocal gradient descent is applicable so long as nu\nabla_{n}u exists, in particular, even if u\nabla u does not exist.

2.2.1 Suboptimal Stepsize

Consider first the sub-optimal step size situation, i.e., we use the same step size for both the local and nonlocal descent methods. This stepsize does not have to be the one obtained by linesearch. We will later consider the more involved situation of a nonlocal stepsize, or a stepsize that changes with nn, the scale of nonlocality. We thus now consider the situation with αnk=αk\alpha_{n}^{k}=\alpha^{k} and compare the two descent methods:

{xk+1=xkαkuk,xk+1,n=xk,nαknuk,n,\begin{cases}x^{k+1}&=x^{k}-\alpha^{k}\nabla u^{k},\\ x^{k+1,n}&=x^{k,n}-\alpha^{k}\nabla_{n}u^{k,n},\\ \end{cases} (11)

with the same initial condition x0=x0,nx^{0}=x^{0,n} for all nn. Note that if uLip(Ω,M)u\in Lip(\Omega,M), then we can conclude that the sequence of nonlocal iterates is bounded if the step sizes αk\alpha^{k} are bounded, and decreasing fast enough:

Proposition 2.4.

If the step sizes αk\alpha^{k} are chosen such that the sum m=1Kαm<1\sum_{m=1}^{K}\alpha^{m}<1 for any K>0K>0, then we can conclude that the sequence {xk,n}\{x^{k,n}\} is bounded.

In particular, proposition 2.4 and the Heine-Borel Theorem imply that we are guaranteed (at least) a convergent subsequence of the sequence {xk,n}\{x^{k,n}\}. For the remainder of this subsection, we assume for purposes of our comparative analysis of the local and nonlocal gradient descent methods, that uC2(Ω)Lip(Ω,M)u\in C^{2}(\Omega)\cap Lip(\Omega,M). Thus, the sequence of nonlocal iterates {xk,n}\{x^{k,n}\} in this case are bounded, and hence, there is a compact VΩV\subset\Omega such that xk,nVx^{k,n}\in V for all kk. We are interested in studying the quantity xkxk,n\|x^{k}-x^{k,n}\| as nn increases. We have the following result.

Theorem 2.5.

Let uC2(Ω)Lip(Ω,M)u\in C^{2}(\Omega)\cap Lip(\Omega,M) and let ϵ>0\epsilon>0 be given. Assume that nonlocal gradient descent is initialized with x0x^{0} for all nn, and with the same sequence of step sizes αk\alpha^{k} for all nn satisfying m=1Kαm<1\sum_{m=1}^{K}\alpha^{m}<1 for any K>0K>0. Then, for each kk, there is an N>0N>0 such that for all n>Nn>N, we have xmxm,n<ϵ\|x^{m}-x^{m,n}\|<\epsilon\, for m=1,,k+1m=1,\ldots,k+1.

Theorem 2.5 implies that, if the objective function admits a gradient, then for nn large enough, the iterates of nonlocal gradient descent get arbitrarily close to the iterates of the standard gradient descent. The important point to note is that while nonlocal gradient descent may be applicable in situations where the classical gradient is unavailable, we would not have a convergence analysis comparing the nonlocal gradient descent with the local analog. However, even in such cases, as we noted earlier, proposition 2.4 guarantees that we have a convergent subsequence of iterates of the nonlocal gradient descent method.

2.2.2 Optimal Stepsize

We now consider the case of optimal stepsize obtained by linesearch. We fix a maximal stepsize range of I=[0,A]I=[0,A] for some A>0A>0. As with the suboptimal stepsize case, we assume that uC2(Ω)Lip(Ω,M)u\in C^{2}(\Omega)\cap Lip(\Omega,M) so that the sequence of nonlocal iterates xk,nx^{k,n} are bounded. Fix a compact VΩV\subset\Omega such that xk,nVx^{k,n}\in V for all kk. In the classical gradient descent approach, exact linesearch for the optimal stepsize at iteration kk involves finding αk=argminαIu(xkαuk)\alpha^{k}=\text{argmin}_{\alpha\in I}\,u(x^{k}-\alpha\nabla u^{k}), i.e., we perform:

xk+1=xkαkuk,x^{k+1}=x^{k}-\alpha^{k}\nabla u^{k},

where

αk=argminαIu(xkαuk).\alpha^{k}=\text{argmin}_{\alpha\in I}\,u(x^{k}-\alpha\nabla u^{k}).

We adapt it now to the nonlocal case. The steps of nonlocal gradient descent with exact linesearch for the optimal stepsize takes the form:

xk+1,n=xk,nαk,nnuk,n,x^{k+1,n}=x^{k,n}-\alpha^{k,n}\nabla_{n}u^{k,n}, (12)

where

αk,n=argminαIu(xk,nαnuk,n),\alpha^{k,n}=\text{argmin}_{\alpha\in I}\,u(x^{k,n}-\alpha\nabla_{n}u^{k,n}), (13)

and again we assume that x0=x0,nx^{0}=x^{0,n} for all nn. Note that AA must be defined in such a way that xk,nΩx^{k,n}\in\Omega for all kk. Also, let us assume for simplicity that αk,n\alpha^{k,n} is unique.

Theorem 2.6.

Let uC2(Ω)Lip(Ω,M)u\in C^{2}(\Omega)\cap Lip(\Omega,M). Assume that the optimal stepsize version of nonlocal gradient descent is initialized with x0x^{0} for all nn. Then, for each kk, there exists a subsequence {αk,nm1(k)}\{\alpha^{k,n_{m_{1}}(k)}\} of {αk,n}n=1\{\alpha^{k,n}\}_{n=1}^{\infty} such that αk,nm1(k)αk\alpha^{k,n_{m_{1}}(k)}\rightarrow\alpha^{k} as nm1(k)n_{m_{1}}(k)\rightarrow\infty and a subsequence {xk,nm2(k)}\{x^{k,n_{m_{2}}(k)}\} of {xk,n}n=1\{x^{k,n}\}_{n=1}^{\infty} such that xk,nm2(k)xkx^{k,n_{m_{2}}(k)}\rightarrow x^{k} as nm2(k)n_{m_{2}}(k)\rightarrow\infty.

2.3 Nonlocal Stochastic Gradient Descent

In this section, we consider the nonlocal version of stochastic gradient descent (SGD). A generic descent algorithm with a proxy-gradient gkDg^{k}\in\mathbb{R}^{D}:

xk+1=xkαkgk.\begin{split}x^{k+1}&=x^{k}-\alpha^{k}g^{k}.\\ \end{split} (14)

After KK iterations, we output the averaged vector x¯=1Kk=1Kxk\bar{x}=\frac{1}{K}\sum_{k=1}^{K}x^{k}. If the expected value of gkg^{k} equals the true gradient uk\nabla u^{k}, then, roughly, the difference between the expected value of u(x¯)u(\bar{x}) and the true minimum u(x)u(x^{*}) approaches 0 as the number of iterations approaches \infty. This assumes that we sample gkg^{k} from a distribution such that 𝔼(gk|xk)=u(xk)\mathbb{E}(g^{k}|x^{k})=\nabla u(x^{k}), or, more generally, 𝔼(gk|xk)u(xk)\mathbb{E}(g^{k}|x^{k})\in\partial u(x^{k}), where u(xk)\partial u(x^{k}) is the subdifferential set of uu at xkx^{k} (see [41]).

As motivation for our nonlocal stochastic gradient descent analysis, let X,YX,Y be Ω\Omega-valued random variables, with a conditional density pY|X(y|x)=ρ(xy).p_{Y|X}(y|x)=\rho(x-y). Here, ρ()\rho(\cdot) is a radial density in the class 𝒫\mathcal{P} defined in section 1.3. Note that since we interpret ρ(xy)\rho(x-y) to be pY|X(y|x)p_{Y|X}(y|x), we can choose a density pX(x)p_{X}(x) and arrive at a joint distribution pX,Y(x,y)=ρ(xy)pX(x)p_{X,Y}(x,y)=\rho(x-y)p_{X}(x), and upon integration, arrive at the marginal distribution pY(y)p_{Y}(y) of YY. Let us denote by

ku(x,y):=u(x)u(y)xyxyxy.k_{u}(x,y):=\frac{u(x)-u(y)}{\|x-y\|}\frac{x-y}{\|x-y\|}. (15)

If we let 𝔼ρ[]\mathbb{E}_{\rho}[\cdot] denote the expectation with respect to the density ρ\rho, then it is evident that

ρu(x)=ku(x,y)ρ(xy)𝑑y:=𝔼ρ[ku(X,Y)|X=x],\nabla_{\rho}u(x)=\int k_{u}(x,y)\rho(x-y)dy:=\mathbb{E}_{\rho}[k_{u}(X,Y)|X=x], (16)

or if we interpret ρu(X)\nabla_{\rho}u(X) as a random variable, we can write ρu(X)=𝔼ρ[ku(X,Y)|X].\nabla_{\rho}u(X)=\mathbb{E}_{\rho}[k_{u}(X,Y)|X]. We will set gk=ku(xk,y)g^{k}=k_{u}(x^{k},y), where yy is sampled from pY(y)p_{Y}(y) and perform the standard SGD with the knowledge that 𝔼ρ[gk|xk]=ρu(xk)\mathbb{E}_{\rho}[g^{k}|x^{k}]=\nabla_{\rho}u(x^{k}). We can therefore consider a sequence of nonlocal SGD with kernels ρn()\rho_{n}(\cdot), and as nn\rightarrow\infty, we prove that the nonlocal optima converge to the usual SGD optima. In the following, we assume uu is a convex, Lipschitz function with Lipschitz constant MM.

Lemma 2.7.

Assume uu is convex. Given ϵ>0\epsilon>0, there is an N>0N>0, such that for all n>Nn>N, we have

u(y)u(x)(yx)Tnu(x)ϵ.u(y)-u(x)\geq(y-x)^{T}\nabla_{n}u(x)-\epsilon.

Lemma 2.7 is a motivation of the following (well-known, see [36, 26, 48, 7]) generalization of the notion of the subderivative set of a function.

Definition 1.

Let ϵ>0\epsilon>0 be given. A vector zDz\in\mathbb{R}^{D} is an ϵ\epsilon-subgradient of uu at xx if u(y)u(x)(yx)Tzϵu(y)-u(x)\geq(y-x)^{T}z-\epsilon for all yΩy\in\Omega. The collection of all ϵ\epsilon-subgradients of uu at xx is called the ϵ\epsilon-subdifferential set of uu at xx and is denoted by ϵu(x).\partial_{\epsilon}u(x).

It is obvious from the definition of ϵu(x)\partial_{\epsilon}u(x) and Lemma 2.7 that given an ϵ>0\epsilon>0, there is an N>0N>0 such that for all n>Nn>N, we have that nu(x)ϵu(x),\nabla_{n}u(x)\in\partial_{\epsilon}u(x), and in particular, u(x)ϵu(x)\nabla u(x)\in\partial_{\epsilon}u(x) for any ϵ>0.\epsilon>0. If we choose to run SGD with an ϵ\epsilon-subgradient gkg^{k} in place of a usual subgradient, we call this an ϵ\epsilon-stochastic subgradient descent (ϵ\epsilon-SGD) algorithm.

Theorem 2.8.

Let uu be a convex, MM-Lipschitz function and let xargminx:xBu(x).x^{*}\in\arg\min_{x:\|x\|\leq B}u(x). Let ϵ>0\epsilon>0 be given. If we run an ϵ\epsilon- subgradient descent algorithm on uu for KK steps with α=B2M2K\alpha=\sqrt{\frac{B^{2}}{M^{2}K}} then the output vector x¯\bar{x} satisfies

u(x¯)u(x)BMK+ϵ.u(\bar{x})-u(x^{*})\leq\frac{BM}{\sqrt{K}}+\epsilon.

Furthermore, for every ϵ^>ϵ\hat{\epsilon}>\epsilon, to achieve u(x¯)u(x)ϵ^,u(\bar{x})-u(x^{*})\leq\hat{\epsilon}, it suffices to run the SGD algorithm for a number of iterations that satisfies

KB2M2(ϵ^ϵ)2.K\geq\frac{B^{2}M^{2}}{(\hat{\epsilon}-\epsilon)^{2}}.

We come now to the main result of this section. The result and proof are modeled after Theorem 14.8 of [41].

Algorithm 1 ϵ\epsilon- Stochastic Subgradient Descent
1:Inputs:
2:      Scalar α,ϵ>0\alpha,\epsilon>0, integer K>0K>0
3:Initialize:
4:      x10x^{1}\leftarrow 0
5:for k=1,,Kk=1,\ldots,K do
6:     Choose gkg^{k} at random from a distribution such that 𝔼[gk|xk]ϵu(xk)\mathbb{E}[g^{k}|x^{k}]\in\partial_{\epsilon}u(x^{k})
7:     xk+1xkαgkx^{k+1}\leftarrow x^{k}-\alpha g^{k}
8:end for
9:Output x¯=1Kk=1Kxk\bar{x}=\frac{1}{K}\sum_{k=1}^{K}x^{k}
Theorem 2.9.

Let B,M>0B,M>0. Let uu be a convex function and let xargminx:xBu(x).x^{*}\in\arg\min_{x:\|x\|\leq B}u(x). Let ϵ>0\epsilon>0 be given. Assume that ϵSGD\epsilon-SGD is run for KK iterations with α=B2M2K\alpha=\sqrt{\frac{B^{2}}{M^{2}K}}. Assume also that for all kk, we have gkM\|g^{k}\|\leq M with probability 11. Then,

𝔼[u(x¯)]u(x)BMK+ϵ.\mathbb{E}[u(\bar{x})]-u(x^{*})\leq\frac{BM}{\sqrt{K}}+\epsilon.

Thus, running ϵSGD\epsilon-SGD for KK iterations with KB2M2(ϵ^ϵ)2K\geq\frac{B^{2}M^{2}}{(\hat{\epsilon}-\epsilon)^{2}} ensures that 𝔼[u(x¯)]u(x)ϵ^.\mathbb{E}[u(\bar{x})]-u(x^{*})\leq\hat{\epsilon}.

Note that Theorem 2.9 shows that the estimate of 𝔼[u(x¯)]u(x)\mathbb{E}[u(\bar{x})]-u(x^{*}) is composed of two parts: BMK\frac{BM}{\sqrt{K}} corresponds to the usual SGD estimate as in [41] while the ϵ\epsilon term corresponds to the effects of nonlocal interactions. We can now specialize the general ϵSGD\epsilon-SGD theory above to the case of the nonlocal gradient operator. Recall our observation that given an ϵ>0\epsilon>0, we have nu(x)ϵu(x)\nabla_{n}u(x)\in\partial_{\epsilon}u(x) for nn large enough. Thus, we can set gk=ku(xk,y)g^{k}=k_{u}(x^{k},y) to be the ϵ\epsilon- subgradients used in ϵSGD\epsilon-SGD and choose pY(y)=ρn(xy)pX(x)p_{Y}(y)=\rho_{n}(x-y)p_{X}(x) to be the distribution from which the samples yy are chosen. We then have a nonlocal ϵSGD\epsilon-SGD procedure.

3 Stylized Application: Non-differentiable Parameter Estimation

As a numerical example of the theory developed in section 2, we consider the case of parameter estimation of non-differentiable parametric mappings motivated by [51, 50] but restrict our attention to the case of a compact 11 dimensional Euclidean parameter space Θ=[0,1]\Theta=[0,1]. Let \mathcal{M} be the 11 dimensional manifold consisting of translations of a rectangular pulse as shown in figure 1. \mathcal{M} is therefore the 11 dimensional manifold {f(xθ):θΘ}\{f(x-\theta):\theta\in\Theta\} viewed as a submanifold of L2([0,1])L^{2}([0,1]) and f(x)f(x) is the template pulse in figure 1 (green dashed line with support [0,0.125][0,0.125]). Note that while Θ\Theta is a flat Euclidean space, its image L2([0,1])\mathcal{M}\subset L^{2}([0,1]) is a topological manifold. It is known (see [17, 51]) that if \mathcal{M} is endowed with the L2([0,1])L^{2}([0,1]) metric, then the map θf(xθ):=fθ\theta\mapsto f(x-\theta):=f_{\theta} is non-differentiable. Indeed [17, 51] observe that

fθ1fθ2L2([0,1])|θ1θ2|C|θ1θ2|12.\frac{\|f_{\theta_{1}}-f_{\theta_{2}}\|_{L^{2}([0,1])}}{|\theta_{1}-\theta_{2}|}\propto C|\theta_{1}-\theta_{2}|^{-\frac{1}{2}}. (17)

Given a template point gg\in\mathcal{M} corresponding to an unknown θ\theta^{*} the objective is to recover an estimate θ^\hat{\theta} of θ\theta^{*} by minimizing the function (θ)=fθgL2([0,1])\mathcal{E}(\theta)=\|f_{\theta}-g\|_{L^{2}([0,1])}. The approach taken in [51] is to “smoothen" the manifold by convolution with a Gaussian kernel and applying the standard Newton’s method to the smoothed manifold. We can directly address this problem without smoothing the manifold. Indeed, it is clear from the estimate 17 that |(θ1)(θ2)||θ1θ2|ρn(θ1θ2)\frac{|\mathcal{E}(\theta_{1})-\mathcal{E}(\theta_{2})|}{|\theta_{1}-\theta_{2}|}\rho_{n}(\theta_{1}-\theta_{2}) is integrable as a function of θ2\theta_{2} for any θ1θ2\theta_{1}\neq\theta_{2}, and hence the nonlocal gradient of (θ)\mathcal{E}(\theta) exists. We can therefore use our nonlocal gradient descent to this problem to arrive at an estimate θ^\hat{\theta} of θ\theta^{*}. Note that we can view the scale of nonlocality nn as playing the role of a scale parameter vis-a-vis the approach of [51], however, our approach can be viewed as smoothing the tangent space, rather than the manifold itself. We apply our “vanilla" nonlocal gradient descent with a learning rate α(0,1]\alpha\in(0,1]. Experiments were done on a 64-bit Linux laptop with 3.83.8 GB RAM and 2.60GHz Intel Core i5 CPU. We consider the case of both bump function and Gaussian kernels that serve as ρn\rho_{n} in the nonlocal gradient definition, with n=1,2,3n=1,2,3 (see figure 2). The initial and “ground truth" pulses corresponding to θ0=0.1\theta^{0}=0.1 and θ=0.5\theta^{*}=0.5 are shown in figure 2. Given the extreme nonconvexity of (θ)\mathcal{E}(\theta), we adapted the nonlocal gradient descent presented earlier by decrementing the learning rate by half whenever the ratio of successive gradients between iterations exceeds 2.52.5. The convergence for both bump function and Gaussian kernels is rapid as seen in figure 3. The Gaussian kernel case gracefully converges: increasing nn results in more rapid convergence, as expected. However, the effect of numerical instability due to vanishing gradients is clear in the case of the bump function kernel: increasing nn results in nonmonotone convergence. Nevertheless, the efficacy of the nonlocal approach is apparent from this example, and extensions to higher dimensions are certainly possible, although the cost of dense quadrature-based integration like the one used here become prohibitive as the dimension of the parameter space increases.

Refer to caption
(a) Points on \mathcal{M}
Figure 1:
Refer to caption
(a) Bump function kernels
Refer to caption
(b) Gaussian kernels
Figure 2:
Refer to caption
(a) Convergence with bump function kernel
Refer to caption
(b) Convergence with Gaussian function kernel
Figure 3:

4 2nd2^{\text{nd}} Order Nonlocal Theory and Methods

We now turn our attention to second order methods, in particular Newton’s method. Recall from equation 9 that we have (at least) 4 distinct choices for defining a nonlocal analog of the Hessian of a function at a point. It is natural to consider the relationship between the four choices, and analyze the modes of convergence of each to the “usual" local Hessian operator. The following results make this relationship clear.

Theorem 4.1.
  • Let uC2(Ω¯)u\in C^{2}(\bar{\Omega}). Assume that there exists an N>0N>0 such that for all n>Nn>N, uxinC1(Ω¯)u_{x_{i}}^{n}\in C^{1}(\bar{\Omega}). Then, For nn fixed and n>Nn>N, we have (uxin)xjm(uxin)xj(u_{x_{i}}^{n})_{x_{j}}^{m}\rightarrow(u_{x_{i}}^{n})_{x_{j}} locally uniformly as m,m\rightarrow\infty, that is, (Hn,m1)i,jm,n(fixed)>N(Hn3)i,j(H_{n,m}^{1})_{i,j}\xrightarrow{m\rightarrow\infty,\,n\,(\text{fixed})>N}(H_{n}^{3})_{i,j} locally uniformly. If uCc2(Ω)u\in C^{2}_{c}(\Omega), the convergence is uniform.

  • Let uC2(Ω¯)u\in C^{2}(\bar{\Omega}). As nn\rightarrow\infty, (uxi)xjn(uxi)xj=2uxixj(u_{x_{i}})^{n}_{x_{j}}\rightarrow(u_{x_{i}})_{x_{j}}=\frac{\partial^{2}u}{\partial x_{i}\partial x_{j}} locally uniformly, that is, (Hn2)i,jn(H)i,j(H_{n}^{2})_{i,j}\xrightarrow{n\rightarrow\infty}(H)_{i,j} locally uniformly. If uCc2(Ω)u\in C^{2}_{c}(\Omega), the convergence is uniform.

We say a sequence unH01(Ω)uu_{n}\overset{H_{0}^{1}(\Omega)}{\rightharpoonup}u (read: unu_{n} converges weakly to uu in H01(Ω)H_{0}^{1}(\Omega)) with un,uL2(Ω)u_{n},u\in L^{2}(\Omega) if (un,v)L2(Ω)(u,v)L2(Ω)(u_{n},v)_{L^{2}(\Omega)}\rightarrow(u,v)_{L^{2}(\Omega)} for all vH01(Ω)v\in H^{1}_{0}(\Omega) as nn\rightarrow\infty. We have the following theorem.

Theorem 4.2.

Let uC2(Ω)u\in C^{2}(\Omega). Then:

  • As nn\rightarrow\infty, we have (uxin)xjH01(Ω)(uxi)xj=2uxixj(u_{x_{i}}^{n})_{x_{j}}\overset{H_{0}^{1}(\Omega)}{\rightharpoonup}(u_{x_{i}})_{x_{j}}=\frac{\partial^{2}u}{\partial x_{i}\partial x_{j}}, that is, (Hn3)i,jH01(Ω)(H)i,j(H_{n}^{3})_{i,j}\overset{H_{0}^{1}(\Omega)}{\rightharpoonup}(H)_{i,j}

  • Combined with the hypothesis of theorem 4.1, we have:

    (Hn,m1)i,jm,n(fixed)>N(Hn3)i,jH01(Ω)(H)i,j,(H_{n,m}^{1})_{i,j}\xrightarrow{m\rightarrow\infty,\,n\,(\text{fixed})>N}(H_{n}^{3})_{i,j}\overset{H_{0}^{1}(\Omega)}{\rightharpoonup}(H)_{i,j},

with the convergence of (Hn,m1)i,j(H_{n,m}^{1})_{i,j} to (Hn3)i,j(H_{n}^{3})_{i,j} being locally uniform if uC2(Ω)u\in C^{2}(\Omega) and uniform if uCc2(Ω)u\in C^{2}_{c}(\Omega).

The proofs of theorem 4.1 and 4.2 are found in section 6. While we have shown weak convergence of (Hn,m1)i,j(H_{n,m}^{1})_{i,j} to (H)i,j,(H)_{i,j}, as m,nm,n\rightarrow\infty, we can show that under additional hypothesis on the radial functions ρn\rho_{n}, we have an upper bound on how the sequence (Hn,m1)i,j(H^{1}_{n,m})_{i,j} deviates from strong (uniform) convergence.

Theorem 4.3.

Assume in addition to the existing hypotheses on ρm\rho_{m}, that Dρm(x)x𝑑x<M(m)\int_{\mathbb{R}^{D}}\frac{\rho_{m}(x)}{\|x\|}dx<M(m). Let uC2(Ω¯)u\in C^{2}(\bar{\Omega}). Then, given an ϵ>0\epsilon>0 there is an N>0N>0 such that for all n>Nn>N, we have that

(Hn,m1)i,jcjm(x)(H)i,jL(Ω)2DϵlimmM(m).\|(H_{n,m}^{1})_{i,j}-c_{j}^{m}(x)(H)_{i,j}\|_{L^{\infty}(\Omega)}\leq 2D\epsilon\lim_{m\rightarrow\infty}M(m).

Thus, if M(m)M(m) is independent of mm, we have local uniform convergence of Hn,m1H_{n,m}^{1} to HH. Otherwise, the LL^{\infty} error grows as M(m)M(m).

We note that theorem 4.3 provides a worst-case upper bound. Indeed, for the case of the Gaussian and “bump" function densities we discussed earlier, a simple scaling argument shows that the corresponding constants M(m)M(m) of theorem 4.3 grow with mm, and therefore, the L(Ω)L^{\infty}(\Omega) error is possibly unbounded. However, in practice, it seems plausible that there is a subclass of C2(Ω)C^{2}(\Omega) functions for which the L(Ω)L^{\infty}(\Omega) error between Hn,m1H_{n,m}^{1} and “usual" local Hessian is bounded. We do not go further into the analysis of finding the optimal class of functions for which the error is bounded.

4.1 Nonlocal Newton’s Method

We now consider a nonlocal version of Newton’s method. Recall the standard Newton iterations to minimize u(x)u(x) for xΩx\in\Omega with stepsize βk\beta^{k}:

xk+1=xkβk(Huk)1uk,x^{k+1}=x^{k}-\beta^{k}(Hu^{k})^{-1}\nabla u^{k}, (18)

where Huk=Hu(xk)Hu^{k}=Hu(x^{k}) is the usual Hessian of uu evaluated at xkx^{k}, and, as before, uk=u(xk)\nabla u^{k}=\nabla u(x^{k}) is the gradient of uu evaluated at xkx^{k}. We shall consider the case of βk\beta^{k} having been chosen by some line-search type method, sufficient to ensure a descent property at each step. The important point is that we consider the same βk\beta^{k} for the nonlocal case:

xk+1,n=xk,nβk(Hnuk,n)1nuk,nx^{k+1,n}=x^{k,n}-\beta^{k}(H_{n}u^{k,n})^{-1}\nabla_{n}u^{k,n} (19)

where HnH_{n} is the nonlocal Hessian Hn4H^{4}_{n} defined earlier, and x0=x0,nx^{0}=x^{0,n} for all nn. As we did for gradient descent, in order to compare the nonlocal and local approaches, we will assume that the functions we wish to minimize are sufficiently regular. Again, we note that the nonlocal methods are still valid for possibly non-differentiable functions, so long as the nonlocal operators are well defined. We choose to do our analysis with Hn4H^{4}_{n}; while we can certainly consider the other possible nonlocal Hessians for numerical computations, we cannot guarantee theoretical convergence properties, due to only weak convergence of the other Hessians to the standard Hessian. Also, by [30], we can ensure uniform convergence of Hn4H^{4}_{n} to HH, though we tacitly expand the domain of uu from Ω\Omega to all of D\mathbb{R}^{D}. Thus, if uu has compact support KΩK\subset\Omega, we extend the domain of uu from Ω\Omega to all of D\mathbb{R}^{D} by zero while maintaining the regularity of uu in the extension to all of D\mathbb{R}^{D}. We continue to denote by uu the extension of uu to all of D\mathbb{R}^{D}.

Suppose xΩx^{*}\in\Omega is a minimizer of u(x)u(x). We assume that Hu(x)Hu(x) is continuous, i.e., u(x)u(x) is at least C2(D)C^{2}(\mathbb{R}^{D}). Also, we assume (Hu(x))1(Hu(x^{*}))^{-1} exists, and therefore, by [13, 40], there is a neighborhood UU of xx^{*} such that (Hu(x))1(Hu(x))^{-1} exists on UU and is continuous as xx^{*}. We let f(x)=(Hu(x))1u(x)f(x)=(Hu(x))^{-1}\nabla u(x), on UU and assume enough regularity of uu to allow fCc1(U)f\in C^{1}_{c}(U). We also assume that (Hu(x))1(Hu(x))^{-1} is uniformly continuous on UU. We can now state the following result, whose proof can be found in section 6.

Theorem 4.4.

Assume that uu has compact support KDK\subset\mathbb{R}^{D}, that (Hu(x))1(Hu(x))^{-1} is uniformly continuous on a neighborhood UU of xx^{*}, and enough regularity of uu to allow fCc1(U)f\in C^{1}_{c}(U). Then, given ϵ>0\epsilon>0, there exists an N>0N>0 so that for n>Nn>N, we have

xkxk,n<ϵ.\|x^{k}-x^{k,n}\|<\epsilon.

We can now give a comparison between the usual Newton’s method’s quadratic convergence with the nonlocal counterpart.

Theorem 4.5.

Assume the hypotheses of theorem 4.4. Then, given ϵ>0\epsilon>0, there exists an N>0N>0 so that for n>Nn>N, and kk large enough, we have

xxk,n<Mxxk12+ϵ,\|x^{*}-x^{k,n}\|<M\|x^{*}-x^{k-1}\|^{2}+\epsilon,

for some MM depending on the regularity of uu.

Proof.

We write xxk,nxxk+xkxk,n\|x^{*}-x^{k,n}\|\leq\|x^{*}-x^{k}\|+\|x^{k}-x^{k,n}\|. By the standard quadratic convergence properties of Newton’s method ([13]), we have xxkMxxk12\|x^{*}-x^{k}\|\leq M\|x^{*}-x^{k-1}\|^{2} for suitable MM depending on the regularity of uu. Next, by 4.4, there is an N>0N>0 such that for n>Nn>N, the second term xkxk,n<ϵ\|x^{k}-x^{k,n}\|<\epsilon. Combining these two estimates, we conclude that xxk,n<Mxxk12+ϵ\|x^{*}-x^{k,n}\|<M\|x^{*}-x^{k-1}\|^{2}+\epsilon, completing the proof. ∎

It is interesting to note that the “error" in approximating the full Hessian by the nonlocal one has a clear effect in the convergence of the nonlocal Newton iterations. The term Mxxk12M\|x^{*}-x^{k-1}\|^{2} in the proof of theorem 4.5 corresponds to the “usual" quadratic convergence of the standard Newton iterations. However, the additional ϵ\epsilon term corresponds to the error made in the approximation of the true Hessian by the nonlocal one. Indeed, ϵ\epsilon can be made smaller by choosing a larger nn in the nonlocal Hessian. However, we are not guaranteed a clean quadratic convergence in the nonlocal case, indeed, the rate of convergence to xx^{*} is always limited by ϵ\epsilon, i.e., the effect of nonlocal interactions in computing the nonlocal Hessian.

5 Concluding Remarks

While we have developed basic first and second order nonlocal optimization methods, much remains to be done, both theoretically and numerically. For instance, one can transfer nonlocal optimization to regularized objective functions, constrained optimization and quasi-Newton methods. A bottleneck on the computational side is the quadrature used for integration. Sparse grid methods going back to Smolyak ([47]) can be employed for higher dimensional problems. Nonlocal gradients can also be incorporated into deep learning models during backpropagation and would be particularly useful for data that exhibits singular behavior. Our analysis assumes the parameter space Θ\Theta is Euclidean, and extending it to the case of Θ\Theta having nonzero curvature is a non-trivial open problem. Nevertheless, the results presented here can serve as a first step towards such expansions.

References

  • [1] Mathematical models for cell migration: a nonlocal perspective. Philosophical Transactions of the Royal Society of London B: Biological Sciences, 11 2019.
  • [2] Robert A. Adams and John J. F. Fournier. Sobolev Spaces. Academic Press, 2 edition, Jun 2003.
  • [3] Bacim Alali and Robert Lipton. Multiscale dynamics of heterogeneous media in the peridynamic formulation. Journal of Elasticity, 106:71–103, 04 2010.
  • [4] Bacim Alali, Kuo Liu, and Max Gunzburger. A generalized nonlocal vector calculus. Zeitschrift für angewandte Mathematik und Physik, 66, 03 2015.
  • [5] Fuensanta Andreu-Vaillo, Jose M. Mazon, Julio D. Rossi, and J. Julian Toledo-melero. Nonlocal Diffusion Problems. American Mathematical Society (Mathematical Surveys and Monographs (Book 165) ), USA, 2010.
  • [6] Kendall E. Atkinson. The Numerical Solution of Integral Equations of the Second Kind. Cambridge Monographs on Applied and Computational Mathematics. Cambridge University Press, 1997.
  • [7] A. Auslender and M. Teboulle. Interior gradient and epsilon-subgradient descent methods for constrained convex minimization. Mathematics of Operations Research, 29:1–26, 2004.
  • [8] Zdenek P. Bazant and Milan Jirasek. Nonlocal integral formulations of plasticity and damage: Survey of progress. Journal of Engineering Mechanics, 128(11):1119–1149, 2002.
  • [9] C. Bjorland, Luis Caffarelli, and Alessio Figalli. Non-local gradient dependent operators. Advances in Mathematics, 230:1859–1894, 07 2012.
  • [10] J. Bourgain, H. Brezis, and P. Mironescu. Another look at sobolev spaces. Optimal control and partial differential equations, IOS Press, A volume in honour of A. Benssoussan’s 60th birthday(Menaldi, J.L., et al. (eds.)):439–455, 2001.
  • [11] A. Buades, B. Coll, and J. . Morel. A non-local algorithm for image denoising. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), volume 2, pages 60–65 vol. 2, 2005.
  • [12] Yuquan Chen, Qing Gao, Yiheng Wei, and Yong Wang. Study on fractional order gradient methods. Applied Mathematics and Computation, 314:310 – 321, 2017.
  • [13] Edwin K. P. Chong and Stanislaw H. Zak. An Introduction to Optimization, Fourth Edition. Wiley-Interscience Series in Discrete Mathematics and Optimization, John Wiley and Sons, Inc., USA, 2013.
  • [14] Rinaldo M. Colombo and Magali Lecureux-Mercier. Nonlocal crowd dynamics models for several populations. Acta Mathematica Scientia, 32(1):177–196, 2012. Mathematics Dedicated to professor Constantine M. Dafermos on the occasion of his 70th birthday.
  • [15] Matteo Cozzi, Serena Dipierro, and Enrico Valdinoci. Nonlocal phase transitions in homogeneous and periodic media. Journal of Fixed Point Theory and Applications, 19:387–405, 11 2016.
  • [16] Marta D’Elia, Mamikon Gulian, Hayley Olson, and George Em Karniadakis. A unified theory of fractional, nonlocal, and weighted nonlocal vector calculus. arXiv preprint, arXiv:2005.07686v2 [math.AP], 2020.
  • [17] David Donoho and Carrie Grimes. Image manifolds which are isometric to euclidean space. Journal of Mathematical Imaging and Vision, 23(1):5–24, 12 2005.
  • [18] Qiang Du. Nonlocal Modeling, Analysis, and Computation. Society for Industrial and Applied Mathematics (CBMS-NSF Regional Conference Series in Applied Mathematics), USA, 2019.
  • [19] Qiang Du, Max Gunzburger, R. Lehoucq, and Kun Zhou. Analysis and approximation of nonlocal diffusion problems with volume constraints. SIAM Review, 54:667–696, 04 2011.
  • [20] Qiang Du, Max Gunzburger, R. Lehoucq, and Kun Zhou. A nonlocal vector calculus, nonlocal volume constrained problems, and nonlocal balance laws. Mathematical Models and Methods in Applied Sciences, 23(3):493–540, 2013.
  • [21] Ravindra Duddu and Haim Waisman. A nonlocal continuum damage mechanics approach to simulation of creep fracture in ice sheets. Computational Mechanics, 51(6):961–974, June 2013.
  • [22] Etienne Emmrich, Richard Lehoucq, and Dimitri Puhst. Peridynamics: A nonlocal continuum theory. Lecture Notes in Computational Science and Engineering, 89, 09 2013.
  • [23] Lawrence C. Evans. Partial Differential Equations. American Mathematical Society, 2 edition, 2010.
  • [24] Andrzej Fryszkowski. Fixed Point Theory for Decomposable Sets (Topological Fixed Point Theory and Its Applications Book 2). Springer, 1 edition, 2004.
  • [25] Guy Gilboa and Stanley J. Osher. Nonlocal operators with applications to image processing. Multiscale Modeling and Simulation, 7(3):1005–1028, 2008.
  • [26] X. L. Guo, C. J. Zhao, and Z. W. Li. On generalized ϵ\epsilon-subdifferential and radial epiderivative of set-valued mappings. Optimization Letters, 8(5):1707–1720, 2014.
  • [27] Milan Jirásek. Nonlocal models for damage and fracture: Comparison of approaches. International Journal of Solids and Structures, 35(31):4133–4145, 1998.
  • [28] Olav Kallenberg. Foundations of Modern Probability. Probability and Its Applications. Springer 2nd edition, 2002.
  • [29] Hwi Lee and Qiang Du. Nonlocal gradient operators with a nonspherical interaction neighborhood and their applications. ESAIM: Mathematical Modelling and Numerical Analysis, 54, 08 2019.
  • [30] Jan. Lellmann, Konstantinos. Papafitsoros, Carola. Schonlieb, and Daniel. Spector. Analysis and application of a nonlocal hessian. SIAM Journal on Imaging Sciences, 8(4):2161–2202, 2015.
  • [31] Randall LeVeque. Finite Difference Methods for Ordinary and Partial Differential Equations: Steady-State and Time-Dependent Problems (Classics in Applied Mathematics Classics in Applied Mathemat). Society for Industrial and Applied Mathematics, USA, 2007.
  • [32] A. Maleki, M. Narayan, and R. Baraniuk. Suboptimality of nonlocal means on images with sharp edges. In 2011 49th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 299–305, 2011.
  • [33] Arian Maleki, Manjari Narayan, and Richard G. Baraniuk. Anisotropic nonlocal means denoising. Applied and Computational Harmonic Analysis, 35(3):452 – 482, 2013.
  • [34] G.F Dal Maso. Introduction to Γ\Gamma-Convergence. Birkhauser (Progress in Nonlinear Differential Equations and Their Applications (8)), Jan 1993.
  • [35] Tadele Mengesha and Daniel Spector. Localization of nonlocal gradients in various topologies. Calculus of Variations and Partial Differential Equations, 52(1):253–279, Jan 2015.
  • [36] R. Diaz Millan and M. Penton Machado. Inexact proximal ϵ\epsilon-subgradient methods for composite convex optimization problems. Journal of Global Optimization, 75:1029–1060, 2019.
  • [37] Mario Di Paola and Massimiliano Zingales. Long-range cohesive interactions of non-local continuum faced by fractional calculus. International Journal of Solids and Structures, 45(21):5642–5659, 2008.
  • [38] Y. Pu, J. Zhou, Y. Zhang, N. Zhang, G. Huang, and P. Siarry. Fractional extreme value adaptive training method: Fractional steepest descent approach. IEEE Transactions on Neural Networks and Learning Systems, 26(4):653–662, 2015.
  • [39] Srujan Rokkam, Max Gunzburger, Michael Brothers, Nam Phan, and Kishan Goel. A nonlocal peridynamics modeling approach for corrosion damage and crack propagation. Theoretical and Applied Fracture Mechanics, 101:373 – 387, 2019.
  • [40] David L. Russell. Optimization theory. New York, W.A. Benjamin, 1970, Series Mathematics lecture note series., USA, 1970.
  • [41] Shai Shalev-Shwartz and Shai Ben-David. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, USA, 2014.
  • [42] Dian Sheng, Yiheng Wei, Yuquan Chen, and Yong Wang. Convolutional neural networks with fractional order gradient method. Neurocomputing, 2019.
  • [43] N.Z. Shor, K.C. Kiwiel, and A. Ruszczynski. Minimization Methods for Non-Differentiable Functions. Springer Series in Computational Mathematics. Springer Berlin Heidelberg, 2012.
  • [44] Waclaw Sierpinski. Sur les fonctions d’ensemble additives et continues. Fundamenta Mathematicae, 3:240–246, 1922.
  • [45] S. Silling and R. Lehoucq. Convergence of peridynamics to classical elasticity theory. Journal of Elasticity, 93:13–37, 10 2008.
  • [46] S.A. Silling. Reformulation of elasticity theory for discontinuities and long-range forces. Journal of the Mechanics and Physics of Solids, 48(1):175 – 209, 2000.
  • [47] S. A. Smolyak. Quadrature and interpolation formulas for tensor products of certain classes of functions. Dokl. Akad. Nauk SSSR, 148:1042–1045, 1963.
  • [48] M. V. Solodov and B. F. Svaiter. A hybrid approximate extragradient-proximal point algorithm using the enlargement of a maximal monotone operator. Set-Valued Analysis, 7:323–345, 1999.
  • [49] Yunzhe Tao, Qi Sun, Qiang Du, and Wei Liu. Nonlocal neural networks, nonlocal diffusion and nonlocal modeling. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS’18, page 494–504, Red Hook, NY, USA, 2018. Curran Associates Inc.
  • [50] M. B. Wakin, D. L. Donoho, Hyeokho Choi, and R. G. Baraniuk. High-resolution navigation on non-differentiable image manifolds. In Proceedings. (ICASSP ’05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005., volume 5, pages v/1073–v/1076 Vol. 5, March 2005.
  • [51] Michael Wakin, David Donoho, Hyeokho Choi, and Richard Baraniuk. The multiscale structure of non-differentiable image manifolds. Proc SPIE, 5914, 08 2005.
  • [52] J. Wang, Y. Guo, Y. Ying, Y. Liu, and Q. Peng. Fast non-local algorithm for image denoising. In 2006 International Conference on Image Processing, pages 1429–1432, 2006.
  • [53] Jian Wang, Yanqing Wen, Yida Gou, Zhenyun Ye, and Hua Chen. Fractional-order gradient descent learning of bp neural networks with caputo derivative. Neural Networks, 89:19 – 30, 2017.
  • [54] Yiheng Wei, Yu Kang, Weidi Yin, and Yong Wang. Generalization of the gradient method with fractional order gradient direction. Journal of the Franklin Institute, 357(4):2514 – 2532, 2020.

6 Appendix

6.1 Nonlocal Gradients and Hessians: More Background

We provide here some more details on nonlocal gradients and Hessians. Recall that if uC1(Ω¯)u\in C^{1}(\bar{\Omega}), or more generally, as [35] note (see also [10]), if uW1,p(Ω)u\in W^{1,p}(\Omega), then the map

y|u(y)u(x)|xyρ(xy)L1(Ω)y\mapsto\frac{|u(y)-u(x)|}{\|x-y\|}\rho(x-y)\in L^{1}(\Omega)

for a.e. yΩy\in\Omega. Thus, in these cases, the principle value integral in equation 6 exists and coincides with the Lebesgue integral of u(x)u(y)xyxyxyρ(xy)\frac{u(x)-u(y)}{\|x-y\|}\frac{x-y}{\|x-y\|}\rho(x-y) over all of Ω\Omega.

Indeed, Lemma 2.12.1 of [35] states:

Proposition 6.1.

(Lemma 2.12.1 of [35]) Let uW1,p(Ω)u\in W^{1,p}(\Omega). Then ρuLp(Ω)CρL1(Ω)uLp(Ω),\|\nabla_{\rho}u\|_{L^{p}(\Omega)}\leq C\|\rho\|_{L^{1}(\Omega)}\|\nabla u\|_{L^{p}(\Omega)}, where C=C(p,D,Ω)C=C(p,D,\Omega).

Recall that we write ρΩu(x)\nabla_{\rho}^{\Omega^{*}}u(x) to denote the corresponding nonlocal gradient on ΩΩ\Omega^{*}\subset\Omega:

ρΩu(x):=DΩu(x)u(y)xyxyxyρ(xy)𝑑y.\nabla_{\rho}^{\Omega^{*}}u(x):=D\int_{\Omega^{*}}\frac{u(x)-u(y)}{\|x-y\|}\frac{x-y}{\|x-y\|}\rho(x-y)dy. (20)

If the sequence of ρn\rho_{n} approaches a limit, albeit perhaps as distributions, we can expect the nonlocal gradients n\nabla_{n} to converge to a limiting operator as well. For example, as we remarked earlier, in case the ρn\rho_{n} approach δ0\delta_{0}, we can presume that the n\nabla_{n} approach \nabla as well. The mode and topology of convergence of the sequence nu\nabla_{n}u to u\nabla u as nn\rightarrow\infty has been investigated by many authors. In [25], it is formally shown that

nunxi(x)=uxi(x)+error,\frac{\partial_{n}u}{\partial_{n}x_{i}}(x)=\frac{\partial u}{\partial x_{i}}(x)+error,

while [20] show for uH1(D)u\in H^{1}(\mathbb{R}^{D}) (or more generally for H1H^{1} tensors) the convergence of the nonlocal gradient in L2(Ω)L^{2}(\Omega):

nuu in L2(Ω,D) as n.\nabla_{n}u\rightarrow\nabla u\text{ in $L^{2}(\Omega,\mathbb{R}^{D})$ as }n\rightarrow\infty.

The localization of nonlocal gradients in multiple different topologies has been studied in [35], and we restate the following theorem from [35].

Theorem 6.2.

(Theorem 1.11.1 of [35])

  • Let uC1(Ω¯)u\in C^{1}(\bar{\Omega}). Then nuu\nabla_{n}u\rightarrow\nabla u locally uniformly as nn\rightarrow\infty. If uCc1(Ω)u\in C_{c}^{1}(\Omega), then the convergence is uniform.

  • Let uW1,p(Ω)u\in W^{1,p}(\Omega). Then nuu\nabla_{n}u\rightarrow\nabla u in Lp(Ω,D)L^{p}(\Omega,\mathbb{R}^{D})as nn\rightarrow\infty.

We record the following theorem from [30] regarding a similar result in the Hessian case.

Theorem 6.3.

(Theorems 1.41.4 and 1.51.5 of [30])

  • Let uCc2(D)u\in C_{c}^{2}(\mathbb{R}^{D}), then Hn4uHuH_{n}^{4}u\rightarrow Hu in Lp(D,D×D)L^{p}(\mathbb{R}^{D},\mathbb{R}^{D\times D}) as nn\rightarrow\infty for 1p1\leq p\leq\infty.

  • Let uW2,p(Ω)u\in W^{2,p}(\Omega), then Hn4uHuH_{n}^{4}u\rightarrow Hu in Lp(D,D×D)L^{p}(\mathbb{R}^{D},\mathbb{R}^{D\times D}) nn\rightarrow\infty for 1p<1\leq p<\infty.

Note that in the Hn4H^{4}_{n} Hessian case, the authors of [30] define and analyze the Hessian defined on all of D\mathbb{R}^{D}, due to the definition of Hn4H^{4}_{n} involving translation terms u(x)u(x+h)u(x)\mapsto u(x+h), yet still obtain convergence both in Lp(D)L^{p}(\mathbb{R}^{D}) and uniform L(D)L^{\infty}(\mathbb{R}^{D}) norms. In section 4, we will show weak convergence of the remaining Hessians Hn,m1,Hn2H_{n,m}^{1},H_{n}^{2} and Hn3H_{n}^{3} to HH as m,nm,n\rightarrow\infty. The above theorems will be one of the main results we draw upon in our analysis. In the sequel, when we use the adjective “the" nonlocal gradient n\nabla_{n} or “the" nonlocal Hessians HnkH^{k}_{n}, with k=1,,4k=1,\ldots,4, we mean the ones defined above associated with a fixed sequence ρn()\rho_{n}(\cdot).

6.2 Proofs

6.3 Proof of Theorem 2.2

In order to prove theorem 2.2, we will need the following lemma.

Lemma 6.4.

Let (X,μ,σ(𝒳))(X,\mu,\sigma(\mathcal{X})) be a σ\sigma-finite measure space, i.e., XX is a set, σ(𝒳)\sigma(\mathcal{X}) is a σ\sigma-algebra generated by 𝒳\mathcal{X}, a collection of subsets of XX, and μ\mu a σ\sigma-finite measure on XX. Assume μ(X)<\mu(X)<\infty. Let fL+1(μ)f\in L^{1}_{+}(\mu), i.e., ff is μ\mu-integrable, with f0f\geq 0 almost everywhere. Then for any MM with 0MXf𝑑μ0\leq M\leq\int_{X}fd\mu, there is an Aσ(𝒳)A\in\sigma(\mathcal{X}) such that Af𝑑μ=M\int_{A}fd\mu=M.

Proof.

The proof relies on Sierpinski’s theorem on non-atomic measures (see [44],[24]). Let Mf=Xf𝑑μM_{f}=\int_{X}fd\mu. Define λ(A)=Af𝑑μ\lambda(A)=\int_{A}fd\mu for any Aσ(𝒳)A\in\sigma(\mathcal{X}). Since fL+1(μ)f\in L^{1}_{+}(\mu), we can conclude that λ:σ(𝒳)[0,Mf]\lambda:\sigma(\mathcal{X})\rightarrow[0,M_{f}] is a non-atomic measure. By Sierpinski’s theorem, λ\lambda attains its maximum and minimum. Thus, for any M[0,Mf]M\in[0,M_{f}] there exists an Aσ(𝒳)A\in\sigma(\mathcal{X}) with λ(A)=Af𝑑μ=M.\lambda(A)=\int_{A}fd\mu=M.

We now prove theorem 2.2.

Proof.

Define for for each i=1,,Di=1,\ldots,D and any xΩx\in\Omega

ku,in(x,x)=Du(x)u(x)xxxixixxρn(xx),k_{u,i}^{n}(x^{*},x)=D\frac{u(x^{*})-u(x)}{\|x^{*}-x\|}\frac{x^{*}_{i}-x^{*}_{i}}{\|x^{*}-x\|}\rho_{n}(x^{*}-x), (21)

and

{Bi+={xD:xixi},Bi={xD:xi>xi}.\begin{cases}B_{i}^{+}&=\{x\in\mathbb{R}^{D}:x_{i}^{*}\leq x_{i}\},\\ B_{i}^{-}&=\{x\in\mathbb{R}^{D}:x_{i}^{*}>x_{i}\}.\\ \end{cases} (22)

Notice that Bi+Bi=ϕB_{i}^{+}\cap B_{i}^{-}=\phi and that xBi+x^{*}\in B_{i}^{+}. Now, by assumption, uW1,p(Ω)u\in W^{1,p}(\Omega) and hence, ku,in(x,x)k_{u,i}^{n}(x^{*},x) is integrable with respect to xx, i.e., ku,in(x,)L1(Ω)k_{u,i}^{n}(x^{*},\cdot)\in L^{1}(\Omega). Thus, for any ϵ>0\epsilon>0 there is a δ>0\delta>0 such that for any ball BδB_{\delta} with volume vol(Bδ)<δvol(B_{\delta})<\delta we have

|Bδku,in(x,x)𝑑x|Bδ|ku,in(x,x)|𝑑x<ϵ.|\int_{B_{\delta}}k_{u,i}^{n}(x^{*},x)dx|\leq\int_{B_{\delta}}|k_{u,i}^{n}(x^{*},x)|dx<\epsilon.

Pick such a ball BδΩB_{\delta}\subset\Omega with xBδx^{*}\in B_{\delta}. Define

{Bδ,i+=BδBi+,Bδ,i=BδBi.\begin{cases}B_{\delta,i}^{+}&=B_{\delta}\cap B_{i}^{+},\\ B_{\delta,i}^{-}&=B_{\delta}\cap B_{i}^{-}.\\ \end{cases} (23)

Note that xBδ,i+x^{*}\in B_{\delta,i}^{+} and that Bδ=Bδ,i+Bδ,iB_{\delta}=B_{\delta,i}^{+}\cup B_{\delta,i}^{-}.

It is easy to see, based on how we have set things up, that for varying xBδx\in B_{\delta} we have that ku,in(x,x)0 a.e.k_{u,i}^{n}(x^{*},x)\leq 0\text{ a.e.} on Bδ,iB_{\delta,i}^{-} while ku,in(x,x)0 a.e.k_{u,i}^{n}(x^{*},x)\geq 0\text{ a.e.} on Bδ,i+B_{\delta,i}^{+}. Indeed, since xx^{*} is an a.e. local minimum, u(x)u(x)0u(x^{*})-u(x)\leq 0 a.e., the sign of ku,in(x,x)k_{u,i}^{n}(x^{*},x) is determined by that of xixix^{*}_{i}-x^{*}_{i}. On Bδ,i+B_{\delta,i}^{+}, we have that xixi0x^{*}_{i}-x^{*}_{i}\leq 0 so that ku,in(x,x)0k_{u,i}^{n}(x^{*},x)\geq 0 a.e. while on Bδ,iB_{\delta,i}^{-}, we have that xixi0x^{*}_{i}-x^{*}_{i}\geq 0 so that ku,in(x,x)0k_{u,i}^{n}(x^{*},x)\leq 0 a.e.

Therefore, we see that

{Bδ,i+ku,in(x,x)𝑑x0,Bδ,iku,in(x,x)𝑑x0,\begin{cases}\int_{B_{\delta,i}^{+}}k_{u,i}^{n}(x^{*},x)dx&\geq 0,\\ \int_{B_{\delta,i}^{-}}k_{u,i}^{n}(x^{*},x)dx&\leq 0,\\ \end{cases} (24)

and, since BδB_{\delta} is the disjoint union Bδ,i+Bδ,iB_{\delta,i}^{+}\cup B_{\delta,i}^{-}

Bδku,in(x,x)𝑑x=Bδ,i+ku,in(x,x)𝑑x+Bδ,iku,in(x,x)𝑑x.\int_{B_{\delta}}k_{u,i}^{n}(x^{*},x)dx=\int_{B_{\delta,i}^{+}}k_{u,i}^{n}(x^{*},x)dx+\int_{B_{\delta,i}^{-}}k_{u,i}^{n}(x^{*},x)dx. (25)

If Bδku,in(x,x)𝑑x=0\int_{B_{\delta}}k_{u,i}^{n}(x^{*},x)dx=0, then setting Ω=Bδ\Omega^{*}=B_{\delta}, we are done. Otherwise, we consider two cases.

Case 1: Bδku,in(x,x)𝑑x<0\int_{B_{\delta}}k_{u,i}^{n}(x^{*},x)dx<0

This is of course equivalent to

Bδ,i+ku,in(x,x)𝑑x<Bδ,iku,in(x,x)𝑑x.\int_{B_{\delta,i}^{+}}k_{u,i}^{n}(x^{*},x)dx<-\int_{B_{\delta,i}^{-}}k_{u,i}^{n}(x^{*},x)dx.

We shall shortly appeal to lemma 6.4. Let X=Bδ,iX=B_{\delta,i}^{-} and σ()\sigma(\mathcal{B}^{-}) be the σ\sigma-algebra generated by open subsets of Bδ,iB_{\delta,i}^{-}. We have that σ()\sigma(\mathcal{B}^{-})\subset\mathcal{B}, where \mathcal{B} is the Borel σ\sigma-algebra on D\mathbb{R}^{D}. Let μ\mu be the Lebesgue measure on \mathcal{B} restricted to σ()\sigma(\mathcal{B}^{-}), so that μ(Bδ,i)<δ<\mu(B_{\delta,i}^{-})<\delta<\infty. If we let f(x)=ku,in(x,x)f(x)=-k_{u,i}^{n}(x^{*},x), then fL+1(μ)f\in L^{1}_{+}(\mu) and f0f\geq 0 almost everywhere. Finally, set

M:=Bδ,i+ku,in(x,x)𝑑xM:=\int_{B_{\delta,i}^{+}}k_{u,i}^{n}(x^{*},x)dx

and Mf:=Bδ,iku,in(x,x)dx>MM_{f}:=\int_{B_{\delta,i}^{-}}-k_{u,i}^{n}(x^{*},x)dx>M. The conditions of lemma 6.4 being satisfied, we are guaranteed a measurable ABδ,iA\subset B_{\delta,i}^{-} with Af𝑑μ=M\int_{A}fd\mu=M. By definition, this means that Aku,in(x,x)dx=Bδ,i+ku,in(x,x)𝑑x\int_{A}-k_{u,i}^{n}(x^{*},x)dx=\int_{B_{\delta,i}^{+}}k_{u,i}^{n}(x^{*},x)dx. Thus,

Aku,in(x,x)𝑑x+Bδ,i+ku,in(x,x)𝑑x.\int_{A}k_{u,i}^{n}(x^{*},x)dx+\int_{B_{\delta,i}^{+}}k_{u,i}^{n}(x^{*},x)dx.

Setting Ω=ABδ,i+\Omega^{*}=A\cup B_{\delta,i}^{+}, we have nΩu(x)=0\nabla_{n}^{\Omega^{*}}u(x^{*})=0

Case 2: Bδku,in(x,x)𝑑x>0\int_{B_{\delta}}k_{u,i}^{n}(x^{*},x)dx>0

This equivalent to

Bδ,i+ku,in(x,x)𝑑x>Bδ,iku,in(x,x)𝑑x.\int_{B_{\delta,i}^{+}}k_{u,i}^{n}(x^{*},x)dx>-\int_{B_{\delta,i}^{-}}k_{u,i}^{n}(x^{*},x)dx.

In this case, let X=Bδ,i+X=B_{\delta,i}^{+} and σ(+)\sigma(\mathcal{B}^{+}) be the σ\sigma-algebra generated by open subsets of Bδ,i+B_{\delta,i}^{+} containing xx^{*}. We have that σ(+)\sigma(\mathcal{B}^{+})\subset\mathcal{B} and μ\mu is the Lebesgue measure on \mathcal{B} restricted to σ(+)\sigma(\mathcal{B}^{+}), so that μ(Bδ,i+)<δ<\mu(B_{\delta,i}^{+})<\delta<\infty. If we let f(x)=ku,in(x,x)f(x)=k_{u,i}^{n}(x^{*},x), then fL+1(μ)f\in L^{1}_{+}(\mu) and f0f\geq 0 almost everywhere. Similar to the previous case, we set

M:=Bδ,iku,in(x,x)𝑑xM:=-\int_{B_{\delta,i}^{-}}k_{u,i}^{n}(x^{*},x)dx

and Mf:=Bδ,i+ku,in(x,x)𝑑x>MM_{f}:=\int_{B_{\delta,i}^{+}}k_{u,i}^{n}(x^{*},x)dx>M. As before, the conditions of lemma 6.4 are satisfied, and we are guaranteed a measurable ABδ,i+A\subset B_{\delta,i}^{+} with Af𝑑μ=M\int_{A}fd\mu=M. By definition, this means that Aku,in(x,x)𝑑x=Bδ,iku,in(x,x)𝑑x\int_{A}k_{u,i}^{n}(x^{*},x)dx=-\int_{B_{\delta,i}^{-}}k_{u,i}^{n}(x^{*},x)dx. Thus,

Aku,in(x,x)𝑑x+Bδ,iku,in(x,x)𝑑x.\int_{A}k_{u,i}^{n}(x^{*},x)dx+\int_{B_{\delta,i}^{-}}k_{u,i}^{n}(x^{*},x)dx.

Setting Ω=ABδ,i\Omega^{*}=A\cup B_{\delta,i}^{-}, we have nΩu(x)=0\nabla_{n}^{\Omega^{*}}u(x^{*})=0 ∎ Note that the Ω\Omega^{*} constructed in the above proof implicitly depends on nn due to the dependency of ku,in(x,x)k_{u,i}^{n}(x^{*},x) on nn.

6.4 Proof of Proposition 2.3

Proof.

Case 1, uniform convergence: uCc1(Ω).u\in C_{c}^{1}(\Omega). Since rnr=A(x0,x)An(x0,x)r_{n}-r=A(x_{0},x)-A_{n}(x_{0},x), we show rnrr_{n}\rightarrow r uniformly, from which it follows that An(x0,x)A(x0,x)A_{n}(x_{0},x)\rightarrow A(x_{0},x) uniformly. We can estimate, using Cauchy-Schwarz,

|rnr|=|A(x0,x)An(x0,x)|=|(xx0)T(u(x0)nu(x0))|xx0u(x0)nu(x0)Mu(x0)nu(x0),\begin{split}|r_{n}-r|&=|A(x_{0},x)-A_{n}(x_{0},x)|\\ &=|(x-x_{0})^{T}(\nabla u(x_{0})-\nabla_{n}u(x_{0}))|\\ &\leq\|x-x_{0}\|\|\nabla u(x_{0})-\nabla_{n}u(x_{0})\|\\ &\leq M\|\nabla u(x_{0})-\nabla_{n}u(x_{0})\|,\end{split} (26)

where, since x,x0Ωx,x_{0}\in\Omega and Ω\Omega is bounded, we have xx0<M\|x-x_{0}\|<M for some M>0M>0 independent of x0,xx_{0},x. Since uCc1(Ω)u\in C_{c}^{1}(\Omega), we know nuu\nabla_{n}u\rightarrow\nabla u uniformly by Theorem 1.1 of [35]. Thus, given ϵ>0\epsilon>0, there is an N>0N>0 such that for all n>Nn>N, we have u(x0)nu(x0)<ϵM\|\nabla u(x_{0})-\nabla_{n}u(x_{0})\|<\frac{\epsilon}{M}. Putting this into the earlier estimate yields |rnr|<ϵ|r_{n}-r|<\epsilon for n>Nn>N.

Case 2, L2(Ω×Ω)L^{2}(\Omega\times\Omega) convergence: uH01(Ω).u\in H_{0}^{1}(\Omega).We know that C01(Ω)C^{1}_{0}(\Omega) is dense in H01(Ω)H_{0}^{1}(\Omega) (see [2]). Hence, we select a u~C01(Ω)\tilde{u}\in C^{1}_{0}(\Omega) such that uu~H01(Ω)<ϵ\|u-\tilde{u}\|_{H^{1}_{0}(\Omega)}<\epsilon and we consider the standard (local) and nonlocal affine approximant of u~\tilde{u} which we denote by r~\tilde{r} and r~n\tilde{r}_{n} respectively. As before, we have

|r~nr~|xx0u~(x0)nu~(x0).\begin{split}|\tilde{r}_{n}-\tilde{r}|&\leq\|x-x_{0}\|\|\nabla\tilde{u}(x_{0})-\nabla_{n}\tilde{u}(x_{0})\|.\end{split}

Now, as we noted earlier, we have that xx0<M\|x-x_{0}\|<M for some M>0M>0 independent of x0,xx_{0},x due to Ω\Omega being bounded. Thus, holding xx constant, we have,

Ω|r~n(,x)r~(,x)|2𝑑x0M2Ωu~(x0)nu~(x0)2𝑑x0=M2u~(x0)nu~(x0)L2(Ω)2.\begin{split}\int_{\Omega}|\tilde{r}_{n}(\cdot,x)-\tilde{r}(\cdot,x)|^{2}dx_{0}&\leq M^{2}\int_{\Omega}\|\nabla\tilde{u}(x_{0})-\nabla_{n}\tilde{u}(x_{0})\|^{2}dx_{0}\\ &=M^{2}\|\nabla\tilde{u}(x_{0})-\nabla_{n}\tilde{u}(x_{0})\|^{2}_{L^{2}(\Omega)}.\end{split} (27)

Assume that the measure of Ω\Omega is m=m(Ω)m=m(\Omega). Then, integrating again with respect to xx we have

ΩΩ|r~n(,)r~(,)|2𝑑x0𝑑x(Mm)2Ωu~(x0)nu~(x0)2𝑑x0=(Mm)2u~(x0)nu~(x0)L2(Ω)2.\begin{split}\int_{\Omega}\int_{\Omega}|\tilde{r}_{n}(\cdot,\cdot)-\tilde{r}(\cdot,\cdot)|^{2}dx_{0}dx&\leq(M\,m)^{2}\int_{\Omega}\|\nabla\tilde{u}(x_{0})-\nabla_{n}\tilde{u}(x_{0})\|^{2}dx_{0}\\ &=(M\,m)^{2}\|\nabla\tilde{u}(x_{0})-\nabla_{n}\tilde{u}(x_{0})\|^{2}_{L^{2}(\Omega)}.\end{split} (28)

By Theorem 1.1 of [35], u~(x0)nu~(x0)L2(Ω)0\|\nabla\tilde{u}(x_{0})-\nabla_{n}\tilde{u}(x_{0})\|_{L^{2}(\Omega)}\rightarrow 0 as nn\rightarrow\infty, so we conclude that given ϵ>0\epsilon>0, we can find an N>0N>0 such that

u~(x0)nu~(x0)L2(Ω)<ϵMm\|\nabla\tilde{u}(x_{0})-\nabla_{n}\tilde{u}(x_{0})\|_{L^{2}(\Omega)}<\frac{\epsilon}{M\,m}

when n>Nn>N. Thus, we have for n>Nn>N, that

r~r~nL2(Ω×Ω)<ϵ.\|\tilde{r}-\tilde{r}_{n}\|_{L^{2}(\Omega\times\Omega)}<\epsilon. (29)

We now write

rrnL2(Ω×Ω)rr~L2(Ω×Ω)+r~r~nL2(Ω×Ω)+r~nrnL2(Ω×Ω),\|r-r_{n}\|_{L^{2}(\Omega\times\Omega)}\leq\|r-\tilde{r}\|_{L^{2}(\Omega\times\Omega)}+\|\tilde{r}-\tilde{r}_{n}\|_{L^{2}(\Omega\times\Omega)}+\|\tilde{r}_{n}-r_{n}\|_{L^{2}(\Omega\times\Omega)},

and estimate each of the three terms in the above equation. Now,

rr~L2(Ω×Ω)22m2(u(x)u~(x)L2(Ω)2+u(x0)u~(x0)L2(Ω)2+M2u(x0)u~(x0)L2(Ω)2).\begin{split}\|r-\tilde{r}\|^{2}_{L^{2}(\Omega\times\Omega)}&\leq 2m^{2}(\|u(x)-\tilde{u}(x)\|_{L^{2}(\Omega)}^{2}\\ &+\|u(x_{0})-\tilde{u}(x_{0})\|^{2}_{L^{2}(\Omega)}+M^{2}\|\nabla u(x_{0})-\nabla\tilde{u}(x_{0})\|^{2}_{L^{2}(\Omega)}).\end{split}

By the density assumption, we have that uu~L2(Ω)<ϵ\|u-\tilde{u}\|_{L^{2}(\Omega)}<\epsilon and uu~L2(Ω)<ϵ\|\nabla u-\nabla\tilde{u}\|_{L^{2}(\Omega)}<\epsilon, and hence, we conclude that

rr~L2(Ω×Ω)m4+2M2ϵ.\|r-\tilde{r}\|_{L^{2}(\Omega\times\Omega)}\leq m\sqrt{4+2M^{2}}\epsilon. (30)

Similarly, we have

rnr~nL2(Ω×Ω)22m2(u(x)u~(x)L2(Ω)2+u(x0)u~(x0)L2(Ω)2+M2nu(x0)nu~(x0)L2(Ω)2).\begin{split}\|r_{n}-\tilde{r}_{n}\|^{2}_{L^{2}(\Omega\times\Omega)}&\leq 2m^{2}(\|u(x)-\tilde{u}(x)\|_{L^{2}(\Omega)}^{2}\\ &+\|u(x_{0})-\tilde{u}(x_{0})\|^{2}_{L^{2}(\Omega)}+M^{2}\|\nabla_{n}u(x_{0})-\nabla_{n}\tilde{u}(x_{0})\|^{2}_{L^{2}(\Omega)}).\end{split}

We also have

nu(x0)nu~(x0)L2(Ω)nu(x0)u(x0)L2(Ω)+u(x0)u~(x0)L2(Ω)+u~(x0)nu~(x0)L2(Ω).\begin{split}\|\nabla_{n}u(x_{0})-\nabla_{n}\tilde{u}(x_{0})\|_{L^{2}(\Omega)}&\leq\|\nabla_{n}u(x_{0})-\nabla u(x_{0})\|_{L^{2}(\Omega)}\\ &+\|\nabla u(x_{0})-\nabla\tilde{u}(x_{0})\|_{L^{2}(\Omega)}+\|\nabla\tilde{u}(x_{0})-\nabla_{n}\tilde{u}(x_{0})\|_{L^{2}(\Omega)}.\end{split} (31)

By theorem 1.1 of [35], there exists an N1>0N_{1}>0 such that each of the three terms in equation 31 can be bounded by ϵ3\frac{\epsilon}{3}, for all n>N1n>N_{1}, which yields

nu(x0)nu~(x0)L2(Ω)<ϵ,\|\nabla_{n}u(x_{0})-\nabla_{n}\tilde{u}(x_{0})\|_{L^{2}(\Omega)}<\epsilon, (32)

for all n>N1n>N_{1}, which in turn implies that

rnr~nL2(Ω×Ω)m4+2M2ϵ.\|r_{n}-\tilde{r}_{n}\|_{L^{2}(\Omega\times\Omega)}\leq m\sqrt{4+2M^{2}}\epsilon. (33)

Putting equations 29, 30 and 33 together, we have, for n>max(N,N1)n>\text{max}(N,N_{1}), that

rrnL2(Ω×Ω)<(1+2m4+2M2)ϵ,\|r-r_{n}\|_{L^{2}(\Omega\times\Omega)}<(1+2m\sqrt{4+2M^{2}})\epsilon,

and since ϵ\epsilon was arbitrary, we conclude that rn(,)r(,)L2(Ω×Ω)0\|r_{n}(\cdot,\cdot)-r(\cdot,\cdot)\|_{L^{2}(\Omega\times\Omega)}\rightarrow 0 as nn\rightarrow\infty. ∎

6.5 Proof of Proposition 2.4

Proof.

Let uLip(Ω,M)u\in Lip(\Omega,M). We know from Lemma 2.2 of [35] that nuL(Ω)DMρnL1(Ω)\|\nabla_{n}u\|_{L^{\infty}(\Omega)}\leq DM\|\rho_{n}\|_{L^{1}(\Omega)} and since ρnL1(Ω)ρnL1(D)=1\|\rho_{n}\|_{L^{1}(\Omega)}\leq\|\rho_{n}\|_{L^{1}(\mathbb{R}^{D})}=1, we conclude that

nuL(Ω)DM.\|\nabla_{n}u\|_{L^{\infty}(\Omega)}\leq DM.

Now,

xk+1,nxk,n+αknuk,nxk,n+DMαk.\|x^{k+1,n}\|\leq\|x^{k,n}\|+\alpha^{k}\|\nabla_{n}u^{k,n}\|\leq\|x^{k,n}\|+DM\alpha^{k}.

Iterating this observation, we get

xk+1,nx0+DM(m=1kαm).\|x^{k+1,n}\|\leq\|x^{0}\|+DM(\sum_{m=1}^{k}\alpha^{m}).

By assumption, we have m=1Kαm<1\sum_{m=1}^{K}\alpha^{m}<1 for any K>0K>0. Thus, xk+1,nx0+DM\|x^{k+1,n}\|\leq\|x^{0}\|+DM, and the sequence of iterates {xk+1,n}\{x^{k+1,n}\} is bounded. ∎

6.6 Proof of Theorem 2.5

Proof.

The proof is by induction on kk.

Base case:

Let k=0k=0. Now, x0=x0,nx^{0}=x^{0,n}, so

x1x1,n=α0(nu0u0).x^{1}-x^{1,n}=\alpha^{0}(\nabla_{n}u^{0}-\nabla u^{0}).

By [35], we know that nuu\nabla_{n}u\rightarrow\nabla u as nn\rightarrow\infty uniformly for all xVx\in V. Thus, given ϵ>0\epsilon>0 there is an N>0N>0 so that for all n>Nn>N, we have nu0u0<ϵα0\|\nabla_{n}u^{0}-\nabla u^{0}\|<\frac{\epsilon}{\alpha^{0}}. Thus, x1x1,n<ϵ\|x^{1}-x^{1,n}\|<\epsilon for n>Nn>N.

Induction Step:

Having completed the base case, we assume the result to be true for k=Kk=K, and we prove the estimate for k=K+1k=K+1. Again,

xK+1xK+1,n=xKxK,n+αK(nuK,nuK).x^{K+1}-x^{K+1,n}=x^{K}-x^{K,n}+\alpha^{K}(\nabla_{n}u^{K,n}-\nabla u^{K}).

Thus,

xK+1xK+1,nxKxK,n(1)+αK(nuK,nuK)(2).\|x^{K+1}-x^{K+1,n}\|\leq\underbrace{\|x^{K}-x^{K,n}\|}_{(1)}+\alpha^{K}\underbrace{\|(\nabla_{n}u^{K,n}-\nabla u^{K})\|}_{(2)}.

By the induction hypothesis, there is an N0>0N_{0}>0 such that term (1)(1) is bounded by ϵ3\frac{\epsilon}{3} for n>N0n>N_{0}.

For the second term, we have

nuK,nuKnuK,nuK,n+uK,nuK.\|\nabla_{n}u^{K,n}-\nabla u^{K}\|\leq\|\nabla_{n}u^{K,n}-\nabla u^{K,n}\|+\|\nabla u^{K,n}-\nabla u^{K}\|.

First, by uniform convergence of nuu\nabla_{n}u\rightarrow\nabla u that there is an N1N_{1} such that for n>N1n>N_{1}, we have

nuK,nuK,n<ϵ3αK.\|\nabla_{n}u^{K,n}-\nabla u^{K,n}\|<\frac{\epsilon}{3\alpha^{K}}.

Second, by the mean-value theorem applied to uK,nuK\nabla u^{K,n}-\nabla u^{K}, we have that

uK,nuKCxK,nxK,\|\nabla u^{K,n}-\nabla u^{K}\|\leq C\|x^{K,n}-x^{K}\|,

where C=HuL(V)<C=\|Hu\|_{L^{\infty}(V)}<\infty. Indeed, since uC2(Ω)u\in C^{2}(\Omega) the Hessian is continuous on the compact set VV and hence is uniformly bounded. By the induction hypothesis, there is an N2N_{2} such that xK,nxK<ϵ3CαK\|x^{K,n}-x^{K}\|<\frac{\epsilon}{3C\alpha^{K}} for all n>N2n>N_{2}. Choosing N=max(N0,N1,N2)N=\max({N_{0},N_{1},N_{2}}), we have that for n>N=N(α0,,αK)n>N=N(\alpha^{0},\ldots,\alpha^{K}), xK+1xK+1,n<ϵ\|x^{K+1}-x^{K+1,n}\|<\epsilon, and the induction step is complete. ∎

6.7 Proof of Theorem 2.6

In the proof of the theorem, we shall need the notion of Γ\Gamma-convergence in the context of metric spaces (see [34] for more details).

Definition 2.

Let XX be a metric space (or, more generally, a first countable topological space) and Fn:X¯F_{n}:X\rightarrow\overline{\mathbb{R}} be a sequence of extended real valued functions on XX. Then the sequence FnF_{n} is said to Γ\Gamma-converge to the Γ\Gamma-limit F:X¯F:X\to\overline{\mathbb{R}} (written FnΓFF_{n}\xrightarrow{\text{$\Gamma$}}F) if the following two conditions hold:

  • For every sequence xnXx_{n}\in X such that xnxXx_{n}\rightarrow x\in X as nn\rightarrow\infty, we have F(x)liminfnFn(xn),F(x)\leq\lim\inf_{n\rightarrow\infty}F_{n}(x_{n}),

  • For all xXx\in X, there is a sequence of xnXx_{n}\in X with xnxx_{n}\rightarrow x (called a Γ\Gamma-realizing sequence) such that F(x)limsupnFn(xn).F(x)\geq\lim\sup_{n\rightarrow\infty}F_{n}(x_{n}).

The notion of Γ\Gamma-convergence is an essential ingredient for the study of convergence of minimizers. Indeed, if FnΓFF_{n}\xrightarrow{\text{$\Gamma$}}F, and xnx_{n} is a minimizer of FnF_{n}, then it is true that every limit point of the sequence xnx_{n} is a minimizer of FF. Moreover, if FnFF_{n}\rightarrow F uniformly, then FnΓFlscF_{n}\xrightarrow{\text{$\Gamma$}}F^{lsc}, where FlscF^{lsc} is the lower semi-continuous envelope of FF, i.e., the largest lower semi-continuous function bounded above by FF. In particular, if FnF_{n} and FF happen to be continuous, then Flsc=FF^{lsc}=F, and therefore, if FnFF_{n}\rightarrow F uniformly then FnΓFlsc=FF_{n}\xrightarrow{\text{$\Gamma$}}F^{lsc}=F (see [34]).

Proof.

The proof is by induction on kk.

Base case:

We consider k=0k=0. Recall that

α0,n=argminαIu(x0,nαnu0,n)=argminαIu(x0αnu0),\alpha^{0,n}=\operatorname*{arg\,min}_{\alpha\in I}\,u(x^{0,n}-\alpha\nabla_{n}u^{0,n})=\operatorname*{arg\,min}_{\alpha\in I}\,u(x^{0}-\alpha\nabla_{n}u^{0}),

since x0=x0,nx^{0}=x^{0,n}. Now let

ϕn(α)=x0αnu0,\phi_{n}(\alpha)=x^{0}-\alpha\nabla_{n}u^{0},

and

ϕ(α)=x0αu0.\phi(\alpha)=x^{0}-\alpha\nabla u^{0}.

We first notice that ϕn,ϕ\phi_{n},\phi are continuous and ϕnϕ\phi_{n}\rightarrow\phi uniformly as nn\rightarrow\infty. Indeed,

|ϕn(α)ϕ(α)|=α|u0nu0|,|\phi_{n}(\alpha)-\phi(\alpha)|=\alpha|\nabla u^{0}-\nabla_{n}u^{0}|,

and, by the uniform α\alpha-independent convergence nu0u0\nabla_{n}u^{0}\rightarrow\nabla u^{0} (see [35]), we conclude that

ϕn(α)ϕ(α)L(Ω)=αu0nu0L(Ω)0\|\phi_{n}(\alpha)-\phi(\alpha)\|_{L^{\infty}(\Omega)}=\alpha\|\nabla u^{0}-\nabla_{n}u^{0}\|_{L^{\infty}(\Omega)}\rightarrow 0

as nn\rightarrow\infty for any αI\alpha\in I. Now, since II is compact and u,ϕn,ϕu,\phi_{n},\phi are continuous, the composition uϕnu\circ\phi_{n} converges uniformly to uϕu\circ\phi.

By our remarks about Γ\Gamma-convergence earlier, we know that uniform convergence of a sequence of continuous functions implies Γ\Gamma-convergence, which in turn implies convergence of limits of minimizers. Thus, limit points of the sequence of minimizers {α0,n}\{\alpha^{0,n}\} of uϕnu\circ\phi_{n} are minimizers of uϕu\circ\phi, i.e., α0\alpha^{0}. In other words, there is a subsequence {α0,nm(0)}\{\alpha^{0,n_{m}(0)}\} of {α0,n}\{\alpha^{0,n}\} such that

α0,nm(0)α0\alpha^{0,n_{m}(0)}\rightarrow\alpha^{0}

as nm(0)n_{m}(0)\rightarrow\infty.

x1,nm(0)x1=α0u0α0,nm(0)nm(0)u0,nm(0)=α0u0α0,nm(0)nm(0)u0=|α0αnm(0)|u0nm(0)u0,\begin{split}\|x^{1,n_{m}(0)}-x^{1}\|&=\|\alpha^{0}\nabla u^{0}-\alpha^{0,n_{m}(0)}\nabla_{n_{m}(0)}u^{0,n_{m}(0)}\|\\ &=\|\alpha^{0}\nabla u^{0}-\alpha^{0,n_{m}(0)}\nabla_{n_{m}(0)}u^{0}\|\\ &=|\alpha^{0}-\alpha^{n_{m}(0)}|\|\nabla u^{0}-\nabla_{n_{m}(0)}u^{0}\|,\\ \end{split} (34)

because x0=x0,nm(0)x^{0}=x^{0,n_{m}(0)} so that,

nm(0)u0,nm(0)=nm(0)u(x0,nm(0))=nm(0)u(x0)=nm(0)u0.\begin{split}\nabla_{n_{m}(0)}u^{0,n_{m}(0)}&=\nabla_{n_{m}(0)}u(x^{0,n_{m}(0)})\|\\ &=\nabla_{n_{m}(0)}u(x^{0})\\ &=\nabla_{n_{m}(0)}u^{0}.\\ \end{split} (35)

Since α0,nm(0)α0\alpha^{0,n_{m}(0)}\rightarrow\alpha^{0} as nm(0)n_{m}(0)\rightarrow\infty, for any ϵ>0\epsilon>0, there is an N1>0N_{1}>0 such that

|α0αnm(0)|<ϵ|\alpha^{0}-\alpha^{n_{m}(0)}|<\sqrt{\epsilon}

for nm(0)>N1n_{m}(0)>N_{1}. Likewise, there is an N2>0N_{2}>0 such that

u0nm(0)u0,nm(0)=u0nm(0)u0<ϵ\|\nabla u^{0}-\nabla_{n_{m}(0)}u^{0,n_{m}(0)}\|=\|\nabla u^{0}-\nabla_{n_{m}(0)}u^{0}\|<\sqrt{\epsilon}

for nm(0)>N2n_{m}(0)>N_{2} since nuu\nabla_{n}u\rightarrow\nabla u as nn\rightarrow\infty. Choosing N=max{N1,N2}N=\max\{N_{1},N_{2}\}, we see x1x1,nm(0)<ϵ\|x^{1}-x^{1,n_{m}(0)}\|<\epsilon. The base case is proved.

Induction Step:

We assume the result is true for k=K1k=K-1. We prove it for k=Kk=K. Similar to the base case, we let

ϕK,n(α)=xK,nαnuK,n\phi_{K,n}(\alpha)=x^{K,n}-\alpha\nabla_{n}u^{K,n}

and

ϕK(α)=xKαuK.\phi_{K}(\alpha)=x^{K}-\alpha\nabla u^{K}.

We now show that by the induction hypothesis, there is a subsequence ϕK,nm(K)\phi_{K,n_{m}(K)} of ϕK,n\phi_{K,n} such that ϕK,nm(K)ϕK\phi_{K,n_{m}(K)}\rightarrow\phi_{K} uniformly. Indeed,

ϕK,n(α)ϕK(α)=xK,nxK+α(uKnuK,n).\phi_{K,n}(\alpha)-\phi_{K}(\alpha)=x^{K,n}-x^{K}+\alpha(\nabla u^{K}-\nabla_{n}u^{K,n}).

Thus,

ϕK,n(α)ϕK(α)xK,nxK+αuKnuK,n,\|\phi_{K,n}(\alpha)-\phi_{K}(\alpha)\|\leq\|x^{K,n}-x^{K}\|+\alpha\|\nabla u^{K}-\nabla_{n}u^{K,n}\|,

so that

ϕK,n(α)ϕK(α)xK,nxK(1)+α(uKuK,n(2)+uK,nnuK,n(3)).\|\phi_{K,n}(\alpha)-\phi_{K}(\alpha)\|\leq\underbrace{\|x^{K,n}-x^{K}\|}_{(1)}+\alpha(\underbrace{\|\nabla u^{K}-\nabla u^{K,n}\|}_{(2)}+\underbrace{\|\nabla u^{K,n}-\nabla_{n}u^{K,n}\|}_{(3)}).

By the induction hypothesis, there is a subsequence xK,nm(K)x^{K,n_{m}(K)} of xK,nx^{K,n} such that xK,nm(K)xKx^{K,n_{m}(K)}\rightarrow x^{K} as nm(K)n_{m}(K)\rightarrow\infty. Thus, given an ϵ>0\epsilon>0, there is an N1>0N_{1}>0 such that for all nm(K)>N1n_{m}(K)>N_{1},

xK,nm(k)xK<ϵ3.\|x^{K,n_{m}(k)}-x^{K}\|<\frac{\epsilon}{3}.

Next, by the uniform convergence of nuu\nabla_{n}u\rightarrow\nabla u, we have that there exists an N2>0N_{2}>0 such that

uK,nm(K)nm(K)uK,nm(K)<ϵ3\|\nabla u^{K,n_{m}(K)}-\nabla_{n_{m}(K)}u^{K,n_{m}(K)}\|<\frac{\epsilon}{3}

for all nm(K)>N2n_{m}(K)>N_{2}. As in the suboptimal stepsize case, we now apply the mean value theorem to uKuK,nm(K)\nabla u^{K}-\nabla u^{K,n_{m}(K)}. We have that

uKuK,nm(K)CxKxK,nm(K)<Cϵ3\|\nabla u^{K}-\nabla u^{K,n_{m}(K)}\|\leq C\|x^{K}-x^{K,n_{m}(K)}\|<C\frac{\epsilon}{3}

where C=HuL(V)<C=\|Hu\|_{L^{\infty}(V)}<\infty. Indeed, since uC2(Ω)u\in C^{2}(\Omega), the Hessian is continuous on the compact set VV containing the iterates xk,xk,nx^{k},x^{k,n} and hence is uniformly bounded. Putting all this together, and noting that αI=[0,A]\alpha\in I=[0,A] we see that

ϕK,nm(K)(α)ϕK(α)<ϵ3(1+α(C+1))ϵ3(1+A(C+1))\|\phi_{K,n_{m}(K)}(\alpha)-\phi_{K}(\alpha)\|<\frac{\epsilon}{3}(1+\alpha(C+1))\leq\frac{\epsilon}{3}(1+A(C+1))

for nm(K)>max{N1,N2}n_{m}(K)>\max{\{N_{1},N_{2}\}}. Thus, ϕK,nm(K)(α)ϕK(α)\phi_{K,n_{m}(K)}(\alpha)\rightarrow\phi_{K}(\alpha) uniformly on the compact interval II. Therefore, the composition uϕK,nm(K)u\circ\phi_{K,n_{m}(K)} converges uniformly to uϕKu\circ\phi_{K} since uu is continuous. This implies that uϕK,nm(K)u\circ\phi_{K,n_{m}(K)} Γ\Gamma-converges to uϕKu\circ\phi_{K}, so that every limit point of the collection of minimizers {αK,nm(K)}\{\alpha^{K,n_{m}(K)}\} of ϕK,nm(K)\phi_{K,n_{m}(K)} is a (and since we have assumed uniqueness of the linesearch minimizer, we can say the) minimizer αK\alpha^{K} of ϕK\phi_{K}. There is therefore a further subsequence of {αK,nm(K)}\{\alpha^{K,n_{m}(K)}\}, which we will denote by {αK,nm1(K)}\{\alpha^{K,n_{m_{1}}(K)}\}, such that αK,nm1(K)αK\alpha^{K,n_{m_{1}}(K)}\rightarrow\alpha^{K}. This proves the first half of the result. We still need to show that there exists a subsequence xK+1,nm2(K)x^{K+1,n_{m_{2}}(K)} of xK+1,nx^{K+1,n} such that xK+1,nm2(K)xK+10\|x^{K+1,n_{m_{2}}(K)}-x^{K+1}\|\rightarrow 0 as nm2(K)n_{m_{2}}(K)\rightarrow\infty.

To this end, consider

xK+1,nxK+1=xK,nxK+αKuKαK,nnuK,nxK,nxK+αKuKαK,nnuK,n.\begin{split}\|x^{K+1,n}-x^{K+1}\|&=\|x^{K,n}-x^{K}+\alpha^{K}\nabla u^{K}-\alpha^{K,n}\nabla_{n}u^{K,n}\|\\ &\leq\|x^{K,n}-x^{K}\|+\|\alpha^{K}\nabla u^{K}-\alpha^{K,n}\nabla_{n}u^{K,n}\|.\\ \end{split} (36)

We write

αKuKαK,nnuK,n=αKuKαKnuK,n+αKnuK,nαK,nnuK,nαKuKnuK,n+|αKαK,n|nuK,nαKuKnuK,n+|αKαK,n|DM.\begin{split}\|\alpha^{K}\nabla u^{K}-\alpha^{K,n}\nabla_{n}u^{K,n}\|&=\|\alpha^{K}\nabla u^{K}-\alpha^{K}\nabla_{n}u^{K,n}+\alpha^{K}\nabla_{n}u^{K,n}-\alpha^{K,n}\nabla_{n}u^{K,n}\|\\ &\leq\alpha^{K}\|\nabla u^{K}-\nabla_{n}u^{K,n}\|+|\alpha^{K}-\alpha^{K,n}|\|\nabla_{n}u^{K,n}\|\\ &\leq\alpha^{K}\|\nabla u^{K}-\nabla_{n}u^{K,n}\|+|\alpha^{K}-\alpha^{K,n}|DM.\end{split} (37)

The last inequality follows since uC2(Ω)Lip(Ω,M)u\in C^{2}(\Omega)\cap Lip(\Omega,M) and by Lemma 2.2 of [35], we have nuK,nnuL(Ω)DMρL1(Ω)DM.\|\nabla_{n}u^{K,n}\|\leq\|\nabla_{n}u\|_{L^{\infty}(\Omega)}\leq DM\|\rho\|_{L^{1}(\Omega)}\leq DM.

Now,

uKnuK,nuKuK,n+uK,nnuK,n.\|\nabla u^{K}-\nabla_{n}u^{K,n}\|\leq\|\nabla u^{K}-\nabla u^{K,n}\|+\|\nabla u^{K,n}-\nabla_{n}u^{K,n}\|.

By the mean value theorem, the first term

uKuK,nCxKxK,n\|\nabla u^{K}-\nabla u^{K,n}\|\leq C\|x^{K}-x^{K,n}\|

with C=HuL(V)<C=\|Hu\|_{L^{\infty}(V)}<\infty due to uC2(Ω)u\in C^{2}(\Omega). The second term

uK,nnuK,n0\|\nabla u^{K,n}-\nabla_{n}u^{K,n}\|\rightarrow 0

as nn\rightarrow\infty by uniform convergence of nu\nabla_{n}u to u\nabla u on VV. Thus, putting all the above together, we see that

xK+1,nxK+1(1+αKC)xK,nxK+αKuK,nnuK,n+|αK,nαK|DM.\|x^{K+1,n}-x^{K+1}\|\leq(1+\alpha^{K}C)\|x^{K,n}-x^{K}\|+\alpha^{K}\|\nabla u^{K,n}-\nabla_{n}u^{K,n}\|+|\alpha^{K,n}-\alpha^{K}|DM.

Now, by our induction hypothesis, there is a subsequence xK,nm(K)x^{K,n_{m}(K)} of xK,nx^{K,n} such that xK,nm(K)xKx^{K,n_{m}(K)}\rightarrow x^{K} as nm(K)n_{m}(K)\rightarrow\infty. Thus, given an ϵ>0\epsilon>0, there is an N1>0N_{1}>0 such that for all nm(K)>N1n_{m}(K)>N_{1},

xK,nm(k)xK<ϵ3(1+αKC).\|x^{K,n_{m}(k)}-x^{K}\|<\frac{\epsilon}{3(1+\alpha^{K}C)}.

Next, by what was proved earlier, there is a subsequence αK,nm1(K)\alpha^{K,n_{m_{1}}(K)} of αK,nm(K)\alpha^{K,n_{m}(K)} and an N2>0N_{2}>0 such that for nm1(K)>N3n_{m_{1}}(K)>N_{3},

|αK,nm1(K)αK|<ϵ3DM.|\alpha^{K,n_{m_{1}}(K)}-\alpha^{K}|<\frac{\epsilon}{3DM}.

Finally, there exists an N3N_{3} such that

uK,nm(K)nm(K)uK,nm(K)<ϵ3αK.\|\nabla u^{K,n_{m}(K)}-\nabla_{n_{m}(K)}u^{K,n_{m}(K)}\|<\frac{\epsilon}{3\alpha^{K}}.

Define a new subsequence {nm2(K)}\{n_{m_{2}}(K)\} to be the diagonal of {nm(K)}\{n_{m}(K)\} and {nm1(K)}\{n_{m_{1}}(K)\}. Then, for nm2(K)>max{N1,N2,N3}n_{m_{2}}(K)>\max\{N_{1},N_{2},N_{3}\}, we have

xK+1,nm2(K)xK+1(1+αKC)xK,nm2(K)xK+αKuK,nm2(K)nuK,nm2(K)+|αK,nm2(K)αK|DM<ϵ.\begin{split}\|x^{K+1,n_{m_{2}}(K)}-x^{K+1}\|&\leq(1+\alpha^{K}C)\|x^{K,n_{m_{2}}(K)}-x^{K}\|\\ &+\alpha^{K}\|\nabla u^{K,n_{m_{2}}(K)}-\nabla_{n}u^{K,n_{m_{2}}(K)}\|+|\alpha^{K,n_{m_{2}}(K)}-\alpha^{K}|DM\\ &<\epsilon.\\ \end{split} (38)

The subsequence xK+1,nm2(K)x^{K+1,n_{m_{2}}(K)} is the required subsequence of xK+1,nx^{K+1,n}, and the induction is complete. ∎

6.8 Proof of Lemma 2.7

Proof.

By convexity of uu, we have that

u(y)u(x)(yx)Tu(x), or,u(x)u(y)(xy)Tu(x).\begin{split}u(y)-u(x)&\geq(y-x)^{T}\nabla u(x),\text{ or,}\\ u(x)-u(y)&\leq(x-y)^{T}\nabla u(x).\\ \end{split} (39)

On a compact neighborhood KK of xx, nuu\nabla_{n}u\rightarrow\nabla u uniformly by theorem 1.1 of [35]. Thus, nu(x)u(x)\nabla_{n}u(x)\rightarrow\nabla u(x) as nn\rightarrow\infty. For any λD\lambda\in\mathbb{R}^{D}, this implies that we have λTnu(x)λTu(x)\lambda^{T}\nabla_{n}u(x)\rightarrow\lambda^{T}\nabla u(x). Choosing λ=(yx)\lambda=(y-x), we have (yx)Tnu(x)(yx)Tu(x)(y-x)^{T}\nabla_{n}u(x)\rightarrow(y-x)^{T}\nabla u(x). Thus, given ϵ>0\epsilon>0, there is an N>0N>0 such that for all n>Nn>N, we have |(yx)Tnu(x)(yx)Tu(x)|<ϵ|(y-x)^{T}\nabla_{n}u(x)-(y-x)^{T}\nabla u(x)|<\epsilon, so that ϵ+(yx)Tnu(x)<(yx)Tu(x)<ϵ+(yx)Tnu(x)-\epsilon+(y-x)^{T}\nabla_{n}u(x)<(y-x)^{T}\nabla u(x)<\epsilon+(y-x)^{T}\nabla_{n}u(x). Thus, u(y)u(x)(yx)Tu(x)>(yx)Tnu(x)ϵ.u(y)-u(x)\geq(y-x)^{T}\nabla u(x)>(y-x)^{T}\nabla_{n}u(x)-\epsilon.

6.9 Proof of Theorem 2.8

As a motivating discussion of the nonlocal stochastic gradient descent method, we consider a simple example. If f:[0,1]f:[0,1]\rightarrow\mathbb{R} is a differentiable function, we can consider, for any x,y[0,1]x,y\in[0,1], and xyx\neq y, the difference quotient

kf(x,y)=f(x)f(y)xy,k_{f}(x,y)=\frac{f(x)-f(y)}{x-y},

which, by the mean value theorem, equals f(ξ)f^{\prime}(\xi) for some ξ[x,y]\xi\in[x,y]. If we now think of X,YX,Y as [0,1][0,1]-valued random variables, then we have that the difference quotient kf(X,Y)k_{f}(X,Y) is also a random variable. Thus, it is reasonable to expect that if the the conditional distribution of YY given XX is centered sufficiently closely around XX, then

𝔼[kf(X,Y)|X]f(X).\mathbb{E}[k_{f}(X,Y)|X]\approx f^{\prime}(X).

While this approximation depends on the exact nature of pY|X(y|x)p_{Y|X}(y|x), it nevertheless motivates our interpretation of kf(X,Y)k_{f}(X,Y) as being a nonlocal subgradient.

We recall a relevant result from [41].

Corollary 6.5.

(Corollary 14.2 of [41]) Let uu be a convex, MM-Lipschitz function and let xargminx:xBu(x).x^{*}\in\arg\min_{x:\|x\|\leq B}u(x). If we run the GD algorithm on uu for KK steps with α=B2M2K\alpha=\sqrt{\frac{B^{2}}{M^{2}K}} then the output vector x¯\bar{x} satisfies

u(x¯)u(x)BMK.u(\bar{x})-u(x^{*})\leq\frac{BM}{\sqrt{K}}.

Furthermore, for every ϵ>0\epsilon>0, to achieve u(x¯)u(x)ϵ,u(\bar{x})-u(x^{*})\leq\epsilon, it suffices to run the GD algorithm for a number of iterations that satisfies

KB2M2ϵ2.K\geq\frac{B^{2}M^{2}}{\epsilon^{2}}.

We now prove 2.8, which we view as a nonlocal version of corollary 14.2 of [41] mentioned above.

Proof.

Consider u(x¯)u(x)=u(1Kk=1Kxk)u(x)u(\bar{x})-u(x^{*})=u(\frac{1}{K}\sum_{k=1}^{K}x^{k})-u(x^{*})

u(x¯)u(x)=u(1Kk=1Kxk)u(x)1Kk=1Ku(xk)u(x)=1K(k=1Ku(xk)u(x)).\begin{split}u(\bar{x})-u(x^{*})&=u(\frac{1}{K}\sum_{k=1}^{K}x^{k})-u(x^{*})\\ &\leq\frac{1}{K}\sum_{k=1}^{K}u(x^{k})-u(x^{*})\\ &=\frac{1}{K}(\sum_{k=1}^{K}u(x^{k})-u(x^{*})).\\ \end{split} (40)

By the choice of ϵ\epsilon-subgradient gkϵu(xk)g^{k}\in\partial_{\epsilon}u(x^{k}) of uu, we have for every kk, that

u(xk)u(x)(xkx)Tgk+ϵ,u(x^{k})-u(x^{*})\leq(x^{k}-x^{*})^{T}g^{k}+\epsilon, (41)

which upon summing and dividing by KK yields

u(x¯)u(x)1Kk=1K(xkx)Tgk+ϵ.u(\bar{x})-u(x^{*})\leq\frac{1}{K}\sum_{k=1}^{K}(x^{k}-x^{*})^{T}g^{k}+\epsilon. (42)

We appeal now to lemma 14.1 of [41] to conclude that 1Kk=1K(xkx)TgkBMK\frac{1}{K}\sum_{k=1}^{K}(x^{k}-x^{*})^{T}g^{k}\leq\frac{BM}{\sqrt{K}}, which results in the final estimate

u(x¯)u(x)BMK+ϵ.u(\bar{x})-u(x^{*})\leq\frac{BM}{\sqrt{K}}+\epsilon. (43)

Now, given an ϵ^>ϵ\hat{\epsilon}>\epsilon, it is clear that picking KB2M2(ϵ^ϵ)2K\geq\frac{B^{2}M^{2}}{(\hat{\epsilon}-\epsilon)^{2}} results in u(x¯)u(x)ϵ^u(\bar{x})-u(x^{*})\leq\hat{\epsilon}. ∎

6.10 Proof of Theorem 2.9

Proof.

The proof follows that of theorem 14.8 of [41]. Let g1:kg^{1:k} denote the sequence of ϵ\epsilon-subgradients g1,,gkg^{1},\ldots,g^{k}, and 𝔼g1:k[]\mathbb{E}_{g^{1:k}}[\cdot] indicates the expected value of the bracketed quantity over the draws g1,,gkg^{1},\ldots,g^{k}. Now, u(x¯)u(x)1K(k=1Ku(xk)u(x))u(\bar{x})-u(x^{*})\leq\frac{1}{K}(\sum_{k=1}^{K}u(x^{k})-u(x^{*})), and therefore, 𝔼g1:K[u(x¯)u(x)]𝔼g1:K[1K(k=1Ku(xk)u(x))].\mathbb{E}_{g^{1:K}}[u(\bar{x})-u(x^{*})]\leq\mathbb{E}_{g^{1:K}}[\frac{1}{K}(\sum_{k=1}^{K}u(x^{k})-u(x^{*}))]. By lemma 14.1 of [41], we have 𝔼g1:K[1Kk=1K(gk)T(xkx)BMK.\mathbb{E}_{g^{1:K}}[\frac{1}{K}\sum_{k=1}^{K}(g^{k})^{T}(x^{k}-x^{*})\leq\frac{BM}{\sqrt{K}}.

We show now that

𝔼g1:K[1K(k=1Ku(xk)u(x))]𝔼g1:K[1Kk=1K(gk)T(xkx)]+ϵ,\mathbb{E}_{g^{1:K}}[\frac{1}{K}(\sum_{k=1}^{K}u(x^{k})-u(x^{*}))]\leq\mathbb{E}_{g^{1:K}}[\frac{1}{K}\sum_{k=1}^{K}(g^{k})^{T}(x^{k}-x^{*})]+\epsilon, (44)

which would establish the result.

Now,

𝔼g1:K[1Kk=1K(gk)T(xkx)]=1Kk=1K𝔼g1:K[(gk)T(xkx)]=1Kk=1K𝔼g1:k[(gk)T(xkx)],\begin{split}\mathbb{E}_{g^{1:K}}[\frac{1}{K}\sum_{k=1}^{K}(g^{k})^{T}(x^{k}-x^{*})]&=\frac{1}{K}\sum_{k=1}^{K}\mathbb{E}_{g^{1:K}}[(g^{k})^{T}(x^{k}-x^{*})]\\ &=\frac{1}{K}\sum_{k=1}^{K}\mathbb{E}_{g^{1:k}}[(g^{k})^{T}(x^{k}-x^{*})],\\ \end{split} (45)

where the first equality is by linearity of expectation, and the second equality is due to the fact that the quantity (gk)T(xkx)(g^{k})^{T}(x^{k}-x^{*}) depends on iterates upto and including kk only, and not beyond kk. Next, by the law of total expectation, we have

𝔼g1:k[(gk)T(xkx)]=𝔼g1:k1[𝔼g1:k[(gk)T(xkx)|g1:k1]].\begin{split}\mathbb{E}_{g^{1:k}}[(g^{k})^{T}(x^{k}-x^{*})]&=\mathbb{E}_{g^{1:k-1}}[\mathbb{E}_{g^{1:k}}[(g^{k})^{T}(x^{k}-x^{*})|g^{1:k-1}]].\\ \end{split} (46)

Conditioned on g1:k1g^{1:k-1}, the quantity xkx^{k} is deterministic, and hence

𝔼g1:k1[𝔼g1:k[(gk)T(xkx)|g1:k1]]=𝔼g1:k1[(𝔼gk[gk|g1:k1])T(xkx)].\begin{split}\mathbb{E}_{g^{1:k-1}}[\mathbb{E}_{g^{1:k}}[(g^{k})^{T}(x^{k}-x^{*})|g^{1:k-1}]]&=\mathbb{E}_{g^{1:k-1}}[(\mathbb{E}_{g^{k}}[g^{k}|g^{1:k-1}])^{T}(x^{k}-x^{*})].\\ \end{split} (47)

We now make the crucial observation that 𝔼gk[gk|g1:k1]ϵu(xk)\mathbb{E}_{g^{k}}[g^{k}|g^{1:k-1}]\in\partial_{\epsilon}u(x^{k}). Indeed, we have noted that xkx^{k} is deterministic given g1:k1g^{1:k-1}, so that 𝔼gk[gk|g1:k1]=𝔼gk[𝔼gk[gk|g1:k1]|xk]\mathbb{E}_{g^{k}}[g^{k}|g^{1:k-1}]=\mathbb{E}_{g^{k}}[\mathbb{E}_{g^{k}}[g^{k}|g^{1:k-1}]|x^{k}]. However, since xkx^{k} depends on g1:k1g^{1:k-1}, the Doob-Dynkin lemma (see [28]) implies that the sigma algebra σ(xk)\sigma(x^{k}) generated by xkx^{k} is a subalgebra of σ(g1:k1)\sigma(g^{1:k-1}), the sigma algebra generated by g1:k1g^{1:k-1}. Therefore, by the tower property of conditional expectation, we have that 𝔼gk[𝔼gk[gk|g1:k1]|xk]=𝔼gk[gk|xk]\mathbb{E}_{g^{k}}[\mathbb{E}_{g^{k}}[g^{k}|g^{1:k-1}]|x^{k}]=\mathbb{E}_{g^{k}}[g^{k}|x^{k}]. This implies that 𝔼gk[gk|g1:k1]ϵu(xk)\mathbb{E}_{g^{k}}[g^{k}|g^{1:k-1}]\in\partial_{\epsilon}u(x^{k}) since by construction we know that 𝔼gk[gk|xk]ϵu(xk)\mathbb{E}_{g^{k}}[g^{k}|x^{k}]\in\partial_{\epsilon}u(x^{k}). Thus, by the ϵ\epsilon-subgradient property of the gkg^{k}, we have that 𝔼g1:k1[(𝔼gk[gk|g1:k1])T(xkx)]𝔼g1:k1[u(xk)u(x)]ϵ.\mathbb{E}_{g^{1:k-1}}[(\mathbb{E}_{g^{k}}[g^{k}|g^{1:k-1}])^{T}(x^{k}-x^{*})]\geq\mathbb{E}_{g^{1:k-1}}[u(x^{k})-u(x^{*})]-\epsilon.

Thus, we have shown that 𝔼g1:K[(gk)T(xkx)]𝔼g1:k1[u(xk)u(x)]ϵ=𝔼g1:K[u(xk)u(x)]ϵ.\mathbb{E}_{g^{1:K}}[(g^{k})^{T}(x^{k}-x^{*})]\geq\mathbb{E}_{g^{1:k-1}}[u(x^{k})-u(x^{*})]-\epsilon=\mathbb{E}_{g^{1:K}}[u(x^{k})-u(x^{*})]-\epsilon. Summing over k=1,,Kk=1,\ldots,K and dividing by KK we obtain equation 44, and thereby we have established the result. ∎

6.11 Proof of Theorem 4.1

We now provide the proof of theorem 4.1. The proof is a direct consequence of Theorem 1.1 of [35].

Proof.

Convergence of (Hn,m1)i,jm,n(fixed)>N(Hn3)i,j(H_{n,m}^{1})_{i,j}\xrightarrow{m\rightarrow\infty,\,n\,(\text{fixed})>N}(H_{n}^{3})_{i,j}: Let KΩK\subset\Omega be compact. Assume that an NN has been chosen large enough so as to ensure that uxinC1(Ω¯)u_{x_{i}}^{n}\in C^{1}(\bar{\Omega}) for all n>Nn>N. Fix such an n>Nn>N. Then, applying Theorem 1.1 of [35], we obtain that (uxin)xjm(uxin)xj(u_{x_{i}}^{n})_{x_{j}}^{m}\rightarrow(u_{x_{i}}^{n})_{x_{j}} as mm\rightarrow\infty, i.e., (Hn,m1)i,jm,n(fixed)>N(Hn3)i,j(H_{n,m}^{1})_{i,j}\xrightarrow{m\rightarrow\infty,\,n\,(\text{fixed})>N}(H_{n}^{3})_{i,j} locally on KK. If uCc2(Ω),u\in C^{2}_{c}(\Omega), the uniform convergence of (Hn,m1)i,jm,n(fixed)>N(Hn3)i,j(H_{n,m}^{1})_{i,j}\xrightarrow{m\rightarrow\infty,\,n\,(\text{fixed})>N}(H_{n}^{3})_{i,j} follows from the analogous result of Theorem 1.1 of [35].

Convergence of (Hn2)i,jn(H)i,j(H_{n}^{2})_{i,j}\xrightarrow{n\rightarrow\infty}(H)_{i,j}: The proof of this case is exactly like that of the previous case obtained by replacing uxinu_{x_{i}}^{n} with uxiu_{x_{i}}. With the analogous set-up as before, we have uxiC1(Ω¯)u_{x_{i}}\in C^{1}(\bar{\Omega}). Thus, applying Theorem 1.1 of [35], we conclude that (uxi)xjn(uxi)xj(u_{x_{i}})_{x_{j}}^{n}\rightarrow(u_{x_{i}})_{x_{j}} locally uniformly. If uCc2(Ω),u\in C^{2}_{c}(\Omega), the uniform convergence of (Hn2)i,j(H)i,j(H_{n}^{2})_{i,j}\rightarrow(H)_{i,j} follows from the analogous result of Theorem 1.1 of [35].

6.12 Proof of Theorem 4.2

We now provide the proof of theorem 4.2.

Proof.

Let vH01(Ω)v\in H^{1}_{0}(\Omega). We show that ((uxin)xj(uxi)xj,v)L20((u^{n}_{x_{i}})_{x_{j}}-(u_{x_{i}})_{x_{j}},v)_{L^{2}}\rightarrow 0 as nn\rightarrow\infty.

Now, integrating by parts we obtain,

((uxin)xj(uxi)xj,v)L2=((uxin)(uxi),vj)L2,((u^{n}_{x_{i}})_{x_{j}}-(u_{x_{i}})_{x_{j}},v)_{L^{2}}=-((u^{n}_{x_{i}})-(u_{x_{i}}),v_{j})_{L^{2}},

where the boundary term is zero due to vv coming from H01H^{1}_{0} and having zero trace. We can now estimate:

|((uxin)xj(uxi)xj,v)L2|=|((uxin)(uxi),vj)L2|uxinuxiL2vjL2uxinuxiL2vH01,\begin{split}|((u^{n}_{x_{i}})_{x_{j}}-(u_{x_{i}})_{x_{j}},v)_{L^{2}}|&=|((u^{n}_{x_{i}})-(u_{x_{i}}),v_{j})_{L^{2}}|\\ &\leq\|u^{n}_{x_{i}}-u_{x_{i}}\|_{L^{2}}\|v_{j}\|_{L^{2}}\\ &\leq\|u^{n}_{x_{i}}-u_{x_{i}}\|_{L^{2}}\|v\|_{H^{1}_{0}},\\ \end{split} (48)

where the first inequality above is Cauchy-Schwarz, while the second follows from the definition of the H01H^{1}_{0} norm. Now, since uC2(Ω¯)u\in C^{2}(\bar{\Omega}), in particular, we have that uH2(Ω)u\in H^{2}(\Omega) and therefore uxiH1(Ω)u_{x_{i}}\in H^{1}(\Omega). By [35], Theorem 1.1, we have that uxinuxiL20\|u^{n}_{x_{i}}-u_{x_{i}}\|_{L^{2}}\rightarrow 0 as nn\rightarrow\infty for uxiH1(Ω)u_{x_{i}}\in H^{1}(\Omega). Therefore, given ϵ>0\epsilon>0, there is an N>0N>0 such that for all n>Nn>N, we have uxinuxiL2<ϵv\|u^{n}_{x_{i}}-u_{x_{i}}\|_{L^{2}}<\frac{\epsilon}{\|v\|}. Thus, |((uxin)xj(uxi)xj,v)L2|<ϵ|((u^{n}_{x_{i}})_{x_{j}}-(u_{x_{i}})_{x_{j}},v)_{L^{2}}|<\epsilon for n>Nn>N. Since ϵ\epsilon is arbitrary, we have proved ((uxin)xj(uxi)xj,v)L20((u^{n}_{x_{i}})_{x_{j}}-(u_{x_{i}})_{x_{j}},v)_{L^{2}}\rightarrow 0 as nn\rightarrow\infty.

The remainder of Theorem 4.2 is a reorganization of the statements of Theorem 4.1 along with what was just proved. ∎

6.13 Proof of Theorem 4.3

The proof follows that of Theorem 1.1 of [35]. We record here the following Lemma from [35] that we will need in our proof as well.

Lemma 6.6.

(Lemma 3.13.1 of [35]) Let

cni(x):=Ω(xiyi)2xy2ρn(xy)𝑑y.c_{n}^{i}(x):=\int_{\Omega}\frac{(x_{i}-y_{i})^{2}}{\|x-y\|^{2}}\rho_{n}(x-y)dy.

Then Dcni(x)1Dc_{n}^{i}(x)\rightarrow 1 pointwise as nn\rightarrow\infty, and the convergence is uniform on compact sets.

We now provide the proof of theorem 4.3.

Proof.

For xKx\in K, with KΩK\subset\Omega, KK being compact, we show that (uxin)xjm(uxin)xj(u_{x_{i}}^{n})_{x_{j}}^{m}\rightarrow(u_{x_{i}}^{n})_{x_{j}} uniformly as m,nm,n\rightarrow\infty. Let

Ji,jn,m=|DΩuxin(x)uxin(y)xyxjyjxyρm(xy)𝑑yDcjm(x)(uxi)xj|,J_{i,j}^{n,m}=|D\int_{\Omega}\frac{u_{x_{i}}^{n}(x)-u_{x_{i}}^{n}(y)}{\|x-y\|}\frac{x_{j}-y_{j}}{\|x-y\|}\rho_{m}(x-y)dy-Dc_{j}^{m}(x)(u_{x_{i}})_{x_{j}}|, (49)

where we have used the fact that (uxi)xj(u_{x_{i}})_{x_{j}} exists due to uC2(Ω¯)u\in C^{2}(\bar{\Omega}). Now, since Dcjm(x)1Dc_{j}^{m}(x)\rightarrow 1 uniformly as mm\rightarrow\infty by lemma 6.6, we have that Dcjm(x)(uxi)xj(uxi)xj,mDc_{j}^{m}(x)(u_{x_{i}})_{x_{j}}\rightarrow(u_{x_{i}})_{x_{j}},\,m\rightarrow\infty uniformly.

We show that there exists an N>0N>0 such that for n>Nn>N fixed, we have Ji,jn,m0J_{i,j}^{n,m}\rightarrow 0 uniformly for xKx\in K as mm\rightarrow\infty, which implies that (uxin)xjm(uxi)xj(u_{x_{i}}^{n})_{x_{j}}^{m}\rightarrow(u_{x_{i}})_{x_{j}} uniformly as mm\rightarrow\infty, i.e., (Hn,m1)i,jm,n(fixed)>N(H)i,j(H_{n,m}^{1})_{i,j}\xrightarrow{m\rightarrow\infty,\,n\,(\text{fixed})>N}(H)_{i,j}.

Now, we have by definition

cjm(x):=Ω(xjyj)2xy2ρm(xy)𝑑y,c_{j}^{m}(x):=\int_{\Omega}\frac{(x_{j}-y_{j})^{2}}{\|x-y\|^{2}}\rho_{m}(x-y)dy,

so that

Ji,jn,m=|D(Ωuxin(x)uxin(y)xyxjyjxyρm(xy)𝑑yΩ(xjyj)2xy2(uxi)xjρm(xy)𝑑y)|=|D(Ωuxin(x)uxin(y)(xjyj)(uxi)xjxyxjyjxyρm(xy)𝑑y)|.\begin{split}J_{i,j}^{n,m}&=|D(\int_{\Omega}\frac{u_{x_{i}}^{n}(x)-u_{x_{i}}^{n}(y)}{\|x-y\|}\frac{x_{j}-y_{j}}{\|x-y\|}\rho_{m}(x-y)dy-\int_{\Omega}\frac{(x_{j}-y_{j})^{2}}{\|x-y\|^{2}}(u_{x_{i}})_{x_{j}}\rho_{m}(x-y)dy)|\\ &=|D(\int_{\Omega}\frac{u_{x_{i}}^{n}(x)-u_{x_{i}}^{n}(y)-(x_{j}-y_{j})(u_{x_{i}})_{x_{j}}}{\|x-y\|}\frac{x_{j}-y_{j}}{\|x-y\|}\rho_{m}(x-y)dy)|.\\ \end{split} (50)

Now, by Theorem 1.1 of [35], we have that for xKx\in K, uxinuxiu_{x_{i}}^{n}\rightarrow u_{x_{i}} uniformly. Thus, given ϵ>0\epsilon>0, there exists an Nϵ>0N_{\epsilon}>0 independent of xKx\in K such that for all n>Nϵn>N_{\epsilon},

uxinuxiL(Ω)<ϵ,\|u_{x_{i}}^{n}-u_{x_{i}}\|_{L^{\infty}(\Omega)}<\epsilon,

or, for any x,yKx,y\in K, |uxin(x)uxi(x)|<ϵ|u_{x_{i}}^{n}(x)-u_{x_{i}}(x)|<\epsilon and |uxi(y)uxin(y)|<ϵ|u_{x_{i}}(y)-u_{x_{i}}^{n}(y)|<\epsilon, which upon combining together, result in

2ϵ+uxi(x)uxi(y)<uxin(x)uxin(y)<2ϵ+uxi(x)uxi(y).-2\epsilon+u_{x_{i}}(x)-u_{x_{i}}(y)<u_{x_{i}}^{n}(x)-u_{x_{i}}^{n}(y)<2\epsilon+u_{x_{i}}(x)-u_{x_{i}}(y). (51)

Putting 51 into 50, we obtain

Ji,jn,m|D(Ω2ϵ+uxi(x)uxi(y)(xjyj)(uxi)xjxyxjyjxyρm(xy)𝑑y)||D(Ω2ϵ(xjyj)xy2ρm(xy)𝑑y)|(1)+|D(Ωuxi(x)uxi(y)(xjyj)(uxi)xjxyxjyjxyρm(xy)𝑑y)|(2).\begin{split}J_{i,j}^{n,m}&\leq|D(\int_{\Omega}\frac{2\epsilon+u_{x_{i}}(x)-u_{x_{i}}(y)-(x_{j}-y_{j})(u_{x_{i}})_{x_{j}}}{\|x-y\|}\frac{x_{j}-y_{j}}{\|x-y\|}\rho_{m}(x-y)dy)|\\ &\leq\underbrace{|D(\int_{\Omega}\frac{2\epsilon(x_{j}-y_{j})}{\|x-y\|^{2}}\rho_{m}(x-y)dy)|}_{(1)}\\ &+\underbrace{|D(\int_{\Omega}\frac{u_{x_{i}}(x)-u_{x_{i}}(y)-(x_{j}-y_{j})(u_{x_{i}})_{x_{j}}}{\|x-y\|}\frac{x_{j}-y_{j}}{\|x-y\|}\rho_{m}(x-y)dy)|}_{(2)}.\end{split} (52)

Term (1)(1) in equation 52 can be estimated as:

|D(Ω2ϵ(xjyj)xy2ρm(xy)𝑑y)|2DϵΩρm(xy)xy𝑑y2DMϵ,|D(\int_{\Omega}\frac{2\epsilon(x_{j}-y_{j})}{\|x-y\|^{2}}\rho_{m}(x-y)dy)|\leq 2D\epsilon\int_{\Omega}\frac{\rho_{m}(x-y)}{\|x-y\|}dy\leq 2DM\epsilon, (53)

since, by the assumption on ρm\rho_{m}, we have Ωρm(xy)xy𝑑yDρm(xy)xy𝑑y<M(m)\int_{\Omega}\frac{\rho_{m}(x-y)}{\|x-y\|}dy\leq\int_{\mathbb{R}^{D}}\frac{\rho_{m}(x-y)}{\|x-y\|}dy<M(m). We now focus on term (2)(2) in equation 52.

Since by assumption (uxi)C1(Ω¯)(u_{x_{i}})\in C^{1}(\bar{\Omega}), i.e., (uxi)xjC0(Ω¯)(u_{x_{i}})_{x_{j}}\in C^{0}(\bar{\Omega}), we can find, given an ϵ2>0\epsilon_{2}>0, a δ>0\delta>0 small enough so that xy<δ\|x-y\|<\delta implies that (uxin)xj(x)(uxin)xj(y)<ϵ2\|(u_{x_{i}}^{n})_{x_{j}}(x)-(u_{x_{i}}^{n})_{x_{j}}(y)\|<\epsilon_{2}.

Assume an ϵ2\epsilon_{2} is given and the corresponding δ\delta has been found. Thus, we can write

|D(Ωuxi(x)uxi(y)(xjyj)(uxi)xjxyxjyjxyρm(xy)𝑑y)|=|D(Bδ(x)uxi(x)uxi(y)(xjyj)(uxi)xjxyxjyjxyρm(xy)dy+xy>δuxi(x)uxi(y)(xjyj)(uxi)xjxyxjyjxyρm(xy))||D(Bδ(x)uxi(x)uxi(y)(xjyj)(uxi)xjxyxjyjxyρm(xy))dy|+2Dδ(uxiL+uxiL)xy>δρm(xy)𝑑y.\begin{split}&|D(\int_{\Omega}\frac{u_{x_{i}}(x)-u_{x_{i}}(y)-(x_{j}-y_{j})(u_{x_{i}})_{x_{j}}}{\|x-y\|}\frac{x_{j}-y_{j}}{\|x-y\|}\rho_{m}(x-y)dy)|\\ &=|D(\int_{B_{\delta}(x)}\frac{u_{x_{i}}(x)-u_{x_{i}}(y)-(x_{j}-y_{j})(u_{x_{i}})_{x_{j}}}{\|x-y\|}\frac{x_{j}-y_{j}}{\|x-y\|}\rho_{m}(x-y)dy\\ &+\int_{\|x-y\|>\delta}\frac{u_{x_{i}}(x)-u_{x_{i}}(y)-(x_{j}-y_{j})(u_{x_{i}})_{x_{j}}}{\|x-y\|}\frac{x_{j}-y_{j}}{\|x-y\|}\rho_{m}(x-y))|\\ &\leq|D(\int_{B_{\delta}(x)}\frac{u_{x_{i}}(x)-u_{x_{i}}(y)-(x_{j}-y_{j})(u_{x_{i}})_{x_{j}}}{\|x-y\|}\frac{x_{j}-y_{j}}{\|x-y\|}\rho_{m}(x-y))dy|\\ &+\frac{2D}{\delta}(\|u_{x_{i}}\|_{L^{\infty}}+\|\nabla u_{x_{i}}\|_{L^{\infty}})\int_{\|x-y\|>\delta}\rho_{m}(x-y)dy.\\ \end{split} (54)

By the mean value theorem, yBδ(x),η(x,y)Bδ(x)\forall y\in B_{\delta}(x),\,\exists\eta(x,y)\in B_{\delta}(x) such that

uxi(x)uxi(y)=(uxi)xj(η(x,y))(xjyj).u_{x_{i}}(x)-u_{x_{i}}(y)=(u_{x_{i}})_{x_{j}}(\eta(x,y))\cdot(x_{j}-y_{j}).

Putting in this equality into equation 54, we obtain:

|D(Bδ(x)uxi(x)uxi(y)(xjyj)(uxi)xjxyxjyjxyρm(xy))dy||D(Bδ(x)(uxi)xj(η(x,y))(uxi)xjxy(xjyj)2xyρm(xy))dy||Dϵ2Bδ(x)ρm(xy)𝑑y|.\begin{split}&|D(\int_{B_{\delta}(x)}\frac{u_{x_{i}}(x)-u_{x_{i}}(y)-(x_{j}-y_{j})(u_{x_{i}})_{x_{j}}}{\|x-y\|}\frac{x_{j}-y_{j}}{\|x-y\|}\rho_{m}(x-y))dy|\\ &\leq|D(\int_{B_{\delta}(x)}\frac{(u_{x_{i}})_{x_{j}}(\eta(x,y))-(u_{x_{i}})_{x_{j}}}{\|x-y\|}\frac{(x_{j}-y_{j})^{2}}{\|x-y\|}\rho_{m}(x-y))dy|\\ &\leq|D\epsilon_{2}\int_{B_{\delta}(x)}\rho_{m}(x-y)dy|.\\ \end{split} (55)

By the assumption on the radial densities ρm\rho_{m}, we have that Bδ(x)ρm(xy)𝑑y<1\int_{B_{\delta}(x)}\rho_{m}(x-y)dy<1 and xy>δρm(xy)𝑑y0,m0\int_{\|x-y\|>\delta}\rho_{m}(x-y)dy\rightarrow 0,\,m\rightarrow 0. Putting equations 52, 53, and 54 together, we obtain limmsupxKJi,jn,m2DϵlimmM(m)+Dϵ2\lim_{m\rightarrow\infty}\sup_{x\in K}J_{i,j}^{n,m}\leq 2D\epsilon\lim_{m\rightarrow\infty}M(m)+D\epsilon_{2} for n>Nϵn>N_{\epsilon}. As ϵ,ϵ2\epsilon,\epsilon_{2} were arbitrary, we obtain the result that the LL^{\infty} error of (Hn,m1)i,jHi,jL(K)\|(H_{n,m}^{1})_{i,j}-H_{i,j}\|_{L^{\infty}(K)} as mm\rightarrow\infty is bounded by 2DϵlimmM(m)2D\epsilon\lim_{m\rightarrow\infty}M(m).

If uCc2(Ω)u\in C_{c}^{2}(\Omega), and M(m)M(m) is independent of mm, we can conclude uniform convergence on Ω\Omega using an argument similar to that of Theorem 1.1 of [35] adapted to the current situation. Indeed, if uCc2(Ω)u\in C^{2}_{c}(\Omega) then supp(u)\text{supp}(u) is a proper closed subset of Ω\Omega. Thus, the distance d:=dist(Ω,supp(u))d:=dist(\partial\Omega,\text{supp}(u)) between Ω\partial\Omega and supp(u)\text{supp}(u) is strictly positive. If

Ωd2:={xΩ:dist({x},Ω)>d2},\Omega_{\frac{d}{2}}:=\{x\in\Omega:dist(\{x\},\partial\Omega)>\frac{d}{2}\},

then supxΩd2Ji,jn,m0,m\sup_{x\in\Omega_{\frac{d}{2}}}J^{n,m}_{i,j}\rightarrow 0,\,m\rightarrow\infty. For points xΩΩd2cx\in\Omega\cap\Omega_{\frac{d}{2}}^{c}, we have

Ji,jn,m2Dd(uL+uL)z>d2ρm(z)𝑑z0,m.J^{n,m}_{i,j}\leq\frac{2D}{d}(\|u\|_{L^{\infty}}+\|\nabla u\|_{L^{\infty}})\int_{\|z\|>\frac{d}{2}}\rho_{m}(z)dz\rightarrow 0,\,m\rightarrow\infty.

6.14 Proof of Theorem 4.4

Proof.

By theorem 1.1 of [35], we have nuu\nabla_{n}u\rightarrow\nabla u locally uniformly, and extending uu smoothly by zero outside of Ω\Omega, we have uniform convergence on all of D\mathbb{R}^{D}. By theorem 1.4 of [30], we have HnuHuH_{n}u\rightarrow Hu uniformly on D×D\mathbb{R}^{D\times D}. By assumption on (Hu(x))1(Hu(x))^{-1} being uniformly continuous on UU, it follows that (Hnu(x))1(H_{n}u(x))^{-1} converges uniformly to (Hu(x))1(Hu(x))^{-1} on UU. Therefore, (Hnu)1nu(H_{n}u)^{-1}\nabla_{n}u converges uniformly to (Hu)1u(Hu)^{-1}\nabla u on UU. We now prove the result by induction.

Base case k=0k=0:

We have:

{x1=x0β0(Hu0)1u0,x1,n=x0,nβ0(Hnu0,n)1nu0,n,\begin{cases}x^{1}&=x^{0}-\beta^{0}(Hu^{0})^{-1}\nabla u^{0},\\ x^{1,n}&=x^{0,n}-\beta^{0}(H_{n}u^{0,n})^{-1}\nabla_{n}u^{0,n},\\ \end{cases} (56)

so that

x1x1,n=β0(Hnu0,n)1nu0,n(Hu0)1u0.\|x^{1}-x^{1,n}\|=\beta^{0}\|(H_{n}u^{0,n})^{-1}\nabla_{n}u^{0,n}-(Hu^{0})^{-1}\nabla u^{0}\|.

Thus, for x0,n=x0Ux^{0,n}=x^{0}\in U, we have, given an ϵ>0\epsilon>0, an N>0N>0 such that (Hnu0,n)1nu0,n(Hu0)1u0<ϵβ0\|(H_{n}u^{0,n})^{-1}\nabla_{n}u^{0,n}-(Hu^{0})^{-1}\nabla u^{0}\|<\frac{\epsilon}{\beta^{0}} for all n>Nn>N We can therefore conclude that x1x1,n<ϵ\|x^{1}-x^{1,n}\|<\epsilon for n>Nn>N and the base case is complete.

Induction case k=Kk=K:

Assume that the result is true for k=K1k=K-1, and we proceed to prove the induction step k=Kk=K. Let ϵ>0\epsilon>0 be given. Consider

xK+1xK+1,n=(xKxK,n)+βK((HnuK,n)1nuK,n(HuK)1uK).x^{K+1}-x^{K+1,n}=(x^{K}-x^{K,n})+\beta^{K}((H_{n}u^{K,n})^{-1}\nabla_{n}u^{K,n}-(Hu^{K})^{-1}\nabla u^{K}). (57)

We have

xK+1xK+1,nxKxK,n(1)+βK(HnuK,n)1nuK,n(HuK)1uK(2).\|x^{K+1}-x^{K+1,n}\|\leq\underbrace{\|x^{K}-x^{K,n}\|}_{(1)}+\beta^{K}\underbrace{\|(H_{n}u^{K,n})^{-1}\nabla_{n}u^{K,n}-(Hu^{K})^{-1}\nabla u^{K}\|}_{(2)}. (58)

By the induction hypothesis, there is an N1>0N_{1}>0 such that for n>N1n>N_{1} we have xKxK,n<ϵ2.\|x^{K}-x^{K,n}\|<\frac{\epsilon}{2}. We now consider term (2)(2) of equation 58. We write:

(HnuK,n)1nuK,n(HuK)1uK=(HnuK,n)1nuK,n(HuK,n)1uK,n(1)+(HuK,n)1uK,n(HuK)1uK(2).\begin{split}\|(H_{n}u^{K,n})^{-1}\nabla_{n}u^{K,n}-(Hu^{K})^{-1}\nabla u^{K}\|&=\underbrace{\|(H_{n}u^{K,n})^{-1}\nabla_{n}u^{K,n}-(Hu^{K,n})^{-1}\nabla u^{K,n}\|}_{(1)}\\ &+\underbrace{\|(Hu^{K,n})^{-1}\nabla u^{K,n}-(Hu^{K})^{-1}\nabla u^{K}\|}_{(2)}.\\ \end{split} (59)

Since (Hnu)1nu(Hu)1u(H_{n}u)^{-1}\nabla_{n}u\rightarrow(Hu)^{-1}\nabla u uniformly on UU, we have that there exists an N2>0N_{2}>0 such that (HnuK,n)1nuK,n(HuK,n)1uK,n<ϵ4βK\|(H_{n}u^{K,n})^{-1}\nabla_{n}u^{K,n}-(Hu^{K,n})^{-1}\nabla u^{K,n}\|<\frac{\epsilon}{4\beta^{K}} for n>N2n>N_{2}. Next, we use the assumption that f=(Hu)1uCc1(U)f=(Hu)^{-1}\nabla u\in C^{1}_{c}(U). This implies, by the mean value theorem that (HuK,n)1uK,n(HuK)1uKCxK,nxK\|(Hu^{K,n})^{-1}\nabla u^{K,n}-(Hu^{K})^{-1}\nabla u^{K}\|\leq C\|x^{K,n}-x^{K}\| where C=fL<.C=\|\nabla f\|_{L^{\infty}}<\infty. Therefore, there exists an N3>0N_{3}>0 such that for n>N3n>N_{3}, we have xK,nxK<ϵ4CβK\|x^{K,n}-x^{K}\|<\frac{\epsilon}{4C\beta^{K}}. Putting these estimates together into equation 59 and 58, we get xK+1xK+1,n<ϵ\|x^{K+1}-x^{K+1,n}\|<\epsilon for n>N=max{N1,N2,N3}n>N=\max\{N_{1},N_{2},N_{3}\}. The induction step is proved. ∎