This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Kalman-based Stochastic Gradient Method with Stop Condition and Insensitivity to Conditioning

Vivak Patel University of Chicago, Department of Statistics. (). Questions, comments, or corrections to this document may be directed to that email address. vp314@uchicago.edu
Abstract

Modern proximal and stochastic gradient descent (SGD) methods are believed to efficiently minimize large composite objective functions, but such methods have two algorithmic challenges: (1) a lack of fast or justified stop conditions, and (2) sensitivity to the objective function’s conditioning. In response to the first challenge, modern proximal and SGD methods guarantee convergence only after multiple epochs, but such a guarantee renders proximal and SGD methods infeasible when the number of component functions is very large or infinite. In response to the second challenge, second order SGD methods have been developed, but they are marred by the complexity of their analysis. In this work, we address these challenges on the limited, but important, linear regression problem by introducing and analyzing a second order proximal/SGD method based on Kalman Filtering (kSGD). Through our analysis, we show kSGD is asymptotically optimal, develop a fast algorithm for very large, infinite or streaming data sources with a justified stop condition, prove that kSGD is insensitive to the problem’s conditioning, and develop a unique approach for analyzing the complex second order dynamics. Our theoretical results are supported by numerical experiments on three regression problems (linear, nonparametric wavelet, and logistic) using three large publicly available datasets. Moreover, our analysis and experiments lay a foundation for embedding kSGD in multiple epoch algorithms, extending kSGD to other problem classes, and developing parallel and low memory kSGD implementations.

keywords:
Data Assimilation, Parameter Estimation, Kalman Filter, Proximal Methods, Stochastic Gradient Descent, Composite Objective Functions
AMS:
62L12, 62J05, 68Q32, 90C53, 93E35
\slugger

sioptxxxxxxxx–x

1 Introduction

In the data sciences, proximal and stochastic gradient descent (SGD) methods are widely applied to minimize objective functions over a parameter βn\beta\in\mathbb{R}^{n} of the form:

k=1Nfk(β),fi:n,i=1,,N\sum_{k=1}^{N}f_{k}(\beta),\quad f_{i}:\mathbb{R}^{n}\to\mathbb{R},\quad\forall i=1,\ldots,N (1)

When the number of component functions, NN, is large, proximal and SGD methods are believed to require fewer memory and computing resources in comparison to classical methods (e.g. gradient descent, quasi-Newton) [5, 10]. However, proximal and SGD methods’ computational benefits are reduced by two algorithmic challenges: (1) the difficulty of developing computationally fast or justified stop conditions, and (2) the difficulty of overcoming sensitivity to the conditioning of the objective function [10, 35, 14, 23, 22, 42, 37, 8, 26].

In response to the challenge of developing computationally fast, justified stop conditions, many proximal and SGD methods guarantee convergence up to a user-specified probability by requiring multiple epochs — full passes over all NN component functions — to be completed, where the number of epochs increases as the user-specified probability increases [35, 14, 23, 22, 42, 37, 8, 26]. As a result, these proximal and SGD methods become inefficient when high communication costs are incurred for completing a single epoch, and such is the case when NN is very large, streaming or infinite [5].

In response to the challenge of overcoming sensitivity to the conditioning of the objective function, second order SGD methods have been developed. Unfortunately, second order SGD methods are marred by the complexity of their analysis owing to the nonlinear dynamics of their second order approximations, the stochastic nature of the gradients, and the difficulty of analyzing tuning parameters [1, 36, 6, 8]. For example, in the method of [1], the second order estimate averages weighted new information with a previous second order estimate. However, the second order estimate initialization and weight parameter tuning is difficult in practice [36, 6], and is not accounted for in the analysis [1]. In order to account for the tuning parameters, one successful second order SGD analysis strategy is an adapted classical Lyapunov approach [8, 26]. However, the Lyapunov approach falls short of demonstrating that these second order SGD methods are insensitive to the conditioning of the objective function. For example, the Lyapunov-based convergence theorem in [8] suggests that the convergence rate depends on the conditioning of the Hessian of the objective function times the conditioning of the BFGS estimates. Such a Lyapunov-based theoretical guarantee offers no improvement over what we achieve with the usual SGD [27, 7], and clash against what we intuitively expect from a second order method.

In this paper, we progress towards addressing these gaps by generalizing proximal and SGD methods using the principles of the Kalman Filter [20]; we refer to this generalization as Kalman-based stochastic gradient descent (kSGD). We analyze kSGD’s properties on the class of linear regression problems, which we concede is a limited class, but is an inherently important class as discussed further in §2. Moreover, we use a probabilistic approach to analyze kSGD’s stochastic second order dynamics and tuning parameter choices, thereby avoiding the difficulties associated with the Lyapunov approach. As a consequence of our analysis,

  1. 1.

    We justify a computationally fast stop condition for the kSGD method, and demonstrate its effectiveness on several examples, thereby addressing the first algorithmic challenge faced by proximal and SGD methods.

  2. 2.

    We prove that the kSGD method is insensitive to the conditioning of the objective function, thereby addressing the second algorithmic challenge faced by proximal and SGD methods.

  3. 3.

    We show that the kSGD method aymptotically recovers the optimal stochastic gradient estimator for the linear regression problem.

  4. 4.

    We create a provably convergent, fast and simple algorithm for solving the linear regression problem, which is robust to the choice of tuning parameters and which can be applied to very large, streaming or infinite data sources.

  5. 5.

    We lay a foundation for embedding kSGD in multiple epoch algorithms, extending kSGD to other problem classes, and developing parallel and low memory kSGD implementations.

We organize the rest of the paper as follows. In §2, we formulate the linear regression problem, and discuss its importance. In §3, we construct an optimal proximal/SGD method from which we derive kSGD. In §4, we prove that kSGD approaches the optimal proximal/SGD method, we prove that kSGD is insensitive to the conditioning of the problem, we analyze the impact of tuning parameters on convergence, and we construct an adaptive tuning parameter strategy. In §5, we present a kSGD algorithm with a stop condition, and an adaptive tuning parameter selection algorithm. In §6, we compare the computational and memory complexities of kSGD with the usual SGD, and SQN [8]. In §7, we support the mathematical arguments herein using three numerical examples: a linear regression problem using a large data set provided by the Center for Medicare & Medicaid Services (CMS), a nonparametric wavelet regression problem using a large data set of gas sensor readings, and a logistic regression problem using a moderately sized data set of adult incomes. In §8, we summarize this work.

2 The Linear Regression Problem & Its Importance

First, the linear regression problem is formulated, and then its importance is discussed. In order to formulate the linear regression problem, the linear model must be specified:

Assumption 1.

Suppose that (X,Y),(X1,Y1),(X2,Y2),n×(X,Y),(X_{1},Y_{1}),(X_{2},Y_{2}),\ldots\in\mathbb{R}^{n}\times\mathbb{R} are independent, identically distributed, and βn\exists\beta^{*}\in\mathbb{R}^{n} such that:

Yi=XiTβ+ϵiY_{i}=X_{i}^{T}\beta^{*}+\epsilon_{i}

where ϵi\epsilon_{i} are independent, identically distributed mean zero random variables with variance σ2(0,)\sigma^{2}\in(0,\infty), and are independent of all XiX_{i}.

Remark 1.

The model does not assume a distribution for the errors, ϵi\epsilon_{i}; hence, the results presented will hold even if the model is misspecified with a reinterpretation of σ2\sigma^{2} as the limiting mean residuals squared. In addition, if the model has heteroscedasticity, the convergence of the kSGD parameter estimate to β\beta^{*} will still hold in the results below as long as the supremum over all variances is bounded.

Informally, the linear regression problem is the task of determining β\beta^{*} from the data (X1,Y1),(X2,Y2),(X_{1},Y_{1}),(X_{2},Y_{2}),\ldots. To formalize this, the linear regression problem can be restated as minimizing a loss function over the data, which is ideally (see [3])

d(X,Y,β)=12(YβTX)2d(X,Y,\beta)=\frac{1}{2}\left(Y-\beta^{T}X\right)^{2}

up to a positive multiplicative constant and conditioned on XX. Because the ideal linear regression loss, d(X,Y,β)d(X,Y,\beta) is a function of random variables (X,Y)(X,Y), it is also a random variable. In order to simplify optimizing over the random variable d(X,Y,β)d(X,Y,\beta), the linear regression objective function is taken to be the expected value of d(X,Y,β)d(X,Y,\beta):

D(β):=𝔼[d(X,Y,β)]=D(β)+12(ββ)TQ(ββ)D(\beta):=\mathbb{E}\left[d(X,Y,\beta)\right]=D(\beta^{*})+\frac{1}{2}(\beta-\beta^{*})^{T}Q_{*}(\beta-\beta^{*}) (2)

where Q=𝔼[XXT]Q_{*}=\mathbb{E}\left[XX^{T}\right].

Thus, the linear regression problem is the task of minimizing D(β)D(\beta). However, on its own, the linear regression problem is ill-posed for several reasons. First, despite the simple form of the linear regression objective, it cannot be constructed because the distribution of (X,Y)(X,Y) is rarely known. To account for this, the linear regression objective function is replaced with the approximation

D^(β)=1Ni=1Nd(Xi,Yi,β)\hat{D}(\beta)=\frac{1}{N}\sum_{i=1}^{N}d(X_{i},Y_{i},\beta) (3)

which is exactly of the form (1). Second, the linear regression objective’s Hessian, QQ_{*}, may not be well-specified, that is, QQ_{*}\not\prec\infty. One way to ensure that QQ_{*}\prec\infty is to require

λmax(Q)𝐭𝐫[Q]=𝔼[𝐭𝐫[XXT]]=𝔼[X22]<\lambda_{\max}(Q_{*})\leq\mathbf{tr}\left[Q_{*}\right]=\mathbb{E}\left[\mathbf{tr}\left[XX^{T}\right]\right]=\mathbb{E}\left[\left\|X\right\|_{2}^{2}\right]<\infty

This requirement is collected in the next assumption:

Assumption 2.

XL2X\in L^{2}. That is, 𝔼[X22]<\mathbb{E}\left[\left\|X\right\|_{2}^{2}\right]<\infty.

For some results, Assumption 2 is strengthened by:

Assumption 3.

XLX\in L^{\infty}. That is, X<\left\|X\right\|_{\infty}<\infty almost surely.

Third, from an optimization perspective, the minimizer of the linear regression objective must satisfy second order sufficiency conditions (Theorem 2.4 in [28]); that is, QQ_{*} must be positive definite. This can be ensured by the weaker requirement

Assumption 4.

The linear span of the image of XX is n\mathbb{R}^{n}. Specifically, for all unit vectors vnv\in\mathbb{R}^{n}, [|XTv|=0]<1\mathbb{P}\left[|X^{T}v|=0\right]<1.

To see how, suppose there is a unit vector vnv\in\mathbb{R}^{n} such that 0vTQv0\geq v^{T}Q_{*}v. Then 0vTQv=𝔼[(XTv)2]0\geq v^{T}Q_{*}v=\mathbb{E}\left[(X^{T}v)^{2}\right]. Hence, XTv=0X^{T}v=0 almost surely, which contradicts Assumption 4.

Remark 2.

We suspect that Assumption 4 can be weakened by allowing XX to be orthogonal to a subspace. As a result, β\beta^{*} would no longer be unique, but would be specified by a hyperplane. To account for this in the analysis, the error measure would need to be replaced by the minimum distance between the estimate and this hyperplane, and the estimation sequence would need to be considered on the lower dimensional manifold specified by the image space of XX.

At first glance, the linear regression problem seems to be a limited problem class, and one that has already been sufficiently addressed; however, solving modern, very large linear regression problems is still a topic under active research (e.g. [13, 17]). One example of very large linear regression problems comes from recent physics studies in which background noise models are estimated from large simulated data sets using maximum likelihood estimation [9, 19], which can be recast as solving a sequence of very large linear regression problems [41]. Another example comes from genome-wide association studies in which multiple very large linear regression problems arise directly [38, 2]. For such linear regression problems, the high communication time of reading the entire data set renders multiple epoch algorithms impractical. One recourse is to apply the usual SGD; however, SGD will often stall before converging to a minimizer. Thus, kSGD becomes a viable alternative for solving such problems as it only requires a single pass through large data sets, and does not stall. These concepts are demonstrated on a linear regression problem on CMS medical claims payments in §7.1.

Moreover, the linear regression problem not only encompasses the usual linear model of Assumption 1, but also the normal means models which include many nonparametric regression models (see [29], Ch. 7 in [40]). For example, the linear regression problem in (3) includes B-spline regression, for which XiX_{i} is replaced by a vector-valued function evaluated at XiX_{i} [31]. The linear regression problem’s applicability to normal means models is demonstrated on a non-parametric wavelet regression problem on gas sensor reading data in §7.2.

Finally, the linear regression problem analysis is an essential step in generalizing the kSGD theory to other problem classes. To be explicit, a common pattern in nonlinear programming is to model the objective function locally as a quadratic, and determine the next iterate by minimizing this local model [4, 28]. Indeed, for objective functions with an underlying statistical model, the local quadratic model can be formalized using the theory of local asymptotic normality (see Ch. 5 in [39], Ch. 6 in [24]); thus, understanding the behavior of kSGD in a quadratic model is an essential step in extending the analysis of kSGD to other problem classes. This principle is demonstrated on a logistic regression problem for modeling adult income categories in §7.3.

Now that we have established the linear regression problem, we construct an optimal stochastic gradient estimator from which we will derive kSGD.

3 An Optimal Estimator & kSGD

Let k=σ(X1,,Xk)\mathcal{F}_{k}=\sigma(X_{1},\ldots,X_{k}), the σ\sigma-algebra of the random variables X1,,XkX_{1},\ldots,X_{k}, and consider the following general update scheme

βk+1=βk+Gk+1(Yk+1βkTXk+1)\beta_{k+1}=\beta_{k}+G_{k+1}(Y_{k+1}-\beta_{k}^{T}X_{k+1}) (4)

where Gk+1G_{k+1} is a random variable in n\mathbb{R}^{n} and is measurable with respect to k+1\mathcal{F}_{k+1}. Using Assumption 1, (4) can be rewritten as

βk+1=βkGk+1Xk+1T(βkβ)+Gk+1ϵk+1\beta_{k+1}=\beta_{k}-G_{k+1}X_{k+1}^{T}(\beta_{k}-\beta^{*})+G_{k+1}\epsilon_{k+1}

We will choose an optimal Gk+1G_{k+1} in the sense that it minimizes the l2l^{2} error between βk+1\beta_{k+1} and β\beta^{*} given k+1\mathcal{F}_{k+1}. Noting that Gk+1G_{k+1} is measurable with respect to k+1\mathcal{F}_{k+1}, and using the independence, first moment and second moment properties of ϵk\epsilon_{k}:

𝔼[βk+1β2|k+1]\displaystyle\mathbb{E}\left[\left.\left\|\beta_{k+1}-\beta^{*}\right\|^{2}\right|\mathcal{F}_{k+1}\right]
=𝐭𝐫[𝔼[(βkβ)(βkβ)T|k]]\displaystyle\quad=\mathbf{tr}\left[\mathbb{E}\left[\left.(\beta_{k}-\beta^{*})(\beta_{k}-\beta^{*})^{T}\right|\mathcal{F}_{k}\right]\right]
𝐭𝐫[Gk+1Xk+1T𝔼[(βkβ)(βkβ)T|k]]\displaystyle\quad\quad-\mathbf{tr}\left[G_{k+1}X_{k+1}^{T}\mathbb{E}\left[\left.(\beta_{k}-\beta^{*})(\beta_{k}-\beta^{*})^{T}\right|\mathcal{F}_{k}\right]\right]
𝐭𝐫[𝔼[(βkβ)(βkβ)T|k]Xk+1Gk+1T]\displaystyle\quad\quad-\mathbf{tr}\left[\mathbb{E}\left[\left.(\beta_{k}-\beta^{*})(\beta_{k}-\beta^{*})^{T}\right|\mathcal{F}_{k}\right]X_{k+1}G_{k+1}^{T}\right]
+𝐭𝐫[Gk+1Xk+1T𝔼[(βkβ)(βkβ)T|k]Xk+1Gk+1T]\displaystyle\quad\quad+\mathbf{tr}\left[G_{k+1}X_{k+1}^{T}\mathbb{E}\left[\left.(\beta_{k}-\beta^{*})(\beta_{k}-\beta^{*})^{T}\right|\mathcal{F}_{k}\right]X_{k+1}G_{k+1}^{T}\right]
+σ2𝐭𝐫[Gk+1Gk+1T]\displaystyle\quad\quad+\sigma^{2}\mathbf{tr}\left[G_{k+1}G_{k+1}^{T}\right]

We now write k=𝔼[(βkβ)(βkβ)T|k]\mathcal{M}_{k}=\mathbb{E}\left[\left.(\beta_{k}-\beta^{*})(\beta_{k}-\beta^{*})^{T}\right|\mathcal{F}_{k}\right], which gives:

𝔼[βk+1β2|k+1]\displaystyle\mathbb{E}\left[\left.\left\|\beta_{k+1}-\beta^{*}\right\|^{2}\right|\mathcal{F}_{k+1}\right] =𝐭𝐫[k]𝐭𝐫[Gk+1Xk+1Tk]𝐭𝐫[kXk+1Gk+1T]\displaystyle=\mathbf{tr}\left[\mathcal{M}_{k}\right]-\mathbf{tr}\left[G_{k+1}X_{k+1}^{T}\mathcal{M}_{k}\right]-\mathbf{tr}\left[\mathcal{M}_{k}X_{k+1}G_{k+1}^{T}\right]
+𝐭𝐫[Gk+1Xk+1TkXk+1Gk+1T]+σ2𝐭𝐫[Gk+1Gk+1T]\displaystyle\quad+\mathbf{tr}\left[G_{k+1}X_{k+1}^{T}\mathcal{M}_{k}X_{k+1}G_{k+1}^{T}\right]+\sigma^{2}\mathbf{tr}\left[G_{k+1}G_{k+1}^{T}\right]

Differentiating with respect to Gk+1G_{k+1} and solving for Gk+1G_{k+1} when this expression is set to nullity, we have that:

Gk+1=kXk+1σ2+Xk+1TkXk+1G_{k+1}=\frac{\mathcal{M}_{k}X_{k+1}}{\sigma^{2}+X_{k+1}^{T}\mathcal{M}_{k}X_{k+1}} (5)

Moreover, the derivation gives us an update scheme for k\mathcal{M}_{k} as well:

k+1\displaystyle\mathcal{M}_{k+1} =kGk+1Xk+1TkkXk+1Gk+1T+(σ2+Xk+1TkXk+1)Gk+1Gk+1T\displaystyle=\mathcal{M}_{k}-G_{k+1}X_{k+1}^{T}\mathcal{M}_{k}-\mathcal{M}_{k}X_{k+1}G_{k+1}^{T}+(\sigma^{2}+X_{k+1}^{T}\mathcal{M}_{k}X_{k+1})G_{k+1}G_{k+1}^{T}
=k2kXk+1Xk+1Tkσ2+Xk+1TkXk+1\displaystyle=\mathcal{M}_{k}-2\frac{\mathcal{M}_{k}X_{k+1}X_{k+1}^{T}\mathcal{M}_{k}}{\sigma^{2}+X_{k+1}^{T}\mathcal{M}_{k}X_{k+1}}
+(σ2+Xk+1TkXk+1)kXk+1Xk+1Tk(σ2+Xk+1TkXk+1)2\displaystyle\quad+\frac{(\sigma^{2}+X_{k+1}^{T}\mathcal{M}_{k}X_{k+1})\mathcal{M}_{k}X_{k+1}X_{k+1}^{T}\mathcal{M}_{k}}{(\sigma^{2}+X_{k+1}^{T}\mathcal{M}_{k}X_{k+1})^{2}}

To summarize, we have the optimal stochastic gradient update scheme:

βk+1=βk+kXk+1σ2+Xk+1TkXk+1(Yk+1βkTXk+1)\beta_{k+1}=\beta_{k}+\frac{\mathcal{M}_{k}X_{k+1}}{\sigma^{2}+X_{k+1}^{T}\mathcal{M}_{k}X_{k+1}}(Y_{k+1}-\beta_{k}^{T}X_{k+1}) (6)
k+1=(IkXk+1Xk+1Tσ2+Xk+1TkXk+1)k\mathcal{M}_{k+1}=\left(I-\frac{\mathcal{M}_{k}X_{k+1}X_{k+1}^{T}}{\sigma^{2}+X_{k+1}^{T}\mathcal{M}_{k}X_{k+1}}\right)\mathcal{M}_{k} (7)

Because σ2\sigma^{2} and k\mathcal{M}_{k} are not known, we take the kSGD method to be:

βk=βk1+Mk1Xkγk2+XkTMk1Xk(Ykβk1TXk)\beta_{k}=\beta_{k-1}+\frac{M_{k-1}X_{k}}{\gamma_{k}^{2}+X_{k}^{T}M_{k-1}X_{k}}\left(Y_{k}-\beta_{k-1}^{T}X_{k}\right) (8)
Mk=(IMk1XkXkTγk2+XkTMk1Xk)Mk1M_{k}=\left(I-\frac{M_{k-1}X_{k}X_{k}^{T}}{\gamma_{k}^{2}+X_{k}^{T}M_{k-1}X_{k}}\right)M_{k-1} (9)

where β0n\beta_{0}\in\mathbb{R}^{n} is arbitrary, and M0M_{0} can be any positive definite matrix, but for simplicity, we will take it to be the identity. The sequence {γk2}\{\gamma_{k}^{2}\} replace the unknown σ2\sigma^{2} value and will be referred to as tuning parameters. We refer to βk\beta_{k} as a parameter estimate, k\mathcal{M}_{k} as the true covariance of the parameter estimate, and MkM_{k} as the estimated covariance of the parameter estimate.

kSGD’s derivation raises several natural questions about the impact of the MkM_{k} and {γk2}\{\gamma_{k}^{2}\} substitutions on the behavior of kSGD:

  1. 1.

    Does the estimated covariance, MkM_{k}, approximate the true covariance, k\mathcal{M}_{k}? Indeed, if MkM_{k} approximates k\mathcal{M}_{k} then it is clear that kSGD approximates the optimal stochastic gradient estimator, and, because 𝐭𝐫[k]=𝔼[βkβ22|k]\mathbf{tr}\left[\mathcal{M}_{k}\right]=\mathbb{E}\left[\left.\left\|\beta_{k}-\beta^{*}\right\|_{2}^{2}\right|\mathcal{F}_{k}\right], MkM_{k} can be used as a measure of the error between βk\beta_{k} and β\beta^{*}. In §4.1, we prove that MkM_{k} will bound k\mathcal{M}_{k} arbitrarily well from above and below in the limit up to a multiplicative constant depending on {γk2}\{\gamma_{k}^{2}\} (Theorem 3). Additionally, we show that Mk0M_{k}\to 0, thereby proving that 𝔼[βkβ22|k]0\mathbb{E}\left[\left.\left\|\beta_{k}-\beta^{*}\right\|_{2}^{2}\right|\mathcal{F}_{k}\right]\to 0 (Corollary 4). Consequently, MkM_{k} can be used as a stop condition; thus, kSGD addresses the first algorithmic challenge faced by proximal and SGD methods.

  2. 2.

    Given that the batch minimizer converges to β\beta^{*} at a rate 𝒪[σ2n/k]\mathcal{O}\left[\sigma^{2}n/k\right], and this convergence rate has no dependence on the conditioning of the problem (see [27], Ch. 5 in [39]), how does kSGD’s convergence compare? In §4.2, we show that kSGD’s convergence rate is comparable to the batch minimizer’s convergence rate, and this rate has no dependence on the conditioning of the problem (Theorem 5, Corollary 6). Consequently, kSGD addresses the second algorithmic challenge faced by proximal and SGD methods.

  3. 3.

    Finally, what role do the tuning parameters, {γk2}\{\gamma_{k}^{2}\}, play in the convergence? In §4.3, {γk2}\{\gamma_{k}^{2}\} are determined to play two roles: (1) {γk2}\{\gamma_{k}^{2}\} moderate how tightly MkM_{k} will bound k\mathcal{M}_{k}, and (2) {γk2}\{\gamma_{k}^{2}\} moderate how quickly Mk0M_{k}\to 0. Specifically, when γk2\gamma_{k}^{2} are within a few orders of magnitude of σ2\sigma^{2}, the estimated covariance, MkM_{k}, will tightly bound the true covariance, k\mathcal{M}_{k}, and when γk2\gamma_{k}^{2} are small, Mk0M_{k}\to 0 quickly. Unfortunately, if σ2\sigma^{2} is very large, these two tuning parameter roles conflict. This conflict is reconciled with an adaptive tuning parameter strategy motivated in Theorem 8, and constructed, in a sense, in §4.4.

4 Analysis

In the analysis below, we use the following conventions and notation. Recalling that k=σ(X1,,Xk)\mathcal{F}_{k}=\sigma(X_{1},\ldots,X_{k}), we consider two types of error in our analysis:

ek=𝔼[βk|k]βe_{k}=\mathbb{E}\left[\beta_{k}|\mathcal{F}_{k}\right]-\beta^{*} (10)

and

Ek=βkβE_{k}=\beta_{k}-\beta^{*} (11)

which are related by ek=𝔼[Ek|k]e_{k}=\mathbb{E}\left[\left.E_{k}\right|\mathcal{F}_{k}\right].

4.1 Convergence of the Estimated Covariance and Estimated Parameter

Informally, the main result of this section, Theorem 3, states that MkM_{k} bounds k\mathcal{M}_{k} from above and below arbitrarily well in the limit. To establish Theorem 3, we will need two basic calculations collected in Lemmas 1 and 2. In the calculations below, we recall that M0M_{0} is taken to be the identity for simplicity.

Lemma 1.

For j=1,,k+1j=1,\ldots,k+1, if 0<γj2<0<\gamma_{j}^{2}<\infty then Mk+1M_{k+1} is symmetric, positive definite matrices and

Mk+11=Mk1+1γk+12Xk+1Xk+1TM_{k+1}^{-1}=M_{k}^{-1}+\frac{1}{\gamma_{k+1}^{2}}X_{k+1}X_{k+1}^{T}
Proof.

By the Sherman-Morrison-Woodbury matrix identity:

M1=IX1X1Tγ12+X1TX1=(I+1γ12X1X1T)1M_{1}=I-\frac{X_{1}X_{1}^{T}}{\gamma_{1}^{2}+X_{1}^{T}X_{1}}\\ =\left(I+\frac{1}{\gamma_{1}^{2}}X_{1}X_{1}^{T}\right)^{-1}

Hence, M11=I+1γ12X1X1TM_{1}^{-1}=I+\frac{1}{\gamma_{1}^{2}}X_{1}X_{1}^{T}. So M1M_{1} is symmetric and positive definite. Suppose this is true up to some kk. By induction and using the Sherman-Morrison-Woodbury matrix identity, we conclude our result:

Mk+1=MkMkXk+1Xk+1TMkγk+12+Xk+1TMkXk+1=(Mk1+1γk+12Xk+1Xk+1T)1M_{k+1}=M_{k}-\frac{M_{k}X_{k+1}X_{k+1}^{T}M_{k}}{\gamma_{k+1}^{2}+X_{k+1}^{T}M_{k}X_{k+1}}\\ =\left(M_{k}^{-1}+\frac{1}{\gamma_{k+1}^{2}}X_{k+1}X_{k+1}^{T}\right)^{-1}

Lemma 2.

For j=1,,k+1j=1,\ldots,k+1, if 0<γj2<0<\gamma_{j}^{2}<\infty then

Mk+11Ek+1=Mk1Ek+Xk+1ϵk+1γk+12 and Mk+11Ek+1=E0+j=1k+1Xjϵjγj2M_{k+1}^{-1}E_{k+1}=M_{k}^{-1}E_{k}+X_{k+1}\frac{\epsilon_{k+1}}{\gamma_{k+1}^{2}}\quad\text{ and }\quad M_{k+1}^{-1}E_{k+1}=E_{0}+\sum_{j=1}^{k+1}X_{j}\frac{\epsilon_{j}}{\gamma_{j}^{2}}

and, recalling k=𝔼[EkEkT|k]\mathcal{M}_{k}=\mathbb{E}\left[\left.E_{k}E_{k}^{T}\right|\mathcal{F}_{k}\right] where k=σ(X1,,Xk)\mathcal{F}_{k}=\sigma(X_{1},\ldots,X_{k}),

k+1\displaystyle\mathcal{M}_{k+1} =Mk+1E0E0TMk+1+Mk+1(j=1k+1σ2γj21γj2XjXjT)Mk+1\displaystyle=M_{k+1}E_{0}E_{0}^{T}M_{k+1}+M_{k+1}\left(\sum_{j=1}^{k+1}\frac{\sigma^{2}}{\gamma_{j}^{2}}\frac{1}{\gamma_{j}^{2}}X_{j}X_{j}^{T}\right)M_{k+1}
Proof.

Using (8) and Assumption 1:

Ek+1=(IMkXk+1Xk+1Tγk+12+Xk+1TMkXk+1)Ek+MkXk+1ϵk+1γk+12+Xk+1TMkXk+1E_{k+1}=\left(I-\frac{M_{k}X_{k+1}X_{k+1}^{T}}{\gamma_{k+1}^{2}+X_{k+1}^{T}M_{k}X_{k+1}}\right)E_{k}+M_{k}X_{k+1}\frac{\epsilon_{k+1}}{\gamma_{k+1}^{2}+X_{k+1}^{T}M_{k}X_{k+1}} (12)

Now, premultiplying EkE_{k} by MkMk1M_{k}M_{k}^{-1} and using (9)

Ek+1\displaystyle E_{k+1} =Mk+1Mk1Ek+MkXk+1ϵk+1γk+12+Xk+1TMkXk+1\displaystyle=M_{k+1}M_{k}^{-1}E_{k}+M_{k}X_{k+1}\frac{\epsilon_{k+1}}{\gamma_{k+1}^{2}+X_{k+1}^{T}M_{k}X_{k+1}}
=Mk+1(Mk1Ek+Mk+11MkXk+1ϵk+1γk+12+Xk+1TMkXk+1)\displaystyle=M_{k+1}\left(M_{k}^{-1}E_{k}+M_{k+1}^{-1}M_{k}X_{k+1}\frac{\epsilon_{k+1}}{\gamma_{k+1}^{2}+X_{k+1}^{T}M_{k}X_{k+1}}\right)

Applying the recurrence relation in Lemma 1:

Ek+1\displaystyle E_{k+1} =Mk+1(Mk1Ek+Xk+1ϵk+1γk+12+Xk+1TMkXk+1\displaystyle=M_{k+1}\left(M_{k}^{-1}E_{k}+X_{k+1}\frac{\epsilon_{k+1}}{\gamma_{k+1}^{2}+X_{k+1}^{T}M_{k}X_{k+1}}\right.
+Xk+1ϵk+1γk+12Xk+1TMkXk+1γk+12+Xk+1TMkXk+1)\displaystyle\quad\left.+X_{k+1}\frac{\epsilon_{k+1}}{\gamma_{k+1}^{2}}\frac{X_{k+1}^{T}M_{k}X_{k+1}}{\gamma_{k+1}^{2}+X_{k+1}^{T}M_{k}X_{k+1}}\right)
=Mk+1(Mk1Ek+Xk+1ϵk+1γk+12)\displaystyle=M_{k+1}\left(M_{k}^{-1}E_{k}+X_{k+1}\frac{\epsilon_{k+1}}{\gamma_{k+1}^{2}}\right)

Using the recursion, we have that Ek+1=Mk+1(E0+j=1k+1Xjϵjγj2)E_{k+1}=M_{k+1}\left(E_{0}+\sum_{j=1}^{k+1}X_{j}\frac{\epsilon_{j}}{\gamma_{j}^{2}}\right). Thus, calculating Ek+1Ek+1TE_{k+1}E_{k+1}^{T}, taking its conditional expectation with respect to k=σ(X1,,Xk)\mathcal{F}_{k}=\sigma(X_{1},\ldots,X_{k}), and recalling that {ϵj}\{\epsilon_{j}\} are independent of {Xj}\{X_{j}\} by Assumption 1, we establish

k\displaystyle\mathcal{M}_{k} =Mk+1E0E0TMk+1+Mk+1(j=1k+1σ2γj21γj2XjXjT)Mk+1\displaystyle=M_{k+1}E_{0}E_{0}^{T}M_{k+1}+M_{k+1}\left(\sum_{j=1}^{k+1}\frac{\sigma^{2}}{\gamma_{j}^{2}}\frac{1}{\gamma_{j}^{2}}X_{j}X_{j}^{T}\right)M_{k+1}

Lemmas 1 and 2 suggest a natural condition on the tuning parameters in order to ensure that these calculations hold for all kk; namely, we require for some δ2\delta^{2} and Δ2\Delta^{2},

0<δ2infkγk2supkγk2Δ2<0<\delta^{2}\leq\inf_{k}\gamma_{k}^{2}\leq\sup_{k}\gamma_{k}^{2}\leq\Delta^{2}<\infty (13)

Using this condition and Lemmas 1 and 2, we can also bound k\mathcal{M}_{k} by MkM_{k} for all kk\in\mathbb{N}

Mk+1E0E0TMk+1+σ2Δ2Mk+1(j=1k+11γj2XjXjT)Mk+1k+1\displaystyle M_{k+1}E_{0}E_{0}^{T}M_{k+1}+\frac{\sigma^{2}}{\Delta^{2}}M_{k+1}\left(\sum_{j=1}^{k+1}\frac{1}{\gamma_{j}^{2}}X_{j}X_{j}^{T}\right)M_{k+1}\preceq\mathcal{M}_{k+1}
Mk+1E0E0TMk+1+σ2δ2Mk+1(j=1k+11γj2XjXjT)Mk+1\displaystyle\quad\preceq M_{k+1}E_{0}E_{0}^{T}M_{k+1}+\frac{\sigma^{2}}{\delta^{2}}M_{k+1}\left(\sum_{j=1}^{k+1}\frac{1}{\gamma_{j}^{2}}X_{j}X_{j}^{T}\right)M_{k+1}

Using Lemma 1,

Mk+1E0E0TMk+1+σ2Δ2Mk+1(Mk+11I)Mk+1k+1\displaystyle M_{k+1}E_{0}E_{0}^{T}M_{k+1}+\frac{\sigma^{2}}{\Delta^{2}}M_{k+1}\left(M_{k+1}^{-1}-I\right)M_{k+1}\preceq\mathcal{M}_{k+1}
Mk+1E0E0TMk+1+σ2δ2Mk+1(Mk+11I)Mk+1\displaystyle\quad\preceq M_{k+1}E_{0}E_{0}^{T}M_{k+1}+\frac{\sigma^{2}}{\delta^{2}}M_{k+1}\left(M_{k+1}^{-1}-I\right)M_{k+1}

Using 0Mk+1E0E0TMk+1Mk+12E0220\preceq M_{k+1}E_{0}E_{0}^{T}M_{k+1}\preceq M_{k+1}^{2}\left\|E_{0}\right\|_{2}^{2}, we have

σ2Δ2Mk+1σ2Δ2Mk+12k+1E022Mk+12+σ2δ2Mk+1\frac{\sigma^{2}}{\Delta^{2}}M_{k+1}-\frac{\sigma^{2}}{\Delta^{2}}M_{k+1}^{2}\preceq\mathcal{M}_{k+1}\preceq\left\|E_{0}\right\|_{2}^{2}M_{k+1}^{2}+\frac{\sigma^{2}}{\delta^{2}}M_{k+1} (14)

Because the true covariance, k\mathcal{M}_{k}, is a measure of the error between the estimated, βk\beta_{k}, and true parameter, β\beta^{*}, then we want k0\mathcal{M}_{k}\to 0, which inequality (14) suggests occurs if Mk0M_{k}\to 0. The next theorem formalizes this claim.

Theorem 3.

If Assumptions 1, 2, 4 hold, and the tuning parameters satisfy (13), then Mk0M_{k}\to 0 almost surely. Moreover, for all ϵ>0\epsilon>0, almost surely there exists a KK\in\mathbb{N} such that for kKk\geq K

1ϵΔ2Mk1σ2k1+ϵδ2Mk\frac{1-\epsilon}{\Delta^{2}}M_{k}\preceq\frac{1}{\sigma^{2}}\mathcal{M}_{k}\preceq\frac{1+\epsilon}{\delta^{2}}M_{k}
Remark 3.

There is a conventional difference between “almost surely there exists a KK” and “there exists a KK almost surely.” The first statement implies that for each outcome, ω\omega, on a probability one set there is a K(ω)K(\omega). The latter statement implies K\exists K for all ω\omega. Thus, “almost surely there exists a KK” is a weaker result than “there exists a KK almost surely”, but it is stronger than convergence in probability.

Proof.

To show that Mk0M_{k}\to 0, it is equivalent to prove that λmax(Mk)\lambda_{\max}(M_{k}), the maximum eigenvalue of MkM_{k}, goes to 0. This is equivalent to showing that the minimum eigenvalue of Mk1M_{k}^{-1}, λmin(Mk1)=λmax(Mk)1\lambda_{\min}(M_{k}^{-1})=\lambda_{\max}(M_{k})^{-1}, diverges to infinity. Moreover, by Lemma 1 and the Courant-Fischer Principle (Ch. 4 in [11]), λmin(Mk1)\lambda_{\min}(M_{k}^{-1}) is a non-decreasing sequence. Hence, it is sufficient to show that a subsequence diverges to infinity, which we will define using the following stochastic process. Define the stochastic process {Sk:k+1}\{S_{k}:k+1\in\mathbb{N}\} by S0=0S_{0}=0 and

Sk=min{m>Sk1:span[XSk1+1,,Xm]=n}S_{k}=\min\left\{m>S_{k-1}:span[X_{S_{k-1}+1},\ldots,X_{m}]=\mathbb{R}^{n}\right\} (15)

We will now show that the sequence {λmin(MSk1)}\{\lambda_{\min}(M_{S_{k}}^{-1})\} diverges. By Lemma 1,

MSk+11=MSk1+s=Sk+1Sk+11γs2XsXsTMSk1+1Δ2s=Sk+1Sk+1XsXsTI+1Δ2j=0k𝒳jM_{S_{k+1}}^{-1}=M_{S_{k}}^{-1}+\sum_{s=S_{k}+1}^{S_{k+1}}\frac{1}{\gamma_{s}^{2}}X_{s}X_{s}^{T}\succeq M_{S_{k}}^{-1}+\frac{1}{\Delta^{2}}\sum_{s=S_{k}+1}^{S_{k+1}}X_{s}X_{s}^{T}\succeq I+\frac{1}{\Delta^{2}}\sum_{j=0}^{k}\mathcal{X}_{j}

where

𝒳k=j=Sk+1Sk+1XjXjTk+1\mathcal{X}_{k}=\sum_{j=S_{k}+1}^{S_{k+1}}X_{j}X_{j}^{T}\quad\forall k+1\in\mathbb{N} (16)

By the Courant-Fischer Principle,

λmin(MSk+11)1+1Δ2j=0kλmin(𝒳j)\lambda_{\min}(M_{S_{k+1}}^{-1})\geq 1+\frac{1}{\Delta^{2}}\sum_{j=0}^{k}\lambda_{\min}(\mathcal{X}_{j}) (17)

Thus, we are left with showing that j=0λmin(𝒳j)\sum_{j=0}^{\infty}\lambda_{\min}(\mathcal{X}_{j}) diverges almost surely. To this end, we will show that λmin(𝒳j)\lambda_{\min}(\mathcal{X}_{j}) will be greater than some α>0\alpha>0 infinitely often. As a result, the sum must diverge to infinity. To show this, we will use a standard Borel-Cantelli argument (see Section 2.3 in [12]). Consider the events Aj={λmin(𝒳j)α}A_{j}=\{\lambda_{\min}(\mathcal{X}_{j})\geq\alpha\} where the choice of α\alpha comes from Lemma 14, in which infj𝔼[λmin(𝒳j)]α>0\inf_{j}\mathbb{E}\left[\lambda_{\min}(\mathcal{X}_{j})\right]\geq\alpha>0. For such an α\alpha, [Aj]>0\mathbb{P}\left[A_{j}\right]>0 else we would have a contradiction. By Lemma 13, we have that AjA_{j} are independent and that [Aj]=[A0]>0\mathbb{P}\left[A_{j}\right]=\mathbb{P}\left[A_{0}\right]>0 for all jj\in\mathbb{N}. Thus by the (Second) Borel-Cantelli Lemma (Theorem 2.3.5, [12]),

j=0[Aj]=\sum_{j=0}^{\infty}\mathbb{P}\left[A_{j}\right]=\infty

Hence, λmin(𝒳j)α\lambda_{\min}(\mathcal{X}_{j})\geq\alpha infinitely often, and thus λmin(MSk+11)\lambda_{\min}(M_{S_{k+1}}^{-1})\to\infty as kk\to\infty almost surely. Now, let ϵ>0\epsilon>0, then almost surely there is a KK\in\mathbb{N} such that if kKk\geq K,

λmax(Mk)min{ϵ,σ2ϵE02δ2}\lambda_{\max}(M_{k})\leq\min\left\{\epsilon,\frac{\sigma^{2}\epsilon}{\left\|E_{0}\right\|^{2}\delta^{2}}\right\}

Applying this to inequality (14), we can conclude the result.   ∎

As a simple corollary to Theorem 3, we prove that EkE_{k} and eke_{k} converge to 0 in some sense.

Corollary 4.

If Assumptions 1, 2, and 4 hold, and the tuning parameters satisfy (13), then

  1. 1.

    𝔼[Ek2|k]0\mathbb{E}\left[\left.\left\|E_{k}\right\|_{2}\right|\mathcal{F}_{k}\right]\to 0 as kk\to\infty almost surely.

  2. 2.

    ek20\left\|e_{k}\right\|_{2}\to 0 as kk\to\infty almost surely.

Proof.

Note, by the Cauchy-Schwarz Inequality:

𝔼[Ek2|k]2𝔼[Ek22|k]=𝐭𝐫[k]\mathbb{E}\left[\left.\left\|E_{k}\right\|_{2}\right|\mathcal{F}_{k}\right]^{2}\leq\mathbb{E}\left[\left.\left\|E_{k}\right\|_{2}^{2}\right|\mathcal{F}_{k}\right]=\mathbf{tr}\left[\mathcal{M}_{k}\right]

By (14) and Theorem 3, 𝐭𝐫[k]E022𝐭𝐫[Mk2]+σ2δ2𝐭𝐫[Mk]0\mathbf{tr}\left[\mathcal{M}_{k}\right]\leq\left\|E_{0}\right\|_{2}^{2}\mathbf{tr}\left[M_{k}^{2}\right]+\frac{\sigma^{2}}{\delta^{2}}\mathbf{tr}\left[M_{k}\right]\to 0 as kk\to\infty almost surely. By Jensen’s Inequality, ek2𝔼[Ek2|k]\left\|e_{k}\right\|_{2}\leq\mathbb{E}\left[\left.\left\|E_{k}\right\|_{2}\right|\mathcal{F}_{k}\right] almost surely, thus, ek20\left\|e_{k}\right\|_{2}\to 0 almost surely.   ∎

4.2 Insensitivity to Conditioning

The main result of this section, Theorem 5, states that MkM_{k} approximates a scaling of the inverse Hessian. As a consequence, Corollary 6 states that kSGD’s convergence rate does not depend on the conditioning of the Hessian, thereby addressing the second algorithmic challenge faced by proximal and SGD methods. Moreover, Corollary 6 states that kSGD becomes arbitrarily close to the batch convergence rate in the limit (see [27], Ch.5 in [39]).

Theorem 5.

If Assumptions 1, 3, and 4 hold, and γk2=γ2(0,)\gamma_{k}^{2}=\gamma^{2}\in(0,\infty) for all kk\in\mathbb{N}, then for ϵ,υ(0,1)\epsilon,\upsilon\in(0,1) the event

k,ϵ={γ2(1ϵ)kQ1Mkγ2(1+ϵ)kQ1}\mathcal{E}_{k,\epsilon}=\left\{\frac{\gamma^{2}(1-\epsilon)}{k}Q_{*}^{-1}\preceq M_{k}\preceq\frac{\gamma^{2}(1+\epsilon)}{k}Q_{*}^{-1}\right\}

has probability at least 1𝒪[k3+υ]1-\mathcal{O}\left[k^{-3+\upsilon}\right].

Proof.

Let {ϵk}\{\epsilon_{k}\} be a non-negative sequence. Define 𝒴k=XkXkTQ\mathcal{Y}_{k}=X_{k}X_{k}^{T}-Q_{*} and 𝒴¯k=1kj=1k𝒴j\bar{\mathcal{Y}}_{k}=\frac{1}{k}\sum_{j=1}^{k}\mathcal{Y}_{j}, and define the event

k\displaystyle\mathcal{B}_{k} ={Mk1kγ2Q+I(1ϵkk)}{kγ2Q+I(1+ϵkk)Mk1}\displaystyle=\left\{M_{k}^{-1}\preceq\frac{k}{\gamma^{2}}Q_{*}+I(1-\epsilon_{k}k)\right\}\cup\left\{\frac{k}{\gamma^{2}}Q_{*}+I(1+\epsilon_{k}k)\preceq M_{k}^{-1}\right\}
={𝒴¯kϵkγ2I}{ϵkγ2I𝒴¯k}\displaystyle=\left\{\bar{\mathcal{Y}}_{k}\preceq-\epsilon_{k}\gamma^{2}I\right\}\cup\left\{\epsilon_{k}\gamma^{2}I\preceq\bar{\mathcal{Y}}_{k}\right\} (use Lemma 1)
=u2=1{|uT𝒴¯ku|ϵkγ2}\displaystyle=\bigcap_{\left\|u\right\|_{2}=1}\left\{\left|u^{T}\bar{\mathcal{Y}}_{k}u\right|\geq\epsilon_{k}\gamma^{2}\right\}
{|vT𝒴¯v|>ϵkγ2}\displaystyle\subseteq\left\{\left|v^{T}\bar{\mathcal{Y}}v\right|>\epsilon_{k}\gamma^{2}\right\}

for an arbitrary unit vector vnv\in\mathbb{R}^{n}. Note that, by construction, vT𝒴jvv^{T}\mathcal{Y}_{j}v are independent, mean zero random variables. By Lemma 15, nC2vT𝒴jvnC2-nC^{2}\leq v^{T}\mathcal{Y}_{j}v\leq nC^{2} almost surely, where C=XC=\left\|X\right\|_{\infty}. Therefore, by Markov’s Inequality and Lemma 16,

[k][|vT𝒴¯kv|>ϵkγ2]=𝒪[n6C12ϵk6γ12k3]\mathbb{P}\left[\mathcal{B}_{k}\right]\leq\mathbb{P}\left[\left|v^{T}\bar{\mathcal{Y}}_{k}v\right|>\epsilon_{k}\gamma^{2}\right]=\mathcal{O}\left[\frac{n^{6}C^{12}}{\epsilon_{k}^{6}\gamma^{12}k^{3}}\right] (18)

We now turn to relating k,ϵ\mathcal{E}_{k,\epsilon} to k\mathcal{B}_{k}. Let M~k=Q1/2MkQ1/2\tilde{M}_{k}=Q_{*}^{1/2}M_{k}Q_{*}^{1/2}. Then,

k\displaystyle\mathcal{B}_{k}
={Mk1kγ2Q+I(1ϵkk)}{kγ2Q+I(1+ϵkk)Mk1}\displaystyle=\left\{M_{k}^{-1}\preceq\frac{k}{\gamma^{2}}Q_{*}+I(1-\epsilon_{k}k)\right\}\cup\left\{\frac{k}{\gamma^{2}}Q_{*}+I(1+\epsilon_{k}k)\preceq M_{k}^{-1}\right\}
={Mkγ2k(Q+I[γ2k+γ2ϵk])1}{γ2k(Q+I[γ2kγ2ϵk])1Mk}\displaystyle=\left\{M_{k}\preceq\frac{\gamma^{2}}{k}\left(Q_{*}+I\left[\frac{\gamma^{2}}{k}+\gamma^{2}\epsilon_{k}\right]\right)^{-1}\right\}\cup\left\{\frac{\gamma^{2}}{k}\left(Q_{*}+I\left[\frac{\gamma^{2}}{k}-\gamma^{2}\epsilon_{k}\right]\right)^{-1}\preceq M_{k}\right\}
={M~kγ2k(I+γ2Q1[1k+ϵk])1}{γ2k(I+γ2Q1[1kϵk])1M~k}\displaystyle=\left\{\tilde{M}_{k}\preceq\frac{\gamma^{2}}{k}\left(I+\gamma^{2}Q_{*}^{-1}\left[\frac{1}{k}+\epsilon_{k}\right]\right)^{-1}\right\}\cup\left\{\frac{\gamma^{2}}{k}\left(I+\gamma^{2}Q_{*}^{-1}\left[\frac{1}{k}-\epsilon_{k}\right]\right)^{-1}\preceq\tilde{M}_{k}\right\}
{M~kγ2kλmin(Q)λmin(Q)+γ2/k+γ2ϵkI}{γ2kλmax(Q)λmax(Q)+γ2/kγ2ϵkIM~k}\displaystyle\supseteq\left\{\tilde{M}_{k}\preceq\frac{\gamma^{2}}{k}\frac{\lambda_{\min}(Q_{*})}{\lambda_{\min}(Q_{*})+\gamma^{2}/k+\gamma^{2}\epsilon_{k}}I\right\}\cup\left\{\frac{\gamma^{2}}{k}\frac{\lambda_{\max}(Q_{*})}{\lambda_{\max}(Q_{*})+\gamma^{2}/k-\gamma^{2}\epsilon_{k}}I\preceq\tilde{M}_{k}\right\}

Now take ϵk=nC2γ2kυ\epsilon_{k}=\frac{nC^{2}}{\gamma^{2}k^{\upsilon}} where υ(0,1)\upsilon\in(0,1). Then, for every ϵ>0\epsilon>0 there is a KK\in\mathbb{N} such that for kKk\geq K

1ϵλmin(Q)λmin(Q)+γ2/k+γ2ϵkλmax(Q)λmax(Q)+γ2/kγ2ϵk1+ϵ1-\epsilon\leq\frac{\lambda_{\min}(Q_{*})}{\lambda_{\min}(Q_{*})+\gamma^{2}/k+\gamma^{2}\epsilon_{k}}\leq\frac{\lambda_{\max}(Q_{*})}{\lambda_{\max}(Q_{*})+\gamma^{2}/k-\gamma^{2}\epsilon_{k}}\leq 1+\epsilon

Therefore,

k{M~kγ2(1ϵ)kI}{γ2(1+ϵ)kIM~k}=k,ϵC\mathcal{B}_{k}\supseteq\left\{\tilde{M}_{k}\preceq\frac{\gamma^{2}(1-\epsilon)}{k}I\right\}\cup\left\{\frac{\gamma^{2}(1+\epsilon)}{k}I\preceq\tilde{M}_{k}\right\}=\mathcal{E}_{k,\epsilon}^{C}

Using (18), [k,ϵ]1𝒪[k3+υ]\mathbb{P}\left[\mathcal{E}_{k,\epsilon}\right]\geq 1-\mathcal{O}\left[k^{-3+\upsilon}\right].   ∎

Remark 4.

Given Assumption 3, it is likely that with a bit more work the probability of 1𝒪[k3+υ]1-\mathcal{O}\left[k^{-3+\upsilon}\right] can be extended to some arbitrary A+υ-A+\upsilon where A>3A\in\mathbb{N}_{>3}. Also, we can extend this result to a convergence in expectation by using the fact that 0MkI0\preceq M_{k}\preceq I.

Corollary 6.

If Assumptions 1, 3, and 4 hold, and γk2=γ2(0,)\gamma_{k}^{2}=\gamma^{2}\in(0,\infty) for all kk\in\mathbb{N}, then for any ϵ,υ(0,1)\epsilon,\upsilon\in(0,1) the event

nσ2(1ϵ)k𝔼[D(βk)|k]D(β)nσ2(1+ϵ)k\frac{n\sigma^{2}(1-\epsilon)}{k}\leq\mathbb{E}\left[\left.D(\beta_{k})\right|\mathcal{F}_{k}\right]-D(\beta^{*})\leq\frac{n\sigma^{2}(1+\epsilon)}{k}

occurs with probability at least 1𝒪[k3+υ]1-\mathcal{O}\left[k^{-3+\upsilon}\right].

Proof.

Let ϵ=ϵ/4\epsilon^{\prime}=\epsilon/4. On the event k,ϵ\mathcal{E}_{k,\epsilon^{\prime}}, we have a uniform bound on the rate at which Mk0M_{k}\to 0. Thus, on the event k,ϵ\mathcal{E}_{k,\epsilon^{\prime}}, we can strengthen Theorem 3 to K\exists K\in\mathbb{N} almost surely such that for any kKk\geq K

1ϵγ2Mk1σ2k1+ϵγ2Mk\frac{1-\epsilon^{\prime}}{\gamma^{2}}M_{k}\preceq\frac{1}{\sigma^{2}}\mathcal{M}_{k}\preceq\frac{1+\epsilon^{\prime}}{\gamma^{2}}M_{k}

Combining this with Theorem 5, on the event k,ϵ\mathcal{E}_{k,\epsilon^{\prime}}

γ2(1ϵ)2kQ1(1ϵ)Mkγ2σ2k(1+ϵ)Mkγ2(1+ϵ)2kQ1\frac{\gamma^{2}(1-\epsilon^{\prime})^{2}}{k}Q_{*}^{-1}\preceq(1-\epsilon^{\prime})M_{k}\preceq\frac{\gamma^{2}}{\sigma^{2}}\mathcal{M}_{k}\preceq(1+\epsilon^{\prime})M_{k}\preceq\frac{\gamma^{2}(1+\epsilon^{\prime})^{2}}{k}Q_{*}^{-1}

Noting, 1ϵ(1ϵ)2(1+ϵ)21+ϵ1-\epsilon\leq(1-\epsilon^{\prime})^{2}\leq(1+\epsilon^{\prime})^{2}\leq 1+\epsilon, the event

σ2(1ϵ)kIQ1/2kQ1/2σ2(1+ϵ)kI\frac{\sigma^{2}(1-\epsilon)}{k}I\preceq Q_{*}^{1/2}\mathcal{M}_{k}Q_{*}^{1/2}\preceq\frac{\sigma^{2}(1+\epsilon)}{k}I

contains k,ϵ\mathcal{E}_{k,\epsilon^{\prime}}. Now, using (2)

𝔼[D(βk)|k]D(β)\displaystyle\mathbb{E}\left[\left.D(\beta_{k})\right|\mathcal{F}_{k}\right]-D(\beta^{*}) =𝔼[(βkβ)TQ(βkβ)|k]\displaystyle=\mathbb{E}\left[\left.(\beta_{k}-\beta^{*})^{T}Q_{*}(\beta_{k}-\beta^{*})\right|\mathcal{F}_{k}\right]
=𝐭𝐫[Q1/2𝔼[(βkβ)(βkβ)T|k]Q1/2]\displaystyle=\mathbf{tr}\left[Q_{*}^{1/2}\mathbb{E}\left[\left.(\beta_{k}-\beta^{*})(\beta_{k}-\beta^{*})^{T}\right|\mathcal{F}_{k}\right]Q_{*}^{1/2}\right]
=𝐭𝐫[Q1/2kQ1/2]\displaystyle=\mathbf{tr}\left[Q_{*}^{1/2}\mathcal{M}_{k}Q_{*}^{1/2}\right]

Therefore, on a set containing k,ϵ\mathcal{E}_{k,\epsilon^{\prime}},

nσ2(1ϵ)k𝔼[D(βk)|k]D(β)nσ2(1+ϵ)k\frac{n\sigma^{2}(1-\epsilon)}{k}\leq\mathbb{E}\left[\left.D(\beta_{k})\right|\mathcal{F}_{k}\right]-D(\beta^{*})\leq\frac{n\sigma^{2}(1+\epsilon)}{k}

Now that kSGD has been shown to address the two algorithmic challenges faced by proximal and SGD methods, we turn our attention to carefully analyzing the effect that the tuning parameters have on convergence.

4.3 Conditions on Tuning Parameters

The tuning parameter condition, (13), raises two natural questions:

  1. 1.

    Is the tuning parameter condition, (13), the necessary condition on tuning parameters in order to guarantee convergence?

  2. 2.

    Given the wide range of possible tuning parameters, is there an optimal strategy for choosing the tuning parameters?

As will be shown in Theorem 7, the first question can be answered negatively. For example, Theorem 7 suggests that if the method converges the tuning parameters could have been γk2=kp\gamma_{k}^{2}=k^{p} for p(0,1]p\in(0,1], which is an example not covered by condition (13).

Theorem 7.

Suppose {γk2}\{\gamma_{k}^{2}\} are selected apriori. If Assumptions 1, and 4 hold, 𝔼[X24]<\mathbb{E}\left[\left\|X\right\|_{2}^{4}\right]<\infty, and for any e0e_{0} we have that ek0e_{k}\to 0 then k=1γk2\sum_{k=1}^{\infty}\gamma_{k}^{-2} diverges almost surely.

Proof.

Using (12), premultiplying EkE_{k} by MkMk1M_{k}M_{k}^{-1}, and taking conditional expectation with respect to k+1\mathcal{F}_{k+1}:

ek+1=Mk+1Mk1eke_{k+1}=M_{k+1}M_{k}^{-1}e_{k}

Repeatedly applying the recursion, ek+1=Mk+1e0e_{k+1}=M_{k+1}e_{0}. Thus, ek+10e_{k+1}\to 0 for any e0e_{0} is equivalent to λmax(Mk)0\lambda_{\max}(M_{k})\to 0. Note:

λmax(Mk)=1λmin(Mk1)1𝐭𝐫[Mk1]\lambda_{\max}(M_{k})=\frac{1}{\lambda_{\min}(M_{k}^{-1})}\geq\frac{1}{\mathbf{tr}\left[M_{k}^{-1}\right]}

We will prove that if j=1γj2<\sum_{j=1}^{\infty}\gamma_{j}^{-2}<\infty then the supremum of the trace of Mk1M_{k}^{-1} is finite, and therefore λmax(Mk)>0\lambda_{\max}(M_{k})>0, which gives a contradiction. That is, we will show, by using Lemma 1, the supremum over all k of

𝐭𝐫[Mk1]=𝐭𝐫[I+j=1k1γj2XjXjT]=n+j=1kXj22γj2\mathbf{tr}\left[M_{k}^{-1}\right]=\mathbf{tr}\left[I+\sum_{j=1}^{k}\frac{1}{\gamma_{j}^{2}}X_{j}X_{j}^{T}\right]=n+\sum_{j=1}^{k}\frac{\left\|X_{j}\right\|_{2}^{2}}{\gamma_{j}^{2}}

is almost surely finite. The main tool used is Kolmogorov’s Three Series Theorem (Theorem 2.5.4 in [12]). Suppose that k=1γj2<\sum_{k=1}^{\infty}\gamma_{j}^{-2}<\infty, and let A>0A>0. By Markov’s Inequality,

j=1[Xj22>Aγj2]\displaystyle\sum_{j=1}^{\infty}\mathbb{P}\left[\left\|X_{j}\right\|_{2}^{2}>A\gamma_{j}^{2}\right] j=1𝔼[Xj22]γj2A<\displaystyle\leq\sum_{j=1}^{\infty}\frac{\mathbb{E}\left[\left\|X_{j}\right\|_{2}^{2}\right]}{\gamma_{j}^{2}A}<\infty
j=1𝔼[Xj221[Xj22Aγj2]]γj2\displaystyle\sum_{j=1}^{\infty}\frac{\mathbb{E}\left[\left\|X_{j}\right\|_{2}^{2}\textbf{1}\left[\left\|X_{j}\right\|_{2}^{2}\leq A\gamma_{j}^{2}\right]\right]}{\gamma_{j}^{2}} j=1𝔼[Xj22]γj2<\displaystyle\leq\sum_{j=1}^{\infty}\frac{\mathbb{E}\left[\left\|X_{j}\right\|_{2}^{2}\right]}{\gamma_{j}^{2}}<\infty
j=1var(Xj221[Xj22Aγj2])γj4\displaystyle\sum_{j=1}^{\infty}\frac{var\left(\left\|X_{j}\right\|_{2}^{2}\textbf{1}\left[\left\|X_{j}\right\|_{2}^{2}\leq A\gamma_{j}^{2}\right]\right)}{\gamma_{j}^{4}} j=1𝔼[Xj24]γj4<\displaystyle\leq\sum_{j=1}^{\infty}\frac{\mathbb{E}\left[\left\|X_{j}\right\|_{2}^{4}\right]}{\gamma_{j}^{4}}<\infty

Thus, by Kolmogorov’s Three Series Theorem, supk𝐭𝐫[Mk1]<\sup_{k}\mathbf{tr}\left[M_{k}^{-1}\right]<\infty. Hence, ek↛0e_{k}\not\to 0 and we have a contradiction.   ∎

Remark 5.

Using a condition such as k=1γk2\sum_{k=1}^{\infty}\gamma_{k}^{-2} diverges or the sum over any subsequence diverges in place of (13) seems appealing. Although k\mathcal{M}_{k} can be upper bounded by MkM_{k} using such a condition and Lemma 2, the main theoretical difficulty occurs in proving that the k=1[Ak]=\sum_{k=1}^{\infty}\mathbb{P}\left[A_{k}\right]=\infty in the proof of Theorem 3.

The second question can be understood by examining the two roles tuning parameters have in Theorem 3. First, the tuning parameters determine how quickly Mk1M_{k}^{-1} diverges: if the tuning parameters take on very small values, then, in light of (17), λmin(Mk1)\lambda_{\min}(M_{k}^{-1}) will diverge quickly. Therefore, the tuning parameters should be selected to be a fixed small value when the goal is to converge quickly. Second, the tuning parameter bounds, δ2\delta^{2} and Δ2\Delta^{2}, determine how tightly MkM_{k} bounds k\mathcal{M}_{k}: if δ2\delta^{2} and Δ2\Delta^{2} are close to σ2\sigma^{2} then MkM_{k} will have a tight lower and upper bound on k\mathcal{M}_{k}. Therefore, from an algorithmic perspective, if the tuning parameter bounds are close to σ2\sigma^{2}, then MkM_{k} will be a better stop condition.

In the case when σ2\sigma^{2} is of moderate size, the tuning parameters can be selected to be small thereby ensuring that fast convergence is achieved and that MkM_{k} is a strong stop condition. On the other hand, if σ2\sigma^{2} is large, the tuning parameters cannot satisfy both fast convergence and ensure that MkM_{k} is a strong stop condition. These opposing tuning parameter goals can be reconciled with the following result.

Theorem 8.

Suppose Assumptions 1, 2, and 4 hold, and σ2(δ2,Δ2)\sigma^{2}\in(\delta^{2},\Delta^{2}) for some Δ2δ2>0\Delta^{2}\geq\delta^{2}>0. If there exists a sequence of tuning parameters satisfying (13) and satisfying for any ϵ(0,1)\epsilon\in(0,1) almost surely K\exists K\in\mathbb{N} such that for kKk\geq K, |γk2σ2|<ϵσ2|\gamma_{k}^{2}-\sigma^{2}|<\epsilon\sigma^{2} then almost surely there exists a KK^{\prime}\in\mathbb{N} such that for kKk\geq K^{\prime}

k1+ϵ1ϵMk\mathcal{M}_{k}\preceq\frac{1+\epsilon}{1-\epsilon}M_{k}
Proof.

Let ϵ(0,1)\epsilon\in(0,1). Then by assumption, almost surely there exists a KK\in\mathbb{N} for which γk2σ2(1ϵ)\gamma_{k}^{2}\geq\sigma^{2}(1-\epsilon). From Lemma 2 with kKk\geq K:

k\displaystyle\mathcal{M}_{k} =MkE0E0TMk+Mk(j=1kσ2γj21γj2XjXjT)Mk\displaystyle=M_{k}E_{0}E_{0}^{T}M_{k}+M_{k}\left(\sum_{j=1}^{k}\frac{\sigma^{2}}{\gamma_{j}^{2}}\frac{1}{\gamma_{j}^{2}}X_{j}X_{j}^{T}\right)M_{k}
=MkE0E0TMk+Mk(j=1K1(σ2γj21)1γj2XjXjT)Mk\displaystyle=M_{k}E_{0}E_{0}^{T}M_{k}+M_{k}\left(\sum_{j=1}^{K-1}\left(\frac{\sigma^{2}}{\gamma_{j}^{2}}-1\right)\frac{1}{\gamma_{j}^{2}}X_{j}X_{j}^{T}\right)M_{k}
+Mk(j=1K11γj2XjXjT+j=Kkσ2γj21γj2XjXjT)Mk\displaystyle\quad+M_{k}\left(\sum_{j=1}^{K-1}\frac{1}{\gamma_{j}^{2}}X_{j}X_{j}^{T}+\sum_{j=K}^{k}\frac{\sigma^{2}}{\gamma_{j}^{2}}\frac{1}{\gamma_{j}^{2}}X_{j}X_{j}^{T}\right)M_{k}
Mk2E022+(σ2δ21)MkMK11Mk\displaystyle\preceq M_{k}^{2}\left\|E_{0}\right\|_{2}^{2}+\left(\frac{\sigma^{2}}{\delta^{2}}-1\right)M_{k}M_{K-1}^{-1}M_{k} (σ2δ2\sigma^{2}\geq\delta^{2} by Assumption)
+Mk(j=1K11γj2XjXjT+11ϵj=Kk1γj2XjXjT)Mk\displaystyle\quad+M_{k}\left(\sum_{j=1}^{K-1}\frac{1}{\gamma_{j}^{2}}X_{j}X_{j}^{T}+\frac{1}{1-\epsilon}\sum_{j=K}^{k}\frac{1}{\gamma_{j}^{2}}X_{j}X_{j}^{T}\right)M_{k}
Mk2E022+(σ2δ21)MkMK11Mk+11ϵMk\displaystyle\preceq M_{k}^{2}\left\|E_{0}\right\|_{2}^{2}+\left(\frac{\sigma^{2}}{\delta^{2}}-1\right)M_{k}M_{K-1}^{-1}M_{k}+\frac{1}{1-\epsilon}M_{k}

By Theorem 3, Mk0M_{k}\to 0 almost surely. Thus, almost surely there is a KK^{\prime}\in\mathbb{N}, which we can take larger than KK, such that if kKk\geq K^{\prime} then

Mk2E022+(σ2δ21)MkMK11Mkϵ1ϵMkM_{k}^{2}\left\|E_{0}\right\|_{2}^{2}+\left(\frac{\sigma^{2}}{\delta^{2}}-1\right)M_{k}M_{K-1}^{-1}M_{k}\preceq\frac{\epsilon}{1-\epsilon}M_{k}

Combining the inequalities gives the result.   ∎

The construction of such a tuning parameter strategy is quite difficult because a sequence which almost surely converges to σ2\sigma^{2} will never be known apriori. In practice, the construction of γk2\gamma_{k}^{2} will then have to depend on the data (X1,Y1),(X2,Y2),(X_{1},Y_{1}),(X_{2},Y_{2}),\ldots; therefore, a data dependent tuning parameter strategy will introduce measurability issues in Theorem 8. Thus, we only use Theorem 8 as motivation for constructing a tuning parameter strategy which estimates σ2\sigma^{2}. The construction of such tuning parameters is the content of the next subsection.

4.4 Adaptive Choice of Tuning Parameters

We will define a sequence of tuning parameters {γk2}\{\gamma_{k}^{2}\} which satisfy condition (13) and incrementally estimate σ2\sigma^{2} when σ2(δ2,Δ2)\sigma^{2}\in(\delta^{2},\Delta^{2}). This choice of tuning parameters will be determined by a two-step process:

  1. 1.

    By defining a sequence of unbounded estimators {ξk2}\{\xi_{k}^{2}\} which converge to σ2\sigma^{2} in some sense.

  2. 2.

    Then, by defining γk2=ϕτ(ξk2)\gamma_{k}^{2}=\phi_{\tau}(\xi_{k}^{2}) for a τ\tau\in\mathbb{N} where

    ϕτ(x)=τ1[x>τ]+τ11[x<τ1]+x1[τ1xτ]\phi_{\tau}(x)=\tau\textbf{1}\left[x>\tau\right]+\tau^{-1}\textbf{1}\left[x<\tau^{-1}\right]+x\textbf{1}\left[\tau^{-1}\leq x\leq\tau\right] (19)

    to ensure that condition (13) is satisfied.

Remark 6.

The choice of ϕτ\phi_{\tau} imposes δ2=τ1\delta^{2}=\tau^{-1} and Δ2=τ\Delta^{2}=\tau in (13) The use of a single parameter, τ\tau, for the bounds is strictly a matter of convenience. Using different lower and upper bounds is completely satisfactory from a theoretical perspective and can be a strategy to negotiate the trade-offs in choosing the tuning parameters. We state this case in §5.

We will use the following general form for estimators {ξk2}\{\xi_{k}^{2}\}:

ξ12=r12andξk+12=1k+1fk+1rk+12+(11k+1)ξk2\xi_{1}^{2}=r_{1}^{2}\quad\text{and}\quad\xi_{k+1}^{2}=\frac{1}{k+1}f_{k+1}r_{k+1}^{2}+\left(1-\frac{1}{k+1}\right)\xi_{k}^{2} (20)

where, the residuals, rkr_{k}, satisfy rk=YkXkTβk1r_{k}=Y_{k}-X_{k}^{T}\beta_{k-1}, and the hyperparameters, fkf_{k}, are non-negative. One natural choice for the hyperparameters is fk=1f_{k}=1 for all kk. Indeed, such a choice has the following nice theoretical guarantee, which is a consequence of Proposition 10 below.

Corollary 9.

If Assumptions 1, 2, and 4 hold and fk=1f_{k}=1 for all kk then limk𝔼[ξk2|k]=σ2\lim_{k\to\infty}\mathbb{E}\left[\left.\xi_{k}^{2}\right|\mathcal{F}_{k}\right]=\sigma^{2} almost surely. Moreover, almost surely:

ϕτ(σ2)lim infk𝔼[γk2|k]lim supk𝔼[γk2|k]ϕτ(σ2+τ1)\phi_{\tau}(\sigma^{2})\leq\liminf_{k\to\infty}\mathbb{E}\left[\left.\gamma_{k}^{2}\right|\mathcal{F}_{k}\right]\leq\limsup_{k\to\infty}\mathbb{E}\left[\left.\gamma_{k}^{2}\right|\mathcal{F}_{k}\right]\leq\phi_{\tau}(\sigma^{2}+\tau^{-1})

Although this result calls into question why fkf_{k} are even considered in (20), it is misleading: if the initial residuals, r12,r22,r_{1}^{2},r_{2}^{2},\ldots, deviate from σ2\sigma^{2} significantly, then a large number of cases must be assimilated in order for the estimator of ξk2\xi_{k}^{2} to converge to σ2\sigma^{2}. A common strategy to overcome this is to shrink the initial residuals — similarly, introduce forgetting factors — which converge to unity. That is, we consider a positive sequence fk1f_{k}\leq 1 such that fk1f_{k}\to 1. However, because fkf_{k} can be designed, this procedure begs the question of how and when fkf_{k} should approach 11. Indeed, the only guidance which can be given on this choice is to make fkf_{k} depend on MkM_{k}: once MkM_{k} is sufficiently small, we have an assurance that rk2r_{k}^{2} should be faithful estimators of σ2\sigma^{2}, and so fkf_{k} should be near 11 in this regime. Since MkM_{k} are measurable with respect to k\mathcal{F}_{k}, a more flexible condition is to allow fkf_{k} to be measurable with respect to k\mathcal{F}_{k} as well.

Remark 7.

Despite restricting fkf_{k} to be k\mathcal{F}_{k} measurable, there are still many strategies for choosing fkf_{k}. For example, fkf_{k} can be set using a hard or soft thresholding strategy depending on the 𝐭𝐫[Mk]\mathbf{tr}\left[M_{k}\right]. Regardless, the strategy for choosing fkf_{k} should depend on the problem, and an uninformed choice is ill-advised unless a sufficient amount of data is being processed.

An additional strategy is to delay the index at which ξ12\xi_{1}^{2} is calculated. To be specific, we can define a stopping time VV with respect to k\mathcal{F}_{k} such that VV is finite almost surely, and let rk=YV+kXV+kTβV+k1r_{k}=Y_{V+k}-X_{V+k}^{T}\beta_{V+k-1}. Indeed, then the process will only be started once a specific condition has been met, such as if 𝐭𝐫[Mk]\mathbf{tr}\left[M_{k}\right] is below a threshold. This offers more flexibility than manipulating fkf_{k} alone. These considerations on fkf_{k} and VV are collected in the following assumption.

Assumption 5.

VV is a stopping time with respect to k=σ(X1,,Xk)\mathcal{F}_{k}=\sigma(X_{1},\ldots,X_{k}) such that VV is almost surely finite with stopped σ\sigma-algebra V\mathcal{F}_{V} (p. 156 in [12]). Also, take fkf_{k} to satisfy:

  1. 1.

    fk[0,1]f_{k}\in[0,1] and fk1f_{k}\to 1 as kk\to\infty almost surely.

  2. 2.

    fkf_{k} are V+k\mathcal{F}_{V+k} measurable.

and let rk=YV+kXV+kTβV+k1r_{k}=Y_{V+k}-X_{V+k}^{T}\beta_{V+k-1}, ξk2\xi_{k}^{2} be defined as in (20), and γV+k2=ϕτ(ξk2)\gamma_{V+k}^{2}=\phi_{\tau}(\xi_{k}^{2}).

We are now ready to show that a tuning parameter strategy using Assumption 5 does indeed approximate σ2\sigma^{2} in the limit.

Proposition 10.

If Assumptions 1, 2, 4, and 5 hold then

limk𝔼[ξk2|V+k]=σ2\lim_{k\to\infty}\mathbb{E}\left[\left.\xi_{k}^{2}\right|\mathcal{F}_{V+k}\right]=\sigma^{2}

almost surely. Moreover, almost surely:

ϕτ(σ2)lim infk𝔼[γV+k2|V+k]lim supk𝔼[γV+k2|V+k]ϕτ(σ2+τ1)\phi_{\tau}(\sigma^{2})\leq\liminf_{k\to\infty}\mathbb{E}\left[\left.\gamma_{V+k}^{2}\right|\mathcal{F}_{V+k}\right]\leq\limsup_{k\to\infty}\mathbb{E}\left[\left.\gamma_{V+k}^{2}\right|\mathcal{F}_{V+k}\right]\leq\phi_{\tau}(\sigma^{2}+\tau^{-1})
Proof.

Applying (20) repeatedly,

ξk+12\displaystyle\xi_{k+1}^{2} =l=1ξk+121[V=l]\displaystyle=\sum_{l=1}^{\infty}\xi_{k+1}^{2}\textbf{1}\left[V=l\right]
=1k+1l=1j=1k+1fj(Yl+jXl+jTβl+j1)21[V=l]\displaystyle=\frac{1}{k+1}\sum_{l=1}^{\infty}\sum_{j=1}^{k+1}f_{j}\left(Y_{l+j}-X_{l+j}^{T}\beta_{l+j-1}\right)^{2}\textbf{1}\left[V=l\right]
=1k+1j=1k+1fjl=1(ϵl+j+Xl+jT(ββl+j1))21[V=l]\displaystyle=\frac{1}{k+1}\sum_{j=1}^{k+1}f_{j}\sum_{l=1}^{\infty}\left(\epsilon_{l+j}+X_{l+j}^{T}(\beta^{*}-\beta_{l+j-1})\right)^{2}\textbf{1}\left[V=l\right] (Assumption 1)

Taking the conditional expectations, using the Conditional Monotone Convergence Theorem (Theorem 5.1.2 in [12]), and using the facts that fjf_{j} and 1[V=l]\textbf{1}\left[V=l\right] are measurable with respect to V+k+1\mathcal{F}_{V+k+1} by construction:

𝔼[ξk+12|V+k+1]\displaystyle\mathbb{E}\left[\left.\xi_{k+1}^{2}\right|\mathcal{F}_{V+k+1}\right]
=1k+1j=1k+1l=1𝔼[(ϵl+j+Xl+jT(ββl+j1))21[V=l]|V+k+1]\displaystyle=\frac{1}{k+1}\sum_{j=1}^{k+1}\sum_{l=1}^{\infty}\mathbb{E}\left[\left.\left(\epsilon_{l+j}+X_{l+j}^{T}(\beta^{*}-\beta_{l+j-1})\right)^{2}\textbf{1}\left[V=l\right]\right|\mathcal{F}_{V+k+1}\right]
=1k+1j=1k+1fjl=11[V=l]𝔼[ϵl+j2|V+k+1]+1[V=l]Xl+jTl+j1Xl+j\displaystyle=\frac{1}{k+1}\sum_{j=1}^{k+1}f_{j}\sum_{l=1}^{\infty}\textbf{1}\left[V=l\right]\mathbb{E}\left[\left.\epsilon_{l+j}^{2}\right|\mathcal{F}_{V+k+1}\right]+\textbf{1}\left[V=l\right]X_{l+j}^{T}\mathcal{M}_{l+j-1}X_{l+j}
+21[V=l]𝔼[ϵl+jXl+jT(ββl+j1)|V+k+1]\displaystyle\quad+2\textbf{1}\left[V=l\right]\mathbb{E}\left[\left.\epsilon_{l+j}X_{l+j}^{T}(\beta^{*}-\beta_{l+j-1})\right|\mathcal{F}_{V+k+1}\right]
=1k+1j=1k+1fjl=11[V=l]σ2+1[V=l]Xl+jTl+j1Xl+j\displaystyle=\frac{1}{k+1}\sum_{j=1}^{k+1}f_{j}\sum_{l=1}^{\infty}\textbf{1}\left[V=l\right]\sigma^{2}+\textbf{1}\left[V=l\right]X_{l+j}^{T}\mathcal{M}_{l+j-1}X_{l+j}
=1k+1j=1k+1fjσ2+fjXV+jTV+j1XV+j\displaystyle=\frac{1}{k+1}\sum_{j=1}^{k+1}f_{j}\sigma^{2}+f_{j}X_{V+j}^{T}\mathcal{M}_{V+j-1}X_{V+j}

where we use the facts that (1) 𝔼[(ββl+j1)(ββl+j1)T|V+k+1]=l+j1\mathbb{E}\left[\left.(\beta^{*}-\beta_{l+j-1})(\beta^{*}-\beta_{l+j-1})^{T}\right|\mathcal{F}_{V+k+1}\right]=\mathcal{M}_{l+j-1} since V+j1V+k1\mathcal{F}_{V+j-1}\subset\mathcal{F}_{V+k-1} and we are restricting to the event {V=l}\{V=l\}, and (2) ϵj\epsilon_{j} are independent of k=σ(X1,,Xk)\mathcal{F}_{k}=\sigma(X_{1},\ldots,X_{k}), thus, ϵj\epsilon_{j} are independent of V+k\mathcal{F}_{V+k}. Using this equality, we will establish a lower and upper bound on 𝔼[ξj2|V+j]\mathbb{E}\left[\left.\xi_{j}^{2}\right|\mathcal{F}_{V+j}\right].

Lower Bound. Since V+j0\mathcal{M}_{V+j}\succeq 0, we have 𝔼[ξk+12|k+1]σ2k+1j=1k+1fj\mathbb{E}\left[\left.\xi_{k+1}^{2}\right|\mathcal{F}_{k+1}\right]\geq\frac{\sigma^{2}}{k+1}\sum_{j=1}^{k+1}f_{j}. Let ϵ>0\epsilon>0. Because VV is almost surely finite and fk1f_{k}\to 1 almost surely, almost surely there is a KK\in\mathbb{N} such that if kKk\geq K then fk(1ϵ)f_{k}\geq(1-\epsilon). Therefore ,

lim infk𝔼[ξk+12|V+k+1]σ2(1ϵ)\liminf_{k\to\infty}\mathbb{E}\left[\left.\xi_{k+1}^{2}\right|\mathcal{F}_{V+k+1}\right]\geq\sigma^{2}(1-\epsilon)

almost surely.

Upper Bound. For an upper bound, using the same ϵ>0\epsilon>0 as for the lower bound, because VV is almost surely finite and by Theorem 3, almost surely there is a KK\in\mathbb{N} such that for kKk\geq K

V+kτσ2(1+ϵ)MV+kandλmax(MV+k)<ϵτσ2(1+ϵ)𝔼[X12]\mathcal{M}_{V+k}\preceq\tau\sigma^{2}(1+\epsilon)M_{V+k}\quad\text{and}\quad\lambda_{\max}(M_{V+k})<\frac{\epsilon}{\tau\sigma^{2}(1+\epsilon)\mathbb{E}\left[\left\|X_{1}\right\|^{2}\right]}

Since fk1f_{k}\leq 1:

𝔼[ξk2|V+k]\displaystyle\mathbb{E}\left[\left.\xi_{k}^{2}\right|\mathcal{F}_{V+k}\right]
σ2+1kj=K+1kXV+jTV+j1XV+j+1kj=1KXV+jTV+j1XV+j\displaystyle\leq\sigma^{2}+\frac{1}{k}\sum_{j={K+1}}^{k}X_{V+j}^{T}\mathcal{M}_{V+j-1}X_{V+j}+\frac{1}{k}\sum_{j=1}^{K}X_{V+j}^{T}\mathcal{M}_{V+j-1}X_{V+j}
σ2+τσ2(1+ϵ)kj=K+1kXV+jTMV+j1XV+j+1kj=1KXV+jTV+j1XV+j\displaystyle\leq\sigma^{2}+\frac{\tau\sigma^{2}(1+\epsilon)}{k}\sum_{j=K+1}^{k}X_{V+j}^{T}M_{V+j-1}X_{V+j}+\frac{1}{k}\sum_{j=1}^{K}X_{V+j}^{T}\mathcal{M}_{V+j-1}X_{V+j}
σ2+ϵ𝔼[X12]1kj=K+1kXk2+1kj=1KXV+jTV+j1XV+j\displaystyle\leq\sigma^{2}+\frac{\epsilon}{\mathbb{E}\left[\left\|X_{1}\right\|^{2}\right]}\frac{1}{k}\sum_{j=K+1}^{k}\left\|X_{k}\right\|^{2}+\frac{1}{k}\sum_{j=1}^{K}X_{V+j}^{T}\mathcal{M}_{V+j-1}X_{V+j}

As kk\to\infty, the third term will vanish and the second term will converge to ϵ\epsilon almost surely by the Strong Law of Large Numbers. Therefore, almost surely

σ2(1ϵ)lim infk𝔼[ξk+12|k+1]lim supk𝔼[ξk+12|k+1]σ2+ϵ\sigma^{2}(1-\epsilon)\leq\liminf_{k\to\infty}\mathbb{E}\left[\left.\xi_{k+1}^{2}\right|\mathcal{F}_{k+1}\right]\leq\limsup_{k\to\infty}\mathbb{E}\left[\left.\xi_{k+1}^{2}\right|\mathcal{F}_{k+1}\right]\leq\sigma^{2}+\epsilon

Since ϵ>0\epsilon>0 is arbitrary, it follows that the limit of the sequence exists and is equal to σ2\sigma^{2} almost surely.

Bounds on γV+k2\gamma_{V+k}^{2}. Define

ντ(x)={τx>τxxτ and ψτ(x)={xx>τ1τ1xτ1\nu_{\tau}(x)=\begin{cases}\tau&x>\tau\\ x&x\leq\tau\end{cases}\quad\text{ and }\quad\psi_{\tau}(x)=\begin{cases}x&x>\tau^{-1}\\ \tau^{-1}&x\leq\tau^{-1}\end{cases}

From this definition, we have that for all xx\in\mathbb{R}, ντϕτψτ\nu_{\tau}\leq\phi_{\tau}\leq\psi_{\tau}. In the next statement, for the maximum of a set of values, we use \vee, and for the minimum we use \wedge. Applying this relationship to γV+k2\gamma_{V+k}^{2}:

𝔼[ντ(ξk2)|V+k]τ1𝔼[γk2|V+k]𝔼[ψτ(ξk2)|V+k]τ\mathbb{E}\left[\left.\nu_{\tau}(\xi_{k}^{2})\right|\mathcal{F}_{V+k}\right]\vee\tau^{-1}\leq\mathbb{E}\left[\left.\gamma_{k}^{2}\right|\mathcal{F}_{V+k}\right]\leq\mathbb{E}\left[\left.\psi_{\tau}(\xi_{k}^{2})\right|\mathcal{F}_{V+k}\right]\wedge\tau

First, we consider the right hand side:

𝔼[ψτ(ξk2)|V+k]\displaystyle\mathbb{E}\left[\left.\psi_{\tau}(\xi_{k}^{2})\right|\mathcal{F}_{V+k}\right] =𝔼[ξk21[ξk2>τ1]|V+k]+τ1[ξk2<τ1|V+k]\displaystyle=\mathbb{E}\left[\left.\xi_{k}^{2}\textbf{1}\left[\xi_{k}^{2}>\tau^{-1}\right]\right|\mathcal{F}_{V+k}\right]+\tau^{-1}\mathbb{P}\left[\left.\xi_{k}^{2}<\tau^{-1}\right|\mathcal{F}_{V+k}\right]
𝔼[ξk2|V+k]+τ1\displaystyle\leq\mathbb{E}\left[\left.\xi_{k}^{2}\right|\mathcal{F}_{V+k}\right]+\tau^{-1}

Applying the first part of Proposition 10, lim supk𝔼[ψτ(ξk2)|V+k]σ2+τ1\limsup_{k\to\infty}\mathbb{E}\left[\left.\psi_{\tau}(\xi_{k}^{2})\right|\mathcal{F}_{V+k}\right]\leq\sigma^{2}+\tau^{-1} almost surely. For the reverse inequality:

𝔼[ντ(ξk2)|V+k]\displaystyle\mathbb{E}\left[\left.\nu_{\tau}(\xi_{k}^{2})\right|\mathcal{F}_{V+k}\right] =τ[ξk2>τ|V+k]+𝔼[ξk21[ξk2<τ]|V+k]\displaystyle=\tau\mathbb{P}\left[\left.\xi_{k}^{2}>\tau\right|\mathcal{F}_{V+k}\right]+\mathbb{E}\left[\left.\xi_{k}^{2}\textbf{1}\left[\xi_{k}^{2}<\tau\right]\right|\mathcal{F}_{V+k}\right]
𝔼[ξk2|V+k]𝔼[(ξk2τ)+|V+k]\displaystyle\geq\mathbb{E}\left[\left.\xi_{k}^{2}\right|\mathcal{F}_{V+k}\right]-\mathbb{E}\left[\left.\left(\xi_{k}^{2}-\tau\right)_{+}\right|\mathcal{F}_{V+k}\right]
𝔼[ξk2|V+k]𝔼[(ξk2σ2)+|V+k](σ2τ)+\displaystyle\geq\mathbb{E}\left[\left.\xi_{k}^{2}\right|\mathcal{F}_{V+k}\right]-\mathbb{E}\left[\left.\left(\xi_{k}^{2}-\sigma^{2}\right)_{+}\right|\mathcal{F}_{V+k}\right]-\left(\sigma^{2}-\tau\right)_{+}
𝔼[ξk2|V+k]𝔼[|ξk2σ2||V+k](σ2τ)+\displaystyle\geq\mathbb{E}\left[\left.\xi_{k}^{2}\right|\mathcal{F}_{V+k}\right]-\mathbb{E}\left[\left.\left|\xi_{k}^{2}-\sigma^{2}\right|\right|\mathcal{F}_{V+k}\right]-\left(\sigma^{2}-\tau\right)_{+}

By the first part of the result, the first term converges to σ2\sigma^{2}, and we are left with showing that the second term converges to 0.

𝔼[|ξk2σ2||V+k]\displaystyle\mathbb{E}\left[\left.\left|\xi_{k}^{2}-\sigma^{2}\right|\right|\mathcal{F}_{V+k}\right]
𝔼[|1kj=1kfjϵV+j2σ2||V+k]+2k𝔼[j=1kfj|ϵV+jXV+jT(ββV+j1)||V+k]\displaystyle\leq\mathbb{E}\left[\left.\left|\frac{1}{k}\sum_{j=1}^{k}f_{j}\epsilon_{V+j}^{2}-\sigma^{2}\right|\right|\mathcal{F}_{V+k}\right]+\frac{2}{k}\mathbb{E}\left[\left.\sum_{j=1}^{k}f_{j}\left|\epsilon_{V+j}X_{V+j}^{T}(\beta^{*}-\beta_{V+j-1})\right|\right|\mathcal{F}_{V+k}\right]
+1kj=1kfjXV+jTV+j1XV+j\displaystyle\quad+\frac{1}{k}\sum_{j=1}^{k}f_{j}X_{V+j}^{T}\mathcal{M}_{V+j-1}X_{V+j}

For the first term, we note that (1) the argument is bounded by 1kj=1kϵV+j2+σ2\frac{1}{k}\sum_{j=1}^{k}\epsilon_{V+j}^{2}+\sigma^{2}, for which using an identical conditioning argument as in the first part of the proof, gives the quantity’s expectation as 2σ22\sigma^{2}, and (2) because VV is finite almost surely and by the strong law of large numbers, almost surely limk1kj=1kfjϵj2=σ2\lim_{k\to\infty}\frac{1}{k}\sum_{j=1}^{k}f_{j}\epsilon_{j}^{2}=\sigma^{2}. Therefore, by the conditional Dominated Convergence Theorem, the first term converges to 0 almost surely.

For the cross term

2k𝔼[j=1kfj|ϵV+jXV+jT(ββV+j1)||V+k]\displaystyle\frac{2}{k}\mathbb{E}\left[\left.\sum_{j=1}^{k}f_{j}\left|\epsilon_{V+j}X_{V+j}^{T}(\beta^{*}-\beta_{V+j-1})\right|\right|\mathcal{F}_{V+k}\right]
2kj=1k𝔼[|ϵV+j|XV+j2ββV+j12|V+k]\displaystyle\quad\leq\frac{2}{k}\sum_{j=1}^{k}\mathbb{E}\left[\left.|\epsilon_{V+j}|\left\|X_{V+j}\right\|_{2}\left\|\beta^{*}-\beta_{V+j-1}\right\|_{2}\right|\mathcal{F}_{V+k}\right] (Cauchy-Schwarz)
2kj=1kl=1𝔼[|ϵl+j|Xl+j2βl+j1β21[V=l]|V+k1]\displaystyle\quad\leq\frac{2}{k}\sum_{j=1}^{k}\sum_{l=1}^{\infty}\mathbb{E}\left[\left.|\epsilon_{l+j}|\left\|X_{l+j}\right\|_{2}\left\|\beta_{l+j-1}-\beta^{*}\right\|_{2}\textbf{1}\left[V=l\right]\right|\mathcal{F}_{V+k-1}\right]

where in the last line, the expectation and summation are alternated by the conditional Monotone Convergence Theorem. Using an identical conditioning argument as in the first part of the proof, and noting 𝔼[|ϵj|]2𝔼[|ϵj|2]=σ2\mathbb{E}\left[|\epsilon_{j}|\right]^{2}\leq\mathbb{E}\left[|\epsilon_{j}|^{2}\right]=\sigma^{2} by Jensen’s inequality, the cross term is bounded by

2k𝔼[j=1kfj|ϵV+jXV+jT(ββV+j1)||V+k]2σkj=1kXV+j2𝐭𝐫[V+j1]1/2\frac{2}{k}\mathbb{E}\left[\left.\sum_{j=1}^{k}f_{j}\left|\epsilon_{V+j}X_{V+j}^{T}(\beta^{*}-\beta_{V+j-1})\right|\right|\mathcal{F}_{V+k}\right]\leq\frac{2\sigma}{k}\sum_{j=1}^{k}\left\|X_{V+j}\right\|_{2}\mathbf{tr}\left[\mathcal{M}_{V+j-1}\right]^{1/2}

Using the same argument as for the upper bound of 𝔼[ξk2|V+k]\mathbb{E}\left[\left.\xi_{k}^{2}\right|\mathcal{F}_{V+k}\right], we have that for any ϵ>0\epsilon>0

lim supk2k𝔼[j=1k|ϵjXjT(ββj1)||k]ϵ\limsup_{k\to\infty}\frac{2}{k}\mathbb{E}\left[\left.\sum_{j=1}^{k}\left|\epsilon_{j}X_{j}^{T}(\beta^{*}-\beta_{j-1})\right|\right|\mathcal{F}_{k}\right]\leq\epsilon

Therefore, the limit exists and is 0. To summarize:

lim infk𝔼[ντ(ξk2)|k]σ2(σ2τ)+=σ2τ\liminf_{k\to\infty}\mathbb{E}\left[\left.\nu_{\tau}(\xi_{k}^{2})\right|\mathcal{F}_{k}\right]\geq\sigma^{2}-\left(\sigma^{2}-\tau\right)_{+}=\sigma^{2}\wedge\tau

5 Algorithms

Using update equations (8) and (9), we state a provably convergent, fast and simple algorithm for solving the linear regression problem, Algorithm 1, which also has a computationally fast, justified stop condition (see Theorem 3), is insensitive to the problem’s conditioning (see Theorem 6), works for a robust choice of tuning parameters (see condition (13), and Theorem 7), and can be used on very large, infinite or streaming data sources.

Input: β\beta arbitrary initialization, rule for choosing γk2\gamma_{k}^{2}, error tolerance ϵ>0\epsilon>0
MIM\leftarrow I
k0k\leftarrow 0
while 𝐭𝐫[M]>ϵ\mathbf{tr}\left[M\right]>\epsilon do
   Read next observation: (X,Y)(X,Y)
   Compute vMXv\leftarrow MX
   Update γk2\gamma_{k}^{2} according to user defined method
   Compute denominator: sγ2+XTvs\leftarrow\gamma^{2}+X^{T}v
   Update parameter estimate: ββ+v(YXTβ)/s\beta\leftarrow\beta+v(Y-X^{T}\beta)/s
   Update covariance estimate: M(IvXTs)MM\leftarrow(I-\frac{vX^{T}}{s})M
   Increment kk+1k\leftarrow k+1
 
end while
Output: Estimates: (β,M)\left(\beta,M\right)
Algorithm 1 A kSGD Algorithm

When σ2\sigma^{2} is very large, the stop condition can be improved by using a tuning parameter strategy which asymptotically estimates σ2\sigma^{2} (see Theorem 8). Motivated by this result, a class of adaptive tuning parameter strategies was constructed (§4.4) and analytically shown to converge to σ2\sigma^{2} in some sense (see Proposition 10). One example from this class of adaptive tuning parameter strategies is stated in Algorithm 2.

Input: counter kk, lower tolerance LL, upper tolerance UU, threshold TT, 𝐭𝐫[Mk]\mathbf{tr}\left[M_{k}\right], estimator ξk2\xi_{k}^{2}, residual rk+1r_{k+1}
kk+1k\leftarrow k+1 Calculate fk[1+exp(𝐭𝐫[Mk1]T)]1f_{k}\leftarrow\left[1+\exp(\mathbf{tr}\left[M_{k-1}\right]-T)\right]^{-1}
Update ξk2\xi_{k}^{2} by (20)
γk2min{U,max{L,ξk2}}\gamma_{k}^{2}\leftarrow\min\left\{U,\max\{L,\xi_{k}^{2}\}\right\}
Output: (ξk2,γk2,k)\left(\xi_{k}^{2},\gamma_{k}^{2},k\right)
Algorithm 2 Adaptive Tuning Parameter Soft-Threshold Sub-routine

6 Complexity Comparison

Table 1: A comparison of the number of data points assimilated, gradient evaluations, Hessian evaluations, floating point operations and memory requirements per iteration between SGD, kSGD, and SQN. For SQN, MM is the number of curvature correction pairs stored, bb is the size of batch gradients, bhb_{h} is the size of batch Hessians, and LL is the number of iterations between BFGS updates.
Method Data Assim. Gradient Hessian FP Ops Memory
SGD 1 1 0 𝒪[n]\mathcal{O}\left[n\right] 𝒪[n]\mathcal{O}\left[n\right]
kSGD 1 1 0 𝒪[n2]\mathcal{O}\left[n^{2}\right] 𝒪[n2]\mathcal{O}\left[n^{2}\right]
SQN b+bh/Lb+b_{h}/L bb bh/Lb_{h}/L 𝒪[(M+b)n+(bh/L)n2]\mathcal{O}\left[(M+b)n+(b_{h}/L)n^{2}\right] 𝒪[Mn]\mathcal{O}\left[Mn\right]

Table 1 reports several common metrics for assessing stochastic optimization algorithms. One notable property in Table 1 is that kSGD has the highest memory requirements. Therefore, when nn is large, kSGD would be infeasible; this can be addressed by making kSGD a low-memory method, but this will not be considered further in this paper. Another notable property is that the floating point costs per data point assimilated appear to be approximately the same for kSGD and SQN. However, the comparability of these two values depends on bhb_{h}: for many statistical regression problems, bh<nb_{h}<n would lead to rank deficient estimates of the Hessian, which would then lead to poorly scaled BFGS updates to the inverse Hessian. Therefore, bhb_{h} should be taken to be larger than nn to ensure full rank estimates of the Hessian. In this case, SQN requires at least 𝒪(n3/L)\mathcal{O}(n^{3}/L) floating point operations per data point assimilated. Since LL is recommended to be 1010 or 2020 [8], SQN has a greater computational cost than kSGD when n>20n>20.

7 Numerical Experiments

Three problem were experimented on: a linear regression problem on medical claims payment by CMS [16, 30], an additive non-parametric Haar wavelet regression problem on gas sensor readings [32, 15], and a logistic regression problem on adult income classification [33, 21]. For each problem, the dimension of the unknown parameter (nn), number of observations (NN), condition number (κ\kappa) of the Hessian at the minimizer, and the optimization methods implemented on the problem are collected in Table 2.

Remark 8.

For the linear and Haar wavelet regression problems, the Hessian does not depend on the parameter, and so it can be calculated directly. For the logistic regression problem, the minimizer was first calculated using generalized Gauss Newton (GN) [41] and confirmed by checking that the composite gradient at the minimizer had a euclidean norm no larger than 101010^{-10}; then, the Hessian was calculated at this approximate minimizer.

Table 2: A tabulation of the number of parameters (nn), number of observations (NN), condition number (κ\kappa) of the Hessian at the minimizer, and optimization methods implemented for each of the three problem types. Note, the Haar wavelet regression problem’s maximum eigenvalue was 28.728.7, but its smallest eigenvalue was within numerical precision of zero.
Problem NN nn κ\kappa Methods
CMS-Linear 2,801,6602,801,660 3434 2.44×1062.44\times 10^{6} kSGD,SGD,SQN
Gas-Haar 4,177,0044,177,004 1,2631,263 - kSGD, SGD, SQN
Income-Logistic 30,16230,162 2929 1.96×10241.96\times 10^{24} kSGD, SGD, SQN, GN

For each method, intermediate parameter values, elapsed compute time, and data points assimilated (ADP) were periodically stored. Once the method terminated, the objective function was calculated at each stored parameter value using the entire data set. The methods are compared along two dimensions:

  1. 1.

    Efficiency: the number of data points assimilated (ADP) to achieve the objective function value. The higher the efficiency of a method, the less information it needs to minimize the objective function. Thus, higher efficiency methods require fewer data points or fewer epochs in order to achieve the same objective function value in comparison to a lower efficiency method.

  2. 2.

    Effort: the elapsed time (in seconds) to achieve the objective function value. This metric is a proxy for the cost of gradient evaluations, Hessian evaluations, floating point operations, and I/O latencies. Thus, higher effort methods require more resources or more time in order to achieve the same objective function value in comparison to a lower effort method.

The methods are implemented in the Julia Programming Language (v0.4.5). For the linear and logistic regression problem, the methods were run on an Intel i5 (3.33GHz) CPU with 3.7 Gb of memory; for the Haar Wavelet regression problem, the methods were run on an Intel X5650 (2.67GHz) CPU with 10 Gb of memory.

Remark 9.

The objective function for the linear and Haar wavelet regression is the mean of the residuals squared (MRS). Therefore, for these problems, the results are discussed in terms of MRS.

7.1 Linear Regression for CMS Payment Data

We modeled the medical claims payment as a linear combination of the patient’s sex, age and place of service. Because the explanatory variables were categorical, there were n=34n=34 parameters. The optimal MRS was determined to be 38,142.638,142.6 using an incremental QR algorithm [25].

The three methods, SGD, kSGD and SQN, were initialized at zero. For SGD, the learning rate was taken to be of the general form

η(k,p,c1,c2,c3)=c11[kc2]+c3(kc2)p1[k>c2]\eta(k,p,c_{1},c_{2},c_{3})=c_{1}\textbf{1}\left[k\leq c_{2}\right]+\frac{c_{3}}{(k-c_{2})^{p}}\textbf{1}\left[k>c_{2}\right] (21)

where kk is the ADP, p(0.5,1]p\in(0.5,1] (see [34, 27]), c1[0,)c_{1}\in[0,\infty), c2[0,]c_{2}\in[0,\infty] (see [5]), and c3(0,)c_{3}\in(0,\infty). SGD was implemented for learning rates over a grid of values for pp, c1c_{1}, c2c_{2} and c3c_{3}. The best learning rate, (p=0.75,c1=0.01,c2=105,c3=1)(p=0.75,c_{1}=0.01,c_{2}=10^{5},c_{3}=1), achieved the smallest MRS. Note, this learning rate took only 0.010.01 seconds longer per epoch than the fastest learning rate. For SQN, the parameters bb, bhb_{h} and LL were allowed to vary between the recommended values b=100,1000b=100,1000, bh=300,600b_{h}=300,600, L=10,20L=10,20 [8], MM was allowed to take values 10,20,3410,20,34, and the learning rate constant, cc, was allowed to vary over a grid of positive numbers. The best set of parameters, (b=1000,bh=300,L=20,M=34,c=2)(b=1000,b_{h}=300,L=20,M=34,c=2), came within 2%2\% of the optimal MRS with the smallest ADP and least amount of time. The tuning parameters for kSGD, summarized in Table 3, were selected to reflect the concepts discussed in §4.3 and were not determined based on any results from running the method.

Table 3: Tuning parameter selection for kSGD method. kSGD-1 uses a tuning parameter based on Theorem 7. kSGD-2 uses a tuning parameter based on (7). kSGD-3 uses a tuning parameter to increase the speed of convergence based on the discussion in §4.3.
Label γk2\gamma_{k}^{2}
kSGD-1 k1k^{-1}
kSGD-2 38,00038,000
kSGD-3 0.00010.0001

Figure 1 visualizes the differences between SGD, kSGD and SQN in terms of efficiency and effort. In terms of efficiency, kSGD-1, kSGD-3 and SQN are comparable, whereas kSGD-2 and SGD perform quite poorly in comparison. For kSGD-2, this behavior is to be expected for large choices of γk2\gamma_{k}^{2} as discussed below. For SGD, despite an optimal choice in the learning rate, it still does not come close to the optimal MRS after a single epoch. Even when SGD is allowed to complete multiple epochs so that its total elapsed time is greater than kSGD’s single epoch elapsed time (Figure 1, right), SGD does not make meaningful improvements towards the optimal MRS. Indeed, this is to be expected as the rate of convergence of SGD is 𝒪(σ2κ2/k)\mathcal{O}(\sigma^{2}\kappa^{2}/k) where kk is the ADP [7], and, therefore, SGD requires approximately 𝒪(109)\mathcal{O}(10^{9}) epochs to converge to the minimizer. We also see in Figure 1 (right) that kSGD-1 and kSGD-3 require much less effort to calculate the minimizer in comparison to SQN.

Refer to caption
Fig. 1: A comparison of the performance of SGD, kSGD and SQN for the linear regression problem. Left: In terms of efficiency, kSGD-1, kSGD-3 and SQN are comparable, whereas kSGD-2 and SGD perform quite poorly in comparison. Right: kSGD-1 and kSGD-3 produce nearly optimal estimates within the first 50 seconds, which is approximately the amount of time SGD needs to complete one epoch. Also, kSGD-1 and kSGD-3 require less effort than SQN.

Figure 2 visualizes the behavior of the covariance estimates for each of the three kSGD tuning parameter choices. We highlight the sluggish behavior of MkM_{k} for kSGD-2, which underscores one of the ideas discussed in §4.3: if σ2\sigma^{2} is large, choosing γk2\gamma_{k}^{2} to approximate σ2\sigma^{2} for all kk will slow down the convergence of the algorithm. Another important property is that, despite some variability, 𝐭𝐫[Mk]\mathbf{tr}\left[M_{k}\right] is reflective of the decay in MRS; this property empirically reinforces the result in Theorem 3, and the claim that MkM_{k} can be used as a stop condition in practice.

Refer to caption
Fig. 2: A comparison of the MRS and trace of the covariance. The rapid decay of kSGD-1’s and kSGD-3’s covariance is reflected in the rapid decay of their MRS. On the other hand, kSGD-2’s covariance is decaying slowly and this too is reflected in the slow decay of the MRS.

7.2 Nonparametric Wavelet Regression on Gas Sensor Readings

We modeled the ethylene concentration in a mixture of Methane, Ethylene and Air as an additive non-parametric function of time and sixteen gas sensor voltage readings. Because the response and explanatory variables were in bounded intervals, the function space was approximated using Haar wavelets without shifts [18]. The resolution for the time component was 99, and the resolution for each gas sensor was 33, which resulted in features and a parameter of dimension n=1,263n=1,263.

Remark 10.

Because of their high-cost of calculation, the features were calculated in advance of running the methods.

Again, SGD, kSGD and SQN were implemented and intialized at zero. For SGD, using the same criteria as described in §7.1, the best learning rate was found to be (p=0.8,c1=0.0,c2=0.0,c3=1)(p=0.8,c_{1}=0.0,c_{2}=0.0,c_{3}=1). For SQN, regardless of the choice of parameters (over a grid larger than the one used in §7.1), the BFGS estimates quickly became unstable and caused the parameter estimate to diverge. For kSGD, the method was implemented with γk2=0.0001\gamma_{k}^{2}=0.0001.

Figure 3 compares SGD and kSGD. Although kSGD is much more efficient than SGD, it is remarkably slower than SGD. For this problem, this difference in effort can be reduced by using sparse matrix techniques since at most 7474 of the 1,2631,263 components in each feature vector are non-zero; however, for dense problems this issue can only be resolved by parallelizing the floating point operations at each iteration.

Refer to caption
Fig. 3: A comparison of SGD and kSGD on the Haar wavelet regression problem. Left: kSGD is more efficient than SGD. Right: kSGD requires significantly more effort.

7.3 Logistic Regression on Adult Incomes

We modeled two income classes as a logistic model with eight demographic explanatory variables. Four of the demographic variables were continuous and the remaining four were categorical, which resulted in n=29n=29 parameters.

SGD, kSGD, SQN and GN were implemented and intialized at zero. For SGD, the best learning rate was found to be (p=0.5,c1=0.0,c2=0.0,c3=0.01)(p=0.5,c_{1}=0.0,c_{2}=0.0,c_{3}=0.01). For SQN, the best parameter set was found to be (b=1000,bh=300,L=10,M=29,c=10)(b=1000,b_{h}=300,L=10,M=29,c=10). For GN, there were no tuning parameters. kSGD was adapted to solve the GN subproblems up to a specified threshold on the trace of the covariance estimate. After one subproblem was solved, the threshold was decreased by a fixed factor, which was arbitrarily selected to be 5. The threshold was arbitrarily intialized at 15, and γk2\gamma_{k}^{2} was started at 0.00010.0001 and increased by a factor of 1010 until the method did not fail; the method succeeded when γk2=0.1\gamma_{k}^{2}=0.1.

Figure 4 visualizes the efficiency and effort of the four methods. In general, the behavior of kSGD, SGD and SQN follow the trends in the two preceding examples. An interesting feature is that kSGD had greater efficiency and required less effort than GN. This is due to the fact that kSGD incompletely solves the GN subproblem away from the minimizer, while GN solves the subproblem exactly at each iteration. However, it is important to note that the choice of kSGD tuning parameters is not as straightforward for the logistic regression problem as it is for the linear regression problem, and appropriate choices will not be discussed further in this paper.

Refer to caption
Fig. 4: A comparison of SGD, kSGD, SQN and GN on the logistic regression problem. Left: kSGD is more efficient than all three methods. Right: kSGD requires less effort than all three methods.

8 Conclusion

We developed and analyzed kSGD on the limited, but important class of linear regression problems. In doing our analysis, we achieve a method which (1) asymptotically recovers an optimal stochastic update method, (2) converges for a robust choice of tuning parameters, (3) is insensitive to the problem’s conditioning, and (4) has a computationally efficient stop condition. As a result of our analysis, we translated this method into a simple algorithm for estimating linear regression parameters; we then successfully implemented this algorithm for solving linear, non-parametric wavelet and logistic regression problems using real data. Moreover, our analysis provides a novel strategy for analyzing the convergence of second order SGD and proximal methods, which leads to theoretical results that correspond to intuitive expectations. Finally, our analysis provides a foundation for embedding kSGD in multiple epoch algorithms, extending kSGD to a larger class of problems, and developing parallel and low memory kSGD algorithms.

Appendix A Renewal Process

We show that the stochastic process {Sk:k+1}\{S_{k}:k+1\in\mathbb{N}\} defined in (15) is a renewal process (see Section 4.4 in [12]). We define the inter-arrival times, Tk=SkSk1T_{k}=S_{k}-S_{k-1} for all kk\in\mathbb{N}.

Lemma 11.

If X1,X2,X_{1},X_{2},\ldots are independent and identically distributed, then T1,T2,T_{1},T_{2},\ldots are independent and identically distributed.

Proof.

Let k,jk,j\in\mathbb{N}

[Tkj]\displaystyle\mathbb{P}\left[T_{k}\geq j\right] =s=1[Tkj|Sk1=s][Sk1=s]\displaystyle=\sum_{s=1}^{\infty}\mathbb{P}\left[T_{k}\geq j|S_{k-1}=s\right]\mathbb{P}\left[S_{k-1}=s\right]
=s=1[span[Xs+1,,Xs+j]=n|Sk1=s][Sk1=s]\displaystyle=\sum_{s=1}^{\infty}\mathbb{P}\left[span[X_{s+1},\ldots,X_{s+j}]=\mathbb{R}^{n}|S_{k-1}=s\right]\mathbb{P}\left[S_{k-1}=s\right]

Note that σ(Xs+1,,Xs+j)\sigma(X_{s+1},\ldots,X_{s+j}) is independent of σ(X1,,Xs)\sigma(X_{1},\ldots,X_{s}), and so:

[Tkj]\displaystyle\mathbb{P}\left[T_{k}\geq j\right] =s=1[span[Xs+1,,Xs+j]=n][Sk1=s]\displaystyle=\sum_{s=1}^{\infty}\mathbb{P}\left[span[X_{s+1},\ldots,X_{s+j}]=\mathbb{R}^{n}\right]\mathbb{P}\left[S_{k-1}=s\right]

Finally, X1,,XjX_{1},\ldots,X_{j} has the same distribution as Xs+1,,Xs+jX_{s+1},\ldots,X_{s+j}. Therefore:

[Tkj]\displaystyle\mathbb{P}\left[T_{k}\geq j\right] =s=1[span[X1,,Xj]=n][Sk1=s]\displaystyle=\sum_{s=1}^{\infty}\mathbb{P}\left[span[X_{1},\ldots,X_{j}]=\mathbb{R}^{n}\right]\mathbb{P}\left[S_{k-1}=s\right]
=[T1j]s=1[Sk1=s]\displaystyle=\mathbb{P}\left[T_{1}\geq j\right]\sum_{s=1}^{\infty}\mathbb{P}\left[S_{k-1}=s\right]
=[T1j]\displaystyle=\mathbb{P}\left[T_{1}\geq j\right]

We have established that T1,T2,T_{1},T_{2},\ldots are identically distributed. Now let k1<k2<<krk_{1}<k_{2}<\cdots<k_{r} be positive integers with rr\in\mathbb{N}. Let j1,,jrj_{1},\ldots,j_{r}\in\mathbb{N}.

[Tkr=jr,,Tk1=j1]\displaystyle\mathbb{P}\left[T_{k_{r}}=j_{r},\ldots,T_{k_{1}}=j_{1}\right]
=s=1[Tkr=jr|Skr1=s,Tkr1=jr1,,Tk1=j1]\displaystyle=\sum_{s=1}^{\infty}\mathbb{P}\left[T_{k_{r}}=j_{r}|S_{k_{r}-1}=s,T_{k_{r-1}}=j_{r-1},\ldots,T_{k_{1}}=j_{1}\right]
×[Skr1=s,Tkr1=jr1,,Tk1=j1]\displaystyle\quad\times\mathbb{P}\left[S_{k_{r}-1}=s,T_{k_{r-1}}=j_{r-1},\ldots,T_{k_{1}}=j_{1}\right]

Just as above, σ(Xs+1,,Xs+jr)\sigma(X_{s+1},\ldots,X_{s+j_{r}}) is independent of σ(X1,,Xs)\sigma(X_{1},\ldots,X_{s}) and so:

[Tkr=jr,,Tk1=j1]\displaystyle\mathbb{P}\left[T_{k_{r}}=j_{r},\ldots,T_{k_{1}}=j_{1}\right]
=s=1[Tkr=jr][Skr1=s,Tkr1=jr1,,Tk1=j1]\displaystyle=\sum_{s=1}^{\infty}\mathbb{P}\left[T_{k_{r}}=j_{r}\right]\mathbb{P}\left[S_{k_{r}-1}=s,T_{k_{r-1}}=j_{r-1},\ldots,T_{k_{1}}=j_{1}\right]
=[Tkr=jr]s=1[Skr1=s,Tkr1=jr1,,Tk1=j1]\displaystyle=\mathbb{P}\left[T_{k_{r}}=j_{r}\right]\sum_{s=1}^{\infty}\mathbb{P}\left[S_{k_{r}-1}=s,T_{k_{r-1}}=j_{r-1},\ldots,T_{k_{1}}=j_{1}\right]
=[Tkr=jr][Tkr1=jr1,,Tk1=j1]\displaystyle=\mathbb{P}\left[T_{k_{r}}=j_{r}\right]\mathbb{P}\left[T_{k_{r-1}}=j_{r-1},\ldots,T_{k_{1}}=j_{1}\right]

Applying the argument recursively, we have established independence.   ∎

Lemma 12.

If X1,X2,X_{1},X_{2},\ldots are independent, identically distributed, and satisfy Assumption 4, then 𝔼[T1]<\mathbb{E}\left[T_{1}\right]<\infty and 𝔼[Sk]=k𝔼[T1]\mathbb{E}\left[S_{k}\right]=k\mathbb{E}\left[T_{1}\right] for all kk.

Proof.

We prove that T1T_{1} is bounded by a geometric random variable and so its expectation must exist. Let 𝒮m=span[X1,,Xm]\mathcal{S}_{m}=span[X_{1},\ldots,X_{m}]. In this notation, we have that T1=inf{m>0:𝒮m=n}T_{1}=\inf\{m>0:\mathcal{S}_{m}=n\}. We will now decompose T1T_{1} into P1,,PnP_{1},\ldots,P_{n}, where Pk=inf{m>Pk1:dim(𝒮m)=k}P_{k}=\inf\{m>P_{k-1}:\dim(\mathcal{S}_{m})=k\} with P0=0P_{0}=0. By construction:

0=P0<P1<<Pn=T10=P_{0}<P_{1}<\cdots<P_{n}=T_{1}

Let U1,,UnU_{1},\ldots,U_{n} denote the inter-arrival times defined by Uk=PkPk1U_{k}=P_{k}-P_{k-1}. On the event that Uk=jU_{k}=j, we have that vn\exists v\in\mathbb{R}^{n} with v=1\left\|v\right\|=1 such that 𝒮Pk1+j1\mathcal{S}_{P_{k-1}+j-1} is orthogonal to vv, and by Assumption 4, there is a p=1X1Tv=0>0p=1-\mathbb{P}{X_{1}^{T}v=0}>0. Then, [Uk=j](1p)j1\mathbb{P}\left[U_{k}=j\right]\leq(1-p)^{j-1}. Thus, 𝔼[Uk]<\mathbb{E}\left[U_{k}\right]<\infty. Therefore, 𝔼[T1]=𝔼[U1]+𝔼[U2]+𝔼[Un]<\mathbb{E}\left[T_{1}\right]=\mathbb{E}\left[U_{1}\right]+\mathbb{E}\left[U_{2}\right]+\cdots\mathbb{E}\left[U_{n}\right]<\infty. Now, by Lemma 11, T1,,TkT_{1},\ldots,T_{k} are independent and identically distributed. Therefore, 𝔼[Sk]=𝔼[T1]++𝔼[Tk]=k𝔼[T1]\mathbb{E}\left[S_{k}\right]=\mathbb{E}\left[T_{1}\right]+\cdots+\mathbb{E}\left[T_{k}\right]=k\mathbb{E}\left[T_{1}\right].   ∎

Lemma 13.

If X1,X2,X_{1},X_{2},\ldots are independent and identically distributed, then 𝒳0,𝒳1,\mathcal{X}_{0},\mathcal{X}_{1},\ldots defined in (16) are independent and identically distributed. In particular, the eigenvalues of 𝒳0,𝒳1,\mathcal{X}_{0},\mathcal{X}_{1},\ldots are independent and identically distributed.

Proof.

Let AA and be a measurable set.

[𝒳kA]\displaystyle\mathbb{P}\left[\mathcal{X}_{k}\in A\right]
=s=1t=1[Xs+1Xs+1T++Xs+tXs+tTA|Tk+1=t,Sk=s]\displaystyle=\sum_{s=1}^{\infty}\sum_{t=1}^{\infty}\mathbb{P}\left[X_{s+1}X_{s+1}^{T}+\cdots+X_{s+t}X_{s+t}^{T}\in A|T_{k+1}=t,S_{k}=s\right]
×[Tk+1=t|Sk=s][Sk=s]\displaystyle\quad\times\mathbb{P}\left[T_{k+1}=t|S_{k}=s\right]\mathbb{P}\left[S_{k}=s\right]
=s=1t=1[X1X1T++XtXtTA|T1=t][T1=t][Sk=s]\displaystyle=\sum_{s=1}^{\infty}\sum_{t=1}^{\infty}\mathbb{P}\left[X_{1}X_{1}^{T}+\cdots+X_{t}X_{t}^{T}\in A|T_{1}=t\right]\mathbb{P}\left[T_{1}=t\right]\mathbb{P}\left[S_{k}=s\right]
=[𝒳0A]\displaystyle=\mathbb{P}\left[\mathcal{X}_{0}\in A\right]

Thus, 𝒳0,𝒳1,\mathcal{X}_{0},\mathcal{X}_{1},\ldots are identically distributed. By the independence of X1,X2,X_{1},X_{2},\ldots and the independence of T1,T2,T_{1},T_{2},\ldots, which is established in Lemma 11, 𝒳0,𝒳1,\mathcal{X}_{0},\mathcal{X}_{1},\ldots are independent because they are functions of independent random variables. Finally, since the eigenvalues of 𝒳k\mathcal{X}_{k} can be calculated using the Courant-Fisher Principle, we have that they too are independent and identically distributed.   ∎

Lemma 14.

If X1,X2,X_{1},X_{2},\ldots are independent and identically distributed, and satisfy Assumptions 2 and 4, then α>0\exists\alpha>0 such that for all kk, 𝔼[λmin(𝒳k)]α>0\mathbb{E}\left[\lambda_{\min}(\mathcal{X}_{k})\right]\geq\alpha>0.

Proof.

By Lemma 13, we need only consider 𝒳0\mathcal{X}_{0}. Suppose there exists a vnv\in\mathbb{R}^{n} with v=1\left\|v\right\|=1 such that 𝔼[vT𝒳0v]=0\mathbb{E}\left[v^{T}\mathcal{X}_{0}v\right]=0. Note, by Assumption 2 and Cauchy-Schwarz, the expectation is well defined. Since 𝒳00\mathcal{X}_{0}\succeq 0 by construction, almost surely

0=vT𝒳0v=s=1S1(XsTv)20=v^{T}\mathcal{X}_{0}v=\sum_{s=1}^{S_{1}}(X_{s}^{T}v)^{2}

Thus, XsTv=0X_{s}^{T}v=0 for all s=1,,S1s=1,\ldots,S_{1}. However, by construction, Assumption 4 and Lemma 12, X1,,XS1X_{1},\ldots,X_{S_{1}} span n\mathbb{R}^{n} and S1<S_{1}<\infty almost surely, hence there is an sS1s\leq S_{1} such that XsTv0X_{s}^{T}v\neq 0 almost surely. Therefore, we have a contradiction.   ∎

Appendix B Some Properties of LL^{\infty} Random Variables

In this section, we establish some useful properties of LL^{\infty} random variables.

Lemma 15.

Suppose XLX\in L^{\infty} random variable taking values in n\mathbb{R}^{n}. Then, for C=XC=\left\|X\right\|_{\infty},

0XXTnC2I and nC2IXXT𝔼[XXT]nC2I0\preceq XX^{T}\preceq nC^{2}I\quad\text{ and }\quad-nC^{2}I\preceq XX^{T}-\mathbb{E}\left[XX^{T}\right]\preceq nC^{2}I

almost surely.

Proof.

The lower bound is straightforward. For the upper bound, let vv be any unit vector. By Cauchy-Schwartz:

vTXXTv\displaystyle v^{T}XX^{T}v =(XTv)2X22v22nC2v22nC2\displaystyle=(X^{T}v)^{2}\leq\left\|X\right\|_{2}^{2}\left\|v\right\|_{2}^{2}\leq nC^{2}\left\|v\right\|_{2}^{2}\leq nC^{2}

where C=XC=\left\|X\right\|_{\infty}. Thus the first set of inequalities holds almost surely. Moreover, the first set of inequalities imply 0𝔼[XXT]nC2I0\preceq\mathbb{E}\left[XX^{T}\right]\preceq nC^{2}I. Hence,

𝔼[XXT]\displaystyle-\mathbb{E}\left[XX^{T}\right] XXT𝔼[XXT]nC2I𝔼[XXT]\displaystyle\preceq XX^{T}-\mathbb{E}\left[XX^{T}\right]\preceq nC^{2}I-\mathbb{E}\left[XX^{T}\right]
nC2I\displaystyle-nC^{2}I XXT𝔼[XXT]nC2I\displaystyle\preceq XX^{T}-\mathbb{E}\left[XX^{T}\right]\preceq nC^{2}I

Lemma 16.

Let Z1,Z2,,ZkZ_{1},Z_{2},\ldots,Z_{k} be mean zero, independent random variables with Z1==Zk=D>0\left\|Z_{1}\right\|_{\infty}=\cdots=\left\|Z_{k}\right\|_{\infty}=D>0. Then

𝔼[(j=1kZj)6]𝒪[D6k3]\mathbb{E}\left[\left(\sum_{j=1}^{k}Z_{j}\right)^{6}\right]\leq\mathcal{O}\left[D^{6}k^{3}\right]
Proof.

Note that 𝔼[Zj]=0\mathbb{E}\left[Z_{j}\right]=0 and ZjZ_{j} are independent. Hence, in the polynomial expansion, any monomial with a term that has unity exponent is going to have a zero expectation. So, we need to count all monomials whose terms have exponents at least two in the expansion:

  1. 1.

    There are kk terms of the form (Zj)6(Z_{j})^{6}.

  2. 2.

    There are (k2)\genfrac{(}{)}{0.0pt}{}{k}{2} terms of the form (Zj)2(Zi)4(Z_{j})^{2}(Z_{i})^{4} with 6!4!2!=15\frac{6!}{4!2!}=15 ways of choosing the exponents.

  3. 3.

    There are (k2)\genfrac{(}{)}{0.0pt}{}{k}{2} terms of the form (Zj)3(Zi)3(Z_{j})^{3}(Z_{i})^{3} with 6!3!3!=20\frac{6!}{3!3!}=20 ways of choosing the exponents.

  4. 4.

    There are (k3)\genfrac{(}{)}{0.0pt}{}{k}{3} terms of the form (Zj)2(Zi)2(Zl)2(Z_{j})^{2}(Z_{i})^{2}(Z_{l})^{2} with 6!2!2!2!=90\frac{6!}{2!2!2!}=90 ways of choosing the exponents.

Since DZjD-D\leq Z_{j}\leq D almost surely by assumption, we have that: 𝔼[(j=1kZj)6](k+35k2+90k3)D6\mathbb{E}\left[\left(\sum_{j=1}^{k}Z_{j}\right)^{6}\right]\leq\left(k+35k^{2}+90k^{3}\right)D^{6}.   ∎

Acknowledgements

This work is supported by the Department of Statistics at the University of Chicago. I would also like to express my deep gratitude to Mihai Anitescu for his insightful feedback on the many drafts of this work and his generous patience and guidance throughout the research and publication process. I would also like to thank Madhukanta Patel, who, despite her illnesses and advanced age, scolded me if I momentarily stopped working on this manuscript while sitting at her bedside. I am also deeply grateful to the reviewers for their detailed readings and useful criticisms.

References

  • [1] Shun-Ichi Amari, Hyeyoung Park, and Kenji Fukumizu, Adaptive method of realizing natural gradient learning for multilayer perceptrons, Neural Computation, 12 (2000), pp. 1399–1409.
  • [2] Yurii S Aulchenko, Maksim V Struchalin, and Cornelia M van Duijn, Probabel package for genome-wide association analysis of imputed data, BMC bioinformatics, 11 (2010), p. 1.
  • [3] Derek Bean, Peter J Bickel, Noureddine El Karoui, and Bin Yu, Optimal m-estimation in high-dimensional regression, Proceedings of the National Academy of Sciences, 110 (2013), pp. 14563–14568.
  • [4] Dimitri P Bertsekas, Nonlinear Programming, Athena scientific, 1999.
  • [5] Dimitri P. Bertsekas, Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization: A Survey, 2010, ch. 4.
  • [6] Antoine Bordes, Léon Bottou, and Patrick Gallinari, Sgd-qn: Careful quasi-newton stochastic gradient descent, The Journal of Machine Learning Research, 10 (2009), pp. 1737–1754.
  • [7] Olivier Bousquet and Léon Bottou, The tradeoffs of large scale learning, in Advances in Neural Information Processing Systems, 2008, pp. 161–168.
  • [8] RH Byrd, SL Hansen, Jorge Nocedal, and Y Singer, A stochastic quasi-newton method for large-scale optimization, SIAM Journal on Optimization, 26 (2016), pp. 1008–1031.
  • [9] ATLAS collaboration, TA Collaboration, et al., Observation of an excess of events in the search for the standard model higgs boson in the gamma-gamma channel with the atlas detector, ATLAS-CONF-2012-091, 2012.
  • [10] Patrick L Combettes and Jean-Christophe Pesquet, Proximal splitting methods in signal processing, in Fixed-point algorithms for inverse problems in science and engineering, Springer, 2011, pp. 185–212.
  • [11] Richard Courant and David Hilbert, Method of Mathematical Physics, vol I, New York: Interscience, 1953.
  • [12] Rick Durrett, Probability: Theory and Examples, Cambridge University Press, 2010.
  • [13] Diego Fabregat-Traver, Yurii S Aulchenko, and Paolo Bientinesi, Solving sequences of generalized least-squares problems on multi-threaded architectures, Applied Mathematics and Computation, 234 (2014), pp. 606–617.
  • [14] Olivier Fercoq and Peter Richtárik, Accelerated, parallel, and proximal coordinate descent, SIAM Journal on Optimization, 25 (2015), pp. 1997–2023.
  • [15] Jordi Fonollosa, Sadique Sheik, Ramón Huerta, and Santiago Marco, Reservoir computing compensates slow response of chemosensor arrays exposed to fast varying gas concentrations in continuous monitoring, Sensors and Actuators B: Chemical, 215 (2015), pp. 618–629.
  • [16] Centers for Medicare & Medicaid, Basic stand alone carrier line items public use files, 2010.
  • [17] Alvaro Frank, Diego Fabregat-Traver, and Paolo Bientinesi, Large-scale linear regression: Development of high-performance routines, arXiv preprint arXiv:1504.07890, (2015).
  • [18] Alfred Haar, On the theory of orthogonal function systems, Math. Ann, 69 (1910), pp. 331–371.
  • [19] Piotr Jaranowski and Andrzej Królak, Gravitational-wave data analysis. formalism and sample applications: The gaussian case, Living Rev. Relativity, 15 (2012).
  • [20] Rudolph Emil Kalman, A new approach to linear filtering and prediction problems, Journal of Fluids Engineering, 82 (1960), pp. 35–45.
  • [21] Ron Kohavi, Scaling up the accuracy of naive-bayes classifiers: A decision-tree hybrid., in KDD, vol. 96, Citeseer, 1996, pp. 202–207.
  • [22] Jakub Konečnỳ, Jie Liu, Peter Richtárik, and Martin Takáč, ms2gd: Mini-batch semi-stochastic gradient descent in the proximal setting, Selected Topics in Signal Processing, IEEE Journal on, 10 (2016), pp. 242–255.
  • [23] Jakub Konečnỳ and Peter Richtárik, Semi-stochastic gradient descent methods, arXiv preprint arXiv:1312.1666, (2013).
  • [24] Lucien Le Cam and Grace Lo Yang, Asymptotics in statistics: some basic concepts, Springer Science & Business Media, 2012.
  • [25] Alan J Miller, Algorithm as 274: Least squares routines to supplement those of gentleman, Applied Statistics, (1992), pp. 458–478.
  • [26] Philipp Moritz, Robert Nishihara, and Michael I Jordan, A linearly-convergent stochastic l-bfgs algorithm, arXiv preprint arXiv:1508.02087, (2015).
  • [27] Noboru Murata, A statistical study of on-line learning, Online Learning and Neural Networks. Cambridge University Press, Cambridge, UK, (1998), pp. 63–92.
  • [28] Jorge Nocedal and Stephen Wright, Numerical optimization, Springer Science & Business Media, 2006.
  • [29] Michael Nussbaum, Asymptotic equivalence of density estimation and gaussian white noise, The Annals of Statistics, (1996), pp. 2399–2430.
  • [30] Belgium Network of Open Source Analytical Consultants, Biglm on your big data in open source r, it just works – similar as in sas, 2012.
  • [31] Jeffrey S Racine, A primer on regression splines, URL: http://cran. r-project. org/web/packages/crs/vignettes/splineprimer. pdf, (2014).
  • [32] UCI Machine Learning Repository, Gas sensor array under dynamic gas mixtures data set, 2015.
  • [33] UCI Machine Learning Respository, Adult data set, 1996.
  • [34] Herbert Robbins and Sutton Monro, A stochastic approximation method, The Annals of Mathematical Statistics, (1951), pp. 400–407.
  • [35] Nicolas L Roux, Mark Schmidt, and Francis R Bach, A stochastic gradient method with an exponential convergence rate for finite training sets, in Advances in Neural Information Processing Systems, 2012, pp. 2663–2671.
  • [36] Nicol N Schraudolph, Jin Yu, and Simon Günter, A stochastic quasi-newton method for online convex optimization, in International Conference on Artificial Intelligence and Statistics, 2007, pp. 436–443.
  • [37] Shai Shalev-Shwartz and Tong Zhang, Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization, Mathematical Programming, (2014), pp. 1–41.
  • [38] Tanya M Teslovich, Kiran Musunuru, Albert V Smith, Andrew C Edmondson, Ioannis M Stylianou, Masahiro Koseki, James P Pirruccello, Samuli Ripatti, Daniel I Chasman, Cristen J Willer, et al., Biological, clinical and population relevance of 95 loci for blood lipids, Nature, 466 (2010), pp. 707–713.
  • [39] Aad W Van der Vaart, Asymptotic statistics, vol. 3, Cambridge University Press, 2000.
  • [40] Larry Wasserman, All of nonparametric statistics, Springer Science & Business Media, 2006.
  • [41] Robert WM Wedderburn, Quasi-likelihood functions, generalized linear models, and the gauss—newton method, Biometrika, 61 (1974), pp. 439–447.
  • [42] Lin Xiao and Tong Zhang, A proximal stochastic gradient method with progressive variance reduction, SIAM Journal on Optimization, 24 (2014), pp. 2057–2075.