This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Stacking as Accelerated Gradient Descent

Naman Agarwal
Google DeepMind
namanagarwal@google.com
   Pranjal Awasthi
Google Research
pranjalawasthi@google.com
   Satyen Kale
Google Research
satyenkale@google.com
   Eric Zhao
UC Berkeley, Google Research
eric.zh@berkeley.edu
  
Abstract

Stacking, a heuristic technique for training deep residual networks by progressively increasing the number of layers and initializing new layers by copying parameters from older layers, has proven quite successful in improving the efficiency of training deep neural networks. In this paper, we propose a theoretical explanation for the efficacy of stacking: viz., stacking implements a form of Nesterov’s accelerated gradient descent. The theory also covers simpler models such as the additive ensembles constructed in boosting methods, and provides an explanation for a similar widely-used practical heuristic for initializing the new classifier in each round of boosting. We also prove that for certain deep linear residual networks, stacking does provide accelerated training, via a new potential function analysis of the Nesterov’s accelerated gradient method which allows errors in updates. We conduct proof-of-concept experiments to validate our theory as well.

1 Introduction

Deep learning architectures are ubiquitous today and have been responsible tremendous technological advances in machine learning. However, until 2006, training deep architectures was extremely challenging. The deep learning revolution of the past couple of decades was ushered in with the discovery that a classical technique, viz. greedy layer-wise pretraining, can be used to train general deep architectures (see [13, Section 15.1] for a historical account of this). Previously, only deep architectures with special structure like convolutions or recurrences were known to be feasible to train. Greedy layer-wise pretraining is a very intuitive technique where a deep network is built and trained in a stagewise manner. Starting with a small network that is easy to train, the technique prescribes adding new layers over a number of stages, and in each stage training the newly added layers (and, potentially, the older layers as well) for a certain number of training steps. This process continues until the desired model depth is reached.

More modern developments such as residual connections [19] and normalization layers [20] have made it possible to directly train deep networks without using greedy layer-wise pretraining. However, in recent years, the tremendous success of deep learning architectures based on transformers [35] in domains such as language modeling, and computer vision [27, 5, 9, 34] has led to the trend of scaling model capacity with ever increasing model sizes for improved performance [8, 1, 33]. This endeavour comes at a significant cost as model training may often take months and require several million dollars of compute resources [8]. As a result there has been a surge of recent work aimed at faster training of large transformer based models. These works include methods for sparse training such as mixture of experts (MoE) models [30, 21], methods for approximate sparse attention mechanisms [6] and better optimization methods [31, 23, 15].

In the effort to reduce the massive costs of training these giant transformer models, greedy layer-wise pretraining has re-emerged as a very effective strategy in recent times. Specifically, a technique for initializing the new layers known as stacking [14, 26] has been shown to be very effective in speeding up training of deep transformer models. Stacking prescribes a heuristic for initializing the newly added layers. Specifically, it prescribes that the newly added layers should be initialized by copying parameters from the previously trained layers. [14] proposed to double the model depth at each time by stacking an exact copy of the current model on top. [26] argue that doubling the model depth may be suboptimal and a better approach is gradual stacking where a few new layers (say 3-4) are added during each stage. These layers are initialized by copying the top most layers from the existing model. See Figure 1 for an illustration of this technique for training deep transformer models.

Refer to caption
Figure 1: Stacking for stagewise training language models. In each stage, a new transformer block is added, initialized with the parameters of the top block from the previous stage, and then trained for a certain number of steps.

The classical greedy layer-wise pretraining strategy doesn’t have a specific prescription for initializing the new layers. In general, they’re initialized randomly in some standard fashion. Stacking initialization provides a clear benefit over random initialization: Figure 2 shows one example of this effect, for training the BERT Base [10] model with 4 stages of stagewise training.

Refer to caption
Figure 2: Stacking init vs random init for stagewise training of BERT Base model. Four stages are used with 168,750 steps in each stage. Stage boundaries are marked by vertical dashed lines. Stacking init provides a clear benefit over random init.

Structurally, greedy layer-wise pretraining resembles another classical technique, viz. boosting. In boosting, an additive ensemble of classifiers is constructed via greedy stagewise training of classifiers in the same greedy manner. Boosting algorithms such as AdaBoost [12] and Gradient Boosting [11] have found tremendous practical application, especially when using decision trees as base classifiers (e.g. XGBoost [7]). A heuristic similar to stacking has also found practical application in boosting algorithms. The heuristic is to initialize each new classifier (e.g. a decision tree) by copying over the just-trained classifier and then updating it using new training data. This process is illustrated in Figure 3. Due to the similarity with stacking for training deep transformer models, in the rest of the paper we use “stacking” to also refer to this initialization strategy in the context of boosting.

Refer to caption
Figure 3: Stacking for boosting. In each stage, a new classifier is added, initialized with the parameters of the last trained classifier from the previous stage, and then trained for a certain number of steps.

While stacking based methods lead to impressive speed up in training of transformer models and additive models in boosting, we currently do not have a good theoretical understanding of why this is the case. Recently, [26] provided a theoretical explanation based on the assumption that each transformer block is a good few-shot learner. This assumption, along with a few others, are then used to conclude copying parameters as stacking does leads to fast learning. However, the assumptions made in the paper are fairly strong and hard to verify.

In this work, we make progress on developing a theoretical understanding of the efficacy of stacking by studying it from an optimization perspective. In particular, our main contribution is that when viewed from the perspective of function optimization, stacking speeds up stagewise training by enabling a form of the accelerated gradient descent method (AGD) developed by Nesterov [25]. In other words, each stage of the stacking-initialized stagewise training procedure will reduce the training loss at an accelerated rate.

In contrast, we also show that without using any form of initialization, or in other words, initializing the new block/classifier to implement the zero function, stagewise training simply recovers usual (non-accelerated) gradient descent, whereas random initialization recovers stochastic gradient descent on a smoothed version of the loss. Hence, stacking initialization accelerates stagewise training over zero or random initialization. In more detail, our contributions are as follows:

  1. 1.

    We propose a general theoretical framework towards learning a prediction function FF via an ensemble, i.e., a sequence of functions (f1,f2,,fT)(f_{1},f_{2},\ldots,f_{T}) in a greedy stagewise manner. The generality of our framework lets us unify classical approaches such as boosting [12, 11] that build the ensemble in an additive manner, and modern approaches that build the ensemble via stagewise training of residual function compositions (e.g. ResNets [19] and Transformer models [35]).

  2. 2.

    Our proposed framework lets us formally establish the connection between various initialization strategies used for building the ensemble and the convergence properties of the resulting overall learning procedure. In particular, we show that the zero initialization strategy recovers the vanilla functional gradient descent algorithm, for both the additive (i.e. boosting) and residual compositional forms of learning, whereas random initialization recovers stochastic functional gradient descent (on a smoothed loss) for both types of models. Furthermore, in the case of additive models, the use of the popular stacking initialization exactly recovers Nesterov’s accelerated functional gradient descent. The consequence is that for TT stages of boosting with stacking initialization, loss reduces at a rate of O(T2)O(T^{-2}) for smooth losses, or exp(Ω(T/κ))\exp(-\Omega(T/\sqrt{\kappa})) for smooth and strongly-convex losses with condition number κ\sqrt{\kappa}, as opposed to rates of O(T1)O(T^{-1}) and exp(Ω(T/κ))\exp(-\Omega(T/\kappa)) respectively for zero initialization.

  3. 3.

    For the case of compositional models, we show that stacking initialization results in updates that look remarkably similar to Nesterov’s accelerated functional gradient descent. Proving an accelerated rate in the general non-parametric functional setting seems intractable, so we analyze stacking in a special parametric setting of deep linear networks with a convex loss function. In this setting we prove (Theorem 3.1) that the stacking initialization quantitatively leads to the same kind of convergence benefits over vanilla gradient descent as is observed for Nesterov’s accelerated method. At the core of our proof is a novel potential function based analysis of Nesterov’s method with errors in the momentum term that may be of independent interest (c.f. Lemma 3.2).

  4. 4.

    We perform proof-of-concept experiments (in Section 4) to validate our theory on synthetic and real world data.

1.1 Related work

Boosting is a classical technique for constructing additive ensembles via greedy stagewise training, and has a long and rich history of work. We refer the interested reader to the excellent textbook of [28] for the literature on this topic.

The idea of training deep residual networks in a layer wise manner has been explored in many prior works. In earlier studies [18, 4] the focus was on greedily adding trained layers to the model while keeping the bottom layers frozen followed by a final fine tuning step where the entire network is trained. In recent years progressive or gradual stacking [14, 16, 32, 26] has emerged as a powerful way to train deep networks especially transformer based architectures.

The empirical insight of [14] was that the attention patterns in neighboring layers of trained transformer models show remarkable similarity. Hence, by copying the parameters from the previous layer one is providing a better initialization for the optimization procedure. As mentioned previously, [26] developed the gradual stacking approach based on the assumption that the trained transformer blocks are good few-shot learners, and showed that gradual stacking leads to significant wallclock improvements during training.

2 Stagewise training as functional gradient descent

Preliminaries.

We consider a fairly general supervised learning setting. Denote the input space by 𝒳\mathcal{X} and the output space by 𝒴\mathcal{Y}. Examples (x,y)𝒳×𝒴(x,y)\in\mathcal{X}\times\mathcal{Y} are drawn from a distribution DD (which may simply be the empirical data distribution in the case of empirical risk minimization). We aim to model the input-output relationship via predictions in d\mathbb{R}^{d}, for some dimension parameter dd. Given an example (x,y)𝒳×𝒴(x,y)\in\mathcal{X}\times\mathcal{Y} the quality of a prediction y^d\widehat{y}\in\mathbb{R}^{d} is measured via a loss function :d×𝒴\ell:\mathbb{R}^{d}\times\mathcal{Y}\to\mathbb{R}. Predictions are computed using functions f:𝒳df:\mathcal{X}\to\mathbb{R}^{d}. We will assume that the predictor functions ff are square integrable with respect to DD, i.e. 𝔼(x,y)D[f(x)2]<\operatorname*{\mathbb{E}}_{(x,y)\sim D}[\|f(x)\|^{2}]<\infty. The space of such functions forms a Hilbert space, denoted 2\mathcal{L}_{2}, with the inner product defined as f,g=𝔼(x,y)D[f(x),g(x)]\langle f,g\rangle=\operatorname*{\mathbb{E}}_{(x,y)\sim D}[\langle f(x),g(x)\rangle]. Unless specified otherwise, all functions in the subsequent discussion will be assumed to be in 2\mathcal{L}_{2}. The loss function can then naturally be extended to predictor functions f2f\in\mathcal{L}_{2} by defining, with some abuse of notation, (f):=𝔼(x,y)D[(f(x),y)]\ell(f):=\operatorname*{\mathbb{E}}_{(x,y)\sim D}[\ell(f(x),y)]. The goal of training is to obtain a function f2f\in\mathcal{L}_{2} that minimizes (f)\ell(f).

In the rest of this section, we perform the analysis in a purely functional setting, which affords a convenient analysis. However, we note that, in practice, functions are parameterized (say by neural networks) and hence update rules for functions may not always be realizable via the specific parameterization used. The functional setting allows us to sidestep realizability issues and to focus on the conceptual message that stacking initialization enables accelerated updates.

We now define a general ensemble learning setup within the above setting. In this setup, we aim to approximate the minimizer of \ell on 2\mathcal{L}_{2} via an ensemble, which is a sequence of functions (f1,f2,,fT)(f_{1},f_{2},\ldots,f_{T}), where T>0T>0 is a given parameter defining the size of the ensemble. The functions in the ensemble are typically “simple” in the sense that they are chosen from a class of functions that is easy to optimize over. A predictor function can be obtained from an ensemble (f1,f2,,fT)(f_{1},f_{2},\dots,f_{T}) by aggregating its constituent functions into a single function FT:𝒳dF_{T}:\mathcal{X}\to\mathbb{R}^{d}. The loss of an ensemble can then be defined (again with some abuse of notation) in terms of its aggregation as ((f1,f2,,fT)):=(FT)\ell((f_{1},f_{2},\ldots,f_{T})):=\ell(F_{T}). Two specific aggregation operators we consider are the following:

  1. 1.

    Addition: (E.g. boosting.) This is a summation over ensemble outputs: FT=f1+f2+fTF_{T}=f_{1}+f_{2}+\cdots f_{T}.

  2. 2.

    Residual composition: (E.g. deep residual neural networks.) This is a composed function FT=(I+fT)(I+fT1)(I+f1)F_{T}=(I+f_{T})\circ(I+f_{T-1})\circ\cdots\circ(I+f_{1}), where the domain is 𝒳=d\mathcal{X}=\mathbb{R}^{d} and I:ddI:\mathbb{R}^{d}\to\mathbb{R}^{d} is the identity mapping.

Greedy stagewise training.

Stagewise training is a simple greedy procedure to train ensembles in a progressive manner. Suppose we have already obtained a (partial) ensemble (f1,f2,,ft)(f_{1},f_{2},\ldots,f_{t}). Then, the next function in the ensemble, ft+1f_{t+1}, is ideally obtained by minimizing the loss of the new ensemble, i.e. ft+1=argminf((f1,f2,,ft,f))f_{t+1}=\arg\min_{f}\ell((f_{1},f_{2},\ldots,f_{t},f)).

However, in practice, this ideal is hard to implement, and instead two heuristics are commonly used: (a) the new function to be trained is initialized in some carefully chosen manner, and (b) the optimization above is done using early stopping, i.e. a few steps of gradient descent, which ensures that the new function stays close to initialization. We analyze these heuristics in a functional optimization setting as follows.

First, we assume that the function ft+1f_{t+1} to be trained is initialized at some carefully chosen value ft+10f_{t+1}^{0}. For notational convenience, we denote the aggregation of the ensemble (f1,f2,,ft,ft+10)(f_{1},f_{2},\ldots,f_{t},f_{t+1}^{0}) by Ft+10F_{t+1}^{0} and that of the generic ensemble (f1,f2,,ft,f)(f_{1},f_{2},\ldots,f_{t},f) by FF.

Next, we note that an exact analysis for early stopping quickly becomes technically intractable. Instead, for a theoretical analysis, we model the heuristic of early stopping by using 2\ell_{2} regularization around the initialization and linearizing the loss near the initialization, as follows. It is known (see, e.g. [13, Section 7.8]) that early stopping acts as a form of 2\ell_{2} regularization which ensures that the trained function ft+1f_{t+1} remains close to its initialization, ft+10f_{t+1}^{0}, which implies that Ft+1F_{t+1} remains close to Ft+10F_{t+1}^{0}. Thus, early stopping can be modeled as minimizing (F)+λ2FFt+102\ell(F)+\frac{\lambda}{2}\|F-F_{t+1}^{0}\|^{2}, for some regularization parameter λ\lambda. Further, since the trained function remains close to the initialization, we also approximate (F)\ell(F) by its linearization around the initialization: (F)(Ft+10)+(Ft+10),FFt+10\ell(F)\approx\ell(F_{t+1}^{0})+\langle\nabla\ell(F_{t+1}^{0}),F-F_{t+1}^{0}\rangle. Here, ()\nabla\ell(\cdot) is the Fréchet derivative, and ,\langle\cdot,\cdot\rangle denotes the inner product in 2\mathcal{L}_{2}. These considerations lead to the following key modeling assumption.

Assumption 2.1.

The result of the early stopped training is given by

Ft+1=argminF2(Ft+10)+(Ft+10),FFt+10+λ2FFt+102.\displaystyle F_{t+1}=\arg\min_{F\in\mathcal{L}_{2}}\ell(F_{t+1}^{0})+\langle\nabla\ell(F_{t+1}^{0}),F-F_{t+1}^{0}\rangle+\frac{\lambda}{2}\|F-F_{t+1}^{0}\|^{2}.

In other words,

Ft+1=Ft+101λ(Ft+10).F_{t+1}=F_{t+1}^{0}-\frac{1}{\lambda}\nabla\ell(F_{t+1}^{0}). (1)

We can now consider specific initialization strategies (i.e. zero initialization, random initialization, and stacking initialization) in the context of additive and residual compositional models and see how these initializations lead to various forms of functional gradient descent.

Stagewise training with zero initialization recovers functional gradient descent.

First, consider stagewise training where functions are initialized to be zero functions, i.e. ft+10=0f_{t+1}^{0}=0. It is easy to see that with this initialization, for both additive and residual compositional models, we have Ft+10=FtF_{t+1}^{0}=F_{t}. Thus, from (1), we have that the updated ensemble’s predictor can be written as

Ft+1=Ft1λ(Ft).F_{t+1}=F_{t}-\frac{1}{\lambda}\nabla\ell(F_{t}).

This exactly describes functional gradient descent with step size 1λ\frac{1}{\lambda}. In the additive setting this is well-known: indeed, boosting can be seen as a functional gradient descent [24]. The result for the residual compositional setting appears to be new.

Stagewise training with random initialization recovers stochastic functional gradient descent on smoothed loss.

We now consider stagewise training where functions are initialized randomly, i.e. ft+10f_{t+1}^{0} is a randomly drawn function, independent of all randomness up to stage tt. In the following, we will assume that 𝔼[ft+10]=0\operatorname*{\mathbb{E}}[f_{t+1}^{0}]=0, where the 0 on the RHS denotes the zero function. With this initialization, for both additive and residual compositional models, we have Ft+10=Ft+gtF_{t+1}^{0}=F_{t}+g_{t}, where gt=ft+10g_{t}=f_{t+1}^{0} for additive models, and gt=ft+10Ftg_{t}=f_{t+1}^{0}\circ F_{t} for residual compositional models. In either case, note that 𝔼[gt]=0\operatorname*{\mathbb{E}}[g_{t}]=0. Now define the loss functional t(F):=𝔼[(F+gt)]\ell_{t}(F):=\operatorname*{\mathbb{E}}[\ell(F+g_{t})]. Since 𝔼[gt]=0\operatorname*{\mathbb{E}}[g_{t}]=0, we can interpret t\ell_{t} as a randomized smoothing of \ell, similar to convolving with a Gaussian. Then, from (1), we have that the updated ensemble’s predictor can be written as

Ft+1=Ft1λ((Ft+gt)λgt).F_{t+1}=F_{t}-\frac{1}{\lambda}(\nabla\ell(F_{t}+g_{t})-\lambda g_{t}).

Now, note that

𝔼[(Ft+gt)λgt]=t(Ft).\operatorname*{\mathbb{E}}[\nabla\ell(F_{t}+g_{t})-\lambda g_{t}]=\nabla\ell_{t}(F_{t}).

Or in other words, the above update can be seen as a stochastic functional gradient descent step on the smoothed loss function t\ell_{t}.

Stagewise training with stacking initialization recovers accelerated functional gradient descent.

We now consider stagewise training where functions are initialized in a stacking-like fashion with ft+10=ftf_{t+1}^{0}=f_{t}, which we will refer to as the stacking initialization. When the ensemble aggregation operator is addition, we have Ft+10=ft+Ft=Ft+(FtFt1)F_{t+1}^{0}=f_{t}+F_{t}=F_{t}+(F_{t}-F_{t-1}) and hence (1) implies that the updated ensemble’s predictor is

Ft+1=Ft+(FtFt1)1λ(Ft+(FtFt1)).F_{t+1}=F_{t}+(F_{t}-F_{t-1})-\frac{1}{\lambda}\nabla\ell(F_{t}+(F_{t}-F_{t-1})).

The above formula essentially describes Nesterov’s accelerated gradient descent, which has the following update rule:

Ft+1=Ft+β(FtFt1)1λ(Ft+β(FtFt1)).F_{t+1}=F_{t}+\beta(F_{t}-F_{t-1})-\frac{1}{\lambda}\nabla\ell(F_{t}+\beta(F_{t}-F_{t-1})). (2)

Here, β[0,1)\beta\in[0,1) is a constant that can depend on tt. In fact, we can exactly recover Nesterov’s accelerated gradient descent if we modify the stacking initialization to ft+10=βftf_{t+1}^{0}=\beta f_{t}. Thus, stacking enables accelerated descent for training additive models.

When the ensemble aggregation operator is residual composition, stagewise training with the stacking initialization ft+10=ftf_{t+1}^{0}=f_{t} results in

Ft+10=(I+ft)Ft=Ft+ftFt.F_{t+1}^{0}=(I+f_{t})\circ F_{t}=F_{t}+f_{t}\circ F_{t}.

Equation (1) therefore implies the updated ensemble’s predictor is

Ft+1=Ft+ftFt1λ(Ft+ftFt).F_{t+1}=F_{t}+f_{t}\circ F_{t}-\frac{1}{\lambda}\nabla\ell(F_{t}+f_{t}\circ F_{t}). (3)

In contrast, Nesterov’s update rule (2), and the fact that for residual compositional models,

FtFt1=(I+ft)Ft1=ftFt1,F_{t}-F_{t-1}=(I+f_{t})\circ F_{t-1}=f_{t}\circ F_{t-1},

yields the following equation for Ft+1F_{t+1}:

Ft+1=Ft+βftFt11λ(Ft+βftFt1).F_{t+1}=F_{t}+\beta f_{t}\circ F_{t-1}-\frac{1}{\lambda}\nabla\ell(F_{t}+\beta f_{t}\circ F_{t-1}). (4)

Comparing (3) and (4), barring the minor difference in β\beta parameters, which can be easily rectified as in the case of the additive models by setting ft+10=βftf_{t+1}^{0}=\beta f_{t}, the major difference is that ftFtf_{t}\circ F_{t} replaces ftFt1f_{t}\circ F_{t-1}. Although possibly intractable to prove formally, we believe that the updates in (3) also provide an accelerated convergence rate, since we expect Ft1F_{t-1} to be close to FtF_{t} as iterates converge to the optimal function.

In the following section, we show that in certain deep linear networks, the above intuition is indeed correct and provide a rigorous proof that stacking provides an accelerated convergence rate.

3 Accelerated convergence of deep linear networks by stacking

To demonstrate that stacking can provide a provably accelerated rate of convergence, we now turn to studying the narrower setting of training deep residual linear networks, which are fully connected feedforward neural networks without non-linear activations and with residual connections. Such networks are a common subject of study in the theory of deep learning [29, 22, 17]. As they have no non-linear components, deep linear networks effectively compute a linear functions, albeit via a parametrization as a product of the weight matrices.

Setup.

Consider again the general supervised learning setting from Section 2 and suppose, as is often the case in modern neural networks, that examples consist of inputs xdx\in\mathbb{R}^{d} and outputs y𝒴y\in\mathcal{Y}. The loss function :d×𝒴\ell:\mathbb{R}^{d}\times\mathcal{Y}\to\mathbb{R} is assumed to be convex in the first argument. Let the samples be drawn from a distribution DD over d×𝒴\mathbb{R}^{d}\times\mathcal{Y}. Then the expected loss of the linear predictor xWxx\mapsto Wx for a matrix Wd×dW\in\mathbb{R}^{d\times d} is (with some abuse of notation) (W):=𝔼(x,y)D[(Wx,y)]\ell(W):=\operatorname*{\mathbb{E}}_{(x,y)\sim D}[\ell(Wx,y)]. In the following, we will assume the expected loss (W)\ell(W) is LL-smooth and μ\mu-strongly convex in WW, by which we mean that the following inequalities hold for any W,Vd×dW,V\in\mathbb{R}^{d\times d}:

(W)+(W),VW+μ2WV2(V)(W)+(W),VW+L2WV2.\displaystyle\ell(W)+\langle\nabla\ell(W),V-W\rangle+\frac{\mu}{2}\left\lVert W-V\right\rVert^{2}\leq\ell(V)\leq\ell(W)+\langle\nabla\ell(W),V-W\rangle+\frac{L}{2}\left\lVert W-V\right\rVert^{2}.

Here, for matrices W,VdW,V\in\mathbb{R}^{d}, W,V=Tr(WV)\langle W,V\rangle=\text{Tr}(W^{\top}V), and W\left\lVert W\right\rVert is the Frobenius norm of WW. The condition number κ\kappa of the loss is defined as κ:=Lμ\kappa:=\frac{L}{\mu}.

The deep residual neural networks we consider have tt layers with weight matrices w1,w2,,wtw_{1},w_{2},\ldots,w_{t}, and the function they compute is xWtxx\mapsto W_{t}x, where

Wt:=(I+wt)(I+wt1)(I+w1).W_{t}:=(I+w_{t})(I+w_{t-1})\dots(I+w_{1}).

Here, Id×dI\in\mathbb{R}^{d\times d} is the identity matrix providing the residual connection. The expected loss of the neural network described above on the data is (Wt)\ell(W_{t}).

Derivation of stacking updates.

Suppose we train the deep residual linear network described above using stacking initialization, but incorporating β\beta-scaling: i.e., to train the (t+1)(t+1)-th layer, its weight matrix is initialized to wt+10=βwtw_{t+1}^{0}=\beta w_{t}, for some constant β[0,1]\beta\in[0,1], and then trained. Following the exact same steps as in the derivation of stacking updates in the functional setting of Section 2, we end up with the following formula for Wt+1W_{t+1}:

Wt+1=Wt+βwtWt1λ(Wt+βwtWt).W_{t+1}=W_{t}+\beta w_{t}W_{t}-\frac{1}{\lambda}\nabla\ell(W_{t}+\beta w_{t}W_{t}).

When Wt1W_{t-1} is non-singular, we have wt=WtWt11I=(WtWt1)Wt11w_{t}=W_{t}W_{t-1}^{-1}-I=(W_{t}-W_{t-1})W_{t-1}^{-1}, so the above equation can be rewritten as

Wt+1=\displaystyle W_{t+1}= Wt+β(WtWt1)Wt11Wt1λ(Wt+β(WtWt1)Wt11Wt).\displaystyle\;W_{t}+\beta(W_{t}-W_{t-1})W_{t-1}^{-1}W_{t}-\frac{1}{\lambda}\nabla\ell(W_{t}+\beta(W_{t}-W_{t-1})W_{t-1}^{-1}W_{t}). (5)

As previously noted, (5) differs from Nesterov’s AGD method in form: Nesterov’s AGD updates would be

Wt+1\displaystyle W_{t+1} =Wt+β(WtWt1)1λ(Wt+β(WtWt1)).\displaystyle=W_{t}+\beta(W_{t}-W_{t-1})-\frac{1}{\lambda}\nabla\ell(W_{t}+\beta(W_{t}-W_{t-1})). (6)

Accelerated convergence for stacking updates.

Despite the differences with Nesterov’s method, we can show that stacking notably still yields a provably accelerated convergence rate. Let W:=argminW(W)W^{*}:=\arg\min_{W}\ell(W). Suppose that WW^{*} is non-singular, i.e. its smallest singular value σmin(W)>0\sigma_{\min}(W^{*})>0. Theorem 3.1 shows that as long as the first two layers are initialized so that W1W_{1} and W2W_{2} are close to optimal, stacking results in a suboptimality gap of exp(Ω~(T/κ))\exp(-\widetilde{\Omega}(T/\sqrt{\kappa})) after tt stages of stacking. This is of the same order as the rate obtained by Nesterov’s acceleration; note that, in comparison, stagewise training with zero initialization results would result in a suboptimality gap of exp(Ω~(T/κ))\exp(-\widetilde{\Omega}(T/\kappa)).

Theorem 3.1.

Consider stagewise training with stacking initialization of a deep residual linear network in the setup described above with β=κ1κ+1\beta=\tfrac{\sqrt{\kappa}-1}{\sqrt{\kappa}+1} and λ=1L\lambda=\tfrac{1}{L}. Suppose that the first layer weights are initialized so that W1=V01L(V0)W_{1}=V_{0}-\frac{1}{L}\nabla\ell(V_{0}), where V0d×dV_{0}\in\mathbb{R}^{d\times d} satisfies (V0)(W)δ\ell(V_{0})-\ell(W^{*})\leq\delta for δμα2σmin(W)2256β2d\delta\coloneqq\tfrac{\mu\alpha^{2}\sigma_{\min}(W^{*})^{2}}{256\beta^{2}d} and α1(κ1)2κ(κ1)(κ3)\alpha^{-1}\coloneqq(\kappa-1)\sqrt{2\sqrt{\kappa}(\kappa-1)(\sqrt{\kappa}-3)}. Then after TT stages of stacking, we have

(WT)(W)exp(Ω~(T/κ)).\ell(W_{T})-\ell(W^{*})\leq\exp(-\widetilde{\Omega}(T/\sqrt{\kappa})).

The Ω~()\widetilde{\Omega}(\cdot) notation above hides polylogarithmic dependence on the problem parameters for clarity of presentation. Precise expressions can be found in the proof. The primary insight behind the result of Theorem 3.1 is that Nesterov’s accelerated gradient method is relatively robust to perturbations in its update rules. This robustness is formalized below in Lemma 3.2. The lemma is described in a fairly general, standalone setting since it may be of independent interest.

Lemma 3.2 (Robustness of Nesterov’s accelerated gradient method).

Let \mathcal{F} be a Hilbert space and :\ell:\mathcal{F}\to\mathbb{R} an LL{}-smooth and μ\mu-strongly convex function to be minimized on \mathcal{F}. Consider the iterates x0,y0,,xT,yTx_{0},y_{0},\dots,x_{T},y_{T}\in\mathcal{F} with x0=y0x_{0}=y_{0} chosen arbitrarily, and the update rules

yt+1\displaystyle y_{t+1} xt1L(xt)\displaystyle\coloneqq x_{t}-\frac{1}{L{}}\nabla\ell(x_{t})
xt+1\displaystyle x_{t+1} yt+1+β(yt+1yt)+Δt+1\displaystyle\coloneqq y_{t+1}+\beta(y_{t+1}-y_{t})+\Delta_{t+1} (7)

where β:=κ1κ+1\beta:=\frac{\sqrt{\kappa}-1}{\sqrt{\kappa}+1}, κ=Lμ\kappa=\tfrac{L}{\mu}, and Δ1,Δ2,,ΔT1\Delta_{1},\Delta_{2},\dots,\Delta_{T-1}\in\mathcal{F} are error terms such that ΔtO(κ2ytyt1)\left\lVert\Delta_{t}\right\rVert\in O(\kappa^{-2}\left\lVert y_{t}-y_{t-1}\right\rVert) for all t[T1]t\in[T-1]. Then the convergence rate of the iterates to a suboptimality gap of ε\varepsilon is of order O~(κlog(κ/ε))\widetilde{O}(\sqrt{\kappa}\log(\kappa/\varepsilon)). Specifically, for any T2T\geq 2,

(yT)(x)(2κ22κ1)T(L2+8κμ4(4κ3))x0x2\ell(y_{T})-\ell(x^{*})\leq(\tfrac{2\sqrt{\kappa}-2}{2\sqrt{\kappa}-1})^{T}\left(\tfrac{L{}}{2}+\tfrac{8\sqrt{\kappa}\mu}{4(4\sqrt{\kappa}-3)}\right)\left\lVert x_{0}-x^{*}\right\rVert^{2}

and

(yT)(x)(2κ22κ1)T8κ4κ3((y0)(x)).\ell(y_{T})-\ell(x^{*})\leq(\tfrac{2\sqrt{\kappa}-2}{2\sqrt{\kappa}-1})^{T}\tfrac{8\sqrt{\kappa}}{4\sqrt{\kappa}-3}(\ell(y_{0})-\ell(x^{*})).

We will apply Lemma 3.2 using the correspondence x0=y0=W0=V0x_{0}=y_{0}=W_{0}=V_{0}, and for all t1t\geq 1, yt=Wty_{t}=W_{t} and xt=Wt+β(WtWt1)Wt11Wtx_{t}=W_{t}+\beta(W_{t}-W_{t-1})W_{t-1}^{-1}W_{t}. Lemma 3.2 implies that even though stagewise training with stacking differs from Nesterov’s method, their similar form allows us to express the former as a perturbation of the latter. In particular, if we write our stacking update (5) for deep residual linear networks as a perturbation of Nesterov’s method, as in (7), the perturbation term is exactly

Δt=β(WtWt1)Wt11Wtβ(WtWt1)\displaystyle\Delta_{t}=\beta(W_{t}-W_{t-1})W_{t-1}^{-1}W_{t}-\beta(W_{t}-W_{t-1}) (8)

for all t[2,T1]t\in[2,T-1]. That is, if we examine the sequence of iterates W1,,WTW_{1},\dots,W_{T} produced by stagewise training with stacking initialization, the term Δt\Delta_{t} is a measure of the disagreement between the realized iterate Wt+1W_{t+1} and what Nesterov’s method says the iterate Wt+1W_{t+1} should be conditioned on the iterates W1,,WtW_{1},\dots,W_{t} from previous timesteps. We note that, even when these perturbation terms Δ1,,ΔT1\Delta_{1},\dots,\Delta_{T-1} are small in norm, it is possible for Nesterov’s method to describe an iterate sequence that diverges significantly in norm from the iterates realized by stagewise training.

Rewriting (8) as Δt=β(WtWt1)Wt11(WtWt1)\Delta_{t}=\beta(W_{t}-W_{t-1})W_{t-1}^{-1}(W_{t}-W_{t-1}), we immediately see that in order to satisfy the requirement of Lemma 3.2 that ΔtO(κ2WtWt1)\left\lVert\Delta_{t}\right\rVert\in O(\kappa^{-2}\left\lVert W_{t}-W_{t-1}\right\rVert), we simply need βWt11(WtWt1)O(κ2)\left\lVert\beta W_{t-1}^{-1}(W_{t}-W_{t-1})\right\rVert\in O(\kappa^{-2}). This is satisfied when WtWt1W_{t}\approx W_{t-1} and Wt1W_{t-1} is reasonably non-singular. A sufficient condition for this is that the iterates are sufficiently close to the ground-truth solution WW^{*} and WW^{*} is non-singular, which explains the conditions of Theorem 3.1.

We now prove Theorem 3.1 formally.

Proof of Theorem 3.1.

As described above, the deep linear networks W1,,WTW_{1},\dots,W_{T}, as defined in (5), can be written as iterates of a variant of Nesterov’s acceleration (7), where we set the gradient step size to be the usual λ=1L\lambda=\tfrac{1}{L} and the stacking parameter β\beta to match the usual momentum parameter setting of κ1κ+1\tfrac{\sqrt{\kappa}-1}{\sqrt{\kappa}+1}.

Sufficient claim.

To prove the main result, it suffices to show that ΔtαWtWt1\left\lVert\Delta_{t}\right\rVert\leq\alpha\left\lVert W_{t}-W_{t-1}\right\rVert for all t[T1]t\in[T-1]. With this claim Lemma 3.2 immediately gives a convergence rate of

(WT)(W)(2κ22κ1)T(L2+8κμ4(4κ3))W1W2.\displaystyle\ell(W_{T})-\ell(W^{*})\leq(\tfrac{2\sqrt{\kappa}-2}{2\sqrt{\kappa}-1})^{T}\left(\tfrac{L{}}{2}+\tfrac{8\sqrt{\kappa}\mu}{4(4\sqrt{\kappa}-3)}\right)\left\lVert W_{1}-W^{*}\right\rVert^{2}.

This gives the desired result (WT)(W)exp(Ω(Tκ+log(Tκ)))\ell(W_{T})-\ell(W^{*})\in\exp(-\Omega(\tfrac{T}{\sqrt{\kappa}}+\log(\tfrac{T}{\sqrt{\kappa}}))) as

1C((WT)(W))(112κ1)Texp(Ω(Tκ))\displaystyle\tfrac{1}{C}(\ell(W_{T})-\ell(W^{*}))\leq(1-\tfrac{1}{2\sqrt{\kappa}-1})^{T}\leq\exp(-\Omega(\tfrac{T}{\sqrt{\kappa}}))

where C=(L2+8κμ4(4κ3))W1W2exp(Ω(log(Tκ)))C=\left(\tfrac{L{}}{2}+\tfrac{8\sqrt{\kappa}\mu}{4(4\sqrt{\kappa}-3)}\right)\left\lVert W_{1}-W^{*}\right\rVert^{2}\in\exp(-\Omega(\log(\tfrac{T}{\sqrt{\kappa}}))).

Thus, we need to show that ΔtαWtWt1\left\lVert\Delta_{t}\right\rVert\leq\alpha\left\lVert W_{t}-W_{t-1}\right\rVert for all t[T1]t\in[T-1]. For this, we use the following claim, where η16δμ\eta\coloneqq\sqrt{\frac{16\delta}{\mu}}:

Claim 3.3.

Suppose that Wt1Wη\left\lVert W_{t-1}-W^{*}\right\rVert\leq\eta and WtWη\left\lVert W_{t}-W^{*}\right\rVert\leq\eta. Then ΔtαWtWt1\left\lVert\Delta_{t}\right\rVert\leq\alpha\left\lVert W_{t}-W_{t-1}\right\rVert.

Proof of Claim 3.3.

We have

Δt=β(WtWt1)(Wt11WtI)βWtWt1Wt11WtI.\left\lVert\Delta_{t}\right\rVert=\left\lVert\beta(W_{t}-W_{t-1})(W_{t-1}^{-1}W_{t}-I)\right\rVert\leq\beta\left\lVert W_{t}-W_{t-1}\right\rVert\left\lVert W_{t-1}^{-1}W_{t}-I\right\rVert.

We can further simplify right-most factor as follows:

Wt11WtI\displaystyle\left\lVert W_{t-1}^{-1}W_{t}-I\right\rVert =Wt11(WtWt1)\displaystyle=\left\lVert W_{t-1}^{-1}(W_{t}-W_{t-1})\right\rVert
WtWt1Wt11\displaystyle\leq\left\lVert W_{t}-W_{t-1}\right\rVert\left\lVert W_{t-1}^{-1}\right\rVert (submultiplicative property)
(WtW+Wt1W)Wt11.\displaystyle\leq(\left\lVert W_{t}-W^{*}\right\rVert+\left\lVert W_{t-1}-W^{*}\right\rVert)\left\lVert W_{t-1}^{-1}\right\rVert. (triangle inequality) (9)

The following fact shows that Wt1W_{t-1} is indeed far from singular as long as it is close enough to WW^{*}.

Fact 3.4.

Wt1W_{t-1} is invertible and Wt11dσmin(W)η\left\lVert W_{t-1}^{-1}\right\rVert\leq\frac{\sqrt{d}}{\sigma_{\min{}}(W^{*})-\eta}.

Proof of Fact 3.4.

Since \|\cdot\| is the Frobenius norm, we can upper bound the singular values of the matrix WWt1W^{*}-W_{t-1} by σmax(WWt1)η\sigma_{\max{}}(W^{*}-W_{t-1})\leq\eta, where we use σmax(W)\sigma_{\max{}}(W) to denote the largest singular values of a matrix WW. We can then use Weil’s inequality to argue that, for any i[d]i\in[d], the difference between the iith largest singular values of Wt1W_{t-1} and WW^{*} is upper bounded by

|σi(Wt1)σi(W)|σmax(Wt1W)η.\left|\sigma_{i}(W_{t-1})-\sigma_{i}(W^{*})\right|\leq\sigma_{\max}(W_{t-1}-W^{*})\leq\eta.

The smallest singular value of Wt1W_{t-1} is thus at least σmin(Wt1)σmin(W)η\sigma_{\min{}}(W_{t-1})\geq\sigma_{\min{}}(W^{*})-\eta. Since our choice of δ\delta guarantees that ησmin(W)\eta\leq\sigma_{\min{}}(W^{*}), we have that Wt1>0W_{t-1}>0 and is invertible. This also implies that the largest singular value of Wt11W_{t-1}^{-1} is at most σmax(Wt11)1σmin(W)η\sigma_{\max{}}(W_{t-1}^{-1})\leq\frac{1}{\sigma_{\min{}}(W^{*})-\eta}. We can therefore upper bound the Frobenius norm of Wt11W_{t-1}^{-1} as claimed by

Wt11i[d]σi(Wt11)2dσmax(Wt11)d1σmin(W)η.\left\lVert W_{t-1}^{-1}\right\rVert\leq\sqrt{\sum_{i\in[d]}\sigma_{i}(W_{t-1}^{-1})^{2}}\leq\sqrt{d}\sigma_{\max{}}(W_{t-1}^{-1})\leq\sqrt{d}\frac{1}{\sigma_{\min{}}(W^{*})-\eta}.

Using Fact 3.4 and (9), we can now bound

Wt11WtI\displaystyle\left\lVert W_{t-1}^{-1}W_{t}-I\right\rVert 2ηd1σmin(W)η.\displaystyle\leq 2\eta\sqrt{d}\frac{1}{\sigma_{\min}(W^{*})-\eta}.

Since η=16δμ\eta=\sqrt{\frac{16\delta}{\mu}}, we have by the definition of δ\delta that 2ηd1σmin(W)ηαβ2\eta\sqrt{d}\frac{1}{\sigma_{\min}(W^{*})-\eta}\leq\tfrac{\alpha}{\beta}. This concludes our proof of the claim as

ΔtβWtWt1Wt11WtIβαβWtWt1.\left\lVert\Delta_{t}\right\rVert\leq\beta\cdot\left\lVert W_{t}-W_{t-1}\right\rVert\left\lVert W_{t-1}^{-1}W_{t}-I\right\rVert\leq\beta\cdot\tfrac{\alpha}{\beta}\|W_{t}-W_{t-1}\|.

We can now complete the proof by showing the following claim via induction on tt:

Claim 3.5.

We have ΔtαWtWt1\left\lVert\Delta_{t}\right\rVert\leq\alpha\left\lVert W_{t}-W_{t-1}\right\rVert and WtWη\left\lVert W_{t}-W^{*}\right\rVert\leq\eta.

Proof of Claim 3.5.

We proceed with induction on tt.

Base case: t=1t=1.

Note that W0=V0W_{0}=V_{0}. We have

μ2W0W2(W0)(W)δ,\tfrac{\mu}{2}\left\lVert W_{0}-W^{*}\right\rVert^{2}\leq\ell(W_{0})-\ell(W^{*})\leq\delta,

which implies that W0W2δμη\left\lVert W_{0}-W^{*}\right\rVert\leq\sqrt{\frac{2\delta}{\mu}}\leq\eta. Furthermore, since W1=W01L(W0)W_{1}=W_{0}-\frac{1}{L}\nabla\ell(W_{0}) and \ell is LL-smooth, we have (W1)(W0)\ell(W_{1})\leq\ell(W_{0}), which implies, exactly as above, that W1Wη\left\lVert W_{1}-W^{*}\right\rVert\leq\eta. Thus, by Claim 3.3, we conclude that Δ1αW1W0\left\lVert\Delta_{1}\right\rVert\leq\alpha\left\lVert W_{1}-W_{0}\right\rVert.

Inductive step.

Fixing t[2,T1]t\in[2,T-1], and assume, as our inductive hypothesis, that ΔταWτWτ1\left\lVert\Delta_{\tau}\right\rVert\leq\alpha\left\lVert W_{\tau}-W_{\tau-1}\right\rVert and WτWη\left\lVert W_{\tau}-W^{*}\right\rVert\leq\eta for all τ[t1]\tau\in[t-1]. Due to this inductive hypothesis, we can invoke Lemma 3.2 to observe that for all τ[T1]\tau\in[T-1]:

μ2WtW2\displaystyle\tfrac{\mu}{2}\left\lVert W_{t}-W^{*}\right\rVert^{2} (Wt)(W)\displaystyle\leq\ell(W_{t})-\ell(W^{*})
(2κ22κ1)t8κ4κ3((W0)(W))\displaystyle\leq(\tfrac{2\sqrt{\kappa}-2}{2\sqrt{\kappa}-1})^{t}\tfrac{8\sqrt{\kappa}}{4\sqrt{\kappa}-3}(\ell(W_{0})-\ell(W^{*}))
8δ.\displaystyle\leq 8\delta.

Thus, we have WtW16δμ=η\|W_{t}-W^{*}\|\leq\sqrt{\frac{16\delta}{\mu}}=\eta. Since Wt1Wη\left\lVert W_{t-1}-W^{*}\right\rVert\leq\eta due to our induction hypothesis, we can now use Claim 3.3 to conclude that ΔtαWtWt1\left\lVert\Delta_{t}\right\rVert\leq\alpha\left\lVert W_{t}-W_{t-1}\right\rVert. This completes the inductive proof.

Theorem 3.1 presents a local convergence result: it assumes that the initial two layers put the network in the vicinity of the optimal solution WW^{*}. We can also provide a global convergence result by using a small warmup period phase to stacking—that is, by training the first few (only a constant number) stages of the deep linear network without stacking.

Corollary 3.6 (Corollary of Theorem 3.1).

Consider stagewise training of a deep residual linear network in the setup described above with an initial warmup phase of zero initialization for O~(κ)\widetilde{O}(\kappa) stages followed by stacking initialization for the remaining stages. Then after TT total stages, we have

(WT)(W)exp(Ω~(T/κ)+O~(κ)).\ell(W_{T})-\ell(W^{*})\leq\exp(-\widetilde{\Omega}(T/\sqrt{\kappa})+\widetilde{O}(\sqrt{\kappa})).
Proof.

We first recall the standard convergence rate of gradient descent. In the same setting as Lemma 3.2 where we minimize a LL-smooth and μ\mu-strongly convex function on some Hilbert space \mathcal{F}, we can define gradient descent iterates with the update rule xt+1=xt1L(xt)x_{t+1}=x_{t}-\tfrac{1}{L}\nabla\ell(x_{t}). For the resulting sequence of iterates x0,,xTx_{0},\dots,x_{T}, it is known that (xT)(x)exp(T/κ)((x0)(x))\ell(x_{T})-\ell(x^{*})\leq\exp(-T/\kappa)(\ell(x_{0})-\ell(x^{*})); see e.g. [3].

We can further recall that, in the stagewise training of deep residual linear networks, we can write networks as Wt+1=Wt+wt+10Wt1L(Wt+wt+10Wt)W_{t+1}=W_{t}+w_{t+1}^{0}W_{t}-\tfrac{1}{L}\nabla\ell(W_{t}+w_{t+1}^{0}W_{t}). In the initial stages where new layers are initialized with zero weights, i.e. wt+10=0w_{t+1}^{0}=0, we recover, as mentioned in Section 2, a gradient descent on d×d\mathbb{R}^{d\times d}:

Wt+1=Wt1L(Wt).\displaystyle W_{t+1}=W_{t}-\tfrac{1}{L}\nabla\ell(W_{t}).

Putting these two pieces together, we have (Wt)(W)exp((t1)κ)((W1)(W))\ell(W_{t})-\ell(W^{*})\leq\exp(-\tfrac{(t-1)}{\kappa})(\ell(W_{1})-\ell(W^{*})). By setting

T0κlog((W1)(W)δ)+2,T_{0}\geq\kappa\log(\tfrac{\ell(W_{1})-\ell(W^{*})}{\delta})+2,

we have (WT01)(W)δ\ell(W_{T_{0}-1})-\ell(W^{*})\leq\delta. So by setting V0=WT01V_{0}=W_{T_{0}-1}, and noting that WT0=WT011L(WT01)W_{T_{0}}=W_{T_{0}-1}-\frac{1}{L}\nabla\ell(W_{T_{0}-1}), we can now apply the local convergence result of Theorem 3.1, and conclude that performing TT^{\prime} rounds of stacking after T0T_{0} rounds of warm-start leads to the desired loss bound

(WT0+T)(W)exp(Ω~(T/κ))\displaystyle\ell(W_{T_{0}+T^{\prime}})-\ell(W^{*})\leq\exp(-\widetilde{\Omega}(T^{\prime}/\sqrt{\kappa})) =exp(Ω~(T/κ)+O(T0/κ))\displaystyle=\exp(-\widetilde{\Omega}(T/\sqrt{\kappa})+O(T_{0}/\sqrt{\kappa}))
=exp(Ω~(T/κ)+O~(κ)).\displaystyle=\exp(-\widetilde{\Omega}(T/\sqrt{\kappa})+\widetilde{O}(\sqrt{\kappa})).

Key ingredient of proof: robustness of Nesterov’s accelerated gradient descent method.

We now turn to proving Lemma 3.2.

Proof of Lemma 3.2.

In this proof, we will assume that the norm of the perturbation term is bounded by

Δtαytyt1,\displaystyle\|\Delta_{t}\|\leq\alpha\|y_{t}-y_{t-1}\|, (10)

where α\alpha is defined in the same way as in Theorem 3.1, namely

α1(κ1)2κ(κ1)(κ3).\alpha^{-1}\coloneqq(\kappa-1)\sqrt{2\sqrt{\kappa}(\kappa-1)(\sqrt{\kappa}-3)}.

We define the coefficients τ:=1κ+1\tau:=\frac{1}{\sqrt{\kappa}+1}, γ:=1κ1\gamma:=\frac{1}{\sqrt{\kappa}-1}, and ρ=μγ4(4+γ)=μ4(4κ3)\rho=\frac{\mu\gamma}{4(4+\gamma)}=\frac{\mu}{4(4\sqrt{\kappa}-3)}. τ\tau can be understood as a momentum parameter, γ\gamma as (proportional to) the parameter of the exponential curve that is our convergence rate, and ρ\rho a penalty for the presence of the perturbations Δt\Delta_{t}. As is done in all proofs of Nesterov’s method, we will define the iterates

zt:=1τxt1ττytz_{t}:=\tfrac{1}{\tau}x_{t}-\tfrac{1-\tau}{\tau}y_{t}

for all t[T]t\in[T]; we refer interested readers to [36, 2] for interpretations of these iterates. We note that although our definition of zTz_{T} depends on xTx_{T}, and by extension ΔT\Delta_{T}, this lemma’s claim about yTy_{T} does not directly depend at all on ΔT\Delta_{T}. Thus, we will freely choose to set ΔT=0\Delta_{T}=0, guaranteeing that ΔTαyTyT1\left\lVert\Delta_{T}\right\rVert\leq\alpha\left\lVert y_{T}-y_{T-1}\right\rVert.

To prove the statement of this lemma, it suffices to show the following claim.

Claim 3.7.

The following function Φ\Phi is a potential function:

Φ(t)=(1+γ2)t[((yt)(x))+μ2ztx2+2ρytx2].\Phi(t)=(1+\tfrac{\gamma}{2})^{t}[(\ell(y_{t})-\ell(x^{*}))+\tfrac{\mu}{2}\|z_{t}-x^{*}\|^{2}+2\rho\|y_{t}-x^{*}\|^{2}]. (11)

That is, for all t[T]t\in[T], Φ(t)Φ(t1)0\Phi(t)-\Phi(t-1)\leq 0.

It suffices to show Claim 3.7 because the consequence that Φ(T)Φ(0)\Phi(T)\leq\Phi(0) directly implies the lemma’s first statement via

(1+γ2)T((yT)(x))\displaystyle(1+\tfrac{\gamma}{2})^{T}(\ell(y_{T})-\ell(x^{*})) (x0)(x)+(μ2+2ρ)x0x2(L+μ2+2ρ)x0x2,\displaystyle\leq\ell(x_{0})-\ell(x^{*})+(\frac{\mu}{2}+2\rho)\left\lVert x_{0}-x^{*}\right\rVert^{2}\leq(\tfrac{L{}+\mu}{2}+2\rho)\left\lVert x_{0}-x^{*}\right\rVert^{2},

with the last inequality following from the LL{}-smoothness of \ell. The second statement follows similarly via

(1+γ2)T((yT)(x))\displaystyle(1+\tfrac{\gamma}{2})^{T}(\ell(y_{T})-\ell(x^{*})) (x0)(x)+(μ2+2ρ)x0x2(2+4ρμ)((x0)(x)).\displaystyle\leq\ell(x_{0})-\ell(x^{*})+(\frac{\mu}{2}+2\rho)\left\lVert x_{0}-x^{*}\right\rVert^{2}\leq(2+\tfrac{4\rho}{\mu})(\ell(x_{0})-\ell(x^{*})).

We therefore turn to showing Claim 3.7.

First, we note that the potential function (11) we are proving differs from the usual potential function that is used to prove Nesterov’s acceleration, which we will denote by Φorig\Phi_{\mathrm{orig}}:

Φorig(t)=(1+γ)t[((yt)(x))+μ2ztx2].\displaystyle\Phi_{\mathrm{orig}}(t)=(1+\gamma)^{t}[(\ell(y_{t})-\ell(x^{*}))+\tfrac{\mu}{2}\left\lVert z_{t}-x^{*}\right\rVert^{2}].

Fact 3.8 says roughly that, for any t[T]t\in[T], we can hypothetically recover a large part of the usual proof of Nesterov’s acceleration by bounding the difference Φorig(t)Φorig(t1)0\Phi_{\mathrm{orig}}(t)-\Phi_{\mathrm{orig}}(t-1)\leq 0, if only we could remove the perturbation at timestep tt, i.e. set Δt=0\Delta_{t}=0 (but keep in place the perturbations Δ1,,Δt1\Delta_{1},\dots,\Delta_{t-1} from previous timesteps).

Fact 3.8.

Fix any t[T]t\in[T] and let z~t1τ(yt+β(ytyt1))1ττyt\widetilde{z}_{t}\coloneqq\tfrac{1}{\tau}(y_{t}+\beta(y_{t}-y_{t-1}))-\tfrac{1-\tau}{\tau}y_{t}. Then,

(1+γ)((yt)(x))((yt1)(x))+μ2((1+γ)z~tx2zt1x2)0(1+\gamma)(\ell(y_{t})-\ell(x^{*}))-(\ell(y_{t-1})-\ell(x^{*}))+\tfrac{\mu}{2}\left((1+\gamma)\|\widetilde{z}_{t}-x^{*}\|^{2}-\|z_{t-1}-x^{*}\|^{2}\right)\leq 0 (12)

Next, in Fact 3.9, we show that, since we can argue that Φorig(t)Φorig(t1)\Phi_{\mathrm{orig}}(t)\leq\Phi_{\mathrm{orig}}(t-1) if we ignore the perturbation Δt\Delta_{t}, we can argue that Φ(t)Φ(t1)\Phi(t)\leq\Phi(t-1) even taking the perturbation Δt\Delta_{t} into account. That is, Fact 3.9 shows that the left-hand side of (12) upper bounds Φ(t)Φ(t1)\Phi(t)-\Phi(t-1), so long as our stated assumption on the perturbation norm Δt\left\lVert\Delta_{t}\right\rVert holds.

Fact 3.9.

For any t[T]t\in[T], the left-hand side of (12) upper bounds Φ(t)Φ(t1)\Phi(t)-\Phi(t-1); equivalently,

(1+γ2)(μ2ztx2+2ρytx2)2ρyt1x2\displaystyle(1+\tfrac{\gamma}{2})(\tfrac{\mu}{2}\left\lVert z_{t}-x^{*}\right\rVert^{2}+2\rho\left\lVert y_{t}-x^{*}\right\rVert^{2})-2\rho\left\lVert y_{t-1}-x^{*}\right\rVert^{2}
γ2((yt)(x))+(1+γ)μ2z~tx2.\displaystyle\leq\tfrac{\gamma}{2}(\ell(y_{t})-\ell(x^{*}))+(1+\gamma)\tfrac{\mu}{2}\left\lVert\widetilde{z}_{t}-x^{*}\right\rVert^{2}. (13)

Plugging Fact 3.9’s (13) into Fact 3.8’s (12), we recover our main claim that Φ(t)Φ(t1)0\Phi(t)-\Phi(t-1)\leq 0. We conclude by turning to prove Facts 3.8 and 3.9.

Proof of Fact 3.8.

The proof of this fact closely follows [3]’s proof of Nesterov’s accelerated gradient method in smooth strongly convex settings.

We begin by upper bounding the first summand in the left-hand side of (12). Since \ell is smooth and yty_{t} is a gradient step on \ell from xt1x_{t-1}, we have that (yt)(xt1)12Lxt12\ell(y_{t})\leq\ell(x_{t-1})-\frac{1}{2L{}}\left\lVert\nabla x_{t-1}\right\rVert^{2} and thus

(1+γ)((yt)(x))((yt1)(x))\displaystyle(1+\gamma)(\ell(y_{t})-\ell(x^{*}))-(\ell(y_{t-1})-\ell(x^{*}))
(xt1)(yt1)+γ((xt1)(x))1+γ2L(xt1)2.\displaystyle\leq\ell(x_{t-1})-\ell(y_{t-1})+\gamma(\ell(x_{t-1})-\ell(x^{*}))-\frac{1+\gamma}{2L{}}\left\lVert\nabla\ell(x_{t-1})\right\rVert^{2}. (14)

Using the μ\mu-strong convexity of \ell, we can further bound part of the right-hand side by

(xt1)(yt1)+γ((xt1)(x))\displaystyle\ell(x_{t-1})-\ell(y_{t-1})+\gamma(\ell(x_{t-1})-\ell(x^{*}))\leq\; (xt1),xt1yt1\displaystyle\left<\nabla\ell(x_{t-1}),x_{t-1}-y_{t-1}\right>
+γ((xt1),xt1xμ2xt1x2).\displaystyle\;+\gamma\left(\left<\nabla\ell(x_{t-1}),x_{t-1}-x^{*}\right>-\frac{\mu}{2}\left\lVert x_{t-1}-x^{*}\right\rVert^{2}\right). (15)

Plugging (15) and the identity zt1=1τxt11ττyt1z_{t-1}=\frac{1}{\tau}x_{t-1}-\frac{1-\tau}{\tau}y_{t-1} into (3), direct algebra yields an upper bound on the first summand of (12):

(1+γ)((yt)(x))((yt1)(x))\displaystyle(1+\gamma)(\ell(y_{t})-\ell(x^{*}))-(\ell(y_{t-1})-\ell(x^{*}))
11+γ(xt1),γ(zt1x)+γ2(xt1x)μγ2xt1x21+γ2L(xt1)2.\displaystyle\leq\frac{1}{1+\gamma}\left<\nabla\ell(x_{t-1}),\gamma(z_{t-1}-x^{*})+\gamma^{2}(x_{t-1}-x^{*})\right>-\frac{\mu\gamma}{2}\left\lVert x_{t-1}-x^{*}\right\rVert^{2}-\frac{1+\gamma}{2L{}}\left\lVert\nabla\ell(x_{t-1})\right\rVert^{2}. (16)

Next, we turn to upper bounding the second summand in (12). Plugging the iterate definitions yt=xt11L(xt1)y_{t}=x_{t-1}-\frac{1}{L{}}\nabla\ell(x_{t-1}) and zt1=1τxt11ττyt1{z}_{t-1}=\frac{1}{\tau}{x}_{t-1}-\frac{1-\tau}{\tau}{y}_{t-1} into our definition of z~t=1τ((1τ)yt(12τ)yt1)\widetilde{z}_{t}=\frac{1}{\tau}\left((1-\tau)y_{t}-(1-2\tau)y_{t-1}\right), we can recover the identity

z~t=11+γzt1+γ1+γxt1γμ(1+γ)(xt1).\widetilde{z}_{t}=\frac{1}{1+\gamma}z_{t-1}+\frac{\gamma}{1+\gamma}x_{t-1}-\frac{\gamma}{\mu(1+\gamma)}\nabla\ell(x_{t-1}).

Plugging this identity for z~t\widetilde{z}_{t} into the expression μ2((1+γ)z~tx2zt1x2)\tfrac{\mu}{2}\left((1+\gamma)\|\widetilde{z}_{t}-x^{*}\|^{2}-\|z_{t-1}-x^{*}\|^{2}\right), direct algebra yields the identity

μ2((1+γ)z~tx2zt1x2)=\displaystyle\tfrac{\mu}{2}\left((1+\gamma)\|\widetilde{z}_{t}-x^{*}\|^{2}-\|z_{t-1}-x^{*}\|^{2}\right)=\; 1+γ2L(xt1)2\displaystyle\frac{1+\gamma}{2L{}}\left\lVert\nabla\ell(x_{t-1})\right\rVert^{2}
11+γ(xt1),γ(zt1x)+γ2(xt1x)\displaystyle-\frac{1}{1+\gamma}\left<\nabla\ell(x_{t-1}),\gamma(z_{t-1}-x^{*})+\gamma^{2}(x_{t-1}-x^{*})\right>
+μγ2xt1x2μγ2(1+γ)zt1xt12.\displaystyle\;+\frac{\mu\gamma}{2}\left\lVert x_{t-1}-x^{*}\right\rVert^{2}-\frac{\mu\gamma}{2(1+\gamma)}\left\lVert z_{t-1}-x_{t-1}\right\rVert^{2}. (17)

Summing (16) and (17) yields the following upper bound on the left-hand side of (12):

(1+γ)((yt)(x))((yt1)(x))+μ2((1+γ)z~tx2zt1x2)\displaystyle(1+\gamma)(\ell(y_{t})-\ell(x^{*}))-(\ell(y_{t-1})-\ell(x^{*}))+\tfrac{\mu}{2}\left((1+\gamma)\|\widetilde{z}_{t}-x^{*}\|^{2}-\|z_{t-1}-x^{*}\|^{2}\right)
μγ2(1+γ)zt1xt120.\displaystyle\leq-\frac{\mu\gamma}{2(1+\gamma)}\left\lVert z_{t-1}-x_{t-1}\right\rVert^{2}\leq 0.

Proof of Fact 3.9.

To show this fact, we can observe from direct algebra that it suffices to prove the following inequalities:

γ2((yt)(x))(2+γ2)2ρytx2,\displaystyle\tfrac{\gamma}{2}(\ell(y_{t})-\ell(x^{*}))\geq(2+\tfrac{\gamma}{2})2\rho\left\lVert y_{t}-x^{*}\right\rVert^{2}, (18)
(1+γ)z~tx2(1+γ2)ztx24ρμ(yt1x2+ytx2).\displaystyle(1+\gamma)\left\lVert\widetilde{z}_{t}-x^{*}\right\rVert^{2}\geq(1+\tfrac{\gamma}{2})\left\lVert z_{t}-x^{*}\right\rVert^{2}-\tfrac{4\rho}{\mu}(\left\lVert y_{t-1}-x^{*}\right\rVert^{2}+\left\lVert y_{t}-x^{*}\right\rVert^{2}). (19)

The inequality (18) follows directly from the μ\mu-convexity of \ell, and the identity ρ=μγ4(4+γ)\rho=\frac{\mu\gamma}{4(4+\gamma)}, as

γ2((yt)(x))μγ4ytx2=(2+γ2)2ρytx2.\tfrac{\gamma}{2}(\ell(y_{t})-\ell(x^{*}))\geq\tfrac{\mu\gamma}{4}\|y_{t}-x^{*}\|^{2}=(2+\tfrac{\gamma}{2})\cdot 2\rho\|y_{t}-x^{*}\|^{2}.

To show (19), we first recall the general fact that c(1+δ)v2cv+w2c(1+1δ)w2c(1+\delta)\|v\|^{2}\geq c\|v+w\|^{2}-c(1+\frac{1}{\delta})\|w\|^{2} for any δ,c>0\delta,c>0. Choosing v=z~txv=\widetilde{z}_{t}-x^{*}, w=ztz~tw=z_{t}-\widetilde{z}_{t}, c=1+γ2c=1+\tfrac{\gamma}{2}, and δ=γγ+2>0\delta=\frac{\gamma}{\gamma+2}>0, this yields

1+γz~tx2(1+γ2)ztx2(2+2γ)(1+γ2)1τΔt2.\displaystyle 1+\gamma\|\widetilde{z}_{t}-x^{*}\|^{2}\geq(1+\tfrac{\gamma}{2})\|z_{t}-x^{*}\|^{2}-(2+\tfrac{2}{\gamma})(1+\tfrac{\gamma}{2})\|\tfrac{1}{\tau}\Delta_{t}\|^{2}. (20)

Using the perturbation norm bound, we can therefore bound

(2+2γ)(1+γ2)1τΔt22ρμytyt124ρμ(ytx2+yt1x2),(2+\tfrac{2}{\gamma})(1+\tfrac{\gamma}{2})\|\tfrac{1}{\tau}\Delta_{t}\|^{2}\leq\tfrac{2\rho}{\mu}\|y_{t}-y_{t-1}\|^{2}\leq\tfrac{4\rho}{\mu}(\|y_{t}-x^{*}\|^{2}+\|y_{t-1}-x^{*}\|^{2}),

where the last inequality uses the triangle inequality; plugged into (20), this yields (19) as desired. ∎

4 Experiments

In this section we provide some proof-of-concept experiments to validate our theoretical results.

4.1 Deep Linear Networks and Squared Losses

As our main theoretical results in Section 3 apply to the case of deep linear networks, we consider the same function class in our experiments on synthetic data with the square loss. Formally, the output space 𝒴=d\mathcal{Y}=\mathbb{R}^{d}, and for a predictor xWtxx\mapsto W_{t}x, we consider the loss

(W)=𝔼(x,y)D[12Wxy2].\displaystyle\ell(W)=\mathbb{E}_{(x,y)\sim D}\left[\tfrac{1}{2}\|Wx-y\|^{2}\right]. (21)

We consider a data distribution DD where the samples (x,y)(x,y) are drawn as follows. Let WW^{*} and be the “ground truth” positive definite matrix and let Σ\Sigma be the data covariance matrix. We first sample xN(0,Σ)x\sim N(0,\Sigma) and then conditioned on xx, the output yy is generated as y=Wx+ξy=W^{*}x+\xi, where ξ\xi is a mean zero random variable. We can then write the expected square loss explicitly as

(W)\displaystyle\ell(W) =𝔼(x,y)D[12Wxy2]=12Tr((WW)Σ(WW))+12𝔼[ξ2].\displaystyle=\mathbb{E}_{(x,y)\sim D}\left[\tfrac{1}{2}\|Wx-y\|^{2}\right]=\tfrac{1}{2}\text{Tr}((W-W^{*})\Sigma(W-W^{*})^{\top})+\tfrac{1}{2}\mathbb{E}\left[\|\xi\|^{2}\right]. (22)

Note that for the case of the squared loss described above, the condition number of the expected loss depends on the covariance matrix Σ\Sigma, i.e., κ=σmax(Σ)σmin(Σ)\kappa=\frac{\sigma_{\text{max}}(\Sigma)}{\sigma_{\text{min}}(\Sigma)}.

Stacking updates.

For the specific case of the squared loss we get the following closed form expression of the stacking updates (see Eq. (5))

Wt+1=(Wt+β(WtWt1)Wt11Wt)(I1LΣ)+1LWΣ.\displaystyle W_{t+1}=(W_{t}+\beta(W_{t}-W_{t-1})W_{t-1}^{-1}W_{t})(I-\tfrac{1}{L}{}\Sigma)+\tfrac{1}{L}{}W^{*}\Sigma. (23)

Here LL is the smoothness of the loss which depends on the largest singular value of Σ\Sigma, L=σmax(Σ)L=\sigma_{\text{max}}(\Sigma).

Nesterov’s updates.

Similarly, we get the following closed form expression for obtaining Nesterov’s updates (see Eq. (6)) for the case of squared loss

Wt+1=(Wt+β(WtWt1))(I1LΣ)+1LWΣ.\displaystyle W_{t+1}=(W_{t}+\beta(W_{t}-W_{t-1}))(I-\tfrac{1}{L}{}\Sigma)+\tfrac{1}{L}{}W^{*}\Sigma. (24)

For both stacking and Nesterov’s updates we set β=κ1κ+1\beta=\frac{\sqrt{\kappa}-1}{\sqrt{\kappa}+1}.

4.2 Synthetic data experiments

We compare the performance of the three updates namely vanilla gradient descent, stacking updates (Eq. 23) and exact Nesterov updates (Eq. 24). Here at each stacking stage only the last layer is updated which matches our theoretical setup faithfully. In Section 4, we also consider the effect of training all the layers in each stacking stage which is closer to how stacking is applied in practice.

We consider points in d=20d=20 dimensions. We generate the ground truth WW^{*} to be of the form I+σSI+\sigma S, where SS is a random positive semi-definite matrix of spectral norm 11 and σ\sigma is a parameter defining the closeness of WW^{*} to identity. For a given κ>1\kappa>1, we generate a random covariance matrix Σ\Sigma. Finally, we sample the noise ξ\xi in the output from a mean zero Gaussian with a standard deviation of 0.10.1.

Refer to caption
Refer to caption
Refer to caption
Figure 4: Mean squared error (MSE) vs. number of stacking stages. We observe that as the data becomes more ill conditioned both the stacking updates and Nesterov’s updates demonstrate faster convergence than vanilla gradient descent.

Figure 4 shows the performance of the three types of updates as the problem becomes more ill conditioned, i.e. as a function of κ\kappa. As expected, at small values of the condition number there is no advantage of the stacking updates over vanilla gradient descent. However, for ill conditioned data the stacking updates converge much faster than gradient descent. We also observe that the convergence of the stacking updates mirrors very closely the convergence behavior of the exact Nesterov’s updates.

To further understand the relationship between the stacking updates and Nesterov’s updates, in Figure 5 we show the performance of the two as the distance of WW^{*} from identity increases. As can be seen from the figures, when WW^{*} is farther from identity the stacking updates behave qualitatively different from Nesterov’s updates during the initial phase where the loss for stacking updates explodes before converging in later stages. This suggests that in practice there may be a better way to initialize a stacking stage by making the initialization closer to the ideal Nesterov’s updates. While in the case of deep linear networks and the squared loss we have a closed form expression for such an initialization, in general this is a hard problem.

Refer to caption
Refer to caption
Figure 5: Mean squared error (MSE) vs. number of stacking stages. The figure compares stacking updates and Nesterov’s updates as WW^{*} becomes farther from Identity, i.e. σ\sigma increases. We observe that for higher values of σ\sigma the stacking updates display a diverging behavior in the initial stages.

Next we consider the case where in each stacking stage we train all the layers of the deep linear network. We use the same data generation procedure as described above. We perform 1010 stages of stacking where in each stage we perform 22 steps of gradient descent with a learning rate of 1/L1/L where LL is the smoothness of the loss function. We train on 10241024 examples with batch size of 3232 and test on 10241024 examples.

We consider two types of stacking based initialization schemes. The first one namely Stacking Init. initializes the next layer’s weight matrix wt+10w^{0}_{t+1} as βwt\beta w_{t}. The second scheme namely Nesterov Init. initializes wt0w^{0}_{t} such that we recover the precises Nesterov’s updates at initialization, i.e., Eq. 24. From the analysis in Section 2 the initialization that achieves this amounts to setting wt+10w^{0}_{t+1} as βwt(I+wt)1\beta w_{t}(I+w_{t})^{-1}.

Figure 6 shows the performance of the two stacking initialization schemes as compared to the random baseline where we initialize the next layer’s weight matrix to be a random one. We again observe that both the stacking schemes outperform the baseline particularly when the data is ill conditioned.

Refer to caption
Refer to caption
Figure 6: Mean squared error (MSE) vs. number of stacking stages when training all the layers.

4.3 Stacking for BERT Base with β\beta parameters

Refer to caption
Figure 7: Stacking initialization with trainable β\beta parameter multiplying the output of the newly added transformer block. Experimental runs with β\beta initialized to 0.990.99 and 0.90.9 are provided.

The theory developed in Section 2 requires the initialization at the (t+1)(t+1)-th stage to be ft+10=βftf_{t+1}^{0}=\beta f_{t} for some β[0,1)\beta\in[0,1). The introduction of β\beta is crucial to get the accelerated convergence rate in Nesterov’s method, but the standard stacking initialization doesn’t use a β\beta parameter. We performed sanity check experiments on BERT Base to ensure that the introduction of the β\beta parameter doesn’t affect the efficacy of stacking. We introduced a trainable parameter, β\beta, that multiplies the output of the newly added transformer block in stacking, which is initialized to the values 0.90.9 and 0.990.99, which are standard settings for momentum parameters. Figure 7 shows that introduction of the β\beta parameter doesn’t hurt the efficacy of stacking. The plot also shows that the final log perplexity improves a bit when using trainable β\beta.

5 Conclusions and Future Work

This paper develops the theoretical perspective that the effectiveness of stacking initialization, compared to other forms of initialization such as zero or random, is because it enables a form of accelerated gradient descent in function space. There are several directions for future work. While this work provides a formal proof of accelerated convergence for a particular parametric setting (deep residual linear networks), such a proof in the general functional setting for deep residual networks is still open, and will probably require some additional assumptions. From a practical standpoint, a very intriguing and potentially impactful question is whether it is possible to come up with an efficiently implementable initialization scheme that leads to Nesterov’s AGD updates exactly for deep residual networks.

References

  • AAA+ [23] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  • AS [22] Kwangjun Ahn and Suvrit Sra. Understanding nesterov’s acceleration via proximal point method. In Karl Bringmann and Timothy Chan, editors, 5th Symposium on Simplicity in Algorithms, SOSA@SODA 2022, Virtual Conference, January 10-11, 2022, pages 117–130. SIAM, 2022.
  • BG [17] Nikhil Bansal and Anupam Gupta. Potential-function proofs for first-order methods, 2017.
  • BLPL [06] Yoshua Bengio, Pascal Lamblin, Dan Popovici, and Hugo Larochelle. Greedy layer-wise training of deep networks. Advances in neural information processing systems, 19, 2006.
  • BMR+ [20] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  • BPC [20] Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020.
  • CG [16] Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. CoRR, abs/1603.02754, 2016.
  • CND+ [22] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  • CWC+ [22] Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, et al. Pali: A jointly-scaled multilingual language-image model. arXiv preprint arXiv:2209.06794, 2022.
  • DCLT [19] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, NAACL-HLT, pages 4171–4186. Association for Computational Linguistics, 2019.
  • Fri [01] Jerome H Friedman. Greedy function approximation: a gradient boosting machine. Annals of statistics, pages 1189–1232, 2001.
  • FS [97] Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1):119–139, 1997.
  • GBC [16] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org.
  • GHL+ [19] Linyuan Gong, Di He, Zhuohan Li, Tao Qin, Liwei Wang, and Tieyan Liu. Efficient training of bert by progressively stacking. In International conference on machine learning, pages 2337–2346. PMLR, 2019.
  • GKS [18] Vineet Gupta, Tomer Koren, and Yoram Singer. Shampoo: Preconditioned stochastic tensor optimization. In Jennifer G. Dy and Andreas Krause, editors, ICML, volume 80 of Proceedings of Machine Learning Research, pages 1837–1845. PMLR, 2018.
  • GLY+ [20] Xiaotao Gu, Liyuan Liu, Hongkun Yu, Jing Li, Chen Chen, and Jiawei Han. On the transformer growth for progressive bert training. arXiv preprint arXiv:2010.12562, 2020.
  • HM [17] Moritz Hardt and Tengyu Ma. Identity matters in deep learning. In ICLR. OpenReview.net, 2017.
  • HOT [06] Geoffrey E Hinton, Simon Osindero, and Yee-Whye Teh. A fast learning algorithm for deep belief nets. Neural computation, 18(7):1527–1554, 2006.
  • HZRS [16] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770–778. IEEE Computer Society, 2016.
  • IS [15] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Francis R. Bach and David M. Blei, editors, ICML, volume 37 of JMLR Workshop and Conference Proceedings, pages 448–456. JMLR.org, 2015.
  • JSR+ [24] Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024.
  • Kaw [16] Kenji Kawaguchi. Deep learning without poor local minima. In Daniel D. Lee, Masashi Sugiyama, Ulrike von Luxburg, Isabelle Guyon, and Roman Garnett, editors, Advances in Neural Information Processing Systems 29, pages 586–594, 2016.
  • LLH+ [23] Hong Liu, Zhiyuan Li, David Hall, Percy Liang, and Tengyu Ma. Sophia: A scalable stochastic second-order optimizer for language model pre-training. arXiv preprint arXiv:2305.14342, 2023.
  • MBBF [99] Llew Mason, Jonathan Baxter, Peter L. Bartlett, and Marcus R. Frean. Boosting algorithms as gradient descent. In Sara A. Solla, Todd K. Leen, and Klaus-Robert Müller, editors, NeurIPS, pages 512–518. The MIT Press, 1999.
  • Nes [83] Yurii Nesterov. A method for unconstrained convex minimization problem with the rate of convergence o (1/k^ 2). In Doklady an ussr, volume 269, pages 543–547, 1983.
  • RMK+ [23] Sashank J Reddi, Sobhan Miryoosefi, Stefani Karp, Shankar Krishnan, Satyen Kale, Seungyeon Kim, and Sanjiv Kumar. Efficient training of language models using few-shot learning. In International Conference on Machine Learning, pages 14553–14568. PMLR, 2023.
  • RWC+ [19] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  • SF [12] Robert E. Schapire and Yoav Freund. Boosting: Foundations and Algorithms. The MIT Press, 2012.
  • SMG [14] Andrew M. Saxe, James L. McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. In Yoshua Bengio and Yann LeCun, editors, 2nd International Conference on Learning Representations, ICLR 2014, Conference Track Proceedings, 2014.
  • SMM+ [17] Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017.
  • SS [18] Noam Shazeer and Mitchell Stern. Adafactor: Adaptive learning rates with sublinear memory cost. In International Conference on Machine Learning, pages 4596–4604. PMLR, 2018.
  • SWK+ [22] Sheng Shen, Pete Walsh, Kurt Keutzer, Jesse Dodge, Matthew Peters, and Iz Beltagy. Staged training for transformer language models. In International Conference on Machine Learning, pages 19893–19908. PMLR, 2022.
  • TAB+ [23] Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  • TAD+ [23] Tao Tu, Shekoofeh Azizi, Danny Driess, Mike Schaekermann, Mohamed Amin, Pi-Chuan Chang, Andrew Carroll, Chuck Lau, Ryutaro Tanno, Ira Ktena, et al. Towards generalist biomedical AI. arXiv preprint arXiv:2307.14334, 2023.
  • VSP+ [17] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  • ZO [17] Zeyuan Allen Zhu and Lorenzo Orecchia. Linear coupling: An ultimate unification of gradient and mirror descent. In Christos H. Papadimitriou, editor, 8th Innovations in Theoretical Computer Science Conference, ITCS 2017, January 9-11, 2017, Berkeley, CA, USA, volume 67 of LIPIcs, pages 3:1–3:22. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2017.