Stacking as Accelerated Gradient Descent

Naman Agarwal
Google DeepMind
namanagarwal@google.com
Pranjal Awasthi
Google Research
pranjalawasthi@google.com Satyen Kale
Google Research
satyenkale@google.com Eric Zhao
UC Berkeley, Google Research
eric.zh@berkeley.edu

Abstract

Stacking, a heuristic technique for training deep residual networks by progressively increasing the number of layers and initializing new layers by copying parameters from older layers, has proven quite successful in improving the efficiency of training deep neural networks. In this paper, we propose a theoretical explanation for the efficacy of stacking: viz., stacking implements a form of Nesterov’s accelerated gradient descent. The theory also covers simpler models such as the additive ensembles constructed in boosting methods, and provides an explanation for a similar widely-used practical heuristic for initializing the new classifier in each round of boosting. We also prove that for certain deep linear residual networks, stacking does provide accelerated training, via a new potential function analysis of the Nesterov’s accelerated gradient method which allows errors in updates. We conduct proof-of-concept experiments to validate our theory as well.

1 Introduction

Deep learning architectures are ubiquitous today and have been responsible tremendous technological advances in machine learning. However, until 2006, training deep architectures was extremely challenging. The deep learning revolution of the past couple of decades was ushered in with the discovery that a classical technique, viz. greedy layer-wise pretraining, can be used to train general deep architectures (see [13, Section 15.1] for a historical account of this). Previously, only deep architectures with special structure like convolutions or recurrences were known to be feasible to train. Greedy layer-wise pretraining is a very intuitive technique where a deep network is built and trained in a stagewise manner. Starting with a small network that is easy to train, the technique prescribes adding new layers over a number of stages, and in each stage training the newly added layers (and, potentially, the older layers as well) for a certain number of training steps. This process continues until the desired model depth is reached.

More modern developments such as residual connections [19] and normalization layers [20] have made it possible to directly train deep networks without using greedy layer-wise pretraining. However, in recent years, the tremendous success of deep learning architectures based on transformers [35] in domains such as language modeling, and computer vision [27, 5, 9, 34] has led to the trend of scaling model capacity with ever increasing model sizes for improved performance [8, 1, 33]. This endeavour comes at a significant cost as model training may often take months and require several million dollars of compute resources [8]. As a result there has been a surge of recent work aimed at faster training of large transformer based models. These works include methods for sparse training such as mixture of experts (MoE) models [30, 21], methods for approximate sparse attention mechanisms [6] and better optimization methods [31, 23, 15].

In the effort to reduce the massive costs of training these giant transformer models, greedy layer-wise pretraining has re-emerged as a very effective strategy in recent times. Specifically, a technique for initializing the new layers known as stacking [14, 26] has been shown to be very effective in speeding up training of deep transformer models. Stacking prescribes a heuristic for initializing the newly added layers. Specifically, it prescribes that the newly added layers should be initialized by copying parameters from the previously trained layers. [14] proposed to double the model depth at each time by stacking an exact copy of the current model on top. [26] argue that doubling the model depth may be suboptimal and a better approach is gradual stacking where a few new layers (say 3-4) are added during each stage. These layers are initialized by copying the top most layers from the existing model. See Figure 1 for an illustration of this technique for training deep transformer models.

Refer to caption — Figure 1: Stacking for stagewise training language models. In each stage, a new transformer block is added, initialized with the parameters of the top block from the previous stage, and then trained for a certain number of steps.

The classical greedy layer-wise pretraining strategy doesn’t have a specific prescription for initializing the new layers. In general, they’re initialized randomly in some standard fashion. Stacking initialization provides a clear benefit over random initialization: Figure 2 shows one example of this effect, for training the BERT Base [10] model with 4 stages of stagewise training.

Structurally, greedy layer-wise pretraining resembles another classical technique, viz. boosting. In boosting, an additive ensemble of classifiers is constructed via greedy stagewise training of classifiers in the same greedy manner. Boosting algorithms such as AdaBoost [12] and Gradient Boosting [11] have found tremendous practical application, especially when using decision trees as base classifiers (e.g. XGBoost [7]). A heuristic similar to stacking has also found practical application in boosting algorithms. The heuristic is to initialize each new classifier (e.g. a decision tree) by copying over the just-trained classifier and then updating it using new training data. This process is illustrated in Figure 3. Due to the similarity with stacking for training deep transformer models, in the rest of the paper we use “stacking” to also refer to this initialization strategy in the context of boosting.

While stacking based methods lead to impressive speed up in training of transformer models and additive models in boosting, we currently do not have a good theoretical understanding of why this is the case. Recently, [26] provided a theoretical explanation based on the assumption that each transformer block is a good few-shot learner. This assumption, along with a few others, are then used to conclude copying parameters as stacking does leads to fast learning. However, the assumptions made in the paper are fairly strong and hard to verify.

In this work, we make progress on developing a theoretical understanding of the efficacy of stacking by studying it from an optimization perspective. In particular, our main contribution is that when viewed from the perspective of function optimization, stacking speeds up stagewise training by enabling a form of the accelerated gradient descent method (AGD) developed by Nesterov [25]. In other words, each stage of the stacking-initialized stagewise training procedure will reduce the training loss at an accelerated rate.

In contrast, we also show that without using any form of initialization, or in other words, initializing the new block/classifier to implement the zero function, stagewise training simply recovers usual (non-accelerated) gradient descent, whereas random initialization recovers stochastic gradient descent on a smoothed version of the loss. Hence, stacking initialization accelerates stagewise training over zero or random initialization. In more detail, our contributions are as follows:

1.

We propose a general theoretical framework towards learning a prediction function $F$ via an ensemble, i.e., a sequence of functions $(f_{1},f_{2},\ldots,f_{T})$ in a greedy stagewise manner. The generality of our framework lets us unify classical approaches such as boosting [12, 11] that build the ensemble in an additive manner, and modern approaches that build the ensemble via stagewise training of residual function compositions (e.g. ResNets [19] and Transformer models [35]).
2.

Our proposed framework lets us formally establish the connection between various initialization strategies used for building the ensemble and the convergence properties of the resulting overall learning procedure. In particular, we show that the zero initialization strategy recovers the vanilla functional gradient descent algorithm, for both the additive (i.e. boosting) and residual compositional forms of learning, whereas random initialization recovers stochastic functional gradient descent (on a smoothed loss) for both types of models. Furthermore, in the case of additive models, the use of the popular stacking initialization exactly recovers Nesterov’s accelerated functional gradient descent. The consequence is that for $T$ stages of boosting with stacking initialization, loss reduces at a rate of $O(T^{-2})$ for smooth losses, or $\exp(-\Omega(T/\sqrt{\kappa}))$ for smooth and strongly-convex losses with condition number $\sqrt{\kappa}$ , as opposed to rates of $O(T^{-1})$ and $\exp(-\Omega(T/\kappa))$ respectively for zero initialization.
3.

For the case of compositional models, we show that stacking initialization results in updates that look remarkably similar to Nesterov’s accelerated functional gradient descent. Proving an accelerated rate in the general non-parametric functional setting seems intractable, so we analyze stacking in a special parametric setting of deep linear networks with a convex loss function. In this setting we prove (Theorem 3.1) that the stacking initialization quantitatively leads to the same kind of convergence benefits over vanilla gradient descent as is observed for Nesterov’s accelerated method. At the core of our proof is a novel potential function based analysis of Nesterov’s method with errors in the momentum term that may be of independent interest (c.f. Lemma 3.2).
4.

We perform proof-of-concept experiments (in Section 4) to validate our theory on synthetic and real world data.

1.1 Related work

Boosting is a classical technique for constructing additive ensembles via greedy stagewise training, and has a long and rich history of work. We refer the interested reader to the excellent textbook of [28] for the literature on this topic.

The idea of training deep residual networks in a layer wise manner has been explored in many prior works. In earlier studies [18, 4] the focus was on greedily adding trained layers to the model while keeping the bottom layers frozen followed by a final fine tuning step where the entire network is trained. In recent years progressive or gradual stacking [14, 16, 32, 26] has emerged as a powerful way to train deep networks especially transformer based architectures.

The empirical insight of [14] was that the attention patterns in neighboring layers of trained transformer models show remarkable similarity. Hence, by copying the parameters from the previous layer one is providing a better initialization for the optimization procedure. As mentioned previously, [26] developed the gradual stacking approach based on the assumption that the trained transformer blocks are good few-shot learners, and showed that gradual stacking leads to significant wallclock improvements during training.

2 Stagewise training as functional gradient descent

Preliminaries.

We consider a fairly general supervised learning setting. Denote the input space by $\mathcal{X}$ and the output space by $\mathcal{Y}$ . Examples $(x,y)\in\mathcal{X}\times\mathcal{Y}$ are drawn from a distribution $D$ (which may simply be the empirical data distribution in the case of empirical risk minimization). We aim to model the input-output relationship via predictions in $\mathbb{R}^{d}$ , for some dimension parameter $d$ . Given an example $(x,y)\in\mathcal{X}\times\mathcal{Y}$ the quality of a prediction $\widehat{y}\in\mathbb{R}^{d}$ is measured via a loss function $\ell:\mathbb{R}^{d}\times\mathcal{Y}\to\mathbb{R}$ . Predictions are computed using functions $f:\mathcal{X}\to\mathbb{R}^{d}$ . We will assume that the predictor functions $f$ are square integrable with respect to $D$ , i.e. $\operatorname*{\mathbb{E}}_{(x,y)\sim D}[\|f(x)\|^{2}]<\infty$ . The space of such functions forms a Hilbert space, denoted $\mathcal{L}_{2}$ , with the inner product defined as $\langle f,g\rangle=\operatorname*{\mathbb{E}}_{(x,y)\sim D}[\langle f(x),g(x)\rangle]$ . Unless specified otherwise, all functions in the subsequent discussion will be assumed to be in $\mathcal{L}_{2}$ . The loss function can then naturally be extended to predictor functions $f\in\mathcal{L}_{2}$ by defining, with some abuse of notation, $\ell(f):=\operatorname*{\mathbb{E}}_{(x,y)\sim D}[\ell(f(x),y)]$ . The goal of training is to obtain a function $f\in\mathcal{L}_{2}$ that minimizes $\ell(f)$ .

In the rest of this section, we perform the analysis in a purely functional setting, which affords a convenient analysis. However, we note that, in practice, functions are parameterized (say by neural networks) and hence update rules for functions may not always be realizable via the specific parameterization used. The functional setting allows us to sidestep realizability issues and to focus on the conceptual message that stacking initialization enables accelerated updates.

We now define a general ensemble learning setup within the above setting. In this setup, we aim to approximate the minimizer of $\ell$ on $\mathcal{L}_{2}$ via an ensemble, which is a sequence of functions $(f_{1},f_{2},\ldots,f_{T})$ , where $T>0$ is a given parameter defining the size of the ensemble. The functions in the ensemble are typically “simple” in the sense that they are chosen from a class of functions that is easy to optimize over. A predictor function can be obtained from an ensemble $(f_{1},f_{2},\dots,f_{T})$ by aggregating its constituent functions into a single function $F_{T}:\mathcal{X}\to\mathbb{R}^{d}$ . The loss of an ensemble can then be defined (again with some abuse of notation) in terms of its aggregation as $\ell((f_{1},f_{2},\ldots,f_{T})):=\ell(F_{T})$ . Two specific aggregation operators we consider are the following:

1.

Addition: (E.g. boosting.) This is a summation over ensemble outputs: $F_{T}=f_{1}+f_{2}+\cdots f_{T}$ .
2.

Residual composition: (E.g. deep residual neural networks.) This is a composed function $F_{T}=(I+f_{T})\circ(I+f_{T-1})\circ\cdots\circ(I+f_{1})$ , where the domain is $\mathcal{X}=\mathbb{R}^{d}$ and $I:\mathbb{R}^{d}\to\mathbb{R}^{d}$ is the identity mapping.

Greedy stagewise training.

Stagewise training is a simple greedy procedure to train ensembles in a progressive manner. Suppose we have already obtained a (partial) ensemble $(f_{1},f_{2},\ldots,f_{t})$ . Then, the next function in the ensemble, $f_{t+1}$ , is ideally obtained by minimizing the loss of the new ensemble, i.e. $f_{t+1}=\arg\min_{f}\ell((f_{1},f_{2},\ldots,f_{t},f))$ .

However, in practice, this ideal is hard to implement, and instead two heuristics are commonly used: (a) the new function to be trained is initialized in some carefully chosen manner, and (b) the optimization above is done using early stopping, i.e. a few steps of gradient descent, which ensures that the new function stays close to initialization. We analyze these heuristics in a functional optimization setting as follows.

First, we assume that the function $f_{t+1}$ to be trained is initialized at some carefully chosen value $f_{t+1}^{0}$ . For notational convenience, we denote the aggregation of the ensemble $(f_{1},f_{2},\ldots,f_{t},f_{t+1}^{0})$ by $F_{t+1}^{0}$ and that of the generic ensemble $(f_{1},f_{2},\ldots,f_{t},f)$ by $F$ .

Next, we note that an exact analysis for early stopping quickly becomes technically intractable. Instead, for a theoretical analysis, we model the heuristic of early stopping by using $\ell_{2}$ regularization around the initialization and linearizing the loss near the initialization, as follows. It is known (see, e.g. [13, Section 7.8]) that early stopping acts as a form of $\ell_{2}$ regularization which ensures that the trained function $f_{t+1}$ remains close to its initialization, $f_{t+1}^{0}$ , which implies that $F_{t+1}$ remains close to $F_{t+1}^{0}$ . Thus, early stopping can be modeled as minimizing $\ell(F)+\frac{\lambda}{2}\|F-F_{t+1}^{0}\|^{2}$ , for some regularization parameter $\lambda$ . Further, since the trained function remains close to the initialization, we also approximate $\ell(F)$ by its linearization around the initialization: $\ell(F)\approx\ell(F_{t+1}^{0})+\langle\nabla\ell(F_{t+1}^{0}),F-F_{t+1}^{0}\rangle$ . Here, $\nabla\ell(\cdot)$ is the Fréchet derivative, and $\langle\cdot,\cdot\rangle$ denotes the inner product in $\mathcal{L}_{2}$ . These considerations lead to the following key modeling assumption.

Assumption 2.1.

The result of the early stopped training is given by

\displaystyle F_{t+1}=\arg\min_{F\in\mathcal{L}_{2}}\ell(F_{t+1}^{0})+\langle\nabla\ell(F_{t+1}^{0}),F-F_{t+1}^{0}\rangle+\frac{\lambda}{2}\|F-F_{t+1}^{0}\|^{2}.

In other words,

F_{t+1}=F_{t+1}^{0}-\frac{1}{\lambda}\nabla\ell(F_{t+1}^{0}).

(1)

We can now consider specific initialization strategies (i.e. zero initialization, random initialization, and stacking initialization) in the context of additive and residual compositional models and see how these initializations lead to various forms of functional gradient descent.

Stagewise training with zero initialization recovers functional gradient descent.

First, consider stagewise training where functions are initialized to be zero functions, i.e. $f_{t+1}^{0}=0$ . It is easy to see that with this initialization, for both additive and residual compositional models, we have $F_{t+1}^{0}=F_{t}$ . Thus, from (1), we have that the updated ensemble’s predictor can be written as

F_{t+1}=F_{t}-\frac{1}{\lambda}\nabla\ell(F_{t}).

This exactly describes functional gradient descent with step size $\frac{1}{\lambda}$ . In the additive setting this is well-known: indeed, boosting can be seen as a functional gradient descent [24]. The result for the residual compositional setting appears to be new.

Stagewise training with random initialization recovers stochastic functional gradient descent on smoothed loss.

We now consider stagewise training where functions are initialized randomly, i.e. $f_{t+1}^{0}$ is a randomly drawn function, independent of all randomness up to stage $t$ . In the following, we will assume that $\operatorname*{\mathbb{E}}[f_{t+1}^{0}]=0$ , where the $0$ on the RHS denotes the zero function. With this initialization, for both additive and residual compositional models, we have $F_{t+1}^{0}=F_{t}+g_{t}$ , where $g_{t}=f_{t+1}^{0}$ for additive models, and $g_{t}=f_{t+1}^{0}\circ F_{t}$ for residual compositional models. In either case, note that $\operatorname*{\mathbb{E}}[g_{t}]=0$ . Now define the loss functional $\ell_{t}(F):=\operatorname*{\mathbb{E}}[\ell(F+g_{t})]$ . Since $\operatorname*{\mathbb{E}}[g_{t}]=0$ , we can interpret $\ell_{t}$ as a randomized smoothing of $\ell$ , similar to convolving with a Gaussian. Then, from (1), we have that the updated ensemble’s predictor can be written as

F_{t+1}=F_{t}-\frac{1}{\lambda}(\nabla\ell(F_{t}+g_{t})-\lambda g_{t}).

Now, note that

\operatorname*{\mathbb{E}}[\nabla\ell(F_{t}+g_{t})-\lambda g_{t}]=\nabla\ell_{t}(F_{t}).

Or in other words, the above update can be seen as a stochastic functional gradient descent step on the smoothed loss function $\ell_{t}$ .

Stagewise training with stacking initialization recovers accelerated functional gradient descent.

We now consider stagewise training where functions are initialized in a stacking-like fashion with $f_{t+1}^{0}=f_{t}$ , which we will refer to as the stacking initialization. When the ensemble aggregation operator is addition, we have $F_{t+1}^{0}=f_{t}+F_{t}=F_{t}+(F_{t}-F_{t-1})$ and hence (1) implies that the updated ensemble’s predictor is

F_{t+1}=F_{t}+(F_{t}-F_{t-1})-\frac{1}{\lambda}\nabla\ell(F_{t}+(F_{t}-F_{t-1})).

The above formula essentially describes Nesterov’s accelerated gradient descent, which has the following update rule:

F_{t+1}=F_{t}+\beta(F_{t}-F_{t-1})-\frac{1}{\lambda}\nabla\ell(F_{t}+\beta(F_{t}-F_{t-1})).

(2)

Here, $\beta\in[0,1)$ is a constant that can depend on $t$ . In fact, we can exactly recover Nesterov’s accelerated gradient descent if we modify the stacking initialization to $f_{t+1}^{0}=\beta f_{t}$ . Thus, stacking enables accelerated descent for training additive models.

When the ensemble aggregation operator is residual composition, stagewise training with the stacking initialization $f_{t+1}^{0}=f_{t}$ results in

F_{t+1}^{0}=(I+f_{t})\circ F_{t}=F_{t}+f_{t}\circ F_{t}.

Equation (1) therefore implies the updated ensemble’s predictor is

F_{t+1}=F_{t}+f_{t}\circ F_{t}-\frac{1}{\lambda}\nabla\ell(F_{t}+f_{t}\circ F_{t}).

(3)

In contrast, Nesterov’s update rule (2), and the fact that for residual compositional models,

F_{t}-F_{t-1}=(I+f_{t})\circ F_{t-1}=f_{t}\circ F_{t-1},

yields the following equation for $F_{t+1}$ :

F_{t+1}=F_{t}+\beta f_{t}\circ F_{t-1}-\frac{1}{\lambda}\nabla\ell(F_{t}+\beta f_{t}\circ F_{t-1}).

(4)

Comparing (3) and (4), barring the minor difference in $\beta$ parameters, which can be easily rectified as in the case of the additive models by setting $f_{t+1}^{0}=\beta f_{t}$ , the major difference is that $f_{t}\circ F_{t}$ replaces $f_{t}\circ F_{t-1}$ . Although possibly intractable to prove formally, we believe that the updates in (3) also provide an accelerated convergence rate, since we expect $F_{t-1}$ to be close to $F_{t}$ as iterates converge to the optimal function.

In the following section, we show that in certain deep linear networks, the above intuition is indeed correct and provide a rigorous proof that stacking provides an accelerated convergence rate.

3 Accelerated convergence of deep linear networks by stacking

To demonstrate that stacking can provide a provably accelerated rate of convergence, we now turn to studying the narrower setting of training deep residual linear networks, which are fully connected feedforward neural networks without non-linear activations and with residual connections. Such networks are a common subject of study in the theory of deep learning [29, 22, 17]. As they have no non-linear components, deep linear networks effectively compute a linear functions, albeit via a parametrization as a product of the weight matrices.

Setup.

Consider again the general supervised learning setting from Section 2 and suppose, as is often the case in modern neural networks, that examples consist of inputs $x\in\mathbb{R}^{d}$ and outputs $y\in\mathcal{Y}$ . The loss function $\ell:\mathbb{R}^{d}\times\mathcal{Y}\to\mathbb{R}$ is assumed to be convex in the first argument. Let the samples be drawn from a distribution $D$ over $\mathbb{R}^{d}\times\mathcal{Y}$ . Then the expected loss of the linear predictor $x\mapsto Wx$ for a matrix $W\in\mathbb{R}^{d\times d}$ is (with some abuse of notation) $\ell(W):=\operatorname*{\mathbb{E}}_{(x,y)\sim D}[\ell(Wx,y)]$ . In the following, we will assume the expected loss $\ell(W)$ is $L$ -smooth and $\mu$ -strongly convex in $W$ , by which we mean that the following inequalities hold for any $W,V\in\mathbb{R}^{d\times d}$ :

\displaystyle\ell(W)+\langle\nabla\ell(W),V-W\rangle+\frac{\mu}{2}\left\lVert W-V\right\rVert^{2}\leq\ell(V)\leq\ell(W)+\langle\nabla\ell(W),V-W\rangle+\frac{L}{2}\left\lVert W-V\right\rVert^{2}.

Here, for matrices $W,V\in\mathbb{R}^{d}$ , $\langle W,V\rangle=\text{Tr}(W^{\top}V)$ , and $\left\lVert W\right\rVert$ is the Frobenius norm of $W$ . The condition number $\kappa$ of the loss is defined as $\kappa:=\frac{L}{\mu}$ .

The deep residual neural networks we consider have $t$ layers with weight matrices $w_{1},w_{2},\ldots,w_{t}$ , and the function they compute is $x\mapsto W_{t}x$ , where

W_{t}:=(I+w_{t})(I+w_{t-1})\dots(I+w_{1}).

Here, $I\in\mathbb{R}^{d\times d}$ is the identity matrix providing the residual connection. The expected loss of the neural network described above on the data is $\ell(W_{t})$ .

Derivation of stacking updates.

Suppose we train the deep residual linear network described above using stacking initialization, but incorporating $\beta$ -scaling: i.e., to train the $(t+1)$ -th layer, its weight matrix is initialized to $w_{t+1}^{0}=\beta w_{t}$ , for some constant $\beta\in[0,1]$ , and then trained. Following the exact same steps as in the derivation of stacking updates in the functional setting of Section 2, we end up with the following formula for $W_{t+1}$ :

W_{t+1}=W_{t}+\beta w_{t}W_{t}-\frac{1}{\lambda}\nabla\ell(W_{t}+\beta w_{t}W_{t}).

When $W_{t-1}$ is non-singular, we have $w_{t}=W_{t}W_{t-1}^{-1}-I=(W_{t}-W_{t-1})W_{t-1}^{-1}$ , so the above equation can be rewritten as

\displaystyle W_{t+1}=

\displaystyle\;W_{t}+\beta(W_{t}-W_{t-1})W_{t-1}^{-1}W_{t}-\frac{1}{\lambda}\nabla\ell(W_{t}+\beta(W_{t}-W_{t-1})W_{t-1}^{-1}W_{t}).

(5)

As previously noted, (5) differs from Nesterov’s AGD method in form: Nesterov’s AGD updates would be

\displaystyle W_{t+1}

\displaystyle=W_{t}+\beta(W_{t}-W_{t-1})-\frac{1}{\lambda}\nabla\ell(W_{t}+\beta(W_{t}-W_{t-1})).

(6)

Accelerated convergence for stacking updates.

Despite the differences with Nesterov’s method, we can show that stacking notably still yields a provably accelerated convergence rate. Let $W^{*}:=\arg\min_{W}\ell(W)$ . Suppose that $W^{*}$ is non-singular, i.e. its smallest singular value $\sigma_{\min}(W^{*})>0$ . Theorem 3.1 shows that as long as the first two layers are initialized so that $W_{1}$ and $W_{2}$ are close to optimal, stacking results in a suboptimality gap of $\exp(-\widetilde{\Omega}(T/\sqrt{\kappa}))$ after $t$ stages of stacking. This is of the same order as the rate obtained by Nesterov’s acceleration; note that, in comparison, stagewise training with zero initialization results would result in a suboptimality gap of $\exp(-\widetilde{\Omega}(T/\kappa))$ .

Theorem 3.1.

Consider stagewise training with stacking initialization of a deep residual linear network in the setup described above with $\beta=\tfrac{\sqrt{\kappa}-1}{\sqrt{\kappa}+1}$ and $\lambda=\tfrac{1}{L}$ . Suppose that the first layer weights are initialized so that $W_{1}=V_{0}-\frac{1}{L}\nabla\ell(V_{0})$ , where $V_{0}\in\mathbb{R}^{d\times d}$ satisfies $\ell(V_{0})-\ell(W^{*})\leq\delta$ for $\delta\coloneqq\tfrac{\mu\alpha^{2}\sigma_{\min}(W^{*})^{2}}{256\beta^{2}d}$ and $\alpha^{-1}\coloneqq(\kappa-1)\sqrt{2\sqrt{\kappa}(\kappa-1)(\sqrt{\kappa}-3)}$ . Then after $T$ stages of stacking, we have

\ell(W_{T})-\ell(W^{*})\leq\exp(-\widetilde{\Omega}(T/\sqrt{\kappa})).

The $\widetilde{\Omega}(\cdot)$ notation above hides polylogarithmic dependence on the problem parameters for clarity of presentation. Precise expressions can be found in the proof. The primary insight behind the result of Theorem 3.1 is that Nesterov’s accelerated gradient method is relatively robust to perturbations in its update rules. This robustness is formalized below in Lemma 3.2. The lemma is described in a fairly general, standalone setting since it may be of independent interest.

Lemma 3.2 (Robustness of Nesterov’s accelerated gradient method).

Let $\mathcal{F}$ be a Hilbert space and $\ell:\mathcal{F}\to\mathbb{R}$ an $L{}$ -smooth and $\mu$ -strongly convex function to be minimized on $\mathcal{F}$ . Consider the iterates $x_{0},y_{0},\dots,x_{T},y_{T}\in\mathcal{F}$ with $x_{0}=y_{0}$ chosen arbitrarily, and the update rules

	$\displaystyle y_{t+1}$	$\displaystyle\coloneqq x_{t}-\frac{1}{L{}}\nabla\ell(x_{t})$
	$\displaystyle x_{t+1}$	$\displaystyle\coloneqq y_{t+1}+\beta(y_{t+1}-y_{t})+\Delta_{t+1}$		(7)

where $\beta:=\frac{\sqrt{\kappa}-1}{\sqrt{\kappa}+1}$ , $\kappa=\tfrac{L}{\mu}$ , and $\Delta_{1},\Delta_{2},\dots,\Delta_{T-1}\in\mathcal{F}$ are error terms such that $\left\lVert\Delta_{t}\right\rVert\in O(\kappa^{-2}\left\lVert y_{t}-y_{t-1}\right\rVert)$ for all $t\in[T-1]$ . Then the convergence rate of the iterates to a suboptimality gap of $\varepsilon$ is of order $\widetilde{O}(\sqrt{\kappa}\log(\kappa/\varepsilon))$ . Specifically, for any $T\geq 2$ ,

\ell(y_{T})-\ell(x^{*})\leq(\tfrac{2\sqrt{\kappa}-2}{2\sqrt{\kappa}-1})^{T}\left(\tfrac{L{}}{2}+\tfrac{8\sqrt{\kappa}\mu}{4(4\sqrt{\kappa}-3)}\right)\left\lVert x_{0}-x^{*}\right\rVert^{2}

and

\ell(y_{T})-\ell(x^{*})\leq(\tfrac{2\sqrt{\kappa}-2}{2\sqrt{\kappa}-1})^{T}\tfrac{8\sqrt{\kappa}}{4\sqrt{\kappa}-3}(\ell(y_{0})-\ell(x^{*})).

We will apply Lemma 3.2 using the correspondence $x_{0}=y_{0}=W_{0}=V_{0}$ , and for all $t\geq 1$ , $y_{t}=W_{t}$ and $x_{t}=W_{t}+\beta(W_{t}-W_{t-1})W_{t-1}^{-1}W_{t}$ . Lemma 3.2 implies that even though stagewise training with stacking differs from Nesterov’s method, their similar form allows us to express the former as a perturbation of the latter. In particular, if we write our stacking update (5) for deep residual linear networks as a perturbation of Nesterov’s method, as in (7), the perturbation term is exactly

\displaystyle\Delta_{t}=\beta(W_{t}-W_{t-1})W_{t-1}^{-1}W_{t}-\beta(W_{t}-W_{t-1})

(8)

for all $t\in[2,T-1]$ . That is, if we examine the sequence of iterates $W_{1},\dots,W_{T}$ produced by stagewise training with stacking initialization, the term $\Delta_{t}$ is a measure of the disagreement between the realized iterate $W_{t+1}$ and what Nesterov’s method says the iterate $W_{t+1}$ should be conditioned on the iterates $W_{1},\dots,W_{t}$ from previous timesteps. We note that, even when these perturbation terms $\Delta_{1},\dots,\Delta_{T-1}$ are small in norm, it is possible for Nesterov’s method to describe an iterate sequence that diverges significantly in norm from the iterates realized by stagewise training.

Rewriting (8) as $\Delta_{t}=\beta(W_{t}-W_{t-1})W_{t-1}^{-1}(W_{t}-W_{t-1})$ , we immediately see that in order to satisfy the requirement of Lemma 3.2 that $\left\lVert\Delta_{t}\right\rVert\in O(\kappa^{-2}\left\lVert W_{t}-W_{t-1}\right\rVert)$ , we simply need $\left\lVert\beta W_{t-1}^{-1}(W_{t}-W_{t-1})\right\rVert\in O(\kappa^{-2})$ . This is satisfied when $W_{t}\approx W_{t-1}$ and $W_{t-1}$ is reasonably non-singular. A sufficient condition for this is that the iterates are sufficiently close to the ground-truth solution $W^{*}$ and $W^{*}$ is non-singular, which explains the conditions of Theorem 3.1.

We now prove Theorem 3.1 formally.

Proof of Theorem 3.1.

As described above, the deep linear networks $W_{1},\dots,W_{T}$ , as defined in (5), can be written as iterates of a variant of Nesterov’s acceleration (7), where we set the gradient step size to be the usual $\lambda=\tfrac{1}{L}$ and the stacking parameter $\beta$ to match the usual momentum parameter setting of $\tfrac{\sqrt{\kappa}-1}{\sqrt{\kappa}+1}$ .

Sufficient claim.

To prove the main result, it suffices to show that $\left\lVert\Delta_{t}\right\rVert\leq\alpha\left\lVert W_{t}-W_{t-1}\right\rVert$ for all $t\in[T-1]$ . With this claim Lemma 3.2 immediately gives a convergence rate of

\displaystyle\ell(W_{T})-\ell(W^{*})\leq(\tfrac{2\sqrt{\kappa}-2}{2\sqrt{\kappa}-1})^{T}\left(\tfrac{L{}}{2}+\tfrac{8\sqrt{\kappa}\mu}{4(4\sqrt{\kappa}-3)}\right)\left\lVert W_{1}-W^{*}\right\rVert^{2}.

This gives the desired result $\ell(W_{T})-\ell(W^{*})\in\exp(-\Omega(\tfrac{T}{\sqrt{\kappa}}+\log(\tfrac{T}{\sqrt{\kappa}})))$ as

\displaystyle\tfrac{1}{C}(\ell(W_{T})-\ell(W^{*}))\leq(1-\tfrac{1}{2\sqrt{\kappa}-1})^{T}\leq\exp(-\Omega(\tfrac{T}{\sqrt{\kappa}}))

where $C=\left(\tfrac{L{}}{2}+\tfrac{8\sqrt{\kappa}\mu}{4(4\sqrt{\kappa}-3)}\right)\left\lVert W_{1}-W^{*}\right\rVert^{2}\in\exp(-\Omega(\log(\tfrac{T}{\sqrt{\kappa}})))$ .

Thus, we need to show that $\left\lVert\Delta_{t}\right\rVert\leq\alpha\left\lVert W_{t}-W_{t-1}\right\rVert$ for all $t\in[T-1]$ . For this, we use the following claim, where $\eta\coloneqq\sqrt{\frac{16\delta}{\mu}}$ :

Claim 3.3.

Suppose that $\left\lVert W_{t-1}-W^{*}\right\rVert\leq\eta$ and $\left\lVert W_{t}-W^{*}\right\rVert\leq\eta$ . Then $\left\lVert\Delta_{t}\right\rVert\leq\alpha\left\lVert W_{t}-W_{t-1}\right\rVert$ .

Proof of Claim 3.3.

We have

\left\lVert\Delta_{t}\right\rVert=\left\lVert\beta(W_{t}-W_{t-1})(W_{t-1}^{-1}W_{t}-I)\right\rVert\leq\beta\left\lVert W_{t}-W_{t-1}\right\rVert\left\lVert W_{t-1}^{-1}W_{t}-I\right\rVert.

We can further simplify right-most factor as follows:

$\displaystyle\left\lVert W_{t-1}^{-1}W_{t}-I\right\rVert$	$\displaystyle=\left\lVert W_{t-1}^{-1}(W_{t}-W_{t-1})\right\rVert$
	$\displaystyle\leq\left\lVert W_{t}-W_{t-1}\right\rVert\left\lVert W_{t-1}^{-1}\right\rVert$	(submultiplicative property)
	$\displaystyle\leq(\left\lVert W_{t}-W^{}\right\rVert+\left\lVert W_{t-1}-W^{}\right\rVert)\left\lVert W_{t-1}^{-1}\right\rVert.$	(triangle inequality)	(9)

The following fact shows that $W_{t-1}$ is indeed far from singular as long as it is close enough to $W^{*}$ .

Fact 3.4.

$W_{t-1}$ is invertible and $\left\lVert W_{t-1}^{-1}\right\rVert\leq\frac{\sqrt{d}}{\sigma_{\min{}}(W^{*})-\eta}$ .

Proof of Fact 3.4.

Since $\|\cdot\|$ is the Frobenius norm, we can upper bound the singular values of the matrix $W^{*}-W_{t-1}$ by $\sigma_{\max{}}(W^{*}-W_{t-1})\leq\eta$ , where we use $\sigma_{\max{}}(W)$ to denote the largest singular values of a matrix $W$ . We can then use Weil’s inequality to argue that, for any $i\in[d]$ , the difference between the $i$ th largest singular values of $W_{t-1}$ and $W^{*}$ is upper bounded by

\left|\sigma_{i}(W_{t-1})-\sigma_{i}(W^{*})\right|\leq\sigma_{\max}(W_{t-1}-W^{*})\leq\eta.

The smallest singular value of $W_{t-1}$ is thus at least $\sigma_{\min{}}(W_{t-1})\geq\sigma_{\min{}}(W^{*})-\eta$ . Since our choice of $\delta$ guarantees that $\eta\leq\sigma_{\min{}}(W^{*})$ , we have that $W_{t-1}>0$ and is invertible. This also implies that the largest singular value of $W_{t-1}^{-1}$ is at most $\sigma_{\max{}}(W_{t-1}^{-1})\leq\frac{1}{\sigma_{\min{}}(W^{*})-\eta}$ . We can therefore upper bound the Frobenius norm of $W_{t-1}^{-1}$ as claimed by

\left\lVert W_{t-1}^{-1}\right\rVert\leq\sqrt{\sum_{i\in[d]}\sigma_{i}(W_{t-1}^{-1})^{2}}\leq\sqrt{d}\sigma_{\max{}}(W_{t-1}^{-1})\leq\sqrt{d}\frac{1}{\sigma_{\min{}}(W^{*})-\eta}.

∎

Using Fact 3.4 and (9), we can now bound

\displaystyle\left\lVert W_{t-1}^{-1}W_{t}-I\right\rVert

\displaystyle\leq 2\eta\sqrt{d}\frac{1}{\sigma_{\min}(W^{*})-\eta}.

Since $\eta=\sqrt{\frac{16\delta}{\mu}}$ , we have by the definition of $\delta$ that $2\eta\sqrt{d}\frac{1}{\sigma_{\min}(W^{*})-\eta}\leq\tfrac{\alpha}{\beta}$ . This concludes our proof of the claim as

\left\lVert\Delta_{t}\right\rVert\leq\beta\cdot\left\lVert W_{t}-W_{t-1}\right\rVert\left\lVert W_{t-1}^{-1}W_{t}-I\right\rVert\leq\beta\cdot\tfrac{\alpha}{\beta}\|W_{t}-W_{t-1}\|.

∎

We can now complete the proof by showing the following claim via induction on $t$ :

Claim 3.5.

We have $\left\lVert\Delta_{t}\right\rVert\leq\alpha\left\lVert W_{t}-W_{t-1}\right\rVert$ and $\left\lVert W_{t}-W^{*}\right\rVert\leq\eta$ .

Proof of Claim 3.5.

We proceed with induction on $t$ .

Base case: $t=1$ .

Note that $W_{0}=V_{0}$ . We have

\tfrac{\mu}{2}\left\lVert W_{0}-W^{*}\right\rVert^{2}\leq\ell(W_{0})-\ell(W^{*})\leq\delta,

which implies that $\left\lVert W_{0}-W^{*}\right\rVert\leq\sqrt{\frac{2\delta}{\mu}}\leq\eta$ . Furthermore, since $W_{1}=W_{0}-\frac{1}{L}\nabla\ell(W_{0})$ and $\ell$ is $L$ -smooth, we have $\ell(W_{1})\leq\ell(W_{0})$ , which implies, exactly as above, that $\left\lVert W_{1}-W^{*}\right\rVert\leq\eta$ . Thus, by Claim 3.3, we conclude that $\left\lVert\Delta_{1}\right\rVert\leq\alpha\left\lVert W_{1}-W_{0}\right\rVert$ .

Inductive step.

Fixing $t\in[2,T-1]$ , and assume, as our inductive hypothesis, that $\left\lVert\Delta_{\tau}\right\rVert\leq\alpha\left\lVert W_{\tau}-W_{\tau-1}\right\rVert$ and $\left\lVert W_{\tau}-W^{*}\right\rVert\leq\eta$ for all $\tau\in[t-1]$ . Due to this inductive hypothesis, we can invoke Lemma 3.2 to observe that for all $\tau\in[T-1]$ :

	$\displaystyle\tfrac{\mu}{2}\left\lVert W_{t}-W^{*}\right\rVert^{2}$	$\displaystyle\leq\ell(W_{t})-\ell(W^{*})$
		$\displaystyle\leq(\tfrac{2\sqrt{\kappa}-2}{2\sqrt{\kappa}-1})^{t}\tfrac{8\sqrt{\kappa}}{4\sqrt{\kappa}-3}(\ell(W_{0})-\ell(W^{*}))$
		$\displaystyle\leq 8\delta.$

Thus, we have $\|W_{t}-W^{*}\|\leq\sqrt{\frac{16\delta}{\mu}}=\eta$ . Since $\left\lVert W_{t-1}-W^{*}\right\rVert\leq\eta$ due to our induction hypothesis, we can now use Claim 3.3 to conclude that $\left\lVert\Delta_{t}\right\rVert\leq\alpha\left\lVert W_{t}-W_{t-1}\right\rVert$ . This completes the inductive proof.

∎

Theorem 3.1 presents a local convergence result: it assumes that the initial two layers put the network in the vicinity of the optimal solution $W^{*}$ . We can also provide a global convergence result by using a small warmup period phase to stacking—that is, by training the first few (only a constant number) stages of the deep linear network without stacking.

Corollary 3.6 (Corollary of Theorem 3.1).

Consider stagewise training of a deep residual linear network in the setup described above with an initial warmup phase of zero initialization for $\widetilde{O}(\kappa)$ stages followed by stacking initialization for the remaining stages. Then after $T$ total stages, we have

\ell(W_{T})-\ell(W^{*})\leq\exp(-\widetilde{\Omega}(T/\sqrt{\kappa})+\widetilde{O}(\sqrt{\kappa})).

Proof.

We first recall the standard convergence rate of gradient descent. In the same setting as Lemma 3.2 where we minimize a $L$ -smooth and $\mu$ -strongly convex function on some Hilbert space $\mathcal{F}$ , we can define gradient descent iterates with the update rule $x_{t+1}=x_{t}-\tfrac{1}{L}\nabla\ell(x_{t})$ . For the resulting sequence of iterates $x_{0},\dots,x_{T}$ , it is known that $\ell(x_{T})-\ell(x^{*})\leq\exp(-T/\kappa)(\ell(x_{0})-\ell(x^{*}))$ ; see e.g. [3].

We can further recall that, in the stagewise training of deep residual linear networks, we can write networks as $W_{t+1}=W_{t}+w_{t+1}^{0}W_{t}-\tfrac{1}{L}\nabla\ell(W_{t}+w_{t+1}^{0}W_{t})$ . In the initial stages where new layers are initialized with zero weights, i.e. $w_{t+1}^{0}=0$ , we recover, as mentioned in Section 2, a gradient descent on $\mathbb{R}^{d\times d}$ :

\displaystyle W_{t+1}=W_{t}-\tfrac{1}{L}\nabla\ell(W_{t}).

Putting these two pieces together, we have $\ell(W_{t})-\ell(W^{*})\leq\exp(-\tfrac{(t-1)}{\kappa})(\ell(W_{1})-\ell(W^{*}))$ . By setting

T_{0}\geq\kappa\log(\tfrac{\ell(W_{1})-\ell(W^{*})}{\delta})+2,

we have $\ell(W_{T_{0}-1})-\ell(W^{*})\leq\delta$ . So by setting $V_{0}=W_{T_{0}-1}$ , and noting that $W_{T_{0}}=W_{T_{0}-1}-\frac{1}{L}\nabla\ell(W_{T_{0}-1})$ , we can now apply the local convergence result of Theorem 3.1, and conclude that performing $T^{\prime}$ rounds of stacking after $T_{0}$ rounds of warm-start leads to the desired loss bound

	$\displaystyle\ell(W_{T_{0}+T^{\prime}})-\ell(W^{*})\leq\exp(-\widetilde{\Omega}(T^{\prime}/\sqrt{\kappa}))$	$\displaystyle=\exp(-\widetilde{\Omega}(T/\sqrt{\kappa})+O(T_{0}/\sqrt{\kappa}))$
		$\displaystyle=\exp(-\widetilde{\Omega}(T/\sqrt{\kappa})+\widetilde{O}(\sqrt{\kappa})).$

∎

Key ingredient of proof: robustness of Nesterov’s accelerated gradient descent method.

We now turn to proving Lemma 3.2.

Proof of Lemma 3.2.

In this proof, we will assume that the norm of the perturbation term is bounded by

\displaystyle\|\Delta_{t}\|\leq\alpha\|y_{t}-y_{t-1}\|,

(10)

where $\alpha$ is defined in the same way as in Theorem 3.1, namely

\alpha^{-1}\coloneqq(\kappa-1)\sqrt{2\sqrt{\kappa}(\kappa-1)(\sqrt{\kappa}-3)}.

We define the coefficients $\tau:=\frac{1}{\sqrt{\kappa}+1}$ , $\gamma:=\frac{1}{\sqrt{\kappa}-1}$ , and $\rho=\frac{\mu\gamma}{4(4+\gamma)}=\frac{\mu}{4(4\sqrt{\kappa}-3)}$ . $\tau$ can be understood as a momentum parameter, $\gamma$ as (proportional to) the parameter of the exponential curve that is our convergence rate, and $\rho$ a penalty for the presence of the perturbations $\Delta_{t}$ . As is done in all proofs of Nesterov’s method, we will define the iterates

z_{t}:=\tfrac{1}{\tau}x_{t}-\tfrac{1-\tau}{\tau}y_{t}

for all $t\in[T]$ ; we refer interested readers to [36, 2] for interpretations of these iterates. We note that although our definition of $z_{T}$ depends on $x_{T}$ , and by extension $\Delta_{T}$ , this lemma’s claim about $y_{T}$ does not directly depend at all on $\Delta_{T}$ . Thus, we will freely choose to set $\Delta_{T}=0$ , guaranteeing that $\left\lVert\Delta_{T}\right\rVert\leq\alpha\left\lVert y_{T}-y_{T-1}\right\rVert$ .

To prove the statement of this lemma, it suffices to show the following claim.

Claim 3.7.

The following function $\Phi$ is a potential function:

\Phi(t)=(1+\tfrac{\gamma}{2})^{t}[(\ell(y_{t})-\ell(x^{*}))+\tfrac{\mu}{2}\|z_{t}-x^{*}\|^{2}+2\rho\|y_{t}-x^{*}\|^{2}].

(11)

That is, for all $t\in[T]$ , $\Phi(t)-\Phi(t-1)\leq 0$ .

It suffices to show Claim 3.7 because the consequence that $\Phi(T)\leq\Phi(0)$ directly implies the lemma’s first statement via

\displaystyle(1+\tfrac{\gamma}{2})^{T}(\ell(y_{T})-\ell(x^{*}))

\displaystyle\leq\ell(x_{0})-\ell(x^{*})+(\frac{\mu}{2}+2\rho)\left\lVert x_{0}-x^{*}\right\rVert^{2}\leq(\tfrac{L{}+\mu}{2}+2\rho)\left\lVert x_{0}-x^{*}\right\rVert^{2},

with the last inequality following from the $L{}$ -smoothness of $\ell$ . The second statement follows similarly via

\displaystyle(1+\tfrac{\gamma}{2})^{T}(\ell(y_{T})-\ell(x^{*}))

\displaystyle\leq\ell(x_{0})-\ell(x^{*})+(\frac{\mu}{2}+2\rho)\left\lVert x_{0}-x^{*}\right\rVert^{2}\leq(2+\tfrac{4\rho}{\mu})(\ell(x_{0})-\ell(x^{*})).

We therefore turn to showing Claim 3.7.

First, we note that the potential function (11) we are proving differs from the usual potential function that is used to prove Nesterov’s acceleration, which we will denote by $\Phi_{\mathrm{orig}}$ :

\displaystyle\Phi_{\mathrm{orig}}(t)=(1+\gamma)^{t}[(\ell(y_{t})-\ell(x^{*}))+\tfrac{\mu}{2}\left\lVert z_{t}-x^{*}\right\rVert^{2}].

Fact 3.8 says roughly that, for any $t\in[T]$ , we can hypothetically recover a large part of the usual proof of Nesterov’s acceleration by bounding the difference $\Phi_{\mathrm{orig}}(t)-\Phi_{\mathrm{orig}}(t-1)\leq 0$ , if only we could remove the perturbation at timestep $t$ , i.e. set $\Delta_{t}=0$ (but keep in place the perturbations $\Delta_{1},\dots,\Delta_{t-1}$ from previous timesteps).

Fact 3.8.

Fix any $t\in[T]$ and let $\widetilde{z}_{t}\coloneqq\tfrac{1}{\tau}(y_{t}+\beta(y_{t}-y_{t-1}))-\tfrac{1-\tau}{\tau}y_{t}$ . Then,

(1+\gamma)(\ell(y_{t})-\ell(x^{*}))-(\ell(y_{t-1})-\ell(x^{*}))+\tfrac{\mu}{2}\left((1+\gamma)\|\widetilde{z}_{t}-x^{*}\|^{2}-\|z_{t-1}-x^{*}\|^{2}\right)\leq 0

(12)

Next, in Fact 3.9, we show that, since we can argue that $\Phi_{\mathrm{orig}}(t)\leq\Phi_{\mathrm{orig}}(t-1)$ if we ignore the perturbation $\Delta_{t}$ , we can argue that $\Phi(t)\leq\Phi(t-1)$ even taking the perturbation $\Delta_{t}$ into account. That is, Fact 3.9 shows that the left-hand side of (12) upper bounds $\Phi(t)-\Phi(t-1)$ , so long as our stated assumption on the perturbation norm $\left\lVert\Delta_{t}\right\rVert$ holds.

Fact 3.9.

For any $t\in[T]$ , the left-hand side of (12) upper bounds $\Phi(t)-\Phi(t-1)$ ; equivalently,

	$\displaystyle(1+\tfrac{\gamma}{2})(\tfrac{\mu}{2}\left\lVert z_{t}-x^{}\right\rVert^{2}+2\rho\left\lVert y_{t}-x^{}\right\rVert^{2})-2\rho\left\lVert y_{t-1}-x^{*}\right\rVert^{2}$
	$\displaystyle\leq\tfrac{\gamma}{2}(\ell(y_{t})-\ell(x^{}))+(1+\gamma)\tfrac{\mu}{2}\left\lVert\widetilde{z}_{t}-x^{}\right\rVert^{2}.$		(13)

Plugging Fact 3.9’s (13) into Fact 3.8’s (12), we recover our main claim that $\Phi(t)-\Phi(t-1)\leq 0$ . We conclude by turning to prove Facts 3.8 and 3.9.

Proof of Fact 3.8.

The proof of this fact closely follows [3]’s proof of Nesterov’s accelerated gradient method in smooth strongly convex settings.

We begin by upper bounding the first summand in the left-hand side of (12). Since $\ell$ is smooth and $y_{t}$ is a gradient step on $\ell$ from $x_{t-1}$ , we have that $\ell(y_{t})\leq\ell(x_{t-1})-\frac{1}{2L{}}\left\lVert\nabla x_{t-1}\right\rVert^{2}$ and thus

		$\displaystyle(1+\gamma)(\ell(y_{t})-\ell(x^{}))-(\ell(y_{t-1})-\ell(x^{}))$
		$\displaystyle\leq\ell(x_{t-1})-\ell(y_{t-1})+\gamma(\ell(x_{t-1})-\ell(x^{*}))-\frac{1+\gamma}{2L{}}\left\lVert\nabla\ell(x_{t-1})\right\rVert^{2}.$		(14)

Using the $\mu$ -strong convexity of $\ell$ , we can further bound part of the right-hand side by

	$\displaystyle\ell(x_{t-1})-\ell(y_{t-1})+\gamma(\ell(x_{t-1})-\ell(x^{*}))\leq\;$	$\displaystyle\left<\nabla\ell(x_{t-1}),x_{t-1}-y_{t-1}\right>$
		$\displaystyle\;+\gamma\left(\left<\nabla\ell(x_{t-1}),x_{t-1}-x^{}\right>-\frac{\mu}{2}\left\lVert x_{t-1}-x^{}\right\rVert^{2}\right).$		(15)

Plugging (15) and the identity $z_{t-1}=\frac{1}{\tau}x_{t-1}-\frac{1-\tau}{\tau}y_{t-1}$ into (3), direct algebra yields an upper bound on the first summand of (12):

	$\displaystyle(1+\gamma)(\ell(y_{t})-\ell(x^{}))-(\ell(y_{t-1})-\ell(x^{}))$
	$\displaystyle\leq\frac{1}{1+\gamma}\left<\nabla\ell(x_{t-1}),\gamma(z_{t-1}-x^{})+\gamma^{2}(x_{t-1}-x^{})\right>-\frac{\mu\gamma}{2}\left\lVert x_{t-1}-x^{*}\right\rVert^{2}-\frac{1+\gamma}{2L{}}\left\lVert\nabla\ell(x_{t-1})\right\rVert^{2}.$		(16)

Next, we turn to upper bounding the second summand in (12). Plugging the iterate definitions $y_{t}=x_{t-1}-\frac{1}{L{}}\nabla\ell(x_{t-1})$ and ${z}_{t-1}=\frac{1}{\tau}{x}_{t-1}-\frac{1-\tau}{\tau}{y}_{t-1}$ into our definition of $\widetilde{z}_{t}=\frac{1}{\tau}\left((1-\tau)y_{t}-(1-2\tau)y_{t-1}\right)$ , we can recover the identity

\widetilde{z}_{t}=\frac{1}{1+\gamma}z_{t-1}+\frac{\gamma}{1+\gamma}x_{t-1}-\frac{\gamma}{\mu(1+\gamma)}\nabla\ell(x_{t-1}).

Plugging this identity for $\widetilde{z}_{t}$ into the expression $\tfrac{\mu}{2}\left((1+\gamma)\|\widetilde{z}_{t}-x^{*}\|^{2}-\|z_{t-1}-x^{*}\|^{2}\right)$ , direct algebra yields the identity

$\displaystyle\tfrac{\mu}{2}\left((1+\gamma)\\|\widetilde{z}_{t}-x^{}\\|^{2}-\\|z_{t-1}-x^{}\\|^{2}\right)=\;$	$\displaystyle\frac{1+\gamma}{2L{}}\left\lVert\nabla\ell(x_{t-1})\right\rVert^{2}$
	$\displaystyle-\frac{1}{1+\gamma}\left<\nabla\ell(x_{t-1}),\gamma(z_{t-1}-x^{})+\gamma^{2}(x_{t-1}-x^{})\right>$
	$\displaystyle\;+\frac{\mu\gamma}{2}\left\lVert x_{t-1}-x^{*}\right\rVert^{2}-\frac{\mu\gamma}{2(1+\gamma)}\left\lVert z_{t-1}-x_{t-1}\right\rVert^{2}.$	(17)

Summing (16) and (17) yields the following upper bound on the left-hand side of (12):

	$\displaystyle(1+\gamma)(\ell(y_{t})-\ell(x^{}))-(\ell(y_{t-1})-\ell(x^{}))+\tfrac{\mu}{2}\left((1+\gamma)\\|\widetilde{z}_{t}-x^{}\\|^{2}-\\|z_{t-1}-x^{}\\|^{2}\right)$
	$\displaystyle\leq-\frac{\mu\gamma}{2(1+\gamma)}\left\lVert z_{t-1}-x_{t-1}\right\rVert^{2}\leq 0.$

∎

Proof of Fact 3.9.

To show this fact, we can observe from direct algebra that it suffices to prove the following inequalities:

	$\displaystyle\tfrac{\gamma}{2}(\ell(y_{t})-\ell(x^{}))\geq(2+\tfrac{\gamma}{2})2\rho\left\lVert y_{t}-x^{}\right\rVert^{2},$		(18)
	$\displaystyle(1+\gamma)\left\lVert\widetilde{z}_{t}-x^{}\right\rVert^{2}\geq(1+\tfrac{\gamma}{2})\left\lVert z_{t}-x^{}\right\rVert^{2}-\tfrac{4\rho}{\mu}(\left\lVert y_{t-1}-x^{}\right\rVert^{2}+\left\lVert y_{t}-x^{}\right\rVert^{2}).$		(19)

The inequality (18) follows directly from the $\mu$ -convexity of $\ell$ , and the identity $\rho=\frac{\mu\gamma}{4(4+\gamma)}$ , as

\tfrac{\gamma}{2}(\ell(y_{t})-\ell(x^{*}))\geq\tfrac{\mu\gamma}{4}\|y_{t}-x^{*}\|^{2}=(2+\tfrac{\gamma}{2})\cdot 2\rho\|y_{t}-x^{*}\|^{2}.

To show (19), we first recall the general fact that $c(1+\delta)\|v\|^{2}\geq c\|v+w\|^{2}-c(1+\frac{1}{\delta})\|w\|^{2}$ for any $\delta,c>0$ . Choosing $v=\widetilde{z}_{t}-x^{*}$ , $w=z_{t}-\widetilde{z}_{t}$ , $c=1+\tfrac{\gamma}{2}$ , and $\delta=\frac{\gamma}{\gamma+2}>0$ , this yields

\displaystyle 1+\gamma\|\widetilde{z}_{t}-x^{*}\|^{2}\geq(1+\tfrac{\gamma}{2})\|z_{t}-x^{*}\|^{2}-(2+\tfrac{2}{\gamma})(1+\tfrac{\gamma}{2})\|\tfrac{1}{\tau}\Delta_{t}\|^{2}.

(20)

Using the perturbation norm bound, we can therefore bound

(2+\tfrac{2}{\gamma})(1+\tfrac{\gamma}{2})\|\tfrac{1}{\tau}\Delta_{t}\|^{2}\leq\tfrac{2\rho}{\mu}\|y_{t}-y_{t-1}\|^{2}\leq\tfrac{4\rho}{\mu}(\|y_{t}-x^{*}\|^{2}+\|y_{t-1}-x^{*}\|^{2}),

where the last inequality uses the triangle inequality; plugged into (20), this yields (19) as desired. ∎

∎

4 Experiments

In this section we provide some proof-of-concept experiments to validate our theoretical results.

4.1 Deep Linear Networks and Squared Losses

As our main theoretical results in Section 3 apply to the case of deep linear networks, we consider the same function class in our experiments on synthetic data with the square loss. Formally, the output space $\mathcal{Y}=\mathbb{R}^{d}$ , and for a predictor $x\mapsto W_{t}x$ , we consider the loss

\displaystyle\ell(W)=\mathbb{E}_{(x,y)\sim D}\left[\tfrac{1}{2}\|Wx-y\|^{2}\right].

(21)

We consider a data distribution $D$ where the samples $(x,y)$ are drawn as follows. Let $W^{*}$ and be the “ground truth” positive definite matrix and let $\Sigma$ be the data covariance matrix. We first sample $x\sim N(0,\Sigma)$ and then conditioned on $x$ , the output $y$ is generated as $y=W^{*}x+\xi$ , where $\xi$ is a mean zero random variable. We can then write the expected square loss explicitly as

\displaystyle\ell(W)

\displaystyle=\mathbb{E}_{(x,y)\sim D}\left[\tfrac{1}{2}\|Wx-y\|^{2}\right]=\tfrac{1}{2}\text{Tr}((W-W^{*})\Sigma(W-W^{*})^{\top})+\tfrac{1}{2}\mathbb{E}\left[\|\xi\|^{2}\right].

(22)

Note that for the case of the squared loss described above, the condition number of the expected loss depends on the covariance matrix $\Sigma$ , i.e., $\kappa=\frac{\sigma_{\text{max}}(\Sigma)}{\sigma_{\text{min}}(\Sigma)}$ .

Stacking updates.

For the specific case of the squared loss we get the following closed form expression of the stacking updates (see Eq. (5))

\displaystyle W_{t+1}=(W_{t}+\beta(W_{t}-W_{t-1})W_{t-1}^{-1}W_{t})(I-\tfrac{1}{L}{}\Sigma)+\tfrac{1}{L}{}W^{*}\Sigma.

(23)

Here $L$ is the smoothness of the loss which depends on the largest singular value of $\Sigma$ , $L=\sigma_{\text{max}}(\Sigma)$ .

Nesterov’s updates.

Similarly, we get the following closed form expression for obtaining Nesterov’s updates (see Eq. (6)) for the case of squared loss

\displaystyle W_{t+1}=(W_{t}+\beta(W_{t}-W_{t-1}))(I-\tfrac{1}{L}{}\Sigma)+\tfrac{1}{L}{}W^{*}\Sigma.

(24)

For both stacking and Nesterov’s updates we set $\beta=\frac{\sqrt{\kappa}-1}{\sqrt{\kappa}+1}$ .

4.2 Synthetic data experiments

We compare the performance of the three updates namely vanilla gradient descent, stacking updates (Eq. 23) and exact Nesterov updates (Eq. 24). Here at each stacking stage only the last layer is updated which matches our theoretical setup faithfully. In Section 4, we also consider the effect of training all the layers in each stacking stage which is closer to how stacking is applied in practice.

We consider points in $d=20$ dimensions. We generate the ground truth $W^{*}$ to be of the form $I+\sigma S$ , where $S$ is a random positive semi-definite matrix of spectral norm $1$ and $\sigma$ is a parameter defining the closeness of $W^{*}$ to identity. For a given $\kappa>1$ , we generate a random covariance matrix $\Sigma$ . Finally, we sample the noise $\xi$ in the output from a mean zero Gaussian with a standard deviation of $0.1$ .

Figure 4 shows the performance of the three types of updates as the problem becomes more ill conditioned, i.e. as a function of $\kappa$ . As expected, at small values of the condition number there is no advantage of the stacking updates over vanilla gradient descent. However, for ill conditioned data the stacking updates converge much faster than gradient descent. We also observe that the convergence of the stacking updates mirrors very closely the convergence behavior of the exact Nesterov’s updates.

To further understand the relationship between the stacking updates and Nesterov’s updates, in Figure 5 we show the performance of the two as the distance of $W^{*}$ from identity increases. As can be seen from the figures, when $W^{*}$ is farther from identity the stacking updates behave qualitatively different from Nesterov’s updates during the initial phase where the loss for stacking updates explodes before converging in later stages. This suggests that in practice there may be a better way to initialize a stacking stage by making the initialization closer to the ideal Nesterov’s updates. While in the case of deep linear networks and the squared loss we have a closed form expression for such an initialization, in general this is a hard problem.

Next we consider the case where in each stacking stage we train all the layers of the deep linear network. We use the same data generation procedure as described above. We perform $10$ stages of stacking where in each stage we perform $2$ steps of gradient descent with a learning rate of $1/L$ where $L$ is the smoothness of the loss function. We train on $1024$ examples with batch size of $32$ and test on $1024$ examples.

We consider two types of stacking based initialization schemes. The first one namely Stacking Init. initializes the next layer’s weight matrix $w^{0}_{t+1}$ as $\beta w_{t}$ . The second scheme namely Nesterov Init. initializes $w^{0}_{t}$ such that we recover the precises Nesterov’s updates at initialization, i.e., Eq. 24. From the analysis in Section 2 the initialization that achieves this amounts to setting $w^{0}_{t+1}$ as $\beta w_{t}(I+w_{t})^{-1}$ .

Figure 6 shows the performance of the two stacking initialization schemes as compared to the random baseline where we initialize the next layer’s weight matrix to be a random one. We again observe that both the stacking schemes outperform the baseline particularly when the data is ill conditioned.

4.3 Stacking for BERT Base with $\beta$ parameters

The theory developed in Section 2 requires the initialization at the $(t+1)$ -th stage to be $f_{t+1}^{0}=\beta f_{t}$ for some $\beta\in[0,1)$ . The introduction of $\beta$ is crucial to get the accelerated convergence rate in Nesterov’s method, but the standard stacking initialization doesn’t use a $\beta$ parameter. We performed sanity check experiments on BERT Base to ensure that the introduction of the $\beta$ parameter doesn’t affect the efficacy of stacking. We introduced a trainable parameter, $\beta$ , that multiplies the output of the newly added transformer block in stacking, which is initialized to the values $0.9$ and $0.99$ , which are standard settings for momentum parameters. Figure 7 shows that introduction of the $\beta$ parameter doesn’t hurt the efficacy of stacking. The plot also shows that the final log perplexity improves a bit when using trainable $\beta$ .

5 Conclusions and Future Work

This paper develops the theoretical perspective that the effectiveness of stacking initialization, compared to other forms of initialization such as zero or random, is because it enables a form of accelerated gradient descent in function space. There are several directions for future work. While this work provides a formal proof of accelerated convergence for a particular parametric setting (deep residual linear networks), such a proof in the general functional setting for deep residual networks is still open, and will probably require some additional assumptions. From a practical standpoint, a very intriguing and potentially impactful question is whether it is possible to come up with an efficiently implementable initialization scheme that leads to Nesterov’s AGD updates exactly for deep residual networks.

References

AAA⁺ [23] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
AS [22] Kwangjun Ahn and Suvrit Sra. Understanding nesterov’s acceleration via proximal point method. In Karl Bringmann and Timothy Chan, editors, 5th Symposium on Simplicity in Algorithms, SOSA@SODA 2022, Virtual Conference, January 10-11, 2022, pages 117–130. SIAM, 2022.
BG [17] Nikhil Bansal and Anupam Gupta. Potential-function proofs for first-order methods, 2017.
BLPL [06] Yoshua Bengio, Pascal Lamblin, Dan Popovici, and Hugo Larochelle. Greedy layer-wise training of deep networks. Advances in neural information processing systems, 19, 2006.
BMR⁺ [20] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
BPC [20] Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020.
CG [16] Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. CoRR, abs/1603.02754, 2016.
CND⁺ [22] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
CWC⁺ [22] Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, et al. Pali: A jointly-scaled multilingual language-image model. arXiv preprint arXiv:2209.06794, 2022.
DCLT [19] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, NAACL-HLT, pages 4171–4186. Association for Computational Linguistics, 2019.
Fri [01] Jerome H Friedman. Greedy function approximation: a gradient boosting machine. Annals of statistics, pages 1189–1232, 2001.
FS [97] Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1):119–139, 1997.
GBC [16] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org.
GHL⁺ [19] Linyuan Gong, Di He, Zhuohan Li, Tao Qin, Liwei Wang, and Tieyan Liu. Efficient training of bert by progressively stacking. In International conference on machine learning, pages 2337–2346. PMLR, 2019.
GKS [18] Vineet Gupta, Tomer Koren, and Yoram Singer. Shampoo: Preconditioned stochastic tensor optimization. In Jennifer G. Dy and Andreas Krause, editors, ICML, volume 80 of Proceedings of Machine Learning Research, pages 1837–1845. PMLR, 2018.
GLY⁺ [20] Xiaotao Gu, Liyuan Liu, Hongkun Yu, Jing Li, Chen Chen, and Jiawei Han. On the transformer growth for progressive bert training. arXiv preprint arXiv:2010.12562, 2020.
HM [17] Moritz Hardt and Tengyu Ma. Identity matters in deep learning. In ICLR. OpenReview.net, 2017.
HOT [06] Geoffrey E Hinton, Simon Osindero, and Yee-Whye Teh. A fast learning algorithm for deep belief nets. Neural computation, 18(7):1527–1554, 2006.
HZRS [16] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770–778. IEEE Computer Society, 2016.
IS [15] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Francis R. Bach and David M. Blei, editors, ICML, volume 37 of JMLR Workshop and Conference Proceedings, pages 448–456. JMLR.org, 2015.
JSR⁺ [24] Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024.
Kaw [16] Kenji Kawaguchi. Deep learning without poor local minima. In Daniel D. Lee, Masashi Sugiyama, Ulrike von Luxburg, Isabelle Guyon, and Roman Garnett, editors, Advances in Neural Information Processing Systems 29, pages 586–594, 2016.
LLH⁺ [23] Hong Liu, Zhiyuan Li, David Hall, Percy Liang, and Tengyu Ma. Sophia: A scalable stochastic second-order optimizer for language model pre-training. arXiv preprint arXiv:2305.14342, 2023.
MBBF [99] Llew Mason, Jonathan Baxter, Peter L. Bartlett, and Marcus R. Frean. Boosting algorithms as gradient descent. In Sara A. Solla, Todd K. Leen, and Klaus-Robert Müller, editors, NeurIPS, pages 512–518. The MIT Press, 1999.
Nes [83] Yurii Nesterov. A method for unconstrained convex minimization problem with the rate of convergence o (1/k^ 2). In Doklady an ussr, volume 269, pages 543–547, 1983.
RMK⁺ [23] Sashank J Reddi, Sobhan Miryoosefi, Stefani Karp, Shankar Krishnan, Satyen Kale, Seungyeon Kim, and Sanjiv Kumar. Efficient training of language models using few-shot learning. In International Conference on Machine Learning, pages 14553–14568. PMLR, 2023.
RWC⁺ [19] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
SF [12] Robert E. Schapire and Yoav Freund. Boosting: Foundations and Algorithms. The MIT Press, 2012.
SMG [14] Andrew M. Saxe, James L. McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. In Yoshua Bengio and Yann LeCun, editors, 2nd International Conference on Learning Representations, ICLR 2014, Conference Track Proceedings, 2014.
SMM⁺ [17] Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017.
SS [18] Noam Shazeer and Mitchell Stern. Adafactor: Adaptive learning rates with sublinear memory cost. In International Conference on Machine Learning, pages 4596–4604. PMLR, 2018.
SWK⁺ [22] Sheng Shen, Pete Walsh, Kurt Keutzer, Jesse Dodge, Matthew Peters, and Iz Beltagy. Staged training for transformer language models. In International Conference on Machine Learning, pages 19893–19908. PMLR, 2022.
TAB⁺ [23] Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
TAD⁺ [23] Tao Tu, Shekoofeh Azizi, Danny Driess, Mike Schaekermann, Mohamed Amin, Pi-Chuan Chang, Andrew Carroll, Chuck Lau, Ryutaro Tanno, Ira Ktena, et al. Towards generalist biomedical AI. arXiv preprint arXiv:2307.14334, 2023.
VSP⁺ [17] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
ZO [17] Zeyuan Allen Zhu and Lorenzo Orecchia. Linear coupling: An ultimate unification of gradient and mirror descent. In Christos H. Papadimitriou, editor, 8th Innovations in Theoretical Computer Science Conference, ITCS 2017, January 9-11, 2017, Berkeley, CA, USA, volume 67 of LIPIcs, pages 3:1–3:22. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2017.