This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Theory on Forgetting and Generalization of Continual Learning

Sen Lin
Department of ECE
The Ohio State University
lin.4282@osu.edu &Peizhong Ju
Department of ECE
The Ohio State University
ju.171@osu.edu &Yingbin Liang
Department of ECE
The Ohio State University
liang.889@osu.edu &Ness Shroff
Department of ECE
The Ohio State University
shroff.11@osu.edu
Equal Contribution
Abstract

Continual learning (CL), which aims to learn a sequence of tasks, has attracted significant recent attention. However, most work has focused on the experimental performance of CL, and theoretical studies of CL are still limited. In particular, there is a lack of understanding on what factors are important and how they affect “catastrophic forgetting” and generalization performance. To fill this gap, our theoretical analysis, under overparameterized linear models, provides the first-known explicit form of the expected forgetting and generalization error. Further analysis of such a key result yields a number of theoretical explanations about how overparameterization, task similarity, and task ordering affect both forgetting and generalization error of CL. More interestingly, by conducting experiments on real datasets using deep neural networks (DNNs), we show that some of these insights even go beyond the linear models and can be carried over to practical setups. In particular, we use concrete examples to show that our results not only explain some interesting empirical observations in recent studies, but also motivate better practical algorithm designs of CL.

1 Introduction

Continual learning (CL) [41] is a learning paradigm where an agent needs to continuously learn a sequence of tasks. To resemble the extraordinary lifelong learning capability of human beings, the agent is expected to learn new tasks more easily based on accumulated knowledge from old tasks, and further improve the learning performance of old tasks by leveraging the knowledge of new tasks. The former is referred to as forward knowledge transfer and the latter as backward knowledge transfer. One major challenge herein is the so-called catastrophic forgetting [36], i.e., the agent easily forgets the knowledge of old tasks when learning new tasks.

Although there have been significant efforts in experimental studies (e.g., [27, 14, 50, 16, 17]) to address the forgetting issue, the theoretical understanding of CL is still in the early stage, where only a few attempts have emerged recently, e.g., [49, 12, 16, 17] (see a more detailed discussion about the previous theoretical studies of CL in Section 2). However, none of these existing theoretical results provide an explicit characterization of forgetting and generalization error, that only depends on fundamental system parameters/setups (e.g., number of tasks/samples/parameters, noise level, task similarity/order). Thus, our work here provides the first-known explicit theoretical result, which enables us to comprehensively understand which factors are relevant and how they (precisely) affect forgetting and generalization error of CL.

Our main contributions can be summarized as follows.

First, we provide theoretical results on the expected value of forgetting and overall generalization error in CL, under a linear regression setup with i.i.d. Gaussian features and noise. The expression of our results is in an explicit form that captures a clear dependency on various system parameters/setups. Note that analyzing overparameterized linear models are important in their own right and also, as demonstrated in many recent works, are a first step towards understanding the generalization performance of DNNs, e.g., [9, 5, 23, 40, 21].

Second, we investigate the impact of overparameterization, task similarity, and task ordering on both forgetting and generalization error of CL, which reveals the following important insights: 1) Both forgetting and generalization error can benefit from more parameters in the overparameterized regime. Moreover, benign overfitting exists and is easier to observe with large noise and/or low task similarity. 2) In terms of the impact of task similarity, we show that the generalization error always decreases when tasks become more similar, whereas this ‘monotonicity’ does not always hold for forgetting. Surprisingly, forgetting can even decrease when tasks are less similar under certain scenarios. 3) In order to minimize forgetting, the optimal task order should diversify the learning tasks in the early stage and learn more dissimilar tasks adjacently. This is also corroborated by some special scenarios where the tasks can be divided into multiple categories, and the optimal task order therein alternatively learns tasks from different categories.

Last but not least, we show that our findings for the linear models are applicable to and can also guide the algorithm designs for CL in practice, by conducting experiments on real datasets with DNNs. Specifically, our analysis of the impact of task similarity is clearly corroborated by the experimental results, which further sheds light on the recent observations [42, 30, 17] that ‘intermediate task similarity’ leads to the worst forgetting in the two-task setup. Experimental results on the impact of task ordering are also consistent with our findings in linear models. More interestingly, inspired by our analysis of knowledge transfer in linear models, we slightly modify a previous method [33] on leveraging task correlation to facilitate forward knowledge transfer, and show that better performance can be achieved by counting more on fresher old tasks. These encouraging results corroborate the benefits of studying the overparameterized linear models to fundamentally demystify CL.

2 Related Work

Empirical studies in CL. CL has attracted much attention in the past decade, and a vast amount of empirical methods have been proposed to address catastrophic forgetting. In general, the existing methods can be divided into three categories: (1) Regularization-based methods (e.g., [27, 1, 34]), which regularize the modifications on the important weights to old tasks when learning the new task; (2) Parameter-isolation based methods (e.g., [46, 50, 48]), which learn a mask to fix the important weights to old tasks during the new task learning and further expand the neural network when needed; (3) Memory-based methods, which either store and replay data of old tasks when learning the new task, i.e., experience-replay based methods (e.g., [14, 43, 22]), or store the gradient information of old tasks and learn the new task in the orthogonal direction to old tasks without data replay, i.e., orthogonal-projection based methods (e.g., [18, 44, 33]).

Theoretical studies in CL. Specifically, [12] and [16] analyzed generalization error and forgetting for the orthogonal gradient descent (OGD) approach [18] based on NTK models, and further proposed variants of OGD to address forgetting. [49] proposed a unified framework for the performance analysis of regularization-based CL methods, by formulating them as a second-order Taylor approximation of the loss function for each task. [4] and [30] studied CL in the teacher-student setup to characterize the impact of task similarity on forgetting performance. [13] and [31] investigated continual representation learning with dynamically expanding feature spaces, and developed provably efficient CL methods with a characterization of the sample complexity. [15] characterized the lower bound of memory in CL using the PAC framework. By investigating the information flow between neural network layers, [2] analyzed the selection of frozen filters based on layer sensitivity to maximize the performance of CL. Nevertheless, none of these existing works show an explicit form of forgetting and generalization error, that only depends on fundamental system parameters/setups (e.g., number of tasks/samples/parameters, noise level, task similarity/order). In contrast, our work is the first one to provide such an explicit theoretical result, which enables us to comprehensively understand what factors (and how they) affect the forgetting and generalization performance of CL.

The most relevant study to our work is [17], which also studied CL in overparameterized linear models. However, our work is quite different from [17]: (1) We study and provide the exact forms of both forgetting and generalization error based on the testing loss, while [17] only evaluated forgetting using the training data; (2) Our results characterize the performance of CL in a comprehensive way, through investigating how overparameterization, task similarity and task ordering affect both forgetting and generalization error, while [17] only studied the upper bound of catastrophic forgetting under specific task orderings; (3) Unlike [17], our study is able to explain recent phenomena and guide the algorithmic development in CL with DNN.

Studies about generalization performance on overparameterized models (benign overfitting). DNNs are usually so overparameterized that can completely fit all training samples, yet they can still generalize well on unseen test data. This seems to contradict the classical knowledge of bias-variance trade-off. As a first step of understanding this mystery, the “benign overfitting” or “double-descent” phenomenon111i.e., test error decreases again in the overparameterized region with more parameters, so the overfitting is benign for the generalization performance. has been discovered and studied for overfitted solutions of single-task linear regression. For example, some work discovered and studied double-descent with min 2\ell_{2}-norm overfitted solutions [9, 7, 6, 20, 39] or min 1\ell_{1}-norm overfitted solutions [38, 23], while using simple features such as Gaussian or Fourier features. Some other recent work studied the overfitted generalization performance by adopting features that approximate shallow neural networks, for example, random feature (RF) models [37], two-layer neural tangent kernel (NTK) models [3, 45, 24], and three-layer NTK models [25]. All of these studies considered only a single task. In contrast, our work focuses on CL with a sequence of tasks, which brings in many new variables such as task similarity and task ordering.

3 Continual Learning in Linear Models

Consider the standard CL setup where a sequence of tasks 𝕋={1,,T}\mathbb{T}=\{1,...,T\} arrives sequentially in time.

Ground truth. We consider a linear ground truth [9, 17] for each task. Specifically, for task tt, the output yy\in\mathbb{R} is given by

yt=𝒙^t𝒘^t+zt,\displaystyle y_{t}=\hat{{\bm{x}}}_{t}^{\top}\hat{{\bm{w}}}_{t}^{*}+z_{t}, (1)

where 𝒙^tst\hat{{\bm{x}}}_{t}\in\mathbb{R}^{s_{t}} denotes the feature vector, 𝒘^tst\hat{{\bm{w}}}_{t}^{*}\in\mathbb{R}^{s_{t}} denotes the model parameters, and ztz_{t} is the random noise. Here sts_{t} denotes the number of features of ground truth (i.e., the number of true features). In practice, true features are unknown in advance. Therefore, when choosing a model to learn a certain task, people usually choose more features than enough such that all possible features are included. We write this formally into the following assumption222When 3.1 does not hold, the derivation techniques for Theorem 4.1 in the next section still hold with a minor modification that treats the missing features as noise..

Assumption 3.1.

We index all possible features by 1,2,1,2,\cdots. Let 𝒲\mathcal{W} denote the set of indices of all the chosen features in the model to be trained, with cardinality |𝒲|=p\left|\mathcal{W}\right|=p. Let 𝒮t\mathcal{S}_{t} denote the set of indices of tt-th task’s true features, with cardinality |𝒮t|=st\left|\mathcal{S}_{t}\right|=s_{t}. We assume that t𝕋𝒮t𝒲\bigcup_{t\in\mathbb{T}}\mathcal{S}_{t}\subseteq\mathcal{W}.

We next define an expanded ground-truth vector 𝒘tp{\bm{w}}_{t}^{*}\in\mathbb{R}^{p} that expands the original ground-truth vector 𝒘^t\hat{{\bm{w}}}_{t}^{*} from dimension sts_{t} to dimension pp by filling zeros in the positions 𝒲𝒮t\mathcal{W}\setminus\mathcal{S}_{t}. Let 𝒙t{\bm{x}}_{t} be the corresponding features for 𝒘t{\bm{w}}_{t}^{*}. Therefore, the ground truth Equation 1 can be rewritten as

yt=𝒙t𝒘t+zt.\displaystyle y_{t}={\bm{x}}_{t}^{\top}{\bm{w}}_{t}^{*}+z_{t}. (2)

Data. For each task t𝕋t\in\mathbb{T}, the training dataset is denoted as 𝒟t={(𝒙t,j,yt,j)p×}j[nt]\mathcal{D}_{t}=\{({\bm{x}}_{t,j},y_{t,j})\in\mathbb{R}^{p}\times\mathbb{R}\}_{j\in[n_{t}]} with sample size ntn_{t}. By stacking the training data as 𝑿t[𝒙t,1𝒙t,2𝒙t,nt]p×nt{\bm{X}}_{t}\coloneqq[{\bm{x}}_{t,1}\ {\bm{x}}_{t,2}\ \cdots\ {\bm{x}}_{t,n_{t}}]\in\mathbb{R}^{p\times n_{t}} and 𝒚t[yt,1yt,jyt,nt]nt×1{\bm{y}}_{t}\coloneqq[y_{t,1}\ y_{t,j}\ \cdots\ y_{t,n_{t}}]^{\top}\in\mathbb{R}^{n_{t}\times 1}, Equation 2 can be written as

𝒚t=𝑿t𝒘t+𝒛t.{\bm{y}}_{t}={\bm{X}}_{t}^{\top}{\bm{w}}_{t}^{*}+{\bm{z}}_{t}.

To simplify our analysis, we consider i.i.d. Gaussian features and noise, which is stated in the following assumption.

Assumption 3.2.

Each element of 𝑿t{\bm{X}}_{t} for all t𝕋t\in\mathbb{T} follows standard Gaussian distribution 𝒩(0,1){\mathcal{N}}(0,1) and is independent of each other. The noise 𝒛t𝒩(𝟎,σt2𝑰p){\bm{z}}_{t}\sim{\mathcal{N}}(\bm{0},\sigma_{t}^{2}{\bm{I}}_{p}) and is independent of each other for all t𝕋t\in\mathbb{T}, where σt0\sigma_{t}\geq 0 denotes the noise level.

Learning procedure. We train the model parameters 𝒘{\bm{w}} for each task sequentially. Let 𝒘t{\bm{w}}_{t} denote the result after training for task tt, which is also the initial point in the model training for task t+1t+1. Let 𝒘0=𝟎{\bm{w}}_{0}=\bm{0}, i.e., task 1 starts training from zero. For each task tt, the training loss is defined by mean-squared-error (MSE) with respect to (w.r.t.) (𝑿t,𝒚t)({\bm{X}}_{t},{\bm{y}}_{t}):

ttr(𝒘,𝒟t)=1nt(𝑿t)𝒘𝒚t22.\displaystyle{\mathcal{L}}^{tr}_{t}({\bm{w}},{\mathcal{D}}_{t})=\frac{1}{n_{t}}\|({\bm{X}}_{t})^{\top}{\bm{w}}-{\bm{y}}_{t}\|_{2}^{2}. (3)

When underparameterized (i.e., ntpn_{t}\leq p), minimizing Equation 3 has a unique solution (with probability 1). When overparameterized (i.e., p>ntp>n_{t}), minimizing Equation 3 has an infinite number of solutions that make Equation 3 zero. Among all overfitted solutions, we are particularly interested in the one corresponding to the convergent point of stochastic gradient descent (SGD) for minimizing Equation 3. In fact, it can be shown that such an overfitted solution has the smallest 2\ell_{2}-norm of the change of parameters [19]. In other words, 𝒘t{\bm{w}}_{t} corresponds to the solution to the following optimization problem:

min𝒘𝒘𝒘t12,s.t.(𝑿t)𝒘=𝒚t.\displaystyle\min_{{\bm{w}}}~{}~{}\|{\bm{w}}-{\bm{w}}_{t-1}\|_{2},~{}~{}~{}s.t.~{}~{}({\bm{X}}_{t})^{\top}{\bm{w}}={\bm{y}}_{t}. (4)

The constraint in Equation 4 implies that the training loss is exactly zero (i.e., overfitted).

Performance evaluation. For the described linear system, we use t(𝒘){\mathcal{L}}_{t}({\bm{w}}) to denote the model error333It can be proved that the model error we defined here is equivalent to the mean-squared-error on noise-free test data. for task tt:

t(𝒘)=𝒘𝒘t2,\displaystyle{\mathcal{L}}_{t}({\bm{w}})=\|{\bm{w}}-{\bm{w}}_{t}^{*}\|^{2}, (5)

which characterizes the generalization performance of 𝒘{\bm{w}} on task tt. As is standard in the empirical studies of CL, e.g., [14, 33], we evaluate the performance of CL on two key metrics, forgetting and overall generalization error, defined as below:

(1) Forgetting: It measures how much ‘knowledge’ of old tasks has been forgotten after learning the current task. Specifically, after learning task t[2,T]t\in[2,T], the average forgetting over all old tasks i[1,t1]i\in[1,t-1] is defined as:

Ft=1t1i=1t1(i(𝒘t)i(𝒘i)).\displaystyle F_{t}=\frac{1}{t-1}\sum_{i=1}^{t-1}({\mathcal{L}}_{i}({\bm{w}}_{t})-{\mathcal{L}}_{i}({\bm{w}}_{i})). (6)

In Equation 6, i(𝒘t)i(𝒘i){\mathcal{L}}_{i}({\bm{w}}_{t})-{\mathcal{L}}_{i}({\bm{w}}_{i}) denotes the performance difference between 𝒘i{\bm{w}}_{i} (the result after training task ii) and 𝒘t{\bm{w}}_{t} (the result after training task tt) on test data of task ii.

(2) Overall generalization error: We evaluate the model generalization performance of the final task model 𝒘T{\bm{w}}_{T} in terms of the average model error over all tasks:

GT=1Ti=1Ti(𝒘T).\displaystyle G_{T}=\frac{1}{T}\sum_{i=1}^{T}{\mathcal{L}}_{i}({\bm{w}}_{T}). (7)

It is worth noting that the forgetting defined in [17] is based on the training loss, which consequently ignores the generalization performance of the learned models for old tasks. Such a definition is not only inconsistent with the evaluation metric in empirical studies, but also insufficient to capture the backward knowledge transfer because the value of forgetting therein can not be negative.

We further simplify the current setup by assuming that each task has the same number of training samples as well as the same noise level σ\sigma, stated as follows.

Assumption 3.3.

nt=nn_{t}=n and σt=σ\sigma_{t}=\sigma for all t𝕋t\in\mathbb{T}.

Note that 3.3 is adopted only to make our results (which will be shown in the next section) easy to interpret. In fact, our analysis can be easily generalized to the situation when 3.3 does not hold.

4 Main Results and Interpretations

Although we use linear models, in order to provide hints on understanding DNNs that are usually heavily overparameterized, we are particularly interested in the performance of CL in the overparameterized region (p>np>n), where we define the overparameterized ratio as r1npr\coloneqq 1-\frac{n}{p}. For ease of exposition, we define the following coefficients that will appear in our main theorem:

ci,j(1r)(rTirji+rTj),\displaystyle c_{i,j}\coloneqq(1-r)\left(r^{T-i}-r^{j-i}+r^{T-j}\right), (8)

where 1i<jT1\leq i<j\leq T are the indices of tasks. Now we are ready to state our main theorem that characterizes the expected value of forgetting and overall generalization error:

Theorem 4.1.

When pn+2p\geq n+2, we must have

𝔼[FT]=1T1i=1T1[(rTri)𝒘i2Term F1+j>iTci,j𝒘i𝒘j2Term F2+pσ2pn1(rirT)Term F3]\displaystyle\mathbb{E}[F_{T}]=\frac{1}{T-1}\sum_{i=1}^{T-1}\Big{[}\underbrace{(r^{T}-r^{i})\|{\bm{w}}_{i}^{*}\|^{2}}_{\text{Term~{}F1}}+\underbrace{\sum_{j>i}^{T}c_{i,j}\|{\bm{w}}_{i}^{*}-{\bm{w}}_{j}^{*}\|^{2}}_{\text{Term~{}F2}}+\underbrace{\frac{p\sigma^{2}}{p-n-1}(r^{i}-r^{T})}_{\text{Term~{}F3}}\Big{]} (9)
𝔼[GT]=rTTi=1T𝒘i2Term G1+1Ti=1TnrTipk=1T𝒘k𝒘i2Term G2+pσ2pn1(1rT)Term G3.\displaystyle\mathbb{E}[G_{T}]=\underbrace{\frac{r^{T}}{T}\sum_{i=1}^{T}\|{\bm{w}}_{i}^{*}\|^{2}}_{\text{Term~{}G1}}+\underbrace{\frac{1}{T}\sum_{i=1}^{T}\frac{nr^{T-i}}{p}\sum_{k=1}^{T}\|{\bm{w}}_{k}^{*}-{\bm{w}}_{i}^{*}\|^{2}}_{\text{Term~{}G2}}+\underbrace{\frac{p\sigma^{2}}{p-n-1}\left(1-r^{T}\right)}_{\text{Term~{}G3}}. (10)

To the best of our knowledge, Theorem 4.1 is the first result that establishes the closed forms of forgetting and overall generalization error of CL in overparameterized linear models. In the rest of the paper, we will see that Theorem 4.1 not only describes how CL performs on the linear system but also provides guidance on applying CL in practice that DNNs and real-world datasets. The proof of Theorem 4.1 is in Section D.3. We also verify the correctness of Theorem 4.1 in Figure 1 where discrete points indicated by markers in Figure 1 (drawn by simulations) are very close to the curves (drawn by Theorem 4.1 and Theorem 4.3).

We can further simply Equation 9 and Equation 10 by only considering two tasks, so as to better understand Theorem 4.1. The result is shown in the following corollary, which clearly characterizes the dependence on task similarity and different system parameters.

Corollary 4.2.

When T=2T=2 and pn+2p\geq n+2, we must have

𝔼[F2]=\displaystyle\mathbb{E}[F_{2}]= (r2r)𝒘12+np𝒘2𝒘12+nrσ2pn1,\displaystyle(r^{2}-r)\|{\bm{w}}_{1}^{*}\|^{2}+\frac{n}{p}\|{\bm{w}}_{2}^{*}-{\bm{w}}_{1}^{*}\|^{2}+\frac{nr\sigma^{2}}{p-n-1}, (11)
𝔼[G2]=\displaystyle\mathbb{E}[G_{2}]= r22(𝒘12+𝒘22)+1r22𝒘1𝒘22+pσ2(1r2)pn1.\displaystyle\frac{r^{2}}{2}\left(\|{\bm{w}}_{1}^{*}\|^{2}+\|{\bm{w}}_{2}^{*}\|^{2}\right)+\frac{1-r^{2}}{2}\|{\bm{w}}_{1}^{*}-{\bm{w}}_{2}^{*}\|^{2}+\frac{p\sigma^{2}(1-r^{2})}{p-n-1}. (12)

Based on Theorem 4.1, we will provide insights on the following three aspects.

(1) Overparameterization (Section 4.1). In order to understand the generalization power of overfitted machine learning models, much attention has focused (e.g., [9, 23, 21]) on studying the impact of overparameterization on single-task learning, whereas how overparameterization affects the performance of CL still remains unclear. Fortunately, the exact forms in Theorem 4.1 provide a way to directly evaluate the impact of overparameterization and the random noise on both forgetting and generalization error in CL.

(2) Task similarity (Section 4.2). Both forgetting and generalization error depend on the optimal model gap between any two tasks , i.e., 𝒘k𝒘i2\|{\bm{w}}_{k}^{*}-{\bm{w}}_{i}^{*}\|^{2} for any task kk and ii, which defines the task similarity in this work (smaller gap means higher similarity). Understanding the impact of task similarity is helpful to not only explain empirical observations but also guide better designs of CL in practice.

(3) Task ordering (Section 4.3). Given a fixed set of tasks in CL, the learning order of the task sequence clearly plays an important role in affecting both 𝔼[FT]\mathbb{E}[F_{T}] and 𝔼[GT]\mathbb{E}[G_{T}], through the task order-dependent coefficients, e.g., cijc_{ij} in Equation 9 and rTir^{T-i} in Equation 10. For example, suppose 𝒘i2\|{\bm{w}}_{i}^{*}\|^{2} is the same for all i𝕋i\in\mathbb{T}, the optimal task ordering to minimize the generalization error is to learn the tasks in a decreasing order of k=1T𝒘k𝒘i2\sum_{k=1}^{T}\|{\bm{w}}_{k}^{*}-{\bm{w}}_{i}^{*}\|^{2}, i.e., i<ji<j if k=1T𝒘k𝒘i2>k=1T𝒘k𝒘j2\sum_{k=1}^{T}\|{\bm{w}}_{k}^{*}-{\bm{w}}_{i}^{*}\|^{2}>\sum_{k=1}^{T}\|{\bm{w}}_{k}^{*}-{\bm{w}}_{j}^{*}\|^{2}. Intuitively, the most dissimilar task should be learnt first in this case. Investigating the impact of task ordering is particularly valuable when the agent can control the task order in CL, in the same spirit of curriculum learning [11].

In what follows, we will delve into the impact of those three crucial factors in order to provide a comprehensive understanding of CL in the linear models.

Refer to caption
Figure 1: The trend of forgetting and overall generalization error w.r.t. the number of model parameters, where T=8T=8, n=50n=50, 𝒘^t10\hat{{\bm{w}}}_{t}^{*}\in\mathbb{R}^{10} and 𝒘^t2=1\|\hat{{\bm{w}}}_{t}^{*}\|^{2}=1 for all t𝕋t\in\mathbb{T}. The ground truths are the same for all tasks in Subfigures (a) and (c), but are orthogonal in Subfigures (b) and (d) where 𝒘t^\hat{{\bm{w}}_{t}^{*}} equals to tt-th standard basis for all t𝕋t\in\mathbb{T}. The discrete points indicated by markers are calculated by simulation and are the average of 300300 random simulation runs. The curves are drawn by the theoretical expressions in Theorem 4.1 and Theorem 4.3.

4.1 The impact of overparameterization

In this subsection, we show some insights about the impact of overparameterization. Specially, we will discuss what happens when pp changes under a fixed nn.

1) More parameters can lead to zero forgetting and alleviate the negative impact of task dissimilarity on generalization error. As shown in Theorem 4.1, when pp\to\infty, we can have that 𝔼[FT]0\mathbb{E}[F_{T}]\to 0 and Term G2 also approaches zero. In some special cases, we can further show that Term G2 is monotonically decreasing w.r.t. pp. A more detailed discussion can be found in Section C.3.

2) Benign overfitting exists and is easier to observe with large noise and/or low task similarity. As we introduced in related work, benign overfitting has recently been discovered and studied in linear models as a first step towards understanding why DNNs can still generalize well even when heavily overparameterized. The concept of “benign overfitting” and “double-descent” is initially proposed for only a single task. We now show that such a phenomenon also exists in CL where there exists a sequence of tasks.

Notice that Theorem 4.1 is for the overparameterized region. For a precise comparison between the performance of overfitting and underfitting, we present the theoretical result of the underparameterized region in the following theorem.

Theorem 4.3.

When np+2n\geq p+2, we must have

𝔼[FT]=\displaystyle\mathbb{E}[F_{T}]= 1T1i=1T1𝒘T𝒘i2,\displaystyle\frac{1}{T-1}\sum_{i=1}^{T-1}\|{\bm{w}}_{T}^{*}-{\bm{w}}_{i}^{*}\|^{2},
𝔼[GT]=\displaystyle\mathbb{E}[G_{T}]= (1Ti=1T1𝒘T𝒘i2)+pσ2np1.\displaystyle\left(\frac{1}{T}\sum_{i=1}^{T-1}\|{\bm{w}}_{T}^{*}-{\bm{w}}_{i}^{*}\|^{2}\right)+\frac{p\sigma^{2}}{n-p-1}.

We provide an intuitive explanation and rigorous proof of Theorem 4.3 in Section D.8. As shown in Theorem 4.3, 𝔼[GT]\mathbb{E}[G_{T}] becomes larger when the noise level σ\sigma is larger, and both 𝔼[FT]\mathbb{E}[F_{T}] and 𝔼[GT]\mathbb{E}[G_{T}] become larger when tasks are less similar (i.e., when i=1T1𝒘T𝒘i2\sum_{i=1}^{T-1}\|{\bm{w}}_{T}^{*}-{\bm{w}}_{i}^{*}\|^{2} is larger). In contrast, in the overfitted situation, Term F2 and Term G2 in Theorem 4.1 (corresponding to task similarity), Term F3 and Term G3 (corresponding to noise) will go to zero when pp\to\infty. This indicates that when the noise level is high and/or task similarity is low, the performance of CL in the overparameterized situation is more likely to be better than that in the underparameterized situation, i.e., benign overfitting exists and is easier to observe. This can be observed from Figure 1. For example, the blue curve with markers “++” corresponds to the largest noise (compared with other curves in Figure 1(d)) and the lowest task similarity (compared with Figure 1(c)), and it has the deepest descent curve in the overparameterized region (p>50=np>50=n). This observation indicates that benign overfitting is easier to observe with larger noise and lower task similarity.

3) A descent floor sometimes exists on forgetting and generalization error, especially when tasks are similar and noise is low. In Equation 11, the term (r2r)𝒘12(r^{2}-r)\|{\bm{w}}_{1}^{*}\|^{2} first decreases and then increases as pp increases from nn to \infty (i.e., rr increases from 0 to 11), while the remaining two terms decrease as pp increases. Thus, when 𝒘2𝒘12\|{\bm{w}}_{2}^{*}-{\bm{w}}_{1}^{*}\|^{2} (task similarity) and σ2\sigma^{2} (noise level) are relatively small, the trend of F2F_{2} w.r.t. pp will be dominated by the first term, where a descent floor of forgetting exists. In the right-hand-side of Equation 12, the first term increases as pp increases, while the rest two terms decrease as pp increases. Taking the derivative of Equation 12 on pp, we have

𝔼[G2]p=2nr𝒘1𝒘2p2σ2((n+1)(1r2)(pn1)2+2nr(pn1)p).\displaystyle\frac{\partial\mathbb{E}[G_{2}]}{\partial p}=\frac{2nr{{\bm{w}}_{1}^{*}}^{\top}{\bm{w}}_{2}^{*}}{p^{2}}-\sigma^{2}\left(\frac{(n+1)(1-r^{2})}{(p-n-1)^{2}}+\frac{2nr}{(p-n-1)p}\right).

Here, since 1pn1\frac{1}{p-n-1} is very large when pp is close to nn, while decreasing to zero when pp\to\infty, we can tell that when σ2\sigma^{2} is relatively small w.r.t. 𝒘1𝒘2{{\bm{w}}_{1}^{*}}^{\top}{\bm{w}}_{2}^{*}, 𝔼[G2]p\frac{\partial\mathbb{E}[G_{2}]}{\partial p} will be positive and then negative as pp increases from n+2n+2 to \infty. In other words, if these two tasks have a positive correlation (i.e., 𝒘1𝒘2>0{{\bm{w}}_{1}^{*}}^{\top}{\bm{w}}_{2}^{*}>0) and noise is small, there exists a descent floor w.r.t. pp on 𝔼[G2]\mathbb{E}[G_{2}]. Such a phenomenon can exist in other setups besides the special case of T=2T=2. For example, in Figure 1(a)(c) where the ground truth for each task is exactly the same, we can observe a descent floor for the small noise cases σ=0.3\sigma=0.3 and 0.10.1 (i.e., orange and green curves with markers “×\times” and “Y”, respectively).

4.2 The impact of task similarity

Generalization error monotonically decreases with task similarity whereas forgetting may not. Based on Theorem 4.1, it can be seen that the generalization error GT(𝒘T)G_{T}({\bm{w}}_{T}) decreases when 𝒘k𝒘i2\|{\bm{w}}^{*}_{k}-{\bm{w}}^{*}_{i}\|^{2} for any two different tasks kk and ii decreases, because of the positive coefficients in Term G2 in Equation 10. Intuitively, the generalization error of CL will be smaller if the tasks are more similar with each other. In contrast, the forgetting FTF_{T} may not change monotonically with 𝒘k𝒘i2\|{\bm{w}}^{*}_{k}-{\bm{w}}^{*}_{i}\|^{2}, because the coefficients cijc_{ij} in Term F2 in Equation 9 can be negative. To verify this result, we consider two different scenarios.

(1) Consider the case where T=2T=2. In Equation 11, 𝒘2𝒘12\|{\bm{w}}_{2}^{*}-{\bm{w}}_{1}^{*}\|^{2} captures the task similarity between tasks 1 and 2 in terms of the optimal task models. It is clear that forgetting increases with 𝒘2𝒘12\|{\bm{w}}_{2}^{*}-{\bm{w}}_{1}^{*}\|^{2}, i.e., less forgetting when the two tasks are more similar.

(2) Consider the case where T=4T=4. We first assume that 𝒘i2=w\|{\bm{w}}_{i}^{*}\|^{2}=w for any task i[1,4]i\in[1,4] considering the overparameterized models [17]. Suppose that task 1 and task 2 share the same set of true features, which is orthogonal to the feature set of both task 3 and task 4, i.e., 𝒮1=𝒮2{\mathcal{S}}_{1}={\mathcal{S}}_{2} and 𝒮1(𝒮3𝒮4)={\mathcal{S}}_{1}\cap({\mathcal{S}}_{3}\cup{\mathcal{S}}_{4})=\emptyset. Note that

𝒘i𝒘j2=𝒘i2+𝒘j22𝒘i,𝒘j\displaystyle\|{\bm{w}}_{i}^{*}-{\bm{w}}_{j}^{*}\|^{2}=\|{\bm{w}}_{i}^{*}\|^{2}+\|{\bm{w}}_{j}^{*}\|^{2}-2\langle{\bm{w}}_{i}^{*},{\bm{w}}_{j}^{*}\rangle

where 𝒘i,𝒘j=0\langle{\bm{w}}_{i}^{*},{\bm{w}}_{j}^{*}\rangle=0 if 𝒮i𝒮j={\mathcal{S}}_{i}\cap{\mathcal{S}}_{j}=\emptyset. Therefore, we can control the value of 𝒘1𝒘22\|{\bm{w}}_{1}^{*}-{\bm{w}}_{2}^{*}\|^{2} by changing 𝒘2{\bm{w}}_{2}^{*}, without affecting the value of 𝒘i𝒘j2\|{\bm{w}}_{i}^{*}-{\bm{w}}_{j}^{*}\|^{2} for any pair of {i,j}{1,2}\{i,j\}\neq\{1,2\}. Based on Theorem 4.1, it can be shown that c1,2<0c_{1,2}<0, such that increasing 𝒘1𝒘22\|{\bm{w}}_{1}^{*}-{\bm{w}}_{2}^{*}\|^{2}, i.e., the tasks become less similar, will surprisingly decrease forgetting.

4.3 The impact of task ordering

In order to investigate the impact of task ordering on the performance of CL, we assume that 𝒘t2=w\|{\bm{w}}_{t}^{*}\|^{2}=w for every task t𝕋t\in\mathbb{T}. By ignoring the task order-independent terms in Equation 9 and Equation 10, we focus on the task order-dependent terms, i.e., Term F2 and Term G2.

1) Optimal task ordering of minimizing forgetting tends to arrange dissimilar tasks adjacently in the early stage of the sequence. As shown in Term F2, the optimal task order to minimize forgetting closely hinges upon the value of ci,jc_{i,j}. Based on Equation 8, ci,jc_{i,j} is smaller when (1) ii and jj are smaller and (2) they are closer. Intuitively, this implies that tasks with larger 𝒘i𝒘j2\|{\bm{w}}_{i}^{*}-{\bm{w}}_{j}^{*}\|^{2} should be learnt adjacently with higher priority in CL, in order to minimize the impact of the task dissimilarity on the value of F~T(𝒘T)\tilde{F}_{T}({\bm{w}}_{T}). However, finding the optimal task order for the general case is highly nontrivial due to the complex coupling across 𝒘i𝒘j2\|{\bm{w}}_{i}^{*}-{\bm{w}}_{j}^{*}\|^{2} for different tasks. To verify the implication above and better understand the structure of the optimal task order, we study several special cases of the task setups.

(1) [Special case I: One vs Many]  There are two different categories of tasks, where tasks in the same category have the same optimal model; among the entire task set, one special task belongs to Category I while the other tasks belong to Category II. In this case, the optimal task order is captured by the optimal learning order of the special task in Category I. We have the following result to characterize the optimal task order for Special case I.

Proposition 4.4.

Let i[1,T]i^{*}\in[1,T] denote the optimal order of the special task in Category I to minimize forgetting. Suppose pn+2p\geq n+2. Then 1) ii^{*} can take any integer value between 2 and T2\frac{T}{2}, depending on the value of np\frac{n}{p}; 2) ii^{*} is non-decreasing with np\frac{n}{p}.

As indicated by Proposition 4.4, the special task will be learnt in the first half of the sequence, such that the task diversity in the first half is always larger than in the second half. Besides, with the model capacity increasing (np0\frac{n}{p}\rightarrow 0), the order of the special task will move towards the beginning of the sequence, because 1) the model is less concerned about the special task since it is powerful enough to learn different features and 2) the model focuses on the performance of the majority and seeks to learn more tasks from Category II at the end of the sequence for better performance.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 2: Impact of task similarity and task order. (a) When T=2T=2, both forgetting and generalization error decrease when two tasks have more overlapping classes. (b) When T=4T=4, the forgetting surprisingly increases when the first two tasks have more overlapping classes; ‘forgetting_0’, ‘forgetting_1’ and ‘forgetting_2’ correspond to three cases of the task setups (also in (c) and (d)). (c) Consider one special task and five same tasks in CL with T=6T=6; the task order index shows the order of the special task, and the smallest forgetting is achieved always when the special task is learnt in the first half for each case (we normalize the forgetting w.r.t. the worst forgetting in each case). (d) Consider two categories of tasks in CL with T=4T=4; the task order indices 0 and 11 refer to the perfectly alternating orders, one of which always achieve the smallest forgetting among all possible orders. All the results are averaged over 3 random seeds.

(2) [Special case II: Equal Occurrence]  There are two different categories (C1C_{1} and C2C_{2}) of tasks, where tasks in the same category have the same optimal model; particularly, two categories contain the same number of tasks. If task 1C11\in C_{1} and task 2C22\in C_{2}, we will denote the task order as (C1,C2)(C_{1},C_{2}). The following proposition characterizes the optimal task order in this case:

Proposition 4.5.

Suppose pn+2p\geq n+2. For T=4T=4 and T=6T=6, the optimal task order to minimize forgetting is the perfectly alternating order, i.e., (Ci,Cj,Ci,Cj)(C_{i},C_{j},C_{i},C_{j}) and (Ci,Cj,Ci,Cj,Ci,Cj)(C_{i},C_{j},C_{i},C_{j},C_{i},C_{j}), where i,j{1,2}i,j\in\{1,2\} and iji\neq j.

Proposition 4.5 clearly shows that adjacent tasks always belong to different categories in the optimal task order, which leads to a more diverse task learning sequence. Intuitively, the alternating order maximizes the memorization of each category by keeping practicing on different tasks. It can be further proved that the perfectly alternating order is also optimal for T=6T=6 with three different categories (Section C.4). Based on these results, we expect that such an alternating order may minimize forgetting for more general scenarios where the tasks contain multiple categories with equal cross-category task model distance.

The findings on the optimal task order indeed share similar insights with the surprising impact of task correlation on forgetting mentioned earlier. Intuitively, learning more dissimilar tasks in the early stage facilitates the exploration of a larger feature space and expands the learnt feature space in CL, which can make the learning of similar tasks in the future much easier. In the meanwhile, the impact of task similarity among the early tasks continuously diminishes in CL with TT increasing, as suggested by the coefficients ci,jc_{i,j} (which can be smaller for smaller ii, jj) in Theorem 4.1. Therefore, the negative impact of learning more dissimilar tasks on forgetting is weaker when they are learnt in the early stage, compared to being learnt in the late stage.

2) The optimal task ordering for minimizing forgetting and for minimizing generalization error are not always the same. Consider Special case I and Special case II. It can be shown that the optimal task orders for minimizing forgetting and generalization error are different in Special case I but same in Special case II. This would open up an interesting direction of finding the task order with balanced impact on forgetting and generalization error. A more detailed discussion can be found in Section C.4.

5 Implications on CL with DNN

So far, we have explored different aspects that affect the performance of CL in overparameterized linear models. More interestingly, we will show next that Theorem 4.1 can also shed light on CL in practice with DNNs, by reflecting on recent empirical observations and guiding improved designs therein. More experimental details are in Appendix A.

5.1 Forgetting is not always monotonic with task similarity

To see if our understandings about the impact of task similarity on forgetting can be carried over to CL with DNN, we conduct experiments on MNIST [29] using a convolutional neural network to investigate the impact of task similarity therein. More specifically, we consider each task ii as a binary classification problem which seeks to decide if an image belongs to a task-specific label subset YiY_{i} of the classes, i.e., Yi{0,,9}Y_{i}\subset\{0,...,9\} in MNIST, and we control the task similarity through the degree of class overlapping between the task-specific subsets, e.g., task ii and jj are more similar if the cardinality of YiYjY_{i}\cap Y_{j} is larger.

We first consider the case with two tasks, where we fix Y1Y_{1} for task 1 as {0,1,2,3,4}\{0,1,2,3,4\} and change Y2Y_{2} for task 2 to have different numbers of overlapping classes with Y1Y_{1}. As shown in Figure 2, both forgetting and generalization error decrease when the number of overlapping classes increases, i.e., the two tasks are more similar, which is indeed consistent with our analysis for the overparameterized linear models for T=2T=2. More interestingly, this result also agrees with some recent studies [42, 30, 17], which found that ‘intermediate task similarity’ leads to the worst forgetting in a two-task setup using various notions of task similarity (different from our definition of task similarity using the optimal model gap), through either empirical studies or analyzing the upper bound of forgetting. We can build the connection based on the closed form of forgetting F2F_{2} in Equation 11.

Note that in Equation 11

𝒘2𝒘12=𝒘22+𝒘122𝒘1,𝒘2\displaystyle\|{\bm{w}}_{2}^{*}-{\bm{w}}_{1}^{*}\|^{2}=\|{\bm{w}}_{2}^{*}\|^{2}+\|{\bm{w}}_{1}^{*}\|^{2}-2\langle{\bm{w}}_{1}^{*},{\bm{w}}_{2}^{*}\rangle

and we can divide the task correlation into three cases depending on the value of 𝒘1,𝒘2\langle{\bm{w}}_{1}^{*},{\bm{w}}_{2}^{*}\rangle: (1) 𝒘1,𝒘2=0\langle{\bm{w}}_{1}^{*},{\bm{w}}_{2}^{*}\rangle=0: Two tasks are orthogonal in the sense that they share no common features, i.e., 𝒮1𝒮2={\mathcal{S}}_{1}\cap{\mathcal{S}}_{2}=\emptyset; (2) 𝒘1,𝒘2>0\langle{\bm{w}}_{1}^{*},{\bm{w}}_{2}^{*}\rangle>0: Two tasks share some common features and are ‘positively’ correlated; (3) 𝒘1,𝒘2<0\langle{\bm{w}}_{1}^{*},{\bm{w}}_{2}^{*}\rangle<0: Two tasks share some common features but are ‘negatively’ correlated. Compared to the first case when two tasks are orthogonal, it can be easily shown that forgetting is worse when two tasks are negatively correlated even if they share some common features, which indeed corresponds to ‘the intermediate task similarity’ in [42, 30, 17]. The reason behind is that in this case task 2 updates the model in the opposite direction to the model update of task 1, which inevitably leads to more forgetting in CL. Note that in Figure 2, the non-overlapping case means that task 1 and 2 are negatively correlated because in this two-task case the image that is not in Y1Y_{1} must be in Y2Y_{2}. On the other hand, the forgetting can even be negative when the two tasks are positively correlated.

We next consider the case with T=4T=4, where we control the task similarity by changing Y2Y_{2} while fixing Y1Y_{1}, Y3Y_{3} and Y4Y_{4}. Here we let (Y1Y2)(Y3Y4)=(Y_{1}\cup Y_{2})\cap(Y_{3}\cup Y_{4})=\emptyset as in Section 4.2. As shown in Figure 2, forgetting surprisingly increases when task 1 and task 2 have more overlapping classes, which is also consistent with our analysis for the linear models. Indeed, this also justifies our observation that forgetting can decrease when the adjacent tasks are more dissimilar when studying the impact of task order.

5.2 Diversify the tasks in the early stage and order dissimilar tasks adjacently

We also evaluate the impact of task ordering on forgetting in CL with DNN, by constructing the tasks using a similar strategy as in Section 5.1. More specifically, we consider two different scenarios: (1) T=6T=6, where the task sequence includes one special task and five same tasks; (2) T=4T=4, where the task sequence includes two categories of tasks and each has two same tasks.

Figure 2 demonstrates forgetting in the first scenario w.r.t. the learning order of the special task, and three plots correspond to three different cases, respectively. It is clear that for all three cases, the optimal order of the special task to minimize forgetting is always in the first half of the sequence. For the second scenario, we evaluate forgetting in Figure 2 for all six possible task orders, where task indices 0 and 11 refer to the perfectly alternating order. We can see that the smallest forgetting is also achieved in the perfectly alternating order. These results indicate that our findings in Section 4.3 for the overparameterized linear models can also be carried over to CL with DNN, i.e., the optimal task order should diversify the tasks in the early stage and learn more different tasks adjacently. Such an implication is indeed consistent with the empirical observations in recent studies [31, 10]. Note that in both Figure 2 and Figure 2, we normalize forgetting w.r.t. the worst forgetting in each case.

5.3 Weight the fresher old tasks more in forward knowledge transfer

Recently, there has been increasing interest in CL on leveraging task correlation to facilitate knowledge transfer [26, 33, 32], which first selects the most correlated old tasks with the current task and then designs algorithms to directly increase the knowledge transfer between correlated tasks. By investigating knowledge transfer in the linear models, we show that improved algorithms can be motivated to achieve better knowledge transfer.

Given a task tt in CL, the forward knowledge transfer [47] in the linear model can be defined as

𝔼[𝒘t𝒘t2]𝔼[𝒘tr𝒘t2],\displaystyle\mathbb{E}[\|{\bm{w}}_{t}-{\bm{w}}_{t}^{*}\|^{2}]-\mathbb{E}[\|{\bm{w}}^{r}_{t}-{\bm{w}}_{t}^{*}\|^{2}], (13)

where 𝒘tr{\bm{w}}^{r}_{t} is the learnt model of task tt by starting from a random model. Intuitively, Equation 13 characterizes the gap in the testing performance between 𝒘t{\bm{w}}_{t} learnt in CL and 𝒘tr{\bm{w}}^{r}_{t} learnt from scratch, for which a positive value means that the accumulated knowledge in CL benefits the learning of the current task. As the second term in Equation 13 is independent with CL, it suffices to analyze 𝔼[𝒘t𝒘t2]\mathbb{E}[\|{\bm{w}}_{t}-{\bm{w}}_{t}^{*}\|^{2}] for understanding the forward knowledge transfer. Based on Lemma B.2 (Appendix B), we can obtain

𝔼[𝒘t𝒘t2]=rt𝒘t2+i=1tnrtip𝒘i𝒘t2+pσ2pn1.\displaystyle\mathbb{E}[\|{\bm{w}}_{t}-{\bm{w}}_{t}^{*}\|^{2}]=r^{t}\|{\bm{w}}_{t}^{*}\|^{2}+\sum_{i=1}^{t}\frac{nr^{t-i}}{p}\|{\bm{w}}_{i}^{*}-{\bm{w}}_{t}^{*}\|^{2}+\frac{p\sigma^{2}}{p-n-1}.

While it is intuitive that better forward knowledge transfer can be achieved when 𝒘i𝒘t2\|{\bm{w}}_{i}^{*}-{\bm{w}}_{t}^{*}\|^{2} is smaller for the current task tt and the old task ii, the impact of different old tasks on the current task is non-uniform, in the sense that a more recent old task ii (i.e., tit-i is smaller) has a larger effect on the forward knowledge transfer to task tt. This result implies that fresher old tasks should contribute more when designing algorithms to leverage correlated old tasks to facilitate better forward knowledge transfer.

To verify this insight, we consider the TRGP algorithm proposed in [33]. Specifically, TRGP first selects the most correlated old tasks with the current task and reuses their knowledge through a scaled weight projection to facilitate forward knowledge transfer, where all the selected old tasks are treated equivalently. We slightly modify TRGP by assigning a larger weight to the selected old task that is more recent to the current task, named as TRGP+, and evaluate its performance on standard CL benchmarks (PMNIST [35] and Split CIFAR-100 [28]) and DNN architectures. As shown in Table 1, TRGP+ outperforms TRGP in both accuracy and forgetting. Assigning a larger weight to the more recent correlated old task not only improves the forward knowledge transfer, but also increases the backward knowledge transfer by forcing the learnt model of the current task to be closer to the model of those highly correlated old tasks.

Table 1: The averaged final testing accuracy (ACC) and backward transfer (BWT: negative value of forgetting, larger is better) over all the tasks on different datasets.
Method PMNIST Split CIFAR-100
ACC(%) BWT(%) ACC(%) BWT(%)
TRGP 96.34 -0.8 74.46 -0.9
TRGP+ 96.75 -0.46 75.31 0.13

6 Conclusions

In this work, we studied CL in the overparameterized linear models where each task is a linear regression problem and solved by using SGD. Under the assumption that each task has a sparse linear model with i.i.d. Gaussian features and noise, we derived the exact forms of both forgetting and generalization error, which built the key foundations of understanding the performance of CL. In particular, we investigated the impact of overparameterization, task similarity and task ordering on both forgetting and generalization error. Experimental results on real datasets with DNNs indicated that our findings in linear models can even be carried over to CL in practice and leveraged to develop better algorithms.

References

  • [1] Rahaf Aljundi, Francesca Babiloni, Mohamed Elhoseiny, Marcus Rohrbach, and Tinne Tuytelaars. Memory aware synapses: Learning what (not) to forget. In Proceedings of the European Conference on Computer Vision (ECCV), pages 139–154, 2018.
  • [2] Joshua Andle and Salimeh Yasaei Sekeh. Theoretical understanding of the information flow on continual learning performance. In European Conference on Computer Vision, pages 86–101. Springer, 2022.
  • [3] Sanjeev Arora, Simon Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. In International Conference on Machine Learning, pages 322–332, 2019.
  • [4] Haruka Asanuma, Shiro Takagi, Yoshihiro Nagano, Yuki Yoshida, Yasuhiko Igarashi, and Masato Okada. Statistical mechanical analysis of catastrophic forgetting in continual learning with teacher and student networks. Journal of the Physical Society of Japan, 90(10):104001, 2021.
  • [5] Peter L Bartlett, Philip M Long, Gábor Lugosi, and Alexander Tsigler. Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 117(48):30063–30070, 2020.
  • [6] Peter L Bartlett, Philip M Long, Gábor Lugosi, and Alexander Tsigler. Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 2020.
  • [7] Mikhail Belkin, Daniel Hsu, and Ji Xu. Two models of double descent for weak features. arXiv preprint arXiv:1903.07571, 2019.
  • [8] Mikhail Belkin, Daniel Hsu, and Ji Xu. Two models of double descent for weak features. SIAM Journal on Mathematics of Data Science, 2(4):1167–1180, 2020.
  • [9] Mikhail Belkin, Siyuan Ma, and Soumik Mandal. To understand deep learning we need to understand kernel learning. In International Conference on Machine Learning, pages 541–549. PMLR, 2018.
  • [10] Samuel J Bell and Neil D Lawrence. The effect of task ordering in continual learning. arXiv preprint arXiv:2205.13323, 2022.
  • [11] Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pages 41–48, 2009.
  • [12] Mehdi Abbana Bennani, Thang Doan, and Masashi Sugiyama. Generalisation guarantees for continual learning with orthogonal gradient descent. arXiv preprint arXiv:2006.11942, 2020.
  • [13] Xinyuan Cao, Weiyang Liu, and Santosh Vempala. Provable lifelong learning of representations. In International Conference on Artificial Intelligence and Statistics, pages 6334–6356. PMLR, 2022.
  • [14] Arslan Chaudhry, Marc’Aurelio Ranzato, Marcus Rohrbach, and Mohamed Elhoseiny. Efficient lifelong learning with a-gem. arXiv preprint arXiv:1812.00420, 2018.
  • [15] Xi Chen, Christos Papadimitriou, and Binghui Peng. Memory bounds for continual learning. arXiv preprint arXiv:2204.10830, 2022.
  • [16] Thang Doan, Mehdi Abbana Bennani, Bogdan Mazoure, Guillaume Rabusseau, and Pierre Alquier. A theoretical analysis of catastrophic forgetting through the ntk overlap matrix. In International Conference on Artificial Intelligence and Statistics, pages 1072–1080. PMLR, 2021.
  • [17] Itay Evron, Edward Moroshko, Rachel Ward, Nathan Srebro, and Daniel Soudry. How catastrophic can catastrophic forgetting be in linear regression? In Conference on Learning Theory, pages 4028–4079. PMLR, 2022.
  • [18] Mehrdad Farajtabar, Navid Azizan, Alex Mott, and Ang Li. Orthogonal gradient descent for continual learning. In International Conference on Artificial Intelligence and Statistics, pages 3762–3773. PMLR, 2020.
  • [19] Suriya Gunasekar, Jason Lee, Daniel Soudry, and Nathan Srebro. Characterizing implicit bias in terms of optimization geometry. In International Conference on Machine Learning, pages 1832–1841. PMLR, 2018.
  • [20] Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J Tibshirani. Surprises in high-dimensional ridgeless least squares interpolation. arXiv preprint arXiv:1903.08560, 2019.
  • [21] Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J Tibshirani. Surprises in high-dimensional ridgeless least squares interpolation. The Annals of Statistics, 50(2):949–986, 2022.
  • [22] Xisen Jin, Arka Sadhu, Junyi Du, and Xiang Ren. Gradient-based editing of memory examples for online task-free continual learning. Advances in Neural Information Processing Systems, 34:29193–29205, 2021.
  • [23] Peizhong Ju, Xiaojun Lin, and Jia Liu. Overfitting can be harmless for basis pursuit, but only to a degree. Advances in Neural Information Processing Systems, 33:7956–7967, 2020.
  • [24] Peizhong Ju, Xiaojun Lin, and Ness B Shroff. On the generalization power of overfitted two-layer neural tangent kernel models. arXiv preprint arXiv:2103.05243, 2021.
  • [25] Peizhong Ju, Xiaojun Lin, and Ness B Shroff. On the generalization power of the overfitted three-layer neural tangent kernel model. arXiv preprint arXiv:2206.02047, 2022.
  • [26] Zixuan Ke, Bing Liu, and Xingchang Huang. Continual learning of a mixed sequence of similar and dissimilar tasks. Advances in Neural Information Processing Systems, 33:18493–18504, 2020.
  • [27] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526, 2017.
  • [28] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
  • [29] Yann LeCun, Bernhard Boser, John Denker, Donnie Henderson, Richard Howard, Wayne Hubbard, and Lawrence Jackel. Handwritten digit recognition with a back-propagation network. Advances in neural information processing systems, 2, 1989.
  • [30] Sebastian Lee, Sebastian Goldt, and Andrew Saxe. Continual learning in the teacher-student setup: Impact of task similarity. In International Conference on Machine Learning, pages 6109–6119. PMLR, 2021.
  • [31] Yingcong Li, Mingchen Li, M Salman Asif, and Samet Oymak. Provable and efficient continual representation learning. arXiv preprint arXiv:2203.02026, 2022.
  • [32] Sen Lin, Li Yang, Deliang Fan, and Junshan Zhang. Beyond not-forgetting: Continual learning with backward knowledge transfer. In Thirty-Sixth Conference on Neural Information Processing Systems, 2022.
  • [33] Sen Lin, Li Yang, Deliang Fan, and Junshan Zhang. Trgp: Trust region gradient projection for continual learning. Tenth International Conference on Learning Representations, ICLR 2022, 2022.
  • [34] Hao Liu and Huaping Liu. Continual learning with recursive gradient optimization. arXiv preprint arXiv:2201.12522, 2022.
  • [35] David Lopez-Paz and Marc’Aurelio Ranzato. Gradient episodic memory for continual learning. Advances in neural information processing systems, 30:6467–6476, 2017.
  • [36] Michael McCloskey and Neal J Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of learning and motivation, volume 24, pages 109–165. Elsevier, 1989.
  • [37] Song Mei and Andrea Montanari. The generalization error of random features regression: Precise asymptotics and double descent curve. arXiv preprint arXiv:1908.05355, 2019.
  • [38] Partha P Mitra. Understanding overfitting peaks in generalization error: Analytical risk curves for l2l_{2} and l1l_{1} penalized interpolation. arXiv preprint arXiv:1906.03667, 2019.
  • [39] Vidya Muthukumar, Kailas Vodrahalli, and Anant Sahai. Harmless interpolation of noisy data in regression. In 2019 IEEE International Symposium on Information Theory (ISIT), pages 2299–2303. IEEE, 2019.
  • [40] Vidya Muthukumar, Kailas Vodrahalli, Vignesh Subramanian, and Anant Sahai. Harmless interpolation of noisy data in regression. IEEE Journal on Selected Areas in Information Theory, 1(1):67–83, 2020.
  • [41] German I Parisi, Ronald Kemker, Jose L Part, Christopher Kanan, and Stefan Wermter. Continual lifelong learning with neural networks: A review. Neural Networks, 113:54–71, 2019.
  • [42] Vinay V Ramasesh, Ethan Dyer, and Maithra Raghu. Anatomy of catastrophic forgetting: Hidden representations and task semantics. arXiv preprint arXiv:2007.07400, 2020.
  • [43] Matthew Riemer, Ignacio Cases, Robert Ajemian, Miao Liu, Irina Rish, Yuhai Tu, and Gerald Tesauro. Learning to learn without forgetting by maximizing transfer and minimizing interference. arXiv preprint arXiv:1810.11910, 2018.
  • [44] Gobinda Saha, Isha Garg, and Kaushik Roy. Gradient projection memory for continual learning. In International Conference on Learning Representations, 2021.
  • [45] Siddhartha Satpathi and R Srikant. The dynamics of gradient descent for overparametrized neural networks. In Learning for Dynamics and Control, pages 373–384. PMLR, 2021.
  • [46] Joan Serra, Didac Suris, Marius Miron, and Alexandros Karatzoglou. Overcoming catastrophic forgetting with hard attention to the task. In International Conference on Machine Learning, pages 4548–4557. PMLR, 2018.
  • [47] Tom Veniat, Ludovic Denoyer, and Marc’Aurelio Ranzato. Efficient continual learning with modular networks and task-driven priors. arXiv preprint arXiv:2012.12631, 2020.
  • [48] Li Yang, Sen Lin, Junshan Zhang, and Deliang Fan. Grown: Grow only when necessary for continual learning. arXiv preprint arXiv:2110.00908, 2021.
  • [49] Dong Yin, Mehrdad Farajtabar, Ang Li, Nir Levine, and Alex Mott. Optimization and generalization of regularization-based continual learning: a loss approximation viewpoint. arXiv preprint arXiv:2006.10974, 2020.
  • [50] Jaehong Yoon, Saehoon Kim, Eunho Yang, and Sung Ju Hwang. Scalable and order-robust continual learning with additive parameter decomposition. In Eighth International Conference on Learning Representations, ICLR 2020. ICLR, 2020.

Appendix A Experimental Details

A.1 Experimental details for Section 5.1 and Section 5.2

Datasets. We consider the MNIST dataset. For each task, we randomly select 200 samples for training and 1000 samples for testing. Different tasks have different subsets of classes.

DNN architecture and training details. We use a five-layer neural network with two convolutional layers and three fully-connected layers. Relu is used for the first four layers and Sigmoid is used for the last layer. The first convolutional layer is followed by 2D max-pooling operation with stride of 2. We learn each task by using SGD with a learning rate of 0.10.1 for 600 epochs. The forgetting and overall generalization error are evaluated as in Equation 6 and Equation 7, respectively, while here t(𝒘){\mathcal{L}}_{t}({\bm{w}}) is defined as the mean-squared test error instead of Equation 5.

Task setups. For Figure 2, we consider the following setup:

  • task 1: (0,1,2,3,4)(0,1,2,3,4).

  • task 2: (5,6,7,8,9)(5,6,7,8,9), (4,5,6,7,8)(4,5,6,7,8), (3,4,5,6,7)(3,4,5,6,7), (2,3,4,5,6)(2,3,4,5,6), (1,2,3,4,5)(1,2,3,4,5), (0,1,2,3,4)(0,1,2,3,4), which correspond to the different numbers of overlapping classes with task 1.

For Figure 2, we randomly select three different setups:

  • ‘forgetting_0’:

    • task 1: (0,1,2)(0,1,2).

    • task 2: (3,4,5)(3,4,5), (2,3,4)(2,3,4), (1,2,3)(1,2,3), (0,1,2)(0,1,2), which correspond to the different numbers of overlapping classes with task 1.

    • task 3: (7,8,9)(7,8,9).

    • task 4: (7,8,9)(7,8,9).

  • ‘forgetting_1’:

    • task 1: (3,4,5)(3,4,5).

    • task 2: (0,1,2)(0,1,2), (1,2,3)(1,2,3), (2,3,4)(2,3,4), (3,4,5)(3,4,5), which correspond to the different numbers of overlapping classes with task 1.

    • task 3: (6,7,8)(6,7,8).

    • task 4: (7,8,9)(7,8,9).

  • ‘forgetting_2’:

    • task 1: (0,1,2)(0,1,2).

    • task 2: (7,8,9)(7,8,9), (2,7,8)(2,7,8), (1,2,7)(1,2,7), (0,1,2)(0,1,2), which correspond to the different numbers of overlapping classes with task 1.

    • task 3: (4,5,6)(4,5,6).

    • task 4: (4,5,6)(4,5,6).

For Figure 2, we randomly select three different setups:

  • ‘forgetting_0’: the special task is (4,5,6,7)(4,5,6,7) and the other tasks are (0,1,2,3)(0,1,2,3).

  • ‘forgetting_1’: the special task is (0,1,2,3)(0,1,2,3) and the other tasks are (5,6,7,8)(5,6,7,8).

  • ‘forgetting_2’: the special task is (3,4,5,6)(3,4,5,6) and the other tasks are (1,2,7,8)(1,2,7,8).

For Figure 2, we randomly select three different setups:

  • ‘forgetting_0’: the two task categories are (4,5,6,7)(4,5,6,7) and (1,2,4,5)(1,2,4,5), and the task order indices are:

    • ‘0’: (4,5,6,7)(4,5,6,7), (1,2,4,5)(1,2,4,5), (4,5,6,7)(4,5,6,7), (1,2,4,5)(1,2,4,5).

    • ‘1’: (1,2,4,5)(1,2,4,5), (4,5,6,7)(4,5,6,7), (1,2,4,5)(1,2,4,5), (4,5,6,7)(4,5,6,7).

    • ‘2’: (4,5,6,7)(4,5,6,7), (4,5,6,7)(4,5,6,7), (1,2,4,5)(1,2,4,5), (1,2,4,5)(1,2,4,5).

    • ‘3’: (1,2,4,5)(1,2,4,5), (1,2,4,5)(1,2,4,5), (4,5,6,7)(4,5,6,7), (4,5,6,7)(4,5,6,7).

    • ‘4’: (4,5,6,7)(4,5,6,7), (1,2,4,5)(1,2,4,5), (1,2,4,5)(1,2,4,5), (4,5,6,7)(4,5,6,7).

    • ‘5’: (1,2,4,5)(1,2,4,5), (4,5,6,7)(4,5,6,7), (4,5,6,7)(4,5,6,7), (1,2,4,5)(1,2,4,5).

  • ‘forgetting_1’: the two task categories are (4,5,6,7)(4,5,6,7) and (2,3,4,5)(2,3,4,5), and the task order indices are:

    • ‘0’: (4,5,6,7)(4,5,6,7), (2,3,4,5)(2,3,4,5), (4,5,6,7)(4,5,6,7), (2,3,4,5)(2,3,4,5).

    • ‘1’: (2,3,4,5)(2,3,4,5), (4,5,6,7)(4,5,6,7), (2,3,4,5)(2,3,4,5), (4,5,6,7)(4,5,6,7).

    • ‘2’: (4,5,6,7)(4,5,6,7), (4,5,6,7)(4,5,6,7), (2,3,4,5)(2,3,4,5), (2,3,4,5)(2,3,4,5).

    • ‘3’: (2,3,4,5)(2,3,4,5), (2,3,4,5)(2,3,4,5), (4,5,6,7)(4,5,6,7), (4,5,6,7)(4,5,6,7).

    • ‘4’: (4,5,6,7)(4,5,6,7), (2,3,4,5)(2,3,4,5), (2,3,4,5)(2,3,4,5), (4,5,6,7)(4,5,6,7).

    • ‘5’: (2,3,4,5)(2,3,4,5), (4,5,6,7)(4,5,6,7), (4,5,6,7)(4,5,6,7), (2,3,4,5)(2,3,4,5).

  • ‘forgetting_2’: the two task categories are (6,7,8,9)(6,7,8,9) and (3,4,5,6)(3,4,5,6), and the task order indices are:

    • ‘0’: (6,7,8,9)(6,7,8,9), (3,4,5,6)(3,4,5,6), (6,7,8,9)(6,7,8,9), (3,4,5,6)(3,4,5,6).

    • ‘1’: (3,4,5,6)(3,4,5,6), (6,7,8,9)(6,7,8,9), (3,4,5,6)(3,4,5,6), (6,7,8,9)(6,7,8,9).

    • ‘2’: (6,7,8,9)(6,7,8,9), (6,7,8,9)(6,7,8,9), (3,4,5,6)(3,4,5,6), (3,4,5,6)(3,4,5,6).

    • ‘3’: (3,4,5,6)(3,4,5,6), (3,4,5,6)(3,4,5,6), (6,7,8,9)(6,7,8,9), (6,7,8,9)(6,7,8,9).

    • ‘4’: (6,7,8,9)(6,7,8,9), (3,4,5,6)(3,4,5,6), (3,4,5,6)(3,4,5,6), (6,7,8,9)(6,7,8,9).

    • ‘5’: (3,4,5,6)(3,4,5,6), (6,7,8,9)(6,7,8,9), (6,7,8,9)(6,7,8,9), (3,4,5,6)(3,4,5,6).

A.2 Experimental details for Section 5.3

A.2.1 TRGP vs TRGP+

TRGP [33] seeks to solve the following optimization problem for the current task tt:

min{𝒘l}l,{𝑸j,tl}l,j𝒯tl\displaystyle\smash{\min_{\{{\bm{w}}^{l}\}_{l},\{{\bm{Q}}_{j,t}^{l}\}_{l,j\in{\mathcal{TR}}_{t}^{l}}}}~{}~{} ({𝒘effl}l,𝒟t),\displaystyle{\mathcal{L}}(\{{\bm{w}}^{l}_{eff}\}_{l},{\mathcal{D}}_{t}),
s.t\displaystyle s.t~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{} 𝒘effl=𝒘l+j𝒯tl[ProjSjlQ(𝒘l)ProjSjl(𝒘l)],\displaystyle{\bm{w}}^{l}_{eff}={\bm{w}}^{l}+\sum\nolimits_{j\in{\mathcal{TR}}_{t}^{l}}[\mathrm{Proj}^{Q}_{S_{j}^{l}}({\bm{w}}^{l})-\mathrm{Proj}_{S_{j}^{l}}({\bm{w}}^{l})], (14)

where 𝒘l{\bm{w}}^{l} is the DNN weight for the layer ll, and SjlS_{j}^{l} denotes the input subspace of the layer ll for the old task j<tj<t, which can be constructed by using SVD on the representation matrix for that layer. Two important designs are introduced in Section A.2.1:

  • The trust region 𝒯tl{\mathcal{TR}}_{t}^{l}: 𝒯tl{\mathcal{TR}}_{t}^{l} denotes the set of the most correlated old tasks selected for task tt based on some correlation evaluation metric in a layer-wise manner. The purpose here is to select the most correlated old tasks and facilitate the forward knowledge transfer by reusing the learnt knowledge of the old tasks in 𝒯tl{\mathcal{TR}}_{t}^{l}.

  • The scaled weight projection ProjSjlQ(𝒘l)\mathrm{Proj}^{Q}_{S_{j}^{l}}({\bm{w}}^{l}): ProjSjlQ(𝒘l)\mathrm{Proj}^{Q}_{S_{j}^{l}}({\bm{w}}^{l}) is developed to reuse the learnt model of the selected old tasks in 𝒯tl{\mathcal{TR}}_{t}^{l}. Specifically, for any j𝒯tlj\in{\mathcal{TR}}_{t}^{l},

    ProjSjlQ(𝒘l)=𝒘t1l𝑩jl𝑸j,tl(𝑩jl)\displaystyle\mathrm{Proj}^{Q}_{S_{j}^{l}}({\bm{w}}^{l})={\bm{w}}_{t-1}^{l}{\bm{B}}_{j}^{l}{\bm{Q}}_{j,t}^{l}({\bm{B}}_{j}^{l})^{\prime}

    where 𝑩jl{\bm{B}}_{j}^{l} is the bases matrix for the subspace SjlS_{j}^{l}, and 𝑸j,tl{\bm{Q}}_{j,t}^{l} is the scaling matrix to scale the weight projection onto SjlS_{j}^{l}. In contrast, ProjSjl(𝒘l)=𝒘t1l𝑩jl(𝑩jl)\mathrm{Proj}_{S_{j}^{l}}({\bm{w}}^{l})={\bm{w}}_{t-1}^{l}{\bm{B}}_{j}^{l}({\bm{B}}_{j}^{l})^{\prime} is the standard weight projection onto SjlS_{j}^{l}. Since the learnt knowledge for the old task jj is indeed ProjSjl(𝒘l)\mathrm{Proj}_{S_{j}^{l}}({\bm{w}}^{l}), scaling the projection provides a way to reuse this knowledge directly for learning the task tt. Intuitively, ProjSjlQ(𝒘l)ProjSjl(𝒘l)\mathrm{Proj}^{Q}_{S_{j}^{l}}({\bm{w}}^{l})-\mathrm{Proj}_{S_{j}^{l}}({\bm{w}}^{l}) characterizes the boosted forward knowledge transfer from old task j𝒯tlj\in{\mathcal{TR}}_{t}^{l} to task tt.

However, as shown in Section A.2.1, all the selected old tasks in 𝒯tl{\mathcal{TR}}_{t}^{l} are treated equivalently in the effective weight 𝒘effl{\bm{w}}^{l}_{eff}, which could be suboptimal. As suggested by our theoretical results, we proposed a slightly modified version of TRGP, i.e., TRGP+, by assigning non-uniform weights for the most correlated old tasks selected in 𝒯tl{\mathcal{TR}}_{t}^{l}:

min{𝒘l}l,{𝑸j,tl}l,j𝒯tl\displaystyle\smash{\min_{\{{\bm{w}}^{l}\}_{l},\{{\bm{Q}}_{j,t}^{l}\}_{l,j\in{\mathcal{TR}}_{t}^{l}}}}~{}~{} ({𝒘effl}l,𝒟t),\displaystyle{\mathcal{L}}(\{{\bm{w}}^{l}_{eff}\}_{l},{\mathcal{D}}_{t}),
s.t\displaystyle s.t~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{} 𝒘effl=𝒘l+j𝒯tlλj[ProjSjlQ(𝒘l)ProjSjl(𝒘l)],\displaystyle{\bm{w}}^{l}_{eff}={\bm{w}}^{l}+\sum\nolimits_{j\in{\mathcal{TR}}_{t}^{l}}\lambda_{j}[\mathrm{Proj}^{Q}_{S_{j}^{l}}({\bm{w}}^{l})-\mathrm{Proj}_{S_{j}^{l}}({\bm{w}}^{l})], (15)

where λj>λj\lambda_{j}>\lambda_{j^{\prime}} if tj<tjt-j<t-j^{\prime} for both jj, j𝒯tlj^{\prime}\in{\mathcal{TR}}_{t}^{l}.

A.2.2 Experimental setup

Datasets. We consider two standard benchmarks in CL: (1) PMNIST: 10 sequential tasks will be created using different permutations, where each task has 10 classes; (2) Split CIFAR-100: The entire dataset of CIFAR-100 will be splitted into 10 group, where each task is a 10-way multi-class classification problem for each group.

DNN architecture and training details. Following [33], we use a 3-layer fully-connected network with 2 hidden layer of 100 units for PMNIST, and train the network for 5 epochs with a batch size of 10 for each task. For Split CIFAR-100, we use a version of 5-layer AlexNet, and train the network for a maximum of 200 epochs with early stopping for each task. Two most correlated old tasks are selected for the current task for each layer, and we assign a larger weight of 1.21.2 to the more recent old task and 0.80.8 to the other one.

Evaluation metrics. The performance is evaluated based on ACC, the average final accuracy over all tasks, and Backward Transfer (BWT) which measures the forgetting of old tasks when learning new tasks. Specfically, ACC and BWT are defined as:

ACC=1Ti=1TAT,i,BWT=1T1i=1T1AT,iAi,i\displaystyle ACC=\frac{1}{T}\sum\nolimits_{i=1}^{T}A_{T,i},BWT=\frac{1}{T-1}\sum\nolimits_{i=1}^{T-1}A_{T,i}-A_{i,i} (16)

where AT,iA_{T,i} is the accuracy of the model on ii-th task after learning the TT-th task sequentially.

Appendix B Useful Lemmas

The following lemma characterizes the solution to the optimization problem Equation 4 for task tt:

Lemma B.1.

The solution to the optimization problem Equation 4, i.e., the learnt model for task tt, is given by

𝒘t=𝒘t1+𝑿t(𝑿t𝑿t)1(𝒚t𝑿t𝒘t1).\displaystyle{\bm{w}}_{t}={\bm{w}}_{t-1}+{\bm{X}}_{t}({\bm{X}}_{t}^{\top}{\bm{X}}_{t})^{-1}\left({\bm{y}}_{t}-{\bm{X}}_{t}^{\top}{\bm{w}}_{t-1}\right). (17)

In the overparameterized case, multiple 𝒘t{\bm{w}}_{t} exist to perfectly fit (𝑿t)𝒘=𝒚t({\bm{X}}_{t})^{\top}{\bm{w}}={\bm{y}}_{t}, and solving Equation 4 picks the one that has minimum l2l^{2} distance to 𝒘t1{\bm{w}}_{t-1}. Therefore, the solution in Equation 17 not only incorporates the information of current task tt through 𝒟t{\mathcal{D}}_{t} but also depends on the previous model evolution trajectory in CL.

By leveraging the recent advance in [8], we can have the following lemma about the evolution of 𝔼[𝒘t𝒘i2]\mathbb{E}[\|{\bm{w}}_{t}-{\bm{w}}_{i}^{*}\|^{2}]:

Lemma B.2.

Suppose pn+2p\geq n+2. For any task t[1,T1]t\in[1,T-1] and any old task i[1,t]i\in[1,t], the following equation holds:

𝔼[𝒘t+1𝒘i2]\displaystyle\mathbb{E}[\|{\bm{w}}_{t+1}-{\bm{w}}_{i}^{*}\|^{2}]
=\displaystyle= (1np)𝔼[𝒘t𝒘i2]+np𝒘t+1𝒘i2+nσ2pn1.\displaystyle\left(1-\frac{n}{p}\right)\mathbb{E}[\|{\bm{w}}_{t}-{\bm{w}}_{i}^{*}\|^{2}]+\frac{n}{p}\|{\bm{w}}_{t+1}^{*}-{\bm{w}}_{i}^{*}\|^{2}+\frac{n\sigma^{2}}{p-n-1}.

Appendix C Additional Results

C.1 Characterization of negative forgetting

As shown in Figure 2, the forgetting can even be negative when the two tasks are positively correlated. Intuitively, because the common features play a similar role in these two tasks, task 2 updates the model in a favorable direction for task 1, which could even result in better performance of task 1 due to the backward knowledge transfer herein. A formal quantification of the condition for better performance of task 1 can be found in the following proposition:

Proposition C.1.

Suppose σ2<pn1p𝐰12\sigma^{2}<\frac{p-n-1}{p}\|{\bm{w}}_{1}^{*}\|^{2} and pn+2p\geq n+2. The learning of task 2 would lead to a better model for task 1, i.e., 𝔼[F2]0\mathbb{E}[F_{2}]\leq 0, if

2𝒘1,𝒮1,𝒘2,𝒮2np𝒘12+𝒘22+(pn)σ2pn1.\displaystyle 2\langle{\bm{w}}_{1,{\mathcal{S}}_{1}}^{*},{\bm{w}}_{2,{\mathcal{S}}_{2}}^{*}\rangle\geq\frac{n}{p}\|{\bm{w}}_{1}^{*}\|^{2}+\|{\bm{w}}_{2}^{*}\|^{2}+\frac{(p-n)\sigma^{2}}{p-n-1}.

C.2 Evolution of forgetting

We can also characterize the evolution of forgetting after learning new tasks. Based on the definition of forgetting, we have

𝔼[Ft]\displaystyle\mathbb{E}[F_{t}] =1t1i=1t1𝔼[𝒘t𝒘i2𝒘i𝒘i2],\displaystyle=\frac{1}{t-1}\sum_{i=1}^{t-1}\mathbb{E}\left[\|{\bm{w}}_{t}-{\bm{w}}_{i}^{*}\|^{2}-\|{\bm{w}}_{i}-{\bm{w}}_{i}^{*}\|^{2}\right],
𝔼[Ft+1]\displaystyle\mathbb{E}[F_{t+1}] =1ti=1t𝔼[𝒘t+1𝒘i2𝒘i𝒘i2].\displaystyle=\frac{1}{t}\sum_{i=1}^{t}\mathbb{E}\left[\|{\bm{w}}_{t+1}-{\bm{w}}_{i}^{*}\|^{2}-\|{\bm{w}}_{i}-{\bm{w}}_{i}^{*}\|^{2}\right].

Rearranging the above equations gives

i=1t1𝔼[𝒘t𝒘i2]\displaystyle\sum_{i=1}^{t-1}\mathbb{E}[\|{\bm{w}}_{t}-{\bm{w}}_{i}^{*}\|^{2}] =(t1)𝔼[Ft]+i=1t1𝔼[𝒘i𝒘i2],\displaystyle=(t-1)\mathbb{E}[F_{t}]+\sum_{i=1}^{t-1}\mathbb{E}[\|{\bm{w}}_{i}-{\bm{w}}_{i}^{*}\|^{2}],
i=1t1𝔼[𝒘t+1𝒘i2]\displaystyle\sum_{i=1}^{t-1}\mathbb{E}[\|{\bm{w}}_{t+1}-{\bm{w}}_{i}^{*}\|^{2}] =t𝔼[Ft+1]+i=1t𝔼[𝒘i𝒘i2]𝔼[𝒘t+1𝒘t2].\displaystyle=t\mathbb{E}[F_{t+1}]+\sum_{i=1}^{t}\mathbb{E}[\|{\bm{w}}_{i}-{\bm{w}}_{i}^{*}\|^{2}]-\mathbb{E}[\|{\bm{w}}_{t+1}-{\bm{w}}_{t}^{*}\|^{2}].

Based on the relationship between 𝔼[𝒘t𝒘i2]\mathbb{E}[\|{\bm{w}}_{t}-{\bm{w}}_{i}^{*}\|^{2}] and 𝔼[𝒘t+1𝒘i2]\mathbb{E}[\|{\bm{w}}_{t+1}-{\bm{w}}_{i}^{*}\|^{2}] characterized in Lemma B.2, it can be seen that

i=1t1𝔼[𝒘t+1𝒘i2]\displaystyle\sum_{i=1}^{t-1}\mathbb{E}[\|{\bm{w}}_{t+1}-{\bm{w}}_{i}^{*}\|^{2}]
=\displaystyle= t𝔼[Ft+1]+i=1t𝔼[𝒘i𝒘i2]𝔼[𝒘t+1𝒘t2]\displaystyle t\mathbb{E}[F_{t+1}]+\sum_{i=1}^{t}\mathbb{E}[\|{\bm{w}}_{i}-{\bm{w}}_{i}^{*}\|^{2}]-\mathbb{E}[\|{\bm{w}}_{t+1}-{\bm{w}}_{t}^{*}\|^{2}]
=\displaystyle= i=1t1{(1np)𝔼[𝒘t𝒘i2]+np𝒘t+1𝒘i2+nσ2pn1}\displaystyle\sum_{i=1}^{t-1}\left\{\left(1-\frac{n}{p}\right)\mathbb{E}[\|{\bm{w}}_{t}-{\bm{w}}_{i}^{*}\|^{2}]+\frac{n}{p}\|{\bm{w}}_{t+1}^{*}-{\bm{w}}_{i}^{*}\|^{2}+\frac{n\sigma^{2}}{p-n-1}\right\}
=\displaystyle= (1np)i=1t1𝔼[𝒘t𝒘i2]+npi=1t1𝒘t+1𝒘i2+nσ2(t1)pn1\displaystyle\left(1-\frac{n}{p}\right)\sum_{i=1}^{t-1}\mathbb{E}[\|{\bm{w}}_{t}-{\bm{w}}_{i}^{*}\|^{2}]+\frac{n}{p}\sum_{i=1}^{t-1}\|{\bm{w}}_{t+1}^{*}-{\bm{w}}_{i}^{*}\|^{2}+\frac{n\sigma^{2}(t-1)}{p-n-1}
=\displaystyle= (1np){(t1)𝔼[Ft]+i=1t1𝔼[𝒘i𝒘i2]}+npi=1t1𝒘t+1𝒘i2+nσ2(t1)pn1,\displaystyle\left(1-\frac{n}{p}\right)\left\{(t-1)\mathbb{E}[F_{t}]+\sum_{i=1}^{t-1}\mathbb{E}[\|{\bm{w}}_{i}-{\bm{w}}_{i}^{*}\|^{2}]\right\}+\frac{n}{p}\sum_{i=1}^{t-1}\|{\bm{w}}_{t+1}^{*}-{\bm{w}}_{i}^{*}\|^{2}+\frac{n\sigma^{2}(t-1)}{p-n-1},

such that

t𝔼[Ft+1]=\displaystyle t\mathbb{E}[F_{t+1}]= (t1)(1np)𝔼[Ft]+(1np)i=1t1𝔼[𝒘i𝒘i2]+npi=1t1𝒘t+1𝒘i2\displaystyle(t-1)\left(1-\frac{n}{p}\right)\mathbb{E}[F_{t}]+\left(1-\frac{n}{p}\right)\sum_{i=1}^{t-1}\mathbb{E}[\|{\bm{w}}_{i}-{\bm{w}}_{i}^{*}\|^{2}]+\frac{n}{p}\sum_{i=1}^{t-1}\|{\bm{w}}_{t+1}^{*}-{\bm{w}}_{i}^{*}\|^{2}
+nσ2(t1)pn1i=1t𝔼[𝒘i𝒘i2]+𝔼[𝒘t+1𝒘t2]\displaystyle+\frac{n\sigma^{2}(t-1)}{p-n-1}-\sum_{i=1}^{t}\mathbb{E}[\|{\bm{w}}_{i}-{\bm{w}}_{i}^{*}\|^{2}]+\mathbb{E}[\|{\bm{w}}_{t+1}-{\bm{w}}_{t}^{*}\|^{2}]
=\displaystyle= (t1)(1np)𝔼[Ft]npi=1t1𝔼[𝒘i𝒘i2]𝔼[𝒘t𝒘t2]\displaystyle(t-1)\left(1-\frac{n}{p}\right)\mathbb{E}[F_{t}]-\frac{n}{p}\sum_{i=1}^{t-1}\mathbb{E}[\|{\bm{w}}_{i}-{\bm{w}}_{i}^{*}\|^{2}]-\mathbb{E}[\|{\bm{w}}_{t}-{\bm{w}}_{t}^{*}\|^{2}]
+npi=1t1𝒘t+1𝒘i2+𝔼[𝒘t+1𝒘t2]+nσ2(t1)pn1.\displaystyle+\frac{n}{p}\sum_{i=1}^{t-1}\|{\bm{w}}_{t+1}^{*}-{\bm{w}}_{i}^{*}\|^{2}+\mathbb{E}[\|{\bm{w}}_{t+1}-{\bm{w}}_{t}^{*}\|^{2}]+\frac{n\sigma^{2}(t-1)}{p-n-1}. (18)

Let i=ti=t in Lemma B.2. We can show that

𝔼[𝒘t+1𝒘t2𝒘t𝒘t2]\displaystyle\mathbb{E}[\|{\bm{w}}_{t+1}-{\bm{w}}_{t}^{*}\|^{2}-\|{\bm{w}}_{t}-{\bm{w}}_{t}^{*}\|^{2}]
=\displaystyle= np𝒘t+1𝒘t2np𝔼[𝒘t𝒘t2]+nσ2pn1.\displaystyle\frac{n}{p}\|{\bm{w}}_{t+1}^{*}-{\bm{w}}_{t}^{*}\|^{2}-\frac{n}{p}\mathbb{E}[\|{\bm{w}}_{t}-{\bm{w}}_{t}^{*}\|^{2}]+\frac{n\sigma^{2}}{p-n-1}. (19)

By substituting Section C.2 back to Section C.2, we can have

𝔼[Ft+1]=\displaystyle\mathbb{E}[F_{t+1}]= t1t(1np)𝔼[Ft]+ntpi=1t1{𝒘t+1𝒘i2𝔼[𝒘i𝒘i2]}\displaystyle\frac{t-1}{t}\left(1-\frac{n}{p}\right)\mathbb{E}[F_{t}]+\frac{n}{tp}\sum_{i=1}^{t-1}\left\{\|{\bm{w}}_{t+1}^{*}-{\bm{w}}_{i}^{*}\|^{2}-\mathbb{E}[\|{\bm{w}}_{i}-{\bm{w}}_{i}^{*}\|^{2}]\right\}
+ntp{𝒘t+1𝒘t2𝔼[𝒘t𝒘t2]}+nσ2pn1\displaystyle+\frac{n}{tp}\left\{\|{\bm{w}}_{t+1}^{*}-{\bm{w}}_{t}^{*}\|^{2}-\mathbb{E}[\|{\bm{w}}_{t}-{\bm{w}}_{t}^{*}\|^{2}]\right\}+\frac{n\sigma^{2}}{p-n-1}
=\displaystyle= t1t(1np)𝔼[Ft]+ntpi=1t{𝒘t+1𝒘i2𝔼[𝒘i𝒘i2]}+nσ2pn1.\displaystyle\frac{t-1}{t}\left(1-\frac{n}{p}\right)\mathbb{E}[F_{t}]+\frac{n}{tp}\sum_{i=1}^{t}\left\{\|{\bm{w}}_{t+1}^{*}-{\bm{w}}_{i}^{*}\|^{2}-\mathbb{E}[\|{\bm{w}}_{i}-{\bm{w}}_{i}^{*}\|^{2}]\right\}+\frac{n\sigma^{2}}{p-n-1}. (20)

C.3 Impact of overparameterization

1) Forgetting approaches zero with more parameters. In Equation 9, when pp\to\infty, we have r1r\to 1, which implies that (rTri)0(r^{T}-r^{i})\to 0 and ci,j0c_{i,j}\to 0. Therefore, we can conclude that 𝔼[FT]0\mathbb{E}[F_{T}]\to 0 when pp\to\infty. An intuitive explanation is that with more parameters, the model has a larger “memory” such that it can remember all knowledge of previous tasks, i.e., zero forgetting.

2) More parameters can alleviate the negative impact of task dissimilarity on generalization error. Term G2 in Equation 10 describes the effect of task dissimilarity on GTG_{T}. When pp\to\infty, Term G2 approaches zero, which indicates that the negative impact of task dissimilarity on generalization error diminishes. In some special cases, we can further show that Term G2 is monotonically decreasing with respect to pp, e.g., T=2T=2 shown in Equation 12. A more general444For general TT, this requirement holds if the ground truth of each task has the same power and is orthogonal to each other, i.e., 𝒘i2=𝒘j2\|{\bm{w}}_{i}^{*}\|^{2}=\|{\bm{w}}_{j}^{*}\|^{2} and (𝒘i)T𝒘j=0({\bm{w}}_{i}^{*})^{T}{\bm{w}}_{j}^{*}=0 for all iji\neq j. case is when k=1T𝒘k𝒘i2=C\sum_{k=1}^{T}\|{\bm{w}}_{k}^{*}-{\bm{w}}_{i}^{*}\|^{2}=C for all task ii, we have Term G2 =1rTTC=\frac{1-r^{T}}{T}C which is also monotonically decreasing w.r.t. pp.

C.4 Impact of task order

(1) [Special case III]  There are three categories (C1C_{1}, C2C_{2} and C3C_{3}) of tasks: each category contains the same number of tasks; the tasks are same in the same category but different across categories. Without loss of generality, we assume that for any task ii and jj

𝒘i𝒘j2={0,if i,jCm for m{1,2,3};1,else.\displaystyle\|{\bm{w}}_{i}^{*}-{\bm{w}}_{j}^{*}\|^{2}=\begin{cases}0,&\text{if $i,j\in C_{m}$ for $m\in\{1,2,3\}$;}\\ 1,&\text{else.}\end{cases}

Based on Theorem 4.1, we can show that the optimal task order for Special case III follows a similar structure of that for Special case II, as characterized in the following proposition:

Proposition C.2.

Suppose pn+2p\geq n+2. For T=6T=6, the optimal task order to minimize forgetting is the perfectly alternating order, i.e., (Ci,Cj,Ck,Ci,Cj,Ck)(C_{i},C_{j},C_{k},C_{i},C_{j},C_{k}), where i,j,k{1,2,3}i,j,k\in\{1,2,3\}, iji\neq j, iki\neq k and jkj\neq k.

(2) [The optimal task order can be different for minimizing forgetting and generalization error]

[Special case I]  As shown in Proposition 4.4, the optimal task order to minimize forgetting is to learn the special task between the 2nd2nd place and the T2th\frac{T}{2}th place. In stark contrast, this special task, which has the largest value of k=1T𝒘k𝒘i2\sum_{k=1}^{T}\|{\bm{w}}_{k}^{*}-{\bm{w}}_{i}^{*}\|^{2}, should be always learnt in the very first place in order to minimize the generalization error, i.e., i=1i=1. The underlying rationale is that the generalization error characterizes the average testing performance of the final model on all tasks, which is maximized when the final model works the best for the majority. Therefore, in this case the optimal order for minimizing forgetting is different from that for minimizing generalization error.

[Special case II]  As shown in Proposition 4.5, the optimal task order to minimize forgetting is the perfectly alternating order. In contrast, the task order indeed does not affect the generalization performance, because k=1T𝒘k𝒘i2\sum_{k=1}^{T}\|{\bm{w}}_{k}^{*}-{\bm{w}}_{i}^{*}\|^{2} is same for every task i𝕋i\in\mathbb{T}. In this case, the optimal task order for minimizing forgetting is also ‘optimal’ for minimizing generalization error. That is to say, we can find an optimal task order to minimize forgetting and generalization error simultaneously.

Appendix D Proofs

D.1 Proof of Lemma B.1

Let 𝒘^=𝒘𝒘t1\hat{{\bm{w}}}={\bm{w}}-{\bm{w}}_{t-1}. It is clear that Equation 4 can be reformulated as

min\displaystyle\min 𝒘^2,\displaystyle~{}~{}\|\hat{{\bm{w}}}\|_{2}, (21)
s.t.\displaystyle s.t. 𝑿t𝒘^=𝒚t𝑿t𝒘t1.\displaystyle~{}~{}{\bm{X}}_{t}^{\top}\hat{{\bm{w}}}={\bm{y}}_{t}-{\bm{X}}_{t}^{\top}{\bm{w}}_{t-1}.

For the overparameterized case, 𝑿t𝑿t{\bm{X}}_{t}^{\top}{\bm{X}}_{t} is invertible. Using the Lagrange multipliers, we can get

min𝒘^,λ𝒘^𝒘^2+λT[𝑿t𝒘^(𝒚t𝑿t𝒘t1)].\displaystyle\min_{\hat{{\bm{w}}},\lambda}~{}~{}\frac{\hat{{\bm{w}}}^{\top}\hat{{\bm{w}}}}{2}+\lambda^{T}[{\bm{X}}_{t}^{\top}\hat{{\bm{w}}}-({\bm{y}}_{t}-{\bm{X}}_{t}^{\top}{\bm{w}}_{t-1})].

By setting the derivative w.r.t. 𝒘^\hat{{\bm{w}}} to 0, it follows that

𝒘^=𝑿tλ\displaystyle\hat{{\bm{w}}}^{*}=-{\bm{X}}_{t}\lambda (22)

such that

𝑿t𝒘^=𝑿t𝑿tλ=𝒚t𝑿t𝒘t1.\displaystyle{\bm{X}}_{t}^{\top}\hat{{\bm{w}}}^{*}=-{\bm{X}}_{t}^{\top}{\bm{X}}_{t}\lambda={\bm{y}}_{t}-{\bm{X}}_{t}^{\top}{\bm{w}}_{t-1}.

Therefore,

λ=(𝑿t𝑿t)1(𝒚t𝑿t𝒘t1).\displaystyle\lambda=-({\bm{X}}_{t}^{\top}{\bm{X}}_{t})^{-1}({\bm{y}}_{t}-{\bm{X}}_{t}^{\top}{\bm{w}}_{t-1}). (23)

By substituting Equation 23 into Equation 22, we can have

𝒘^=𝑿t(𝑿t𝑿t)1(𝒚t𝑿t𝒘t1)\displaystyle\hat{{\bm{w}}}^{*}={\bm{X}}_{t}({\bm{X}}_{t}^{\top}{\bm{X}}_{t})^{-1}({\bm{y}}_{t}-{\bm{X}}_{t}^{\top}{\bm{w}}_{t-1})

such that

𝒘t=𝒘t1+𝑿t(𝑿t𝑿t)1(𝒚t𝑿t𝒘t1).\displaystyle{\bm{w}}_{t}={\bm{w}}_{t-1}+{\bm{X}}_{t}({\bm{X}}_{t}^{\top}{\bm{X}}_{t})^{-1}({\bm{y}}_{t}-{\bm{X}}_{t}^{\top}{\bm{w}}_{t-1}).

D.2 Proof of Lemma B.2

Let 𝑷t𝑿t(𝑿t𝑿t)1𝑿t{\bm{P}}_{t}\coloneqq{\bm{X}}_{t}({\bm{X}}_{t}^{\top}{\bm{X}}_{t})^{-1}{\bm{X}}_{t}^{\top} and 𝑿t𝑿t(𝑿t𝑿t)1{\bm{X}}_{t}^{\dagger}\coloneqq{\bm{X}}_{t}({\bm{X}}_{t}^{\top}{\bm{X}}_{t})^{-1} for any t𝕋t\in\mathbb{T}, where 𝑷t{\bm{P}}_{t} characterizes the projection onto the row space of 𝑿t{\bm{X}}_{t}^{\top}. Based on Lemma B.1, we have

𝒘t+1=(𝑰𝑷t+1)𝒘t+𝑷t+1𝒘t+1+𝑿t+1𝒛t+1.\displaystyle{\bm{w}}_{t+1}=({\bm{I}}-{\bm{P}}_{t+1}){\bm{w}}_{t}+{\bm{P}}_{t+1}{\bm{w}}_{t+1}^{*}+{\bm{X}}_{t+1}^{\dagger}{\bm{z}}_{t+1}. (24)

Intuitively, the learnt model 𝒘t+1{\bm{w}}_{t+1} for task t+1t+1 is an ‘interpolation’ between the learnt model 𝒘t{\bm{w}}_{t} for task tt and the optimal task model 𝒘t+1{\bm{w}}_{t+1}^{*} for task t+1t+1, while being perturbed by the random noise zt+1z_{t+1}.

Let H=(𝑰𝑷t+1)(𝒘t𝒘i)+𝑷t+1(𝒘t+1𝒘i)H=({\bm{I}}-{\bm{P}}_{t+1})({\bm{w}}_{t}-{\bm{w}}_{i}^{*})+{\bm{P}}_{t+1}({\bm{w}}_{t+1}^{*}-{\bm{w}}_{i}^{*}). Based on Equation 24, we can know that

𝔼[𝒘t+1𝒘i2]\displaystyle\mathbb{E}[\|{\bm{w}}_{t+1}-{\bm{w}}_{i}^{*}\|^{2}]
=\displaystyle= 𝔼[(𝑰𝑷t+1)𝒘t+𝑷t+1𝒘t+1+𝑿t+1𝒛t+1𝒘i2]\displaystyle\mathbb{E}[\|({\bm{I}}-{\bm{P}}_{t+1}){\bm{w}}_{t}+{\bm{P}}_{t+1}{\bm{w}}_{t+1}^{*}+{\bm{X}}_{t+1}^{\dagger}{\bm{z}}_{t+1}-{\bm{w}}_{i}^{*}\|^{2}]
=\displaystyle= 𝔼[(𝑰𝑷t+1)(𝒘t𝒘i)+𝑷t+1(𝒘t+1𝒘i)+𝑿t+1𝒛t+12]\displaystyle\mathbb{E}[\|({\bm{I}}-{\bm{P}}_{t+1})({\bm{w}}_{t}-{\bm{w}}_{i}^{*})+{\bm{P}}_{t+1}({\bm{w}}_{t+1}^{*}-{\bm{w}}_{i}^{*})+{\bm{X}}_{t+1}^{\dagger}{\bm{z}}_{t+1}\|^{2}]
=\displaystyle= 𝔼[H+𝑿t+1𝒛t+12]\displaystyle\mathbb{E}[\|H+{\bm{X}}_{t+1}^{\dagger}{\bm{z}}_{t+1}\|^{2}]
=\displaystyle= 𝔼[H2](a)+2𝔼[H,𝑿t+1𝒛t+1](b)+𝔼[𝑿t+1𝒛t+12](c).\displaystyle\underbrace{\mathbb{E}[\|H\|^{2}]}_{(a)}+\underbrace{2\mathbb{E}[\langle H,{\bm{X}}_{t+1}^{\dagger}{\bm{z}}_{t+1}\rangle]}_{(b)}+\underbrace{\mathbb{E}[\|{\bm{X}}_{t+1}^{\dagger}{\bm{z}}_{t+1}\|^{2}]}_{(c)}. (25)

(1) For the term (a), we have

𝔼[H2]=\displaystyle\mathbb{E}[\|H\|^{2}]= 𝔼[(𝑰𝑷t+1)(𝒘t𝒘i)+𝑷t+1(𝒘t+1𝒘i)2]\displaystyle\mathbb{E}[\|({\bm{I}}-{\bm{P}}_{t+1})({\bm{w}}_{t}-{\bm{w}}_{i}^{*})+{\bm{P}}_{t+1}({\bm{w}}_{t+1}^{*}-{\bm{w}}_{i}^{*})\|^{2}]
=\displaystyle= 𝔼[(𝑰𝑷t+1)(𝒘t𝒘i)2]+𝔼[𝑷t+1(𝒘t+1𝒘i)2]+2𝔼[(𝑰𝑷t+1)(𝒘t𝒘i),𝑷t+1(𝒘t+1𝒘i)]\displaystyle\mathbb{E}[\|({\bm{I}}-{\bm{P}}_{t+1})({\bm{w}}_{t}-{\bm{w}}_{i}^{*})\|^{2}]+\mathbb{E}[\|{\bm{P}}_{t+1}({\bm{w}}_{t+1}^{*}-{\bm{w}}_{i}^{*})\|^{2}]+2\mathbb{E}[\langle({\bm{I}}-{\bm{P}}_{t+1})({\bm{w}}_{t}-{\bm{w}}_{i}^{*}),{\bm{P}}_{t+1}({\bm{w}}_{t+1}^{*}-{\bm{w}}_{i}^{*})\rangle]
=(a)\displaystyle\overset{(a)}{=} 𝔼[(𝑰𝑷t+1)(𝒘t𝒘i)2]+𝔼[𝑷t+1(𝒘t+1𝒘i)2]\displaystyle\mathbb{E}[\|({\bm{I}}-{\bm{P}}_{t+1})({\bm{w}}_{t}-{\bm{w}}_{i}^{*})\|^{2}]+\mathbb{E}[\|{\bm{P}}_{t+1}({\bm{w}}_{t+1}^{*}-{\bm{w}}_{i}^{*})\|^{2}]
=(b)\displaystyle\overset{(b)}{=} 𝔼[𝒘t𝒘i2]𝔼[𝑷t+1(𝒘t𝒘i)2]+𝔼[𝑷t+1(𝒘t+1𝒘i)2]\displaystyle\mathbb{E}[\|{\bm{w}}_{t}-{\bm{w}}_{i}^{*}\|^{2}]-\mathbb{E}[\|{\bm{P}}_{t+1}({\bm{w}}_{t}-{\bm{w}}_{i}^{*})\|^{2}]+\mathbb{E}[\|{\bm{P}}_{t+1}({\bm{w}}_{t+1}^{*}-{\bm{w}}_{i}^{*})\|^{2}] (26)

where (a) is because of the orthogonality between 𝑰𝑷t+1{\bm{I}}-{\bm{P}}_{t+1} and 𝑷t+1{\bm{P}}_{t+1}, and (b) is due to the Pythagorean theorem.

Because 𝑷t+1{\bm{P}}_{t+1} is the orthogonal projection matrix for the row space of 𝑿t+1{\bm{X}}_{t+1}, based on the rotational symmetry of the standard normal distribution, it follows that

𝔼[𝑷t+1(𝒘t+1𝒘i)2]=np𝒘t+1𝒘i2,\displaystyle\mathbb{E}[\|{\bm{P}}_{t+1}({\bm{w}}_{t+1}^{*}-{\bm{w}}_{i}^{*})\|^{2}]=\frac{n}{p}\|{\bm{w}}_{t+1}^{*}-{\bm{w}}_{i}^{*}\|^{2}, (27)

and

𝔼[𝑷t+1(𝒘t𝒘i)2]=np𝔼[𝒘t𝒘i2],\displaystyle\mathbb{E}[\|{\bm{P}}_{t+1}({\bm{w}}_{t}-{\bm{w}}_{i}^{*})\|^{2}]=\frac{n}{p}\mathbb{E}[\|{\bm{w}}_{t}-{\bm{w}}_{i}^{*}\|^{2}], (28)

since 𝑷t+1{\bm{P}}_{t+1} is independent with 𝒘t{\bm{w}}_{t}.

By substituting Equation 27 and Equation 28 back to Section D.2, we can obtain that

𝔼[H2]=(1np)𝔼[𝒘t𝒘i2]+np𝒘t+1𝒘i2.\displaystyle\mathbb{E}[\|H\|^{2}]=\left(1-\frac{n}{p}\right)\mathbb{E}[\|{\bm{w}}_{t}-{\bm{w}}_{i}^{*}\|^{2}]+\frac{n}{p}\|{\bm{w}}_{t+1}^{*}-{\bm{w}}_{i}^{*}\|^{2}. (29)

(2) For the term (b), we have

𝔼[H,𝑿t+1𝒛t+1]=\displaystyle\mathbb{E}[\langle H,{\bm{X}}_{t+1}^{\dagger}{\bm{z}}_{t+1}\rangle]= 𝔼[(𝑰𝑷t+1)(𝒘t𝒘i)+𝑷t+1(𝒘t+1𝒘i),𝑿t+1𝒛t+1]\displaystyle\mathbb{E}[\langle({\bm{I}}-{\bm{P}}_{t+1})({\bm{w}}_{t}-{\bm{w}}_{i}^{*})+{\bm{P}}_{t+1}({\bm{w}}_{t+1}^{*}-{\bm{w}}_{i}^{*}),{\bm{X}}_{t+1}^{\dagger}{\bm{z}}_{t+1}\rangle]
=\displaystyle= 𝔼[(𝑰𝑷t+1)(𝒘t𝒘i),𝑿t+1𝒛t+1]+𝔼[𝑷t+1(𝒘t+1𝒘i),𝑿t+1𝒛t+1].\displaystyle\mathbb{E}[\langle({\bm{I}}-{\bm{P}}_{t+1})({\bm{w}}_{t}-{\bm{w}}_{i}^{*}),{\bm{X}}_{t+1}^{\dagger}{\bm{z}}_{t+1}\rangle]+\mathbb{E}[\langle{\bm{P}}_{t+1}({\bm{w}}_{t+1}^{*}-{\bm{w}}_{i}^{*}),{\bm{X}}_{t+1}^{\dagger}{\bm{z}}_{t+1}\rangle].

Because (𝑰𝑷t+1)({\bm{I}}-{\bm{P}}_{t+1}) is the projection onto the null space of 𝑿t+1{\bm{X}}_{t+1}^{\top} and 𝑿t+1𝒛t+1{\bm{X}}_{t+1}^{\dagger}{\bm{z}}_{t+1} is a vector in the row space of 𝑿t+1{\bm{X}}_{t+1}^{\top}, it follows that

𝔼[(𝑰𝑷t+1)(𝒘t𝒘i),𝑿t+1𝒛t+1]=0.\displaystyle\mathbb{E}[\langle({\bm{I}}-{\bm{P}}_{t+1})({\bm{w}}_{t}-{\bm{w}}_{i}^{*}),{\bm{X}}_{t+1}^{\dagger}{\bm{z}}_{t+1}\rangle]=0. (30)

And since

𝔼[𝑷t+1(𝒘t+1𝒘i),𝑿t+1𝒛t+1]=𝔼[(𝑿t+1)𝑷t+1(𝒘t+1𝒘i),𝒛t+1]=0.\displaystyle\mathbb{E}[\langle{\bm{P}}_{t+1}({\bm{w}}_{t+1}^{*}-{\bm{w}}_{i}^{*}),{\bm{X}}_{t+1}^{\dagger}{\bm{z}}_{t+1}\rangle]=\mathbb{E}[\langle({\bm{X}}_{t+1}^{\dagger})^{\top}{\bm{P}}_{t+1}({\bm{w}}_{t+1}^{*}-{\bm{w}}_{i}^{*}),{\bm{z}}_{t+1}\rangle]=0.

we can know that

𝔼[H,𝑿t+1𝒛t+1]=0.\displaystyle\mathbb{E}[\langle H,{\bm{X}}_{t+1}^{\dagger}{\bm{z}}_{t+1}\rangle]=0. (31)

(3) For the term (c), we apply the “trace trick” by following [8]. Specifically, it can be first seen that

𝑿t+1𝒛t+12=\displaystyle\|{\bm{X}}_{t+1}^{\dagger}{\bm{z}}_{t+1}\|^{2}= 𝑿t+1(𝑿t+1𝑿t+1)1𝒛t+12\displaystyle\|{\bm{X}}_{t+1}({\bm{X}}_{t+1}^{\top}{\bm{X}}_{t+1})^{-1}{\bm{z}}_{t+1}\|^{2}
=\displaystyle= tr((𝑿t+1𝑿t+1)1(𝑿t+1𝑿t+1)(𝑿t+1𝑿t+1)1𝒛t+1𝒛t+1)\displaystyle tr(({\bm{X}}_{t+1}^{\top}{\bm{X}}_{t+1})^{-1}({\bm{X}}_{t+1}^{\top}{\bm{X}}_{t+1})({\bm{X}}_{t+1}^{\top}{\bm{X}}_{t+1})^{-1}{\bm{z}}_{t+1}{\bm{z}}_{t+1}^{\top})
=\displaystyle= tr((𝑿t+1𝑿t+1)1𝒛t+1𝒛t+1)\displaystyle tr(({\bm{X}}_{t+1}^{\top}{\bm{X}}_{t+1})^{-1}{\bm{z}}_{t+1}{\bm{z}}_{t+1}^{\top})

Due to the independence between 𝑿t+1{\bm{X}}_{t+1} and the random noise 𝒛t+1{\bm{z}}_{t+1}, we can have that

𝔼[𝑿t+1𝒛t+12]=\displaystyle\mathbb{E}[\|{\bm{X}}_{t+1}^{\dagger}{\bm{z}}_{t+1}\|^{2}]= 𝔼[tr((𝑿t+1𝑿t+1)1𝒛t+1𝒛t+1))]\displaystyle\mathbb{E}[tr(({\bm{X}}_{t+1}^{\top}{\bm{X}}_{t+1})^{-1}{\bm{z}}_{t+1}{\bm{z}}_{t+1}^{\top}))]
=\displaystyle= tr[𝔼[(𝑿t+1𝑿t+1)1𝒛t+1𝒛t+1]]\displaystyle tr[\mathbb{E}[({\bm{X}}_{t+1}^{\top}{\bm{X}}_{t+1})^{-1}{\bm{z}}_{t+1}{\bm{z}}_{t+1}^{\top}]]
=\displaystyle= tr(𝔼[(𝑿t+1𝑿t+1)1]𝔼[𝒛t+1𝒛t+1])\displaystyle tr(\mathbb{E}[({\bm{X}}_{t+1}^{\top}{\bm{X}}_{t+1})^{-1}]\mathbb{E}[{\bm{z}}_{t+1}{\bm{z}}_{t+1}^{\top}])
=\displaystyle= σ2tr(𝔼[(𝑿t+1𝑿t+1)1]).\displaystyle\sigma^{2}tr(\mathbb{E}[({\bm{X}}_{t+1}^{\top}{\bm{X}}_{t+1})^{-1}]).

Since (𝑿t+1𝑿t+1)1({\bm{X}}_{t+1}^{\top}{\bm{X}}_{t+1})^{-1} follows the inverse-Wishart distribution with identity scale matrix 𝑰n×n{\bm{I}}\in{\mathbb{R}}^{n\times n} and pp degrees-of-freedom, and each diagonal entry of (𝑿t+1𝑿t+1)1({\bm{X}}_{t+1}^{\top}{\bm{X}}_{t+1})^{-1} has a reciprocal that follows the χ2\chi^{2} distribution with pn+1p-n+1 degrees-of-freedom. Therefore, for pn+2p\geq n+2,

tr(𝔼[(𝑿t+1𝑿t+1)1])=npn+1,\displaystyle tr(\mathbb{E}[({\bm{X}}_{t+1}^{\top}{\bm{X}}_{t+1})^{-1}])=\frac{n}{p-n+1},

such that

𝔼[𝑿t+1𝒛t+12]=nσ2pn+1.\displaystyle\mathbb{E}[\|{\bm{X}}_{t+1}^{\dagger}{\bm{z}}_{t+1}\|^{2}]=\frac{n\sigma^{2}}{p-n+1}. (32)

Lemma B.2 can be proved by substituting Equation 29, Equation 31 and Equation 32 to Section D.2.

D.3 Proof of Theorem 4.1

Based on Lemma B.2, we can have that

𝔼[𝒘t𝒘i2]=\displaystyle\mathbb{E}[\|{\bm{w}}_{t}-{\bm{w}}_{i}^{*}\|^{2}]= (1np)t𝒘0𝒘i2+k=1t(1np)tknp𝒘k𝒘i2\displaystyle\left(1-\frac{n}{p}\right)^{t}\|{\bm{w}}_{0}-{\bm{w}}_{i}^{*}\|^{2}+\sum_{k=1}^{t}\left(1-\frac{n}{p}\right)^{t-k}\frac{n}{p}\|{\bm{w}}_{k}^{*}-{\bm{w}}_{i}^{*}\|^{2}
+nσ2pn1k=1t(1np)tk\displaystyle+\frac{n\sigma^{2}}{p-n-1}\sum_{k=1}^{t}\left(1-\frac{n}{p}\right)^{t-k}
=\displaystyle= (1np)t𝒘i2+k=1t(1np)tknp𝒘k𝒘i2\displaystyle\left(1-\frac{n}{p}\right)^{t}\|{\bm{w}}_{i}^{*}\|^{2}+\sum_{k=1}^{t}\left(1-\frac{n}{p}\right)^{t-k}\frac{n}{p}\|{\bm{w}}_{k}^{*}-{\bm{w}}_{i}^{*}\|^{2}
+nσ2pn1k=1t(1np)tk (since 𝒘0=𝟎).\displaystyle+\frac{n\sigma^{2}}{p-n-1}\sum_{k=1}^{t}\left(1-\frac{n}{p}\right)^{t-k}\text{ (since ${\bm{w}}_{0}=\bm{0}$)}. (33)

Let t=it=i in Section D.3. We have

𝔼[𝒘i𝒘i2]=\displaystyle\mathbb{E}[\|{\bm{w}}_{i}-{\bm{w}}_{i}^{*}\|^{2}]= (1np)i𝒘i2+k=1i(1np)iknp𝒘k𝒘i2\displaystyle\left(1-\frac{n}{p}\right)^{i}\|{\bm{w}}_{i}^{*}\|^{2}+\sum_{k=1}^{i}\left(1-\frac{n}{p}\right)^{i-k}\frac{n}{p}\|{\bm{w}}_{k}^{*}-{\bm{w}}_{i}^{*}\|^{2}
+nσ2pn1k=1i(1np)ik.\displaystyle+\frac{n\sigma^{2}}{p-n-1}\sum_{k=1}^{i}\left(1-\frac{n}{p}\right)^{i-k}. (34)

Based on Section D.3 and Section D.3, we can obtain the closed form of 𝔼[FT]\mathbb{E}[F_{T}]:

𝔼[FT]\displaystyle\mathbb{E}[F_{T}]
=\displaystyle= 1T1i=1T1𝔼[𝒘T𝒘i2𝒘i𝒘i2]\displaystyle\frac{1}{T-1}\sum_{i=1}^{T-1}\mathbb{E}\left[\|{\bm{w}}_{T}-{\bm{w}}_{i}^{*}\|^{2}-\|{\bm{w}}_{i}-{\bm{w}}_{i}^{*}\|^{2}\right]
=\displaystyle= 1T1i=1T1{(1np)T𝒘i2+k=1T(1np)Tknp𝒘k𝒘i2+nσ2pn1k=1T(1np)Tk\displaystyle\frac{1}{T-1}\sum_{i=1}^{T-1}\Bigg{\{}\left(1-\frac{n}{p}\right)^{T}\|{\bm{w}}_{i}^{*}\|^{2}+\sum_{k=1}^{T}\left(1-\frac{n}{p}\right)^{T-k}\frac{n}{p}\|{\bm{w}}_{k}^{*}-{\bm{w}}_{i}^{*}\|^{2}+\frac{n\sigma^{2}}{p-n-1}\sum_{k=1}^{T}\left(1-\frac{n}{p}\right)^{T-k}
(1np)i𝒘i2k=1i(1np)iknp𝒘k𝒘i2nσ2pn1k=1i(1np)ik}\displaystyle-\left(1-\frac{n}{p}\right)^{i}\|{\bm{w}}_{i}^{*}\|^{2}-\sum_{k=1}^{i}\left(1-\frac{n}{p}\right)^{i-k}\frac{n}{p}\|{\bm{w}}_{k}^{*}-{\bm{w}}_{i}^{*}\|^{2}-\frac{n\sigma^{2}}{p-n-1}\sum_{k=1}^{i}\left(1-\frac{n}{p}\right)^{i-k}\Bigg{\}}
=\displaystyle= 1T1i=1T1{[(1np)T(1np)i]𝒘i2+k=1inp[(1np)Tk(1np)ik]𝒘k𝒘i2\displaystyle\frac{1}{T-1}\sum_{i=1}^{T-1}\Bigg{\{}\left[\left(1-\frac{n}{p}\right)^{T}-\left(1-\frac{n}{p}\right)^{i}\right]\|{\bm{w}}_{i}^{*}\|^{2}+\sum_{k=1}^{i}\frac{n}{p}\left[\left(1-\frac{n}{p}\right)^{T-k}-\left(1-\frac{n}{p}\right)^{i-k}\right]\|{\bm{w}}_{k}^{*}-{\bm{w}}_{i}^{*}\|^{2}
+k=i+1Tnp(1np)Tk𝒘k𝒘i2+nσ2pn1k=1i[(1np)Tk(1np)ik]\displaystyle+\sum_{k=i+1}^{T}\frac{n}{p}\left(1-\frac{n}{p}\right)^{T-k}\|{\bm{w}}_{k}^{*}-{\bm{w}}_{i}^{*}\|^{2}+\frac{n\sigma^{2}}{p-n-1}\sum_{k=1}^{i}\left[\left(1-\frac{n}{p}\right)^{T-k}-\left(1-\frac{n}{p}\right)^{i-k}\right]
+nσ2pn1k=i+1T(1np)Tk}\displaystyle+\frac{n\sigma^{2}}{p-n-1}\sum_{k=i+1}^{T}\left(1-\frac{n}{p}\right)^{T-k}\Bigg{\}}
=\displaystyle= 1T1i=1T1{[(1np)T(1np)i]𝒘i2+j>iTci,j𝒘i𝒘j2\displaystyle\frac{1}{T-1}\sum_{i=1}^{T-1}\Bigg{\{}\left[\left(1-\frac{n}{p}\right)^{T}-\left(1-\frac{n}{p}\right)^{i}\right]\|{\bm{w}}_{i}^{*}\|^{2}+\sum_{j>i}^{T}c_{i,j}\|{\bm{w}}_{i}^{*}-{\bm{w}}_{j}^{*}\|^{2}
+nσ2pn1k=1i[(1np)Tk(1np)ik]+nσ2pn1k=i+1T(1np)Tk}\displaystyle+\frac{n\sigma^{2}}{p-n-1}\sum_{k=1}^{i}\left[\left(1-\frac{n}{p}\right)^{T-k}-\left(1-\frac{n}{p}\right)^{i-k}\right]+\frac{n\sigma^{2}}{p-n-1}\sum_{k=i+1}^{T}\left(1-\frac{n}{p}\right)^{T-k}\Bigg{\}}
=\displaystyle= 1T1i=1T1{[(1np)T(1np)i]𝒘i2+j>iTci,j𝒘i𝒘j2\displaystyle\frac{1}{T-1}\sum_{i=1}^{T-1}\Bigg{\{}\left[\left(1-\frac{n}{p}\right)^{T}-\left(1-\frac{n}{p}\right)^{i}\right]\|{\bm{w}}_{i}^{*}\|^{2}+\sum_{j>i}^{T}c_{i,j}\|{\bm{w}}_{i}^{*}-{\bm{w}}_{j}^{*}\|^{2}
+nσ2pn1[k=1T(1np)Tkk=1i(1np)ik]}\displaystyle+\frac{n\sigma^{2}}{p-n-1}\left[\sum_{k=1}^{T}\left(1-\frac{n}{p}\right)^{T-k}-\sum_{k=1}^{i}\left(1-\frac{n}{p}\right)^{i-k}\right]\Bigg{\}}
=\displaystyle= 1T1i=1T1{[(1np)T(1np)i]𝒘i2+j>iTci,j𝒘i𝒘j2\displaystyle\frac{1}{T-1}\sum_{i=1}^{T-1}\Bigg{\{}\left[\left(1-\frac{n}{p}\right)^{T}-\left(1-\frac{n}{p}\right)^{i}\right]\|{\bm{w}}_{i}^{*}\|^{2}+\sum_{j>i}^{T}c_{i,j}\|{\bm{w}}_{i}^{*}-{\bm{w}}_{j}^{*}\|^{2}
+nσ2pn1[1(1np)T1(1np)1(1np)i1(1np)]}\displaystyle+\frac{n\sigma^{2}}{p-n-1}\left[\frac{1-\left(1-\frac{n}{p}\right)^{T}}{1-\left(1-\frac{n}{p}\right)}-\frac{1-\left(1-\frac{n}{p}\right)^{i}}{1-\left(1-\frac{n}{p}\right)}\right]\Bigg{\}}
=\displaystyle= 1T1i=1T1{[(1np)T(1np)i]𝒘i2+j>iTci,j𝒘i𝒘j2\displaystyle\frac{1}{T-1}\sum_{i=1}^{T-1}\Bigg{\{}\left[\left(1-\frac{n}{p}\right)^{T}-\left(1-\frac{n}{p}\right)^{i}\right]\|{\bm{w}}_{i}^{*}\|^{2}+\sum_{j>i}^{T}c_{i,j}\|{\bm{w}}_{i}^{*}-{\bm{w}}_{j}^{*}\|^{2}
+nσ2pn1pn[(1(1np)T)(1(1np)i)]}\displaystyle+\frac{n\sigma^{2}}{p-n-1}\frac{p}{n}\left[\left(1-\left(1-\frac{n}{p}\right)^{T}\right)-\left(1-\left(1-\frac{n}{p}\right)^{i}\right)\right]\Bigg{\}}
=\displaystyle= 1T1i=1T1{[(1np)T(1np)i]𝒘i2+j>iTci,j𝒘i𝒘j2\displaystyle\frac{1}{T-1}\sum_{i=1}^{T-1}\Bigg{\{}\left[\left(1-\frac{n}{p}\right)^{T}-\left(1-\frac{n}{p}\right)^{i}\right]\|{\bm{w}}_{i}^{*}\|^{2}+\sum_{j>i}^{T}c_{i,j}\|{\bm{w}}_{i}^{*}-{\bm{w}}_{j}^{*}\|^{2}
+pσ2pn1[(1np)i(1np)T]}\displaystyle+\frac{p\sigma^{2}}{p-n-1}\left[\left(1-\frac{n}{p}\right)^{i}-\left(1-\frac{n}{p}\right)^{T}\right]\Bigg{\}}
=\displaystyle= 1T1i=1T1{(rTri)𝒘i2+j>iTci,j𝒘i𝒘j2+pσ2pn1(rirT)}.\displaystyle\frac{1}{T-1}\sum_{i=1}^{T-1}\Bigg{\{}(r^{T}-r^{i})\|{\bm{w}}_{i}^{*}\|^{2}+\sum_{j>i}^{T}c_{i,j}\|{\bm{w}}_{i}^{*}-{\bm{w}}_{j}^{*}\|^{2}+\frac{p\sigma^{2}}{p-n-1}\left(r^{i}-r^{T}\right)\Bigg{\}}.

Based on Section D.3, we can also obtain the exact form of the generalization error. Specifically,

𝔼[𝒘T𝒘i2]\displaystyle\mathbb{E}[\|{\bm{w}}_{T}-{\bm{w}}_{i}^{*}\|^{2}]
=\displaystyle= (1np)T𝒘i2+k=1Tnp(1np)Tk𝒘k𝒘i2+nσ2pn1k=1T(1np)Tk,\displaystyle\left(1-\frac{n}{p}\right)^{T}\|{\bm{w}}_{i}^{*}\|^{2}+\sum_{k=1}^{T}\frac{n}{p}\left(1-\frac{n}{p}\right)^{T-k}\|{\bm{w}}_{k}^{*}-{\bm{w}}_{i}^{*}\|^{2}+\frac{n\sigma^{2}}{p-n-1}\sum_{k=1}^{T}\left(1-\frac{n}{p}\right)^{T-k},

such that

𝔼[GT]=\displaystyle\mathbb{E}[G_{T}]= 1Ti=1T𝔼[𝒘T𝒘i2]\displaystyle\frac{1}{T}\sum_{i=1}^{T}\mathbb{E}[\|{\bm{w}}_{T}-{\bm{w}}_{i}^{*}\|^{2}]
=\displaystyle= 1T(1np)Ti=1T𝒘i2+1Tk=1Tnp(1np)Tki=1T𝒘k𝒘i2\displaystyle\frac{1}{T}\left(1-\frac{n}{p}\right)^{T}\sum_{i=1}^{T}\|{\bm{w}}_{i}^{*}\|^{2}+\frac{1}{T}\sum_{k=1}^{T}\frac{n}{p}\left(1-\frac{n}{p}\right)^{T-k}\sum_{i=1}^{T}\|{\bm{w}}_{k}^{*}-{\bm{w}}_{i}^{*}\|^{2}
+nσ2pn1k=1T(1np)Tk\displaystyle+\frac{n\sigma^{2}}{p-n-1}\sum_{k=1}^{T}\left(1-\frac{n}{p}\right)^{T-k}
=\displaystyle= 1T(1np)Ti=1T𝒘i2+1Tk=1Tnp(1np)Tki=1T𝒘k𝒘i2\displaystyle\frac{1}{T}\left(1-\frac{n}{p}\right)^{T}\sum_{i=1}^{T}\|{\bm{w}}_{i}^{*}\|^{2}+\frac{1}{T}\sum_{k=1}^{T}\frac{n}{p}\left(1-\frac{n}{p}\right)^{T-k}\sum_{i=1}^{T}\|{\bm{w}}_{k}^{*}-{\bm{w}}_{i}^{*}\|^{2}
+nσ2pn11(1np)T1(1np)\displaystyle+\frac{n\sigma^{2}}{p-n-1}\frac{1-\left(1-\frac{n}{p}\right)^{T}}{1-\left(1-\frac{n}{p}\right)}
=\displaystyle= 1T(1np)Ti=1T𝒘i2+1Tk=1Tnp(1np)Tki=1T𝒘k𝒘i2\displaystyle\frac{1}{T}\left(1-\frac{n}{p}\right)^{T}\sum_{i=1}^{T}\|{\bm{w}}_{i}^{*}\|^{2}+\frac{1}{T}\sum_{k=1}^{T}\frac{n}{p}\left(1-\frac{n}{p}\right)^{T-k}\sum_{i=1}^{T}\|{\bm{w}}_{k}^{*}-{\bm{w}}_{i}^{*}\|^{2}
+pσ2pn1[1(1np)T]\displaystyle+\frac{p\sigma^{2}}{p-n-1}\left[1-\left(1-\frac{n}{p}\right)^{T}\right]
=\displaystyle= rTTi=1T𝒘i2+1Ti=1TnrTipk=1T𝒘k𝒘i2+pσ2pn1(1rT).\displaystyle\frac{r^{T}}{T}\sum_{i=1}^{T}\|{\bm{w}}_{i}^{*}\|^{2}+\frac{1}{T}\sum_{i=1}^{T}\frac{nr^{T-i}}{p}\sum_{k=1}^{T}\|{\bm{w}}_{k}^{*}-{\bm{w}}_{i}^{*}\|^{2}+\frac{p\sigma^{2}}{p-n-1}\left(1-r^{T}\right).

D.4 Proof of Proposition C.1

Based on Theorem 4.1, it follows that

𝔼[F2]=\displaystyle\mathbb{E}[F_{2}]= (r2r)𝒘12+np𝒘1𝒘22+nrσ2pn1\displaystyle(r^{2}-r)\|{\bm{w}}_{1}^{*}\|^{2}+\frac{n}{p}\|{\bm{w}}_{1}^{*}-{\bm{w}}_{2}^{*}\|^{2}+\frac{nr\sigma^{2}}{p-n-1}
=\displaystyle= (1np)np𝒘1,s2+np𝒘1,s2+np𝒘2,s22np𝒘1,s,𝒘2,s+nrσ2pn1\displaystyle-\left(1-\frac{n}{p}\right)\frac{n}{p}\|{\bm{w}}_{1,s}^{*}\|^{2}+\frac{n}{p}\|{\bm{w}}_{1,s}^{*}\|^{2}+\frac{n}{p}\|{\bm{w}}_{2,s}^{*}\|^{2}-2\frac{n}{p}\langle{\bm{w}}_{1,s}^{*},{\bm{w}}_{2,s}^{*}\rangle+\frac{nr\sigma^{2}}{p-n-1}
=\displaystyle= (np)2𝒘1,s2+np𝒘2,s22np𝒘1,s,𝒘2,s+nrσ2pn1.\displaystyle\left(\frac{n}{p}\right)^{2}\|{\bm{w}}_{1,s}^{*}\|^{2}+\frac{n}{p}\|{\bm{w}}_{2,s}^{*}\|^{2}-2\frac{n}{p}\langle{\bm{w}}_{1,s}^{*},{\bm{w}}_{2,s}^{*}\rangle+\frac{nr\sigma^{2}}{p-n-1}.

When σ2<pn1p𝒘12\sigma^{2}<\frac{p-n-1}{p}\|{\bm{w}}_{1}^{*}\|^{2},

np𝒘12+𝒘22+(pn)σ2pn1𝒘12+𝒘22,\displaystyle\frac{n}{p}\|{\bm{w}}_{1}^{*}\|^{2}+\|{\bm{w}}_{2}^{*}\|^{2}+\frac{(p-n)\sigma^{2}}{p-n-1}\leq\|{\bm{w}}_{1}^{*}\|^{2}+\|{\bm{w}}_{2}^{*}\|^{2},

such that 𝔼[F2]0\mathbb{E}[F_{2}]\leq 0 if

2𝒘1,𝒮1,𝒘2,𝒮2np𝒘12+𝒘22+(pn)σ2pn1.\displaystyle 2\langle{\bm{w}}_{1,{\mathcal{S}}_{1}}^{*},{\bm{w}}_{2,{\mathcal{S}}_{2}}^{*}\rangle\geq\frac{n}{p}\|{\bm{w}}_{1}^{*}\|^{2}+\|{\bm{w}}_{2}^{*}\|^{2}+\frac{(p-n)\sigma^{2}}{p-n-1}.

D.5 Proof of Proposition 4.4

Without loss of generality, we assume that 𝒘i𝒘j=1\|{\bm{w}}_{i}^{*}-{\bm{w}}_{j}^{*}\|=1 for task ii in Category I and task jj in Category II. It follows that

F~T(𝒘T)=\displaystyle\tilde{F}_{T}({\bm{w}}_{T})= i<ici,i+j>ici,j\displaystyle\sum_{i<i^{*}}c_{i,i^{*}}+\sum_{j>i^{*}}c_{i^{*},j}
=\displaystyle= (1r)(i=1i1(rTirii+rTi)+j=i+1T(rTirji+rTj))\displaystyle(1-r)\left(\sum_{i=1}^{i^{*}-1}(r^{T-i}-r^{i^{*}-i}+r^{T-i^{*}})+\sum_{j=i^{*}+1}^{T}(r^{T-i^{*}}-r^{j-i^{*}}+r^{T-j})\right)
=\displaystyle= (1r)((T1)rTi+rTi+1ri11r1rri11r1+1rTi)\displaystyle(1-r)\left((T-1)\cdot r^{T-i^{*}}+r^{T-i^{*}+1}\frac{r^{i^{*}-1}-1}{r-1}-r\frac{r^{i^{*}-1}-1}{r-1}+1-r^{T-i^{*}}\right)
=\displaystyle= (1r)(T2)rTi+(rTi1)(1ri1)r+(1r).\displaystyle(1-r)(T-2)r^{T-i^{*}}+(r^{T-i^{*}}-1)(1-r^{i^{*}-1})r+(1-r).

Letting αrTi\alpha\coloneqq r^{T-i^{*}}. Then minimizing F~T(𝒘T)\tilde{F}_{T}({\bm{w}}_{T}) is equivalent to minimize

(1r)(T2)α+(α1)(1rT1α)r\displaystyle(1-r)(T-2)\alpha+(\alpha-1)(1-\frac{r^{T-1}}{\alpha})r
=\displaystyle= ((1r)(T2)+r)α+rTαrTr.\displaystyle((1-r)(T-2)+r)\alpha+\frac{r^{T}}{\alpha}-r^{T}-r.

By setting the derivative w.r.t. α\alpha to 0, we can have that the optimal value of α\alpha is

α=rTT2(T1)r\displaystyle\alpha=\sqrt{\frac{r^{T}}{T-2-(T-1)r}} (35)

which is clearly increasing with rr. Therefore, the optimal order of the special task ii^{*} is non-increasing with rr, i.e., non-decreasing with np\frac{n}{p}.

D.6 Proof of Proposition 4.5

Without loss of generality, we assume that for any task ii and jj

𝒘i𝒘j2={0,if task i and j are in the same category;1,if task i and j are in the different categories.\displaystyle\|{\bm{w}}_{i}^{*}-{\bm{w}}_{j}^{*}\|^{2}=\begin{cases}0,&\text{if task $i$ and $j$ are in the same category;}\\ 1,&\text{if task $i$ and $j$ are in the different categories.}\end{cases}

Based on the closed form of forgetting, we can see that it suffices to minimize i=1T1j>iTci,j𝒘i𝒘j2\sum_{i=1}^{T-1}\sum_{j>i}^{T}c_{i,j}\|{\bm{w}}_{i}^{*}-{\bm{w}}_{j}^{*}\|^{2} in order to minimize the forgetting FT(𝒘T)F_{T}({\bm{w}}_{T}), where ci,j=(1r)(rTirji+rTj)c_{i,j}=(1-r)(r^{T-i}-r^{j-i}+r^{T-j}). Besides, since whenever we change the order between the ii-th task and the jj-th task, the value of rTi+rTjr^{T-i}+r^{T-j} does not change. In other words, only the term rjir^{j-i} affects the optimal task order, which should minimize i=1T1j>iT(rji)𝒘i𝒘j2\sum_{i=1}^{T-1}\sum_{j>i}^{T}(-r^{j-i})\|{\bm{w}}^{*}_{i}-{\bm{w}}_{j}^{*}\|^{2}.

(1) For the case T=4T=4, there are three effective task orders: (1) task 1C11\in C_{1}, task 2C12\in C_{1}, task 3C23\in C_{2}, task 4C24\in C_{2} ((C1,C1,C2,C2)(C_{1},C_{1},C_{2},C_{2}) for simplicity); (2) (C1,C2,C1,C2)(C_{1},C_{2},C_{1},C_{2}); (3) (C1,C2,C2,C1)(C_{1},C_{2},C_{2},C_{1}). Swapping all tasks in C1C_{1} with all tasks in C2C_{2} does not change the value of forgetting, e.g., (C1,C1,C2,C2)(C_{1},C_{1},C_{2},C_{2}) has the same forgetting with (C2,C2,C1,C1)(C_{2},C_{2},C_{1},C_{1}). In what follows, we compare i=1T1j>iT(rji)𝒘i𝒘j2\sum_{i=1}^{T-1}\sum_{j>i}^{T}(-r^{j-i})\|{\bm{w}}^{*}_{i}-{\bm{w}}_{j}^{*}\|^{2} among these three orders.

(a) For (C1,C1,C2,C2)(C_{1},C_{1},C_{2},C_{2}),

i=1T1j>iT(rji)𝒘i𝒘j2=(r2+r3+r+r2).\displaystyle\sum_{i=1}^{T-1}\sum_{j>i}^{T}(-r^{j-i})\|{\bm{w}}^{*}_{i}-{\bm{w}}_{j}^{*}\|^{2}=-(r^{2}+r^{3}+r+r^{2}).

(b) For (C1,C2,C1,C2)(C_{1},C_{2},C_{1},C_{2}),

i=1T1j>iT(rji)𝒘i𝒘j2=(r+r3+r+r).\displaystyle\sum_{i=1}^{T-1}\sum_{j>i}^{T}(-r^{j-i})\|{\bm{w}}^{*}_{i}-{\bm{w}}_{j}^{*}\|^{2}=-(r+r^{3}+r+r).

(c) For (C1,C2,C2,C1)(C_{1},C_{2},C_{2},C_{1}),

i=1T1j>iT(rji)𝒘i𝒘j2=(r+r2+r+r2).\displaystyle\sum_{i=1}^{T-1}\sum_{j>i}^{T}(-r^{j-i})\|{\bm{w}}^{*}_{i}-{\bm{w}}_{j}^{*}\|^{2}=-(r+r^{2}+r+r^{2}).

It is clear that the alternating task order, i.e., (C1,C2,C1,C2)(C_{1},C_{2},C_{1},C_{2}) and (C2,C1,C2,C1)(C_{2},C_{1},C_{2},C_{1}), is the optimal order for this special case.

(2) For the case T=6T=6, based on the closed form of forgetting in Theorem 4.1, we can use computer programming to show that besides the perfectly alternating task order, i.e., (C1,C2,C1,C2,C1,C2)(C_{1},C_{2},C_{1},C_{2},C_{1},C_{2}) and (C2,C1,C2,C1,C2,C1)(C_{2},C_{1},C_{2},C_{1},C_{2},C_{1}), there are 10 effective task orders as illustrated in Table 2. We further evaluate the difference of forgetting between each task order in Table 2 and the perfectly alternating task order, where a positive difference means that the corresponding task order will lead a larger forgetting than the perfectly alternating task order. It can be verified that the difference of forgetting is positive for all the task orders in Table 2, which indicates that the optimal task order is the perfectly alternating task order.

Index Order Difference of forgetting
1 (C1,C2,C1,C2,C1,C2)(C_{1},C_{2},C_{1},C_{2},C_{1},C_{2}) 0
2 (C1,C1,C2,C1,C2,C2)(C_{1},C_{1},C_{2},C_{1},C_{2},C_{2}) r(22r+2r22r3)r\left(2-2r+2r^{2}-2r^{3}\right)
3 (C1,C1,C2,C2,C1,C2)(C_{1},C_{1},C_{2},C_{2},C_{1},C_{2}) r(23r+2r2r3)r\left(2-3r+2r^{2}-r^{3}\right)
4 (C1,C1,C2,C2,C2,C1)(C_{1},C_{1},C_{2},C_{2},C_{2},C_{1}) r(33rr3+r4)r\left(3-3r-r^{3}+r^{4}\right)
5 (C1,C2,C2,C1,C1,C2)(C_{1},C_{2},C_{2},C_{1},C_{1},C_{2}) r(24r+2r2)r\left(2-4r+2r^{2}\right)
6 (C1,C2,C2,C1,C2,C1)(C_{1},C_{2},C_{2},C_{1},C_{2},C_{1}) r(12r+2r22r3+r4)r\left(1-2r+2r^{2}-2r^{3}+r^{4}\right)
7 (C1,C1,C1,C2,C2,C2)(C_{1},C_{1},C_{1},C_{2},C_{2},C_{2}) r(42r2r3)r\left(4-2r-2r^{3}\right)
8 (C1,C2,C1,C2,C2,C1)(C_{1},C_{2},C_{1},C_{2},C_{2},C_{1}) r(12r+2r22r3+r4)r\left(1-2r+2r^{2}-2r^{3}+r^{4}\right)
9 (C1,C2,C1,C1,C2,C2)(C_{1},C_{2},C_{1},C_{1},C_{2},C_{2}) r(23r+2r2r3)r\left(2-3r+2r^{2}-r^{3}\right)
10 (C1,C2,C2,C2,C1,C1)(C_{1},C_{2},C_{2},C_{2},C_{1},C_{1}) r(33rr3+r4)r\left(3-3r-r^{3}+r^{4}\right)
Table 2: Evaluation of the difference of forgetting between each effective task order and the perfectly alternating task order (C1,C2,C1,C2,C1,C2)(C_{1},C_{2},C_{1},C_{2},C_{1},C_{2}), where a positive difference means that the corresponding task order will lead a larger forgetting than the perfectly alternating task order.

D.7 Proof of Proposition C.2

Following the same strategy with Special case II, we can have Table 3 to show all effective task orders and their difference of forgetting with the perfectly alternating task order, i.e., (C1,C2,C3,C1,C2,C3)(C_{1},C_{2},C_{3},C_{1},C_{2},C_{3}) and its ‘equivalent’ task orders (e.g., (C1,C3,C2,C1,C3,C2)(C_{1},C_{3},C_{2},C_{1},C_{3},C_{2})). It can also be verified that the perfectly alternating task order is the optimal task order in this case.

Index Order Difference of forgetting
1 (C1,C2,C3,C1,C2,C3)(C_{1},C_{2},C_{3},C_{1},C_{2},C_{3}) 0
2 (C1,C2,C1,C2,C3,C3)(C_{1},C_{2},C_{1},C_{2},C_{3},C_{3}) r(1+2r3r2)r\left(1+2r-3r^{2}\right)
3 (C1,C2,C2,C3,C3,C1)(C_{1},C_{2},C_{2},C_{3},C_{3},C_{1}) r(23r2+r4)r\left(2-3r^{2}+r^{4}\right)
4 (C1,C2,C1,C3,C2,C3)(C_{1},C_{2},C_{1},C_{3},C_{2},C_{3}) r2(22r)r^{2}\left(2-2r\right)
5 (C1,C2,C3,C2,C1,C3)(C_{1},C_{2},C_{3},C_{2},C_{1},C_{3}) r2(12r+r2)r^{2}\left(1-2r+r^{2}\right)
6 (C1,C2,C3,C1,C3,C2)(C_{1},C_{2},C_{3},C_{1},C_{3},C_{2}) r2(12r+r2)r^{2}\left(1-2r+r^{2}\right)
7 (C1,C1,C2,C3,C2,C3)(C_{1},C_{1},C_{2},C_{3},C_{2},C_{3}) r(1+2r3r2)r\left(1+2r-3r^{2}\right)
8 (C1,C2,C2,C1,C3,C3)(C_{1},C_{2},C_{2},C_{1},C_{3},C_{3}) r(22r2)r\left(2-2r^{2}\right)
9 (C1,C1,C2,C2,C3,C3)(C_{1},C_{1},C_{2},C_{2},C_{3},C_{3}) r(33r2)r\left(3-3r^{2}\right)
10 (C1,C2,C1,C3,C3,C2)(C_{1},C_{2},C_{1},C_{3},C_{3},C_{2}) r(1+r3r2+r3)r\left(1+r-3r^{2}+r^{3}\right)
11 (C1,C2,C3,C3,C1,C2)(C_{1},C_{2},C_{3},C_{3},C_{1},C_{2}) r(13r2+2r3)r\left(1-3r^{2}+2r^{3}\right)
12 (C1,C2,C3,C3,C2,C1)(C_{1},C_{2},C_{3},C_{3},C_{2},C_{1}) r(12r2+r4)r\left(1-2r^{2}+r^{4}\right)
13 (C1,C2,C2,C3,C1,C3)(C_{1},C_{2},C_{2},C_{3},C_{1},C_{3}) r(1+r3r2+r3)r\left(1+r-3r^{2}+r^{3}\right)
14 (C1,C1,C2,C3,C3,C2)(C_{1},C_{1},C_{2},C_{3},C_{3},C_{2}) r(22r2)r\left(2-2r^{2}\right)
15 (C1,C2,C3,C2,C3,C1)(C_{1},C_{2},C_{3},C_{2},C_{3},C_{1}) r2(23r+r3)r^{2}\left(2-3r+r^{3}\right)
Table 3: Evaluation of the difference of forgetting between each effective task order and the perfectly alternating task order (C1,C2,C3,C1,C2,C3)(C_{1},C_{2},C_{3},C_{1},C_{2},C_{3}), where a positive difference means that the corresponding task order will lead a larger forgetting than the perfectly alternating task order.

D.8 Proof of Theorem 4.3

Intuitive explanation of Theorem 4.3: In the underparameterized region, minimizing the loss Equation 3 for the current task tt will lead to a unique solution for this task, which does not depend on the learning process and the learned model of previous tasks. That is to say, the task learning is independent among all tasks, such that (i) the learning order of the first T1T-1 tasks does not matter, and (ii) both forgetting and generalization performance depend only on the model distance between the last task and the other tasks, i.e., i=1T1𝒘T𝒘i2\sum_{i=1}^{T-1}\|{\bm{w}}_{T}^{*}-{\bm{w}}_{i}^{*}\|^{2}.

Now we formally prove Theorem 4.3.

For the underparameterized regime, the solution of minimizing the training loss is

𝒘t=\displaystyle{\bm{w}}_{t}= (𝑿t𝑿t)1𝑿t𝒚t\displaystyle({\bm{X}}_{t}{\bm{X}}_{t}^{\top})^{-1}{\bm{X}}_{t}{\bm{y}}_{t}
=\displaystyle= (𝑿t𝑿t)1𝑿t(𝑿t𝒘t+𝒛t)\displaystyle({\bm{X}}_{t}{\bm{X}}_{t}^{\top})^{-1}{\bm{X}}_{t}\left({\bm{X}}_{t}^{\top}{\bm{w}}_{t}^{*}+{\bm{z}}_{t}\right)
=\displaystyle= 𝒘t+(𝑿t𝑿t)1𝑿t𝒛t.\displaystyle{\bm{w}}_{t}^{*}+({\bm{X}}_{t}{\bm{X}}_{t}^{\top})^{-1}{\bm{X}}_{t}{\bm{z}}_{t}.

It follows that

𝒘T𝒘i=𝒘T𝒘i+(𝑿T𝑿T)1𝑿T𝒛T,\displaystyle{\bm{w}}_{T}-{\bm{w}}_{i}^{*}={\bm{w}}_{T}^{*}-{\bm{w}}_{i}^{*}+({\bm{X}}_{T}{\bm{X}}_{T}^{\top})^{-1}{\bm{X}}_{T}{\bm{z}}_{T},

such that the model error for the ii-th task can be represented as:

𝒘T𝒘i2=𝒘T𝒘i2+(𝑿T𝑿T)1𝑿T𝒛T2.\displaystyle\|{\bm{w}}_{T}-{\bm{w}}_{i}^{*}\|^{2}=\|{\bm{w}}_{T}^{*}-{\bm{w}}_{i}^{*}\|^{2}+\|({\bm{X}}_{T}{\bm{X}}_{T}^{\top})^{-1}{\bm{X}}_{T}{\bm{z}}_{T}\|^{2}.

By taking expectation on both sides, we can have

𝔼𝒘T𝒘i2=𝒘T𝒘i2+pσ2np1.\displaystyle\mathbb{E}\|{\bm{w}}_{T}-{\bm{w}}_{i}^{*}\|^{2}=\|{\bm{w}}_{T}^{*}-{\bm{w}}_{i}^{*}\|^{2}+\frac{p\sigma^{2}}{n-p-1}.

Therefore, it can be shown that

𝔼[GT]=𝔼1Ti=1T𝒘T𝒘i2=(1Ti=1T𝒘T𝒘i2)+pσ2np1\displaystyle\mathbb{E}[G_{T}]=\mathbb{E}\frac{1}{T}\sum_{i=1}^{T}\|{\bm{w}}_{T}-{\bm{w}}_{i}^{*}\|^{2}=\left(\frac{1}{T}\sum_{i=1}^{T}\|{\bm{w}}_{T}^{*}-{\bm{w}}_{i}^{*}\|^{2}\right)+\frac{p\sigma^{2}}{n-p-1}

and

𝔼[FT]=\displaystyle\mathbb{E}[F_{T}]= 1T1i=1T1𝔼[𝒘T𝒘i2𝒘i𝒘i2]\displaystyle\frac{1}{T-1}\sum_{i=1}^{T-1}\mathbb{E}\left[\|{\bm{w}}_{T}-{\bm{w}}_{i}^{*}\|^{2}-\|{\bm{w}}_{i}-{\bm{w}}_{i}^{*}\|^{2}\right]
=\displaystyle= 1T1i=1T1𝒘T𝒘i2.\displaystyle\frac{1}{T-1}\sum_{i=1}^{T-1}\|{\bm{w}}_{T}^{*}-{\bm{w}}_{i}^{*}\|^{2}.