This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

HiDe-PET: Continual Learning via Hierarchical Decomposition of Parameter-Efficient Tuning

Liyuan Wang, Jingyi Xie, Xingxing Zhang, Hang Su, Member, IEEE, Jun Zhu, Fellow, IEEE Liyuan Wang, Jingyi Xie, Xingxing Zhang, Hang Su, and Jun Zhu are with Dept. of Comp. Sci. & Tech., Institute for AI, BNRist Center, THBI Lab, Tsinghua-Bosch Joint Center for ML, Tsinghua University, Beijing, China (email: wly19@tsinghua.org.cn; jingyi_xie96@163.com; xxzhang1993@gmail.com; {suhangss, dcszj}@tsinghua.edu.cn). Corresponding authors: Jun Zhu and Hang Su.
Abstract

The deployment of pre-trained models (PTMs) has greatly advanced the field of continual learning (CL), enabling positive knowledge transfer and resilience to catastrophic forgetting. To sustain these advantages for sequentially arriving tasks, a promising direction involves keeping the pre-trained backbone frozen while employing parameter-efficient tuning (PET) techniques to instruct representation learning. Despite the popularity of Prompt-based PET for CL, its empirical design often leads to sub-optimal performance in our evaluation of different PTMs and target tasks. To this end, we propose a unified framework for CL with PTMs and PET that provides both theoretical and empirical advancements. We first perform an in-depth theoretical analysis of the CL objective in a pre-training context, decomposing it into hierarchical components namely within-task prediction, task-identity inference and task-adaptive prediction. We then present Hierarchical Decomposition PET (HiDe-PET), an innovative approach that explicitly optimizes the decomposed objective through incorporating task-specific and task-shared knowledge via mainstream PET techniques along with efficient recovery of pre-trained representations. Leveraging this framework, we delve into the distinct impacts of implementation strategy, PET technique and PET architecture, as well as adaptive knowledge accumulation amidst pronounced distribution changes. Finally, across various CL scenarios, our approach demonstrates remarkably superior performance over a broad spectrum of recent strong baselines. Our code is available at https://github.com/thu-ml/HiDe-PET.

Index Terms:
Continual Learning, Incremental Learning, Pre-trained Models, Parameter-Efficient Tuning, Catastrophic Forgetting.

1 Introduction

The proficiency of artificial intelligence (AI) in accommodating real-world dynamics is largely contingent on its capability of continual learning (CL), which has benefited significantly in recent years from the deployment of pre-trained models (PTMs). In essence, PTMs provide CL with not only positive knowledge transfer but also resilience to catastrophic forgetting [1, 2, 3, 4], which are critical to improve the performance of CL methods. Given that adapting PTMs to sequentially arriving tasks may compromise these advantages via progressive overwriting of pre-trained knowledge, numerous efforts have been devoted into implementing parameter-efficient tuning (PET) techniques for CL, i.e., keeping the pre-trained backbone frozen and introducing a few lightweight parameters to instruct representation learning. However, current advances predominantly center around empirical designs of Prompt-based PET [5, 6, 7, 8], which struggle to adequately achieve the CL objective and therefore often exhibit sub-optimal performance across different PTMs and target tasks (see Sec. 6.2). In response to this critical challenge, there is an urgent need for a more profound understanding of CL with PTMs and PET, coupled with endeavors to enhance its effectiveness and generality.

In this work, we propose a unified framework for CL with PTMs and PET, seeking to advance this direction with both theoretical and empirical insights. We initiate our explorations with an in-depth theoretical analysis of the CL objective in a pre-training context. Considering the significant impact of pre-trained knowledge on CL, this objective can be decomposed into three hierarchical components in response to sequentially arriving tasks, namely within-task prediction (WTP), task-identity inference (TII) and task-adaptive prediction (TAP). They prove to be sufficient and necessary to ensure low errors in common CL settings. Based on the theoretical analysis, we devise an innovative approach named Hierarchical Decomposition PET (HiDe-PET) to explicitly optimize WTP, TII and TAP.

The principal concept behind HiDe-PET leverages the unique strengths of PTMs for CL, with a particular focus on the effective instruction and efficient recovery of pre-trained representations. As a generic approach applicable to mainstream PET techniques (e.g., Prompt [9, 10], Adapter [11], LoRA [12], etc.), we construct an ensemble of task-specific parameters that incorporates the knowledge of each task to optimize WTP, and a set of task-shared parameters that accumulates knowledge in a task-agnostic manner to improve TII. We further recover the distribution of uninstructed and instructed representations through preserving their statistical information, so as to optimize TII and TAP, respectively. In this way, our HiDe-PET is able to adeptly predict the identity of task-specific parameters from uninstructed representations and collect appropriate instructed representations for final predictions.

The proposed framework facilitates a thorough assessment of important factors emerged in CL with PTMs and PET, including the implementation strategy, PET technique and PET architecture. For example, we dissect representative strategies for stabilizing task-shared parameters and for preserving pre-trained representations, empirically analyzing their effectiveness. Moreover, we evaluate the behaviors of different PET techniques in achieving the CL objective, where the Prompt-based PET tends to be less effective in WTP, consequently displaying lower sensitivity to TII and clearly lagging behind the LoRA/Adapter-based PET. Motivated by the inherent connections between TII and out-of-distribution (OOD) detection, we further devise a PET hierarchy tailored for adaptive knowledge accumulation, and unravel the relationship between task-specific and task-shared PET architectures in representation learning.

Refer to caption
Figure 1: Summary of CL performance with different PET techniques. We compare our HiDe-PET, our preliminary version [13] and LAE [14], and report the final average accuracy (FAA) over three pre-trained checkpoints and four CL benchmarks.

Upon extensive experiments, our HiDe-PET clearly outperforms a wide range of recent strong baselines, and demonstrates remarkable generality across a variety of PET techniques, pre-trained checkpoints and CL benchmarks (summarized in Fig. 1). We further provide empirical analysis to elucidate the respective contributions of the three hierarchical components. Note that some results have been presented in our preliminary version [13], which mainly considered a specific implementation of task-specific parameters via Prompt-based PET. In contrast, the current version presents a unified framework for CL with PTMs and PET. It incorporates substantial extensions to the implementation strategy, PET technique and PET architecture, culminating in a comprehensive analysis and improved performance.

Overall, our main contributions are as follows:

  • We present an in-depth theoretical analysis of the CL objective in a pre-training context, decomposing it into three hierarchical components for model design;

  • We propose an innovative approach that exploits mainstream PET techniques and pre-trained representations to explicitly optimize the decomposed objective;

  • We conduct extensive experiments to demonstrate the effectiveness and generality of our approach, coupled with a thorough assessment of important factors emerged in CL with PTMs and PET.

2 Related Work

Continual Learning (CL) is characterized by learning sequentially arriving tasks and performing well on them. The primary challenge of CL stems from the dynamic nature of data distribution, which leads to catastrophic forgetting of old tasks while acquiring new ones [3, 15]. As summarized in a recent survey [3], representative methods include selective stabilization of important parameters for old tasks [16, 17, 18], approximation and recovery of old data distributions [19, 20, 21], exploitation of robust and well-distributed representations [22, 23, 24], manipulation of optimization programs via gradient projection [25, 26, 27], construction of (relatively) dedicated parameters for each task [28, 29, 30], etc.

In the realm of CL, current efforts have predominantly centered around computer vision (CV), specifically for visual classification tasks. These efforts have progressively expanded their scope to include more complex visual tasks, as well as natural language processing (NLP), reinforcement learning (RL) and other related applications. Across various representative CL settings, especially those lacking the oracle of task identity during the testing phase, robust CL models often necessitate the storage and rehearsal of old training samples [31, 32, 21], which creates additional resource overheads and potential privacy concerns. These issues have been largely avoided through effective use of pre-trained knowledge in recent work, as discussed later.

Pre-training and Fine-tuning: Transfer learning with pre-trained models (PTMs) can significantly improve the performance of target tasks and has therefore become a prevalent paradigm in many areas of artificial intelligence (AI). Since the pre-trained knowledge is usually generalized and difficult to cover all specific domains, PTMs necessitate further fine-tuning for better adaptation. The most straightforward way is to update all model parameters, but involves keeping a separate copy of fine-tuned model parameters for each task. This leads to considerable resource overheads especially for advanced PTMs of increasing size.

In order to improve the efficiency of fine-tuning, some lightweight alternatives have been proposed that update only a few extra parameters with most of the model parameters frozen, collectively referred to as the parameter-efficient tuning (PET) techniques. These PET techniques were originally proposed for NLP and are now widely used for CV as well. Representative practices include Prompt Tuning (ProT) [9] and Prefix Tuning (PreT) [10] that prepend short prompt tokens into the original inputs or intermediate inputs, Adapter [11] that inserts small neural modules between backbone layers, Low-Rank Adaptation (LoRA) [12] that approximates the updates of model parameters with low rank matrices, etc. A recent work [33] unified these PET techniques in a similar form, corresponding to modifying specific hidden states of the PTMs.

CL with PTMs and PET: While much of the past work in CL has focused on training from scratch, a growing body of efforts have delved into the benefits of PTMs, which provide not only positive knowledge transfer but also resilience to catastrophic forgetting [1, 2]. Meanwhile, PTMs also need to improve the ability of CL to accommodate sequentially arriving tasks and to refine pre-trained knowledge from them. However, fine-tuning PTMs becomes much more difficult when considering CL, since repetitive adaptation of the same PTM may lead to progressive overwriting of pre-trained knowledge, whereas separate saving of the fine-tuned PTMs creates an additional linear increase in resource overhead on the time scale [4].

Therefore, applying PET techniques for CL has become an emerging direction in recent years, with Prompt-based PET being predominant. Many state-of-the-art methods have focused on designing appropriate prompt architectures for CL, such as task-shared prompts [6, 34, 14], task-specific prompts [6, 7, 13, 8] and their combinations [5, 6]. Since the frozen backbone with pre-trained knowledge can provide robust and well-distributed representations, these methods have almost achieved the performance pinnacle of rehearsal-free CL under adequate supervised pre-training and general tasks. However, their sub-optimality in achieving the objective of CL has been clearly exposed under the more realistic self-supervised pre-training and fine-grained tasks [4, 13], in part due to the limited fitting ability of Prompt-based PET [35, 36]. LAE [14] is a recent work that assembles a pair of task-shared parameters to be implemented with mainstream PET techniques, but exhibits only moderate improvements in CL performance.

3 Preliminaries

In this section, we first introduce the problem formulation of continual learning (CL) in a pre-training context. Then we describe representative parameter-efficient tuning (PET) techniques and PET-based CL methods.

3.1 Problem Formulation

The CL problem can be generally defined as learning sequentially arriving tasks from their respective training sets 𝒟1,,𝒟t\mathcal{D}_{1},...,\mathcal{D}_{t} in order to perform well on their corresponding test sets. The training set and test set of each task are assumed to follow the same distribution. For supervised CL, each training set 𝒟t={(𝒙t,n,yt,n)}n=1Nt\mathcal{D}_{t}=\{(\boldsymbol{x}_{t,n},y_{t,n})\}_{n=1}^{N_{t}} comprises multiple sample-label pairs, where 𝒙t,n𝒳t\boldsymbol{x}_{t,n}\in\mathcal{X}_{t} and yt,n𝒴ty_{t,n}\in\mathcal{Y}_{t} represent the sample and label elements, respectively, and NtN_{t} denotes the size of 𝒟t\mathcal{D}_{t}. Let us consider a neural network model composed of a backbone fθf_{\theta} with parameters θ\theta, and an output layer hψh_{\psi} with parameters ψ\psi. This model aims to learn a projection from 𝒳=i=1t𝒳i\mathcal{X}=\bigcup_{i=1}^{t}\mathcal{X}_{i} to 𝒴=i=1t𝒴i\mathcal{Y}=\bigcup_{i=1}^{t}\mathcal{Y}_{i}, so that it can correctly predict the label y^=hψ(fθ(𝒙))𝒴\hat{y}=h_{\psi}(f_{\theta}(\boldsymbol{x}))\in\mathcal{Y} of an unseen test sample 𝒙\boldsymbol{x} from observed tasks. Since 𝒟1,,𝒟t\mathcal{D}_{1},...,\mathcal{D}_{t} are usually limited in size and distinct in distribution, learning fθf_{\theta} from scratch can easily converge to an undesirable local minimum. In contrast, initialization of fθf_{\theta} with a substantial quantity of training samples external to 𝒟1,,𝒟t\mathcal{D}_{1},...,\mathcal{D}_{t}, i.e., applying adequate pre-training, helps θ\theta converge to a wide low-error region and thus can greatly improve the CL performance [1, 2].

Depending on the split of label space and the availability of task identity, there are three representative setups for CL [37], including task-incremental learning (TIL), domain-incremental learning (DIL), and class-incremental learning (CIL). Specifically, the label space 𝒴i\mathcal{Y}_{i} with i[t]i\in[t] is the same for DIL whereas disjoint for TIL and CIL. The task identity i[t]i\in[t] is provided at test time for TIL whereas not available for DIL and CIL. As a result, CIL is often considered more challenging and realistic. Of note, although CIL is named from classification tasks, its definition can be generalized to other task types. To avoid additional resource overhead and potential privacy issues, the CL process is further restricted to be rehearsal-free [8], i.e., the sample elements of each 𝒟i\mathcal{D}_{i} are available only when learning task ii, which particularly increases the challenge of CIL [3].

3.2 PET Techniques

The backbone fθf_{\theta} of advanced pre-trained models (PTMs) often employs a transformer architecture [38] based on multi-head attention mechanisms. For example, a pre-trained vision transformer (ViT) [39] consists of multiple consecutive multi-head self-attention (MSA) layers that transform an input sample into a sequence-like output representation 𝒓d𝒓×d\boldsymbol{r}\in\mathbb{R}^{d_{\boldsymbol{r}}\times d} of sequence length d𝒓d_{\boldsymbol{r}} and embedding dimension dd. Let us consider the ll-th layer with input 𝒉(l)\boldsymbol{h}_{(l)} and output 𝒉(l)\boldsymbol{h}_{(l)}^{\prime}, where 𝒉(L)\boldsymbol{h}_{(L)}^{\prime} is equivalent to 𝒓\boldsymbol{r} for total LL layers. Since the output 𝒉(l)\boldsymbol{h}_{(l)}^{\prime} then becomes the input 𝒉(l+1)d𝒉(l+1)×d\boldsymbol{h}_{(l+1)}\in\mathbb{R}^{d_{\boldsymbol{h}_{(l+1)}}\times d} of the next layer, we omit the layer identity ll for the sake of clarity. Then, the input 𝒉d𝒉×d\boldsymbol{h}\in\mathbb{R}^{d_{\boldsymbol{h}}\times d} is further specified as the query 𝒉Q\boldsymbol{h}_{Q}, key 𝒉K\boldsymbol{h}_{K} and value 𝒉V\boldsymbol{h}_{V}, and the output 𝒉d𝒉×d\boldsymbol{h}^{\prime}\in\mathbb{R}^{d_{\boldsymbol{h}}\times d} of the current layer is

𝒉=MSA(𝒉Q,𝒉K,𝒉V)=Concat(h1,,hm)𝑾O,\begin{split}\boldsymbol{h}^{\prime}={\rm{MSA}}(\boldsymbol{h}_{Q},\boldsymbol{h}_{K},\boldsymbol{h}_{V})={\rm{Concat}}(h_{1},...,h_{m})\boldsymbol{W}_{O},\\ \end{split} (1)
hi=Attn(𝒉Q𝑾Q,i,𝒉K𝑾K,i,𝒉V𝑾V,i),i[m],\begin{split}h_{i}={\rm{Attn}}(\boldsymbol{h}_{Q}\boldsymbol{W}_{Q,i},\boldsymbol{h}_{K}\boldsymbol{W}_{K,i},\boldsymbol{h}_{V}\boldsymbol{W}_{V,i}),i\in[m],\end{split} (2)

where 𝑾O\boldsymbol{W}_{O}, 𝑾Q,i\boldsymbol{W}_{Q,i}, 𝑾K,i,𝑾V,i\boldsymbol{W}_{K,i},\boldsymbol{W}_{V,i} are projection matrices, mm is the number of heads, and 𝒉Q=𝒉K=𝒉V=𝒉\boldsymbol{h}_{Q}=\boldsymbol{h}_{K}=\boldsymbol{h}_{V}=\boldsymbol{h} in ViT. The concatenation (Concat) and attention (Attn) functions are specified in their original papers [38, 39].

To facilitate effective transfer of pre-trained knowledge while preventing its catastrophic forgetting, the backbone parameters θ\theta are usually frozen and additional lightweight parameters are introduced to instruct representation learning, referred to as the PET techniques [33]. Here we describe some representative ones (see Fig. 2):

Refer to caption
Figure 2: Implementation of PET techniques for representation learning. These PET techniques all mount to modulating the (intermediate) representations of the backbone and ensure lightweight implementations.

Prompt Tuning (ProT) [9] and Prefix Tuning (PreT) [10] both prepend a few learnable parameters 𝒑d𝒑×d\boldsymbol{p}\in\mathbb{R}^{d_{\boldsymbol{p}}\times d} of sequence length d𝒑d_{\boldsymbol{p}} and embedding dimension dd to 𝒉\boldsymbol{h}, collectively known as the Prompt-based PET. For ProT in ViT, an identical 𝒑\boldsymbol{p} is prepended to 𝒉Q\boldsymbol{h}_{Q}, 𝒉K\boldsymbol{h}_{K} and 𝒉V\boldsymbol{h}_{V}:

𝒉=MSA([𝒑;𝒉Q],[𝒑;𝒉K],[𝒑;𝒉V]),\boldsymbol{h}^{\prime}={\rm{MSA}}([\boldsymbol{p};\boldsymbol{h}_{Q}],[\boldsymbol{p};\boldsymbol{h}_{K}],[\boldsymbol{p};\boldsymbol{h}_{V}]), (3)

where [;][\cdot\,;\cdot] represents the concatenation operation along the dimension of sequence length. Since the output in (d𝒉+d𝒑)×d\mathbb{R}^{(d_{\boldsymbol{h}}+d_{\boldsymbol{p}})\times d} has increased dimensions, ProT is often used for only the last layer in CL [5, 7]. In contrast, PreT splits 𝒑\boldsymbol{p} into 𝒑Kd𝒑/2×d\boldsymbol{p}_{K}\in\mathbb{R}^{d_{\boldsymbol{p}}/2\times d} for 𝒉K\boldsymbol{h}_{K} and 𝒑Vd𝒑/2×d\boldsymbol{p}_{V}\in\mathbb{R}^{d_{\boldsymbol{p}}/2\times d} for 𝒉V\boldsymbol{h}_{V}:

𝒉=MSA(𝒉Q,[𝒑K;𝒉K],[𝒑V;𝒉V]),\boldsymbol{h}^{\prime}={\rm{MSA}}(\boldsymbol{h}_{Q},[\boldsymbol{p}_{K};\boldsymbol{h}_{K}],[\boldsymbol{p}_{V};\boldsymbol{h}_{V}]), (4)

where the output has the same dimension as the input 𝒉d𝒉×d\boldsymbol{h}\in\mathbb{R}^{d_{\boldsymbol{h}}\times d}, allowing PreT to be implemented in multiple layers. In particular, the output of PreT can be reframed as

𝒉(1λ(𝒉))𝒉+λ(𝒉)fNonL(𝒉𝑾Q𝒑K)𝒑V,\boldsymbol{h}^{\prime}\leftarrow(1-\lambda(\boldsymbol{h}))\boldsymbol{h}^{\prime}+\lambda(\boldsymbol{h})\,f_{{\rm{NonL}}}(\boldsymbol{h}\boldsymbol{W}_{Q}\boldsymbol{p}_{K}^{\top})\boldsymbol{p}_{V}, (5)

where fNonLf_{{\rm{NonL}}} is the nonlinear (NonL) softmax function, and λ(𝒉)\lambda(\boldsymbol{h}) is a scalar that depends on the input [33].

Adapter [11] inserts lightweight neural modules between backbone layers, each usually composed of a down-projection matrix 𝑾downd×r\boldsymbol{W}_{{\rm{down}}}\in\mathbb{R}^{d\times r} that reduces the dimension of 𝒉\boldsymbol{h} with bottleneck rr, a nonlinear (NonL) activation function fNonLf_{{\rm{NonL}}} and an up-projection matrix 𝑾upr×d\boldsymbol{W}_{{\rm{up}}}\in\mathbb{R}^{r\times d}. These modules are implemented with residual connections, acting on the output 𝒉\boldsymbol{h}^{\prime} in a sequential (Seq) manner, i.e.,

𝒉𝒉+fNonL(𝒉𝑾down)𝑾up,\boldsymbol{h}^{\prime}\leftarrow\boldsymbol{h}^{\prime}+f_{{\rm{NonL}}}(\boldsymbol{h}^{\prime}\boldsymbol{W}_{{\rm{down}}})\boldsymbol{W}_{{\rm{up}}}, (6)

as well as on the input 𝒉\boldsymbol{h} in a parallel (Par) manner, i.e.,

𝒉𝒉+fNonL(𝒉𝑾down)𝑾up.\boldsymbol{h}^{\prime}\leftarrow\boldsymbol{h}^{\prime}+f_{{\rm{NonL}}}(\boldsymbol{h}\boldsymbol{W}_{{\rm{down}}})\boldsymbol{W}_{{\rm{up}}}. (7)

LoRA [12] approximates the updates of pre-trained parameter matrix 𝑾d×k\boldsymbol{W}\in\mathbb{R}^{d\times k} with a low-rank decomposition 𝑾+𝑾=𝑾+𝑾down𝑾up\boldsymbol{W}+\bigtriangleup\boldsymbol{W}=\boldsymbol{W}+\boldsymbol{W}_{{\rm{down}}}\boldsymbol{W}_{{\rm{up}}}, where 𝑾downd×r\boldsymbol{W}_{{\rm{down}}}\in\mathbb{R}^{d\times r}, 𝑾upr×k\boldsymbol{W}_{{\rm{up}}}\in\mathbb{R}^{r\times k} and rr is the low-rank bottleneck. For ViT, LoRA is typically used to update the 𝑾Q\boldsymbol{W}_{Q} and 𝑾V\boldsymbol{W}_{V} of a backbone layer. As a special case, when 𝑾\boldsymbol{W} performs linear projection of the input 𝒉\boldsymbol{h}, the output is modified as

𝒉𝒉+s𝒉𝑾down𝑾up,\boldsymbol{h}^{\prime}\leftarrow\boldsymbol{h}^{\prime}+s\cdot\boldsymbol{h}\boldsymbol{W}_{{\rm{down}}}\boldsymbol{W}_{{\rm{up}}}, (8)

where s1s\geq 1 is a scalar hyperparameter [33].

As we can see, these PET techniques all amount to modulating the (intermediate) representations of fθf_{\theta}, though differing in their specific implementations.

3.3 PET-Based CL Methods

With the widespread use of PTMs in CL, there have been a variety of methods that incorporate PET techniques on a continual basis. Most of these methods have focused on designing appropriate PET architectures to formulate lightweight parameters tailored for CL, which can be conceptually summarized into task-specific parameters, task-shared parameters, and their explicit or implicit combinations. In particular, task-specific parameters require their identity to be predicted during the testing phase, while task-shared parameters need to mitigate catastrophic forgetting when learning each task. We briefly describe these methods here, with a comprehensive summary in Appendix Table VII for further comparison.

L2P [5] constructs a prompt pool 𝒑1,,𝒑M\boldsymbol{p}_{1},...,\boldsymbol{p}_{M} of size MM and instructs pre-trained representations in a ProT manner. Each prompt is associated with a learnable key, optimized by the cosine distance of the top-NN keys to a query function q(𝒙)=fθ(𝒙)[0]q(\boldsymbol{x})=f_{\theta}(\boldsymbol{x})[0]. The most relevant prompt(s) can therefore be selected via uninstructed representations.

DualPrompt [6] employs the task-specific prompt 𝒑tl\boldsymbol{p}^{l}_{t} and task-shared prompt 𝒑l\boldsymbol{p}^{l} to instruct respective layers in a PreT manner. The layer-wise 𝒑tl\boldsymbol{p}^{l}_{t} is associated with a task-specific key, optimized by its cosine distance to q(𝒙)q(\boldsymbol{x}), and the best-matched key is selected via uninstructed representations.

S-Prompt [7] employs only the task-specific prompt 𝒑t\boldsymbol{p}_{t} and instructs pre-trained representations in a ProT manner. The task identity is inferred from preserved task centroids with kk-Nearest Neighbors (kkNNs). S-Prompt also employs a multi-head output layer associated with the task identity.

CODA-Prompt [8] performs a weighted summation of the prompt pool, i.e., 𝒑=i=1Mαi𝒑i\boldsymbol{p}=\sum_{i=1}^{M}\alpha_{i}\boldsymbol{p}_{i}, and instructs multiple layers in a PreT manner. Each weighting factor αi\alpha_{i} for i[M]i\in[M] is associated to a learnable key, optimized by its cosine distance to q(𝒙)q(\boldsymbol{x}). The inference of αi\alpha_{i} can therefore construct an appropriate prompt for each input sample.

LAE [14] employs two kinds of task-shared parameters to incorporate knowledge from more recent tasks and more remote tasks, respectively, applicable to PreT, Adapter and LoRA for multiple layers. Their update speeds are regulated via combinatorial strategies such as temporary freezing and exponential moving average (EMA).

Besides, there are several relevant methods with different focuses. For example, SLCA [4] updates the entire backbone with a reduced learning rate, and further preserves pre-trained representations via dedicated covariance matrices. RanPAC [34] projects pre-trained representations to a high-dimension space and preserves them via a shared covariance matrix. These methods are often parameter-inefficient, i.e., the parameter cost is comparable to θ\theta due to a complexity much larger than O(d2)O(d^{2}), and therefore not prioritized in this work.111Our preliminary version [13] also employed dedicated covariance matrices as the main implementation to acquire better performance.

Taken together, three notable limitations have surfaced in current progress of harnessing PET techniques for CL. First, the above methods often rely on empirical designs, making it difficult to ascertain their effectiveness in achieving the objective of CL in a pre-training context. In particular, their performance exhibits significant variations across different PTMs and target tasks, as demonstrated in Sec. 6.2. Second, these methods predominantly center around Prompt-based PET, which has been shown to be less effective under self-supervised pre-training [36] and fine-grained tasks [40], leaving underexplored the particular behaviors and potential benefits of other alternatives. Third, these methods share some analogous strategies, such as stabilizing task-shared parameters and recovering pre-trained representations, without a comprehensive analysis of different implementations and assimilation of their respective strengths. Therefore, there is an urgent need to establish a unified framework that incorporates both theoretical and empirical insights for CL with PTMs and PET, which constitutes our main motivation.

4 Theoretical Analysis

In this section, we present an in-depth theoretical analysis of the CL objective in a pre-training context, so as to inspire the better design of PET-based CL methods. We first decompose this objective into three hierarchical components, which demonstrate the impact of pre-trained knowledge on CL and are both sufficient and necessary to optimize the CL performance (Sec. 4.1). This analysis motivates us to develop an innovative CL method to explicitly optimize the objective (Sec. 5.1). We then illustrate the connection of the decomposed objective to OOD detection, which is shown to play a similar role as inferring the task identity (Sec. 4.2). This analysis motivates us to adaptively accumulate knowledge with the proposed CL method to facilitate the learning of subsequent tasks (Sec. 5.2).

4.1 Three Hierarchical Components

Let us consider a CL problem for sequentially arriving tasks i[t]i\in[t] with training set 𝒟i\mathcal{D}_{i} of domain 𝒳i=j𝒳i,j\mathcal{X}_{i}=\bigcup_{j}\mathcal{X}_{i,j} and label 𝒴i=j𝒴i,j\mathcal{Y}_{i}=\bigcup_{j}\mathcal{Y}_{i,j}, where j[|𝒴i|]j\in[|\mathcal{Y}_{i}|] denotes the jj-th class of task ii. The goal is to learn a projection from 𝒳=i=1t𝒳i\mathcal{X}=\bigcup_{i=1}^{t}\mathcal{X}_{i} to 𝒴=i=1t𝒴i\mathcal{Y}=\bigcup_{i=1}^{t}\mathcal{Y}_{i} in order to predict the label of an unseen test sample 𝒙\boldsymbol{x} from all observed tasks, referred to as task-adaptive prediction (TAP). As summarized in Sec. 3.1, there are many representative setups of CL, such as CIL, DIL and TIL. Here we take CIL as a typical scenario for theoretical analysis, and leave the results of DIL and TIL to Appendix A.

CL from Scratch: When training from scratch, the TAP performance P(𝒙𝒳y|𝒟)P(\boldsymbol{x}\in\mathcal{X}^{y}|\mathcal{D}) is to predict across all classes without distinguishing tasks, where 𝒟={𝒟1,,𝒟t}\mathcal{D}=\{\mathcal{D}_{1},...,\mathcal{D}_{t}\}, y[|i=1t𝒴i|]y\in[|\bigcup_{i=1}^{t}\mathcal{Y}_{i}|] denotes the ground truth label of 𝒙\boldsymbol{x}, and 𝒳y\mathcal{X}^{y} denotes the domain of class yy. The restricted definition of CIL further posits two assumptions [41]: the domains of tasks are disjoint (i.e., 𝒳i𝒳i=\mathcal{X}_{i}\cap\mathcal{X}_{i^{\prime}}=\emptyset, ii\forall i\neq i^{\prime}), and the domains of classes of the same task are disjoint (i.e., 𝒳i,j𝒳i,j=\mathcal{X}_{i,j}\cap\mathcal{X}_{i,j^{\prime}}=\emptyset, jj\forall j\neq j^{\prime}). Through predicting which task to perform and then performing that task (i.e., there is an execution order), the CIL probability can be expressed as a hierarchical process of task-identity inference (TII) and within-task prediction (WTP):

P(𝒙𝒳i,j|𝒟)CIL=P(𝒙𝒳i|𝒟)TIIP(𝒙𝒳i,j|𝒙𝒳i,𝒟)WTP.\displaystyle\underbrace{P(\boldsymbol{x}\in\mathcal{X}_{i,j}|\mathcal{D})}_{\text{CIL}}=\underbrace{P(\boldsymbol{x}\in\mathcal{X}_{i}|\mathcal{D})}_{\text{TII}}\underbrace{P(\boldsymbol{x}\in\mathcal{X}_{i,j}|\boldsymbol{x}\in\mathcal{X}_{i},\mathcal{D})}_{\text{WTP}}. (9)

Eq. (9) is exactly the main conclusion of a previous theoretical study of CL from scratch [41]. Given the assumptions of disjoint domains and the omitted impact of randomly initialized θ\theta, the TAP performance P(𝒙𝒳y|𝒟)P(\boldsymbol{x}\in\mathcal{X}^{y}|\mathcal{D}) is essentially equivalent to the decomposed CIL performance P(𝒙𝒳i¯,j¯|𝒟){P}(\boldsymbol{x}\in\mathcal{X}_{\bar{i},\bar{j}}|\mathcal{D}), where i¯[t]\bar{i}\in[t] and j¯[|𝒴i¯|]\bar{j}\in[|\mathcal{Y}_{\bar{i}}|] denote the ground truth of an 𝒙\boldsymbol{x} w.r.t. the task identity and within-task index. Here we provide a more intuitive explanation with the definition of class labels and the implementation of output heads. The decomposed CIL performance can be naturally computed in a multi-head manner with two steps: inferring the task identity i¯[t]\bar{i}\in[t], i.e., the TII performance; and predicting the within-task index j¯[|𝒴i¯|]\bar{j}\in[|\mathcal{Y}_{\bar{i}}|], i.e., the WTP performance. In comparison, the TAP performance P(𝒙𝒳y|𝒟)P(\boldsymbol{x}\in\mathcal{X}^{y}|\mathcal{D}) is computed by performing single-head prediction for global class y[|i=1t𝒴i|]y\in[|\bigcup_{i=1}^{t}\mathcal{Y}_{i}|]. For CL from scratch, the decomposed CIL performance of multi-head inference & prediction is equivalent to the TAP performance of single-head prediction. In this case, the decomposed CIL performance can also be computed in a single-head manner, as many traditional continual learning methods do.

CL with Pre-training: When considering the impact of pre-trained knowledge carried by parameters θ\theta, the TAP is redefined as P(𝒙𝒳y|𝒟,θ)P(\boldsymbol{x}\in\mathcal{X}^{y}|\mathcal{D},\theta), while the CIL probability of TII and WTP in Eq. (9) is re-written as

P(𝒙𝒳i,j|𝒟,θ)CILP(𝒙𝒳i|𝒟,θ)TIIP(𝒙𝒳i,j|𝒙𝒳i,𝒟,θ)WTP.\displaystyle\underbrace{P(\boldsymbol{x}\in\mathcal{X}_{i,j}|\mathcal{D},\theta)}_{\text{CIL}}\triangleq\underbrace{P(\boldsymbol{x}\in\mathcal{X}_{i}|\mathcal{D},\theta)}_{\text{TII}}\underbrace{P(\boldsymbol{x}\in\mathcal{X}_{i,j}|\boldsymbol{x}\in\mathcal{X}_{i},\mathcal{D},\theta)}_{\text{WTP}}.\vspace{-0.2cm} (10)

It can be seen that both the TAP performance P(𝒙𝒳y|𝒟,θ)P(\boldsymbol{x}\in\mathcal{X}^{y}|\mathcal{D},\theta) and the CIL performance P(𝒙𝒳i¯,j¯|𝒟,θ){P}(\boldsymbol{x}\in\mathcal{X}_{\bar{i},\bar{j}}|\mathcal{D},\theta) are now affected by θ\theta, but in different ways. The pre-trained parameters θ\theta in TAP affect simultaneously all observed classes (i.e., a large output space [|i=1t𝒴i|][|\bigcup_{i=1}^{t}\mathcal{Y}_{i}|]), while in CIL affect separately TII and WTP (i.e., two small output spaces [t][t] and [|𝒴i¯|][|\mathcal{Y}_{\bar{i}}|]). Accordingly, the CIL performance is upper bounded by either the TII performance P(𝒙𝒳i¯|𝒟,θ)P(\boldsymbol{x}\in\mathcal{X}_{\bar{i}}|\mathcal{D},\theta) or the WTP performance P(𝒙𝒳i¯,j¯|𝒙𝒳i¯,𝒟,θ)P(\boldsymbol{x}\in\mathcal{X}_{\bar{i},\bar{j}}|\boldsymbol{x}\in\mathcal{X}_{\bar{i}},\mathcal{D},\theta), whereas the TAP performance is not, as the CL tasks and pre-trained knowledge are not conditionally independent from a statistical perspective. For example, an incorrectly predicted task identity results in full errors in predicting within-task index, remaining a performance gap from rectifying the predictions of inter-task classes to make them properly balanced. This also underscores the notable difference of the multi-head inference & prediction from the single-head prediction in the context of pre-training. Such difference tends to be more pronounced if the task structure is not clear (e.g., randomly split classes of the same dataset), and has been empirically validated in our preliminary experiments [13], where the multi-head inference & prediction significantly underperforms the single-head prediction in CL with PTMs.

Therefore, we propose to further optimize TAP along with the improved TII and WTP, formulating the ultimate goal of CL as a multi-objective optimization problem, i.e.,

max[P(𝒙𝒳i¯,j¯|𝒟,θ)CIL,P(𝒙𝒳y|𝒟,θ)TAP],\displaystyle\max[\,\underbrace{P(\boldsymbol{x}\in\mathcal{X}_{\bar{i},\bar{j}}|\mathcal{D},\theta)}_{\text{CIL}},\underbrace{P(\boldsymbol{x}\in\mathcal{X}^{y}|\mathcal{D},\theta)}_{\text{TAP}}\,], (11)

where P(𝒙𝒳i¯,j¯|𝒟,θ)P(\boldsymbol{x}\in\mathcal{X}_{\bar{i},\bar{j}}|\mathcal{D},\theta) follows a similar decomposition as in Eq. (10), with TII and WTP having an execution order.

To resolve the above WTP, TII and TAP, we derive the sufficient and necessary conditions with the widely-used cross-entropy loss. Specifically, we define

HWTP(𝒙)\displaystyle{H}_{\rm{WTP}}(\boldsymbol{x}) =(𝟏j¯,{P(𝒙𝒳i¯,j|𝒙𝒳i¯,𝒟,θ)}j),\displaystyle=\mathcal{H}(\boldsymbol{1}_{\bar{j}},\{P(\boldsymbol{x}\in\mathcal{X}_{\bar{i},j}|\boldsymbol{x}\in\mathcal{X}_{\bar{i}},\mathcal{D},\theta)\}_{j}), (12)
HTII(𝒙)\displaystyle{H}_{\rm{TII}}(\boldsymbol{x}) =(𝟏i¯,{P(𝒙𝒳i|𝒟,θ)}i),\displaystyle=\mathcal{H}(\boldsymbol{1}_{\bar{i}},\{P(\boldsymbol{x}\in\mathcal{X}_{i}|\mathcal{D},\theta)\}_{i}), (13)
HTAP(𝒙)\displaystyle{H}_{\rm{TAP}}(\boldsymbol{x}) =(𝟏y,{P(𝒙𝒳c|𝒟,θ)}c),\displaystyle=\mathcal{H}(\boldsymbol{1}_{y},\{P(\boldsymbol{x}\in\mathcal{X}^{c}|\mathcal{D},\theta)\}_{c}), (14)

where HWTP{H}_{\rm{WTP}}, HTII{H}_{\rm{TII}} and HTAP{H}_{\rm{TAP}} are the cross-entropy values of WTP, TII and TAP, respectively. c[|i=1t𝒴i|]c\in[|\bigcup_{i=1}^{t}\mathcal{Y}_{i}|] denotes the index of all observed classes. (p,q)𝔼p[logq]=kpklogqk\mathcal{H}(p,q)\triangleq-\mathbb{E}_{p}[\log q]=-\sum_{k}p_{k}\log q_{k}. 𝟏\boldsymbol{1}_{\cdot} is a one-hot encoding function.

We now present the first theorem under the CIL setting (see Appendix A for the complete proof and the corresponding extensions to DIL and TIL settings):

Theorem 1

For continual learning (CL) in a pre-training context, if 𝔼𝐱[HWTP(𝐱)]δ\mathbb{E}_{\boldsymbol{x}}[{H}_{\rm{WTP}}(\boldsymbol{x})]\leq\delta, 𝔼𝐱[HTII(𝐱)]ϵ\mathbb{E}_{\boldsymbol{x}}[{H}_{\rm{TII}}(\boldsymbol{x})]\leq\epsilon, and 𝔼𝐱[HTAP(𝐱)]η\mathbb{E}_{\boldsymbol{x}}[{H}_{\rm{TAP}}(\boldsymbol{x})]\leq\eta, we have the loss error [0,max{δ+ϵ,η}]\mathcal{L}\in[0,\max\{{\delta+\epsilon},\eta\}], regardless whether the WTP predictor, TII predictor and TAP predictor are trained together or separately.

With the use of cross-entropy, the CL performance tends to be better as the bounds are tightened. In Theorem 1 we have shown that good performance of WTP, TII and TAP is sufficient to ensure good performance of CL. For completeness, we now study the necessary conditions of a well-performed CL model in Theorem 2.

Theorem 2

For CL in a pre-training context, if the loss error ξ\mathcal{L}\leq\xi, then there always exist (1) a WTP predictor, s.t. HWTPξ{H}_{\rm{WTP}}\leq\xi; (2) a TII predictor, s.t. HTIIξ{H}_{\rm{TII}}\leq\xi; and (3) a TAP predictor, s.t. HTAPξ{H}_{\rm{TAP}}\leq\xi.

Theorem 2 suggests that if a CL model is well trained (i.e., with low loss error), then the WTP error, TII error and TAP error for sequentially arriving tasks are always implied to be small. Similar to the connection between TAP and CIL under different assumptions, Theorem 1 and Theorem 2 would degenerate into the main conclusion of the previous theoretical study [41] if the pre-trained knowledge carried by θ\theta is not considered. This suggests that the presented theorems are particularly directed to the impact of pre-training on CL.

4.2 Connection of TII to OOD Detection

In essence, the TII probability specifies the CL problem with task-wise input samples. Although the definition of “task” in literature can generalize to an incoming training batch of distinct distribution [3], it may not be pertinent in describing realistic data streams with apparent similarity and dissimilarity. Indeed, the CL problem is often associated with the out-of-distribution (OOD) detection [42], i.e., the ability of a model to detect an unseen input sample, which has been shown to behave similarly as task prediction when training from scratch [41]. Inspired by this, we further explore the necessary conditions to optimize TII/OOD for CL in a pre-training context, in order to facilitate adaptive knowledge accumulation from more pronounced distribution changes.

We again use cross-entropy to measure the performance of TII and OOD detection, so as to establish their connection in each task. We first define the OOD detector of the ii-th task as Pi(𝒙𝒳i|𝒟,θ)P_{i}(\boldsymbol{x}\in\mathcal{X}_{i}|\mathcal{D},\theta). Different from the TII probability, the OOD detection probability here is a Bernoulli distribution:

HOOD,i(𝒙)={(1,Pi(𝒙𝒳i|𝒟,θ)),𝒙𝒳i(0,Pi(𝒙𝒳i|𝒟,θ)),𝒙𝒳i,H_{{\rm{OOD}},i}(\boldsymbol{x})=\begin{cases}\mathcal{H}(1,P_{i}(\boldsymbol{x}\in\mathcal{X}_{i}|\mathcal{D},\theta)),&\boldsymbol{x}\in\mathcal{X}_{i}\\ \mathcal{H}(0,P_{i}(\boldsymbol{x}\in\mathcal{X}_{i}|\mathcal{D},\theta)),&\boldsymbol{x}\notin\mathcal{X}_{i}\\ \end{cases},\\ (15)

where HOOD,iH_{{\rm{OOD}},i} is the cross-entropy value of an OOD detector of task ii, and Pi(𝒙𝒳i|𝒟,θ)P_{i}(\boldsymbol{x}\in\mathcal{X}_{i}|\mathcal{D},\theta) can be predicted with an appropriate distance function. The TII probability can then be defined with the OOD detectors, i.e., P(𝒙𝒳i|𝒟,θ)=Pi(𝒙𝒳i|𝒟,θ)jPj(𝒙𝒳j|𝒟,θ)P(\boldsymbol{x}\in\mathcal{X}_{i}|\mathcal{D},\theta)=\frac{P_{i}(\boldsymbol{x}\in\mathcal{X}_{i}|\mathcal{D},\theta)}{\sum_{j}P_{j}(\boldsymbol{x}\in\mathcal{X}_{j}|\mathcal{D},\theta)}.

Now we have the following theorem to explore the connection between TII and OOD detection in a pre-training context (see Appendix B for the complete proof):

Theorem 3

For CL in a pre-training context, if HOOD,i(𝐱)ϵiH_{{\rm{OOD}},i}(\boldsymbol{x})\leq\epsilon_{i} for i[t]i\in[t], then we have HTII(𝐱)(i𝟏𝐱𝒳ieϵi)(i𝟏𝐱𝒳i(1eϵi))H_{\rm{TII}}(\boldsymbol{x})\leq(\sum_{i}\boldsymbol{1}_{\boldsymbol{x}\in\mathcal{X}_{i}}e^{\epsilon_{i}})(\sum_{i}\boldsymbol{1}_{\boldsymbol{x}\notin\mathcal{X}_{i}}(1-e^{-\epsilon_{i}})). Likewise, if HTII(𝐱)ϵH_{\rm{TII}}(\boldsymbol{x})\leq\epsilon, then HOOD,i(𝐱)ϵ{H}_{{\rm{OOD}},i}(\boldsymbol{x})\leq\epsilon for i[t]i\in[t].

Theorem 3 shows that the TII performance improves if the OOD detection performance improves, and vice versa. In particular, HTII(𝒙)H_{\rm{TII}}(\boldsymbol{x}) converges to 0 as ϵi\epsilon_{i} converges to 0. Note that the connection between TII and OOD detection in Theorem 3 for CL with pre-training is similar in form to that for CL from scratch [41]. We further derive the sufficient and necessary conditions of improving CL with WTP, OOD detection and TAP, as detailed in Appendix B.

5 Hierarchical Decomposition PET

Based on the above analysis, we now present an innovative approach named Hierarchical Decomposition PET (HiDe-PET) to explicitly optimize the three hierarchical components tailored for CL in a pre-training context (see Fig. 3). Our HiDe-PET is applicable to mainstream PET techniques for learning sequentially arriving tasks, from which the pre-trained knowledge can also be evolved.

Algorithm 1 Training Algorithm of HiDe-PET

Input: Pre-trained transformer backbone fθf_{\theta}, training sets 𝒟1,,𝒟t\mathcal{D}_{1},...,\mathcal{D}_{t}, number of tasks tt, number of epochs EE and EE^{\prime}.
Output: Parameters 𝒆1,,𝒆t\boldsymbol{e}_{1},...,\boldsymbol{e}_{t}, 𝒈\boldsymbol{g}, ω\omega and ψ\psi

1:Initialize 𝒈\boldsymbol{g}, ω\omega and ψ\psi
2:for i=1,,ti=1,...,t do
3:     Initialize ψ^\hat{\psi} with ψ\psi
4:     Construct 𝒆i\boldsymbol{e}_{i} with 𝒆1,,𝒆i1\boldsymbol{e}_{1},...,\boldsymbol{e}_{i-1}
5:     for epoch=1,,Eepoch=1,...,E do
6:         Update 𝒈\boldsymbol{g} and ψ^\hat{\psi} with CE(ψ^,𝒈)\mathcal{L}_{{\rm{CE}}}(\hat{\psi},\boldsymbol{g})
7:         Update 𝒆i\boldsymbol{e}_{i} and ψ\psi with WTP(ψ,𝒆i)\mathcal{L}_{{\rm{WTP}}}(\psi,\boldsymbol{e}_{i}) in Eq. (16)      
8:     for c𝒴ic\in\mathcal{Y}_{i} do
9:         Calculate 𝒢^i,c\hat{\mathcal{G}}_{i,c} from fθf_{\theta}, 𝒈\boldsymbol{g} and 𝒟i\mathcal{D}_{i}
10:         Calculate 𝒢i,c\mathcal{G}_{i,c} from fθf_{\theta}, 𝒆i\boldsymbol{e}_{i} and 𝒟i\mathcal{D}_{i}      
11:     for epoch=1,,Eepoch=1,...,E^{\prime} do
12:         Optimize ω\omega with TII(ω)\mathcal{L}_{{\rm{TII}}}(\omega) in Eq. (17)
13:         Optimize ψ\psi with TAP(ψ)\mathcal{L}_{{\rm{TAP}}}(\psi) in Eq. (18)      
14:return (𝒆1,,𝒆t(\boldsymbol{e}_{1},...,\boldsymbol{e}_{t}, 𝒈\boldsymbol{g}, ω\omega, ψ)\psi)
Refer to caption
Figure 3: Illustration of HiDe-PET. See Fig. 2 for detailed implementations of PET techniques and the frozen transformer backbone. Here we have an example of 2 tasks with 2 classes each. WTP aims to classify each 2 classes well, optimized by the PET ensemble of task-specific parameters. TII aims to select appropriately one of the 2 tasks, optimized by the task-shared parameters and the recovery of uninstructed representations. TAP aims to classify the total of 4 classes well, optimized by the recovery of instructed representations upon WTP and TII.

5.1 Optimization of Decomposed Objective

The principal concept behind HiDe-PET stems from two unique strengths of PTMs in CL: (1) the pre-trained representations can be effectively adapted to the distribution of target tasks through implementing PET techniques, and (2) the distribution of target tasks can be efficiently recovered through preserving statistical information of their pre-trained representations. The optimization of WTP, TII and TAP is described as below.

First, we improve WTP through incorporating task-specific knowledge from 𝒟i\mathcal{D}_{i} to capture the distribution of any task i[t]i\in[t]. Specifically, we employ multiple sets of task-specific parameters 𝒆1,,𝒆t\boldsymbol{e}_{1},...,\boldsymbol{e}_{t} to instruct pre-trained representations, implemented via mainstream PET techniques. Their concrete forms can be defined as the prompt parameters {𝒑}\{\boldsymbol{p}\} for ProT and PreT, the projection matrices {𝑾down,𝑾up}\{\boldsymbol{W}_{{\rm{down}}},\boldsymbol{W}_{{\rm{up}}}\} for Adapter and LoRA, etc. When learning task tt, we keep the previous parameters 𝒆1,,𝒆t1\boldsymbol{e}_{1},...,\boldsymbol{e}_{t-1} frozen to avoid catastrophic forgetting. In order to inherit knowledge obtained from CL, we employ a PET ensemble strategy that initializes 𝒆t\boldsymbol{e}_{t} with 𝒆t1\boldsymbol{e}_{t-1} and optimizes 𝒆t\boldsymbol{e}_{t} with a weighted combination of all previous parameters 𝒆tαi[t1]𝒆i+(1α)𝒆t\boldsymbol{e}_{t}\leftarrow\alpha\sum_{i\in[t-1]}\boldsymbol{e}_{i}+(1-\alpha)\boldsymbol{e}_{t}, where α[0,1]\alpha\in[0,1] is a hyperparameter that controls the strength of obtained knowledge that facilitates 𝒆t\boldsymbol{e}_{t} in learning task tt. For HWTP{H}_{\rm{WTP}}, we then optimize 𝒆t\boldsymbol{e}_{t} and ψ\psi via the cross-entropy (CE) loss:

WTP(ψ,𝒆t)=CE(ψ,𝒆t)=1|𝒟t|(𝒙,y)𝒟tlogexp(hψ(fθ,𝒆t(𝒙)[y]))y𝒴texp(hψ(fθ,𝒆t(𝒙))[y]).\begin{split}&\mathcal{L}_{{\rm{WTP}}}(\psi,\boldsymbol{e}_{t})=\mathcal{L}_{{\rm{CE}}}(\psi,\boldsymbol{e}_{t})\\ &=\frac{1}{|\mathcal{D}_{t}|}\sum_{(\boldsymbol{x},y)\in\mathcal{D}_{t}}-\log\frac{\exp(h_{\psi}(f_{\theta,\boldsymbol{e}_{t}}(\boldsymbol{x})[y]))}{\sum_{y^{\prime}\in\mathcal{Y}_{t}}\exp(h_{\psi}(f_{\theta,\boldsymbol{e}_{t}}(\boldsymbol{x}))[y^{\prime}])}.\end{split} (16)

Second, we improve TII and TAP through recovering the distributions of pre-trained representations to achieve the optimum over all tasks. Specifically, after learning each task i[t]i\in[t], we collect the uninstructed and instructed representations, i.e., the backbone projection of 𝒟i\mathcal{D}_{i} with fθ()f_{\theta}(\cdot) and fθ,𝒆i()f_{\theta,\boldsymbol{e}_{i}}(\cdot), respectively. We then approximate the distributions of these representations with their statistical information for subsequent recovery. Taking classification tasks as an example, for each class c𝒴ic\in\mathcal{Y}_{i} and i[t]i\in[t], we denote the approximated distributions of uninstructed and instructed representations as 𝒢^i,c\hat{\mathcal{G}}_{i,c} and 𝒢i,c\mathcal{G}_{i,c}, respectively.

For HTII{H}_{\rm{TII}}, we build an auxiliary output layer h^ω():dt\hat{h}_{\omega}(\cdot):\mathbb{R}^{d}\rightarrow\mathbb{R}^{t} with parameters ω\omega, learning the projection from uninstructed representations to the task identity:

TII(ω)=i[t]c𝒴i𝒓^𝒮^i,clogexp(h^ω(𝒓^)[i])j[t]exp(h^ω(𝒓^)[j]),\mathcal{L}_{{\rm{TII}}}(\omega)=\sum_{i\in[t]}\sum_{c\in\mathcal{Y}_{i}}\sum_{\hat{\boldsymbol{r}}\in\hat{\mathcal{S}}_{i,c}}-\log\frac{\exp(\hat{h}_{\omega}(\hat{\boldsymbol{r}})[i])}{\sum_{j\in[t]}\exp(\hat{h}_{\omega}(\hat{\boldsymbol{r}})[j])}, (17)

where 𝒮^i,c\hat{\mathcal{S}}_{i,c} is a collection of uninstructed representations sampled in a class-balanced manner from 𝒢^i,c\hat{\mathcal{G}}_{i,c} for c𝒴ic\in\mathcal{Y}_{i} and i[t]i\in[t]. Therefore, we can determine the task identity via uninstructed representations and then obtain the corresponding instructed representations.

For HTAP{H}_{\rm{TAP}}, we use a similar strategy to optimize the final output layer hψ():d|𝒴|h_{\psi}(\cdot):\mathbb{R}^{d}\rightarrow\mathbb{R}^{|\mathcal{Y}|} with parameters ψ\psi, learning the projection from instructed representations to all observed classes:

TAP(ψ)=i[t]c𝒴i𝒓𝒮i,clogexp(hψ(𝒓)[c])j[t]c𝒴jexp(hψ(𝒓)[c]),\mathcal{L}_{{\rm{TAP}}}(\psi)=\sum_{i\in[t]}\sum_{c\in\mathcal{Y}_{i}}\sum_{\boldsymbol{r}\in\mathcal{S}_{i,c}}-\log\frac{\exp(h_{\psi}(\boldsymbol{r})[c])}{\sum_{j\in[t]}\sum_{c^{\prime}\in\mathcal{Y}_{j}}\exp(h_{\psi}(\boldsymbol{r})[c^{\prime}])}, (18)

where 𝒮i,c\mathcal{S}_{i,c} is a collection of instructed representations sampled in a class-balanced manner from 𝒢i,c\mathcal{G}_{i,c} for c𝒴ic\in\mathcal{Y}_{i} and i[t]i\in[t].

Improvement of Uninstructed Representations: Since the TII process depends heavily on the uninstructed representations of 𝒟i\mathcal{D}_{i} collected from fθ()f_{\theta}(\cdot), its effectiveness tends to be severely affected by the pre-trained checkpoints and target tasks. This issue becomes particularly pronounced when implementing more advanced PET techniques, which better incorporate task-specific knowledge for WTP and are thus more sensitive to TII. To address this issue, we further deploy a set of task-shared parameters 𝒈\boldsymbol{g} to improve TII in a task-agnostic manner. 𝒈\boldsymbol{g} is implemented via mainstream PET techniques analogous to 𝒆i\boldsymbol{e}_{i}, while optimized with the cross-entropy CE(ψ^,𝒈)\mathcal{L}_{{\rm{CE}}}(\hat{\psi},\boldsymbol{g}) to accumulate task-shared knowledge from sequentially arriving 𝒟i\mathcal{D}_{i} for i[t]i\in[t], where ψ^\hat{\psi} is an interim copy of ψ\psi to avoid overwriting. We then use fθ,𝒈()f_{\theta,\boldsymbol{g}}(\cdot) instead of fθ()f_{\theta}(\cdot) to collect uninstructed representations222For naming consistency, we still use “uninstructed representations” to denote the projection of fθ,𝒈()f_{\theta,\boldsymbol{g}}(\cdot)., approximate each 𝒢^i,c\hat{\mathcal{G}}_{i,c} and optimize Eq. (17).

Notably, 𝒈\boldsymbol{g} needs to overcome its own catastrophic forgetting that leads to not only the loss of information in representation learning but also the representation shift in subsequent recovery. There are many CL methods attempting to address this challenge [14, 34, 4], but their strategies remain sub-optimal in balancing sequentially arriving tasks (see Table IV). Here we propose a simple yet effective strategy by taking advantages of first-session adaptation (FSA) [34, 43] and slow learner (SL) [4, 14]. Specifically, we learn 𝒈\boldsymbol{g} in the first task with a larger learning rate that is adequately strong for representation learning, and then in subsequent tasks with a smaller learning rate for further fine-tuning. In this way, task-shared knowledge is effectively incorporated into 𝒈\boldsymbol{g} and accumulates over time.

Refer to caption
Figure 4: Inference of HiDe-PET during the testing phase. HiDe-PET first employs the task-shared parameters and the auxiliary output layer to infer task identity, and then employs the corresponding task-specific parameters and the output layer to obtain final prediction.

Recovery of Task Distributions: Since the pre-trained representations are well-distributed in general, there are many feasible strategies to approximate and preserve their distributions 𝒢^i,c\hat{\mathcal{G}}_{i,c} and 𝒢i,c\mathcal{G}_{i,c}. The most straightforward option is to save randomly selected prototypes [44], yet not adequately employing the relationships between them. For classification tasks, the class-wise distribution is typically single-peaked and thus can be naturally approximated as a Gaussian with its mean and covariance [4, 13]. In order to reduce storage complexity, dedicated covariance matrices need to be further simplified for practical use, suffering from information loss to varying degrees [4, 34, 13]. Considering both storage efficiency and task-type generality, our default implementation is to obtain multiple representation centroids with kk-Nearest Neighbors (kkNNs) and add Gaussian noise to them. We also provide an empirical comparison of different implementations in Table V.

Overall, the entire HiDe-PET consists of two training stages (see Algorithm 1 and Fig. 3), corresponding to the pre-trained transformer backbone and the (auxiliary) output layer. At test time, HiDe-PET first predicts the task identity i^=h^ω(fθ,𝒈(𝒙))\hat{i}=\hat{h}_{\omega}(f_{\theta,\boldsymbol{g}}(\boldsymbol{x})) and then the overall class label y^=hψ(fθ,𝒆i^(𝒙))\hat{y}=h_{\psi}(f_{\theta,\boldsymbol{e}_{\hat{i}}}(\boldsymbol{x})) (see Fig. 4). Compared to the backbone parameters θ\theta, the trainable parameters 𝒆t\boldsymbol{e}_{t}, 𝒈\boldsymbol{g}, ω\omega and ψ\psi, as well as the representation recovery are all lightweight, thus ensuring resource efficiency.

5.2 Adaptive Knowledge Accumulation

Within HiDe-PET, the parallel organization of 𝒆1,,𝒆t\boldsymbol{e}_{1},...,\boldsymbol{e}_{t} and 𝒈\boldsymbol{g} facilitates the incorporation of task-specific and task-shared knowledge for many representative CL scenarios. In fact, the functions of 𝒆1,,𝒆t\boldsymbol{e}_{1},...,\boldsymbol{e}_{t} and 𝒈\boldsymbol{g} are usually not exclusive, depending on the similarity and dissimilarity between task distributions. Motivated by the intrinsic connection between TII and OOD detection in our theoretical analysis, we unify the task-specific and task-shared PET architectures with a hierarchy of expandable parameter sets (see Fig. 5), which may degenerate into either case of Sec. 5.1. We further explore a particular implementation of this hierarchy in order to accumulate knowledge adaptively from more pronounced distribution changes.

Let us assume the existence of multiple parameter sets that are implemented via mainstream PET techniques and are expanded or retrieved upon the input samples. For example, the sample elements of sequentially learning tasks i[t1]i\in[t-1] have derived kk parameter sets 𝒈1,,𝒈k\boldsymbol{g}_{1},...,\boldsymbol{g}_{k}. If the incoming 𝒙𝒳t\boldsymbol{x}\in\mathcal{X}_{t} is identified as OOD from the previously observed distributions of 𝒳i\mathcal{X}_{i}, it learns an expanded set of parameters 𝒈k+1\boldsymbol{g}_{k+1} through the task-specific loss CE(ψ^,𝒈k+1)\mathcal{L}_{{\rm{CE}}}(\hat{\psi},\boldsymbol{g}_{k+1}), otherwise it retrieves and updates the most relevant one 𝒈j\boldsymbol{g}_{j} with j[k]j\in[k] through CE(ψ^,𝒈j)\mathcal{L}_{{\rm{CE}}}(\hat{\psi},\boldsymbol{g}_{j}), where ψ^\hat{\psi} is an interim copy of the output layer parameters ψ\psi to avoid overwriting.

Refer to caption
Figure 5: Adaptive knowledge accumulation. HiDe-PET employs OOD detection to decide whether to expand a new set of parameters or to retrieve a previously learned set of parameters. Such parameters are further specified into LoRA-based PET to update the backbone.

OOD Detection Strategy: Given that the previous 𝒳i\mathcal{X}_{i} are not available in CL to describe their distributions, we take inspirations from recent metric-based OOD detectors [45] and formulate an effective criterion with their uninstructed representations:

Pi(𝒙𝒳i|𝒟,θ)=𝟏(Dis(𝒙,𝒢^i)>λOOD),𝒙𝒳t,P_{i}(\boldsymbol{x}\in\mathcal{X}_{i}|\mathcal{D},\theta)=\boldsymbol{1}({\rm{Dis}}(\boldsymbol{x},\hat{\mathcal{G}}_{i})>\lambda_{{\rm{OOD}}}),\boldsymbol{x}\in\mathcal{X}_{t}, (19)

where 𝒢^i\hat{\mathcal{G}}_{i} for i[t1]i\in[t-1] denotes the approximated distribution of uninstructed representations. It can be further specified as 𝒢^i=c𝒴i𝒢^i,c\hat{\mathcal{G}}_{i}=\bigcup_{c\in\mathcal{Y}_{i}}\hat{\mathcal{G}}_{i,c} for classification tasks. λOOD\lambda_{{\rm{OOD}}} denotes the OOD detection threshold. 𝟏()\boldsymbol{1}(\cdot) is the indicator function. Dis(𝒙,𝒢^i){\rm{Dis}}(\boldsymbol{x},\hat{\mathcal{G}}_{i}) measures the distance of task-wise distributions, which can be implemented via the average Euclidean distance between fθ,𝒈(𝒙)f_{\theta,\boldsymbol{g}}(\boldsymbol{x}) and 𝒓^𝒢^i\hat{\boldsymbol{r}}\sim\hat{\mathcal{G}}_{i}. Consequently, if 𝒙\boldsymbol{x} is identified as OOD for all tasks i[t1]i\in[t-1], then it will be associated with 𝒈k+1\boldsymbol{g}_{k+1}. Otherwise, it will retrieve the associated 𝒈j\boldsymbol{g}_{j} for j[k]j\in[k] corresponding to the majority of the most relevant task i^=argmini[t1]Dis(𝒙,𝒢^i)\hat{i}=\arg\min_{i\in[t-1]}{\rm{Dis}}(\boldsymbol{x},\hat{\mathcal{G}}_{i}). To overcome catastrophic forgetting when updating 𝒈j\boldsymbol{g}_{j}, we employ the same strategy as for learning the task-shared parameters 𝒈\boldsymbol{g}, e.g., a combination of FSA [43] and SL [4]. As a special case, we have k=1k=1 if all input samples are identified as in-distribution, for which only 𝒈1\boldsymbol{g}_{1} exists and is equivalent to 𝒈\boldsymbol{g}.

Connection of PET Architectures: We now extend the above discussion with the criterion of OOD detection in Eq. (19). Given a task sequence 1,,t1,...,t, using a larger λOOD\lambda_{{\rm{OOD}}} would make k1k\rightarrow 1 and {𝒈1,,𝒈k}{𝒈}\{\boldsymbol{g}_{1},...,\boldsymbol{g}_{k}\}\rightarrow\{\boldsymbol{g}\}, while using a smaller λOOD\lambda_{{\rm{OOD}}} would make ktk\rightarrow t and {𝒈1,,𝒈k}{𝒆1,,𝒆t}\{\boldsymbol{g}_{1},...,\boldsymbol{g}_{k}\}\rightarrow\{\boldsymbol{e}_{1},...,\boldsymbol{e}_{t}\}. Therefore, the representation learning of HiDe-PET in Sec. 5.1 is equivalent to a parallel combination of these two extreme conditions for TII and WTP, respectively. This is a reasonable choice as most CL benchmarks employ randomly split classes of the same dataset as the task sequence, i.e., there is no actual task structure.

Instead, Eq. (19) applies to the apparent similarity and dissimilarity between task distributions, which is more realistic in applications and enables adaptive knowledge accumulation from CL for enhanced utilization. Here we leverage the unique property of LoRA-based PET to construct a specialized implementation, serving as a plug-in module for Algorithm 1. Unlike the commonly-used Prompt-based PET that updates only attached tokens, the LoRA-based PET specifies 𝒈1,,𝒈k\boldsymbol{g}_{1},...,\boldsymbol{g}_{k} as the approximated updates of θ\theta, where the most relevant 𝒈j\boldsymbol{g}_{j} is selected and temporarily added to θ\theta in CL. Therefore, the learning of subsequent tasks can be significantly improved from the accumulated knowledge and further contribute to it (see Fig. 5). Moreover, this allows for the flexible evolution of pre-trained knowledge with target tasks in a lifelong manner, deviating from the conventional practice of fixing it at the initial checkpoint.

In brief, Sec. 4 and 5 serve as a unified framework for CL with PTMs and PET, applicable to explore the distinct impacts of implementation strategy, PET technique and PET architecture, as well as the adaptive knowledge accumulation for enhanced utilization.

6 Experiment

In this section, we perform extensive experiments to demonstrate the effectiveness and generality of our HiDe-PET. We first describe the experimental setups, and then present the experimental results with a comprehensive analysis.

6.1 Experimental Setup

To ensure the breadth and adequacy of the experiments, we consider a variety of CL benchmarks, pre-trained checkpoints, recent strong baselines, PET techniques and evaluation metrics. For the sake of comparison fairness, we follow the official implementations to reproduce all baselines.

TABLE I: Overall performance of continual learning. PTM: pre-trained model. FAA (%): final average accuracy. CAA (%): cumulative average accuracy.
PTM              Method Split CIFAR-100 Split ImageNet-R Split CUB-200 Split Cars-196
FAA (\uparrow) CAA (\uparrow) FAA (\uparrow) CAA (\uparrow) FAA (\uparrow) CAA (\uparrow) FAA (\uparrow) CAA (\uparrow)
Sup-21/1K L2P [5] 84.25 88.84 71.34 76.87 70.90 76.70 41.06 46.47
DualPrompt [6] 83.75 89.11 71.65 77.51 68.21 75.15 42.68 51.60
S-Prompt++ [7] 82.41 87.68 71.15 77.16 68.01 75.04 39.62 47.85
CODA-Prompt [8] 86.65 90.78 75.11 81.45 71.43 78.61 45.67 53.28
LAE-PreT [14] 87.36 91.63 74.95 81.23 78.46 83.65 42.80 52.12
LAE-LoRA [14] 88.38 92.45 76.27 82.99 80.02 84.47 50.90 58.38
LAE-Adapter [14] 88.37 92.50 75.69 82.80 80.52 84.75 55.20 61.63
\cdashline2-10[2pt/2pt] HiDe-PreT 91.11 94.11 78.93 83.44 87.95 88.48 68.73 69.19
HiDe-LoRA 91.21 93.99 79.32 83.97 88.76 89.32 69.65 69.36
HiDe-Adapter 91.23 94.26 78.65 83.55 88.49 89.17 70.98 71.31
iBOT-21K L2P [5] 79.32 85.13 61.31 70.05 45.93 56.02 45.25 45.75
DualPrompt [6] 78.17 85.15 61.42 70.06 41.46 54.57 34.61 42.28
S-Prompt++ [7] 79.85 85.89 60.84 69.01 39.88 53.71 36.46 43.34
CODA-Prompt [8] 81.58 87.36 67.15 76.54 47.79 59.24 39.50 43.32
LAE-PreT [14] 82.22 88.05 65.85 75.34 45.83 60.31 49.14 52.59
LAE-LoRA [14] 84.63 90.24 70.49 79.06 56.16 68.38 58.66 62.59
LAE-Adapter [14] 84.68 90.31 69.93 79.14 58.04 70.01 61.76 65.61
\cdashline2-10[2pt/2pt] HiDe-PreT 88.13 92.17 70.57 77.89 70.72 74.09 63.98 64.18
HiDe-LoRA 89.72 93.34 74.46 80.89 76.10 79.99 67.73 68.64
HiDe-Adapter 89.46 93.12 74.25 80.48 75.17 79.42 69.62 70.11
Sup-Weak L2P [5] 67.73 78.84 47.95 56.51 43.99 58.85 33.25 38.97
DualPrompt [6] 69.09 79.56 51.21 59.67 46.05 58.51 35.08 42.99
S-Prompt++ [7] 71.05 81.34 47.87 56.62 42.91 57.70 36.20 43.35
CODA-Prompt [8] 65.45 76.43 53.21 63.61 44.91 57.73 35.59 41.90
LAE-PreT [14] 67.25 77.34 55.55 64.78 48.56 61.73 36.63 41.56
LAE-LoRA [14] 68.43 78.57 57.40 66.84 48.99 61.50 35.35 39.93
LAE-Adapter [14] 68.55 78.59 57.92 67.79 49.79 62.25 37.17 41.72
\cdashline2-10[2pt/2pt] HiDe-PreT 77.65 85.14 57.98 65.79 65.03 71.63 52.89 55.09
HiDe-LoRA 77.46 84.89 59.40 67.05 66.84 71.91 52.61 54.78
HiDe-Adapter 76.71 84.55 58.94 67.53 66.26 71.24 54.38 56.23
Refer to caption
Figure 6: Comparison of our HiDe-PET, our preliminary version [13] and LAE [14] implemented with different PET techniques. FAA (%): final average accuracy. FFM (%): final forgetting measure of old tasks. ALA (%): average learning accuracy of new tasks. Note the different range and scale of the y-axis. We present in the first row the average FAA lead of our HiDe-PET over LAE (dark purple) and our preliminary version (dark blue).

Benchmark: We focus on four datasets that are widely used in previous work to evaluate CL [5, 6, 8, 14, 4], and split each into 10 target tasks of disjoint classes for CIL experiments. The first two are general datasets, such as CIFAR-100 [46] of 100-class small-scale images that are common objects in the real world and ImageNet-R [47] of 200-class large-scale images that are hard examples of ImageNet-21K [48] or newly collected examples of different styles. The latter two are fine-grained datasets, such as CUB-200 [49] of 200-class bird images and Cars-196 [50] of 196-type car images. We mainly consider three representative pre-trained checkpoints that differ in paradigm and dataset, including Sup-21/1K, iBOT-21K and Sup-Weak. Specifically, Sup-21/1K [8] is essentially a supervised checkpoint that performs self-supervised learning on ImageNet-21K and then supervised fine-tuning on ImageNet-1K. iBOT-21K [51] is a self-supervised checkpoint that currently achieves the first-place classification performance for self-supervised learning on ImageNet-21K. Sup-Weak [52] is a supervised checkpoint on a subset of ImageNet-1K, in which 389 similar classes to subsequent CL are intentionally removed.

Baseline: We compare our HiDe-PET with a range of recent strong baselines as described in Sec. 3.3, including L2P [5], DualPrompt [6], S-Prompt [7], CODA-Prompt [8] and LAE [14]. In brief, these baselines cover different PET architectures, but mainly target the Prompt-based PET. LAE [14] is the most recent baseline among them, and is also the most relevant one to ours as it applies to a variety of mainstream PET techniques. Similar to the previous work [13], we modify the implementation of S-Prompt [7] by inserting task-specific prompts into multiple MSA layers in a PreT manner and using a single-head output layer, in order to evaluate the impact of PET architecture and better adapt to the CIL experiments. The modified S-Prompt is referred to as S-Prompt++. Following LAE [14], we consider three kinds of mainstream PET techniques in our HiDe-PET and our preliminary version [13], including PreT [10], LoRA [12] and Adapter [11].

Evaluation: We use Ai,iA_{i,i^{\prime}} to denote the prediction accuracy on task ii after learning task ii^{\prime} (with single-head evaluation for CIL), and define the average accuracy of all seen tasks as AAi=1ii=1iAi,iAA_{i^{\prime}}=\frac{1}{i^{\prime}}\sum_{i=1}^{i^{\prime}}A_{i,i^{\prime}}. After learning all tasks, we report both the final average accuracy FAA=AAt{\rm{FAA}}=AA_{t} that serves as the primary metric to evaluate CL performance, and the cumulative average accuracy CAA=1ti=1tAAi{\rm{CAA}}=\frac{1}{t}\sum_{i=1}^{t}AA_{i} that further reflects the historical performance. Moreover, we evaluate the behaviors of different PET techniques with the average learning accuracy ALA=1t1i=2tAi,i1{\rm{ALA}}=\frac{1}{t-1}\sum_{i=2}^{t}A_{i,i-1} for learning plasticity and the final forgetting measure FFM=1t1i=1t1maxi[t1](Ai,iAi,t){\rm{FFM}}=\frac{1}{t-1}\sum_{i=1}^{t-1}\max_{i^{\prime}\in[t-1]}(A_{i,i^{\prime}}-A_{i,t}) for memory stability, as well as evaluate the TII performance in our HiDe-PET.

Implementation: We follow similar implementations as previous work.333The training regime and supervised checkpoints are identical to those in CODA-Prompt [8], which are slightly different from those in our preliminary version [13] and lead to some performance differences. Specifically, we employ a pre-trained ViT-B/16 backbone with an Adam optimizer (β1=0.9\beta_{1}=0.9, β2=0.999\beta_{2}=0.999), a batch size of 128, and a cosine-decaying learning rate of 0.001. We train Split CIFAR-100 for 20 epochs and other benchmarks for 50 epochs to ensure convergence on each task. The image inputs are resized to 224×224224\times 224 and normalized to [0,1][0,1]. The PET architecture of each baseline is also similar to its original paper, which has been shown to yield strong performance. Specifically, L2P [5] is implemented with the prompt pool size M=30M=30, the prompt length d𝒑=5d_{\boldsymbol{p}}=5 and the Top-5 keys. DualPrompt [6] is implemented with the prompt length d𝒑=5d_{\boldsymbol{p}}=5 of task-shared prompts inserted into layers 1-2 and the prompt length d𝒑=20d_{\boldsymbol{p}}=20 of task-specific prompts inserted into layers 3-5. S-Prompt++ [7] is implemented similarly to DualPrompt but replaces all task-shared prompts with task-specific prompts, inserted into layers 1-5 with the prompt length d𝒑=20d_{\boldsymbol{p}}=20. CODA-Prompt [8] is implemented with the prompt pool size M=100M=100 and the prompt length d𝒑=8d_{\boldsymbol{p}}=8, inserted into layers 1-5. LAE [14] and our HiDe-PET are implemented with the prompt length d𝒑=20d_{\boldsymbol{p}}=20 for PreT, and the low-dimension bottleneck r=10r=10 for Adapter and LoRA, inserted into layers 1-5. We insert the Adapter modules in both sequential and parallel manners, while employ LoRA to update both 𝑾K\boldsymbol{W}_{K} and 𝑾V\boldsymbol{W}_{V}. Therefore, the extra parameter costs of PreT, Adapter and LoRA are identical [14]. Note that, our HiDe-PET and our preliminary version [13] adopt a similar PET architecture as S-Prompt++, but replace the task-specific keys with an auxiliary output layer h^ω\hat{h}_{\omega} to predict the task identity and further preserve statistical information of pre-trained representations.444We consider a lightweight implementation in terms of the auxiliary output layer and representation recovery, which slightly compromise the performance but largely improve resource efficiency.

6.2 Experimental Result

Now we present the results of our empirical investigation, including the overall performance of all methods, an ablation study of the three hierarchical components, the distinct impacts of implementation strategy, PET technique and PET architecture, as well as the adaptive knowledge accumulation over similar and dissimilar tasks.

Overall Performance: Table I summarizes the results of all methods across various pre-trained checkpoints and CL benchmarks. It can be seen that our HiDe-PET implemented with three mainstream PET techniques achieves consistently the highest performance, and the performance lead tends to be more pronounced under the more challenging scenarios. Specifically, the performance of all methods is affected to varying degrees when considering fine-grained tasks (i.e., CUB-200 and Cars-196) and weakened pre-training in terms of self-supervised paradigm (i.e., iBOT-21K) and reduced pre-training samples (i.e., Sup-Weak). Among these competitors, the sub-optimality of Prompt-based PET in CL is clearly exposed, which underperforms LoRA/Adapter-based PET within and across methods. The LoRA/Adapter version of LAE [14] is the strongest competitor but still severely affected by the double challenges of pre-trained checkpoints and CL benchmarks. In contrast, our HiDe-PET adapts to them effectively with strong generality.

TABLE II: Overall performance of continual learning under DINOv2 [53]. FAA (%): final average accuracy. CAA (%): cumulative average accuracy.
Setup      Method Split ImageNet-R Split Cars-196
FAA (\uparrow) CAA (\uparrow) FAA (\uparrow) CAA (\uparrow)
DINOv2 LVD-142M LAE-PreT [14] 78.98 85.26 45.53 58.19
LAE-LoRA [14] 79.13 86.04 52.48 61.92
LAE-Adapter [14] 77.43 83.98 57.30 67.08
\cdashline2-6[2pt/2pt] HiDe-PreT 85.68 87.70 85.65 82.20
HiDe-LoRA 86.26 89.14 85.53 84.95
HiDe-Adapter 86.03 90.05 83.48 84.11

It is noteworthy that self-supervised pre-training is often considered more practical than supervised pre-training, owing to the expense of annotating massive pre-training samples [13, 4]. Meanwhile, Sup-Weak avoids potential overlap between PTMs and target tasks, providing a more restrictive scenario for CL [52]. Sup-Weak is also more analogous to the widely used setting of CIL experiments in literature, i.e., the model first learns half of the classes and then learns the other classes in multiple incremental phases, where the baselines of Prompt-based PET have been shown to perform poorly on it [54]. These considerations underscore more profound advantages of our HiDe-PET in CL. We further evaluate CL under DINOv2 of ViT-B/14 [53], a state-of-the-art self-supervised checkpoint that largely improves iBOT-21K with additional training tricks and more pre-training data, and HiDe-PET consistently outperforms LAE [14] by a wide margin (see Table II).

TABLE III: Ablation study of the three hierarchical components in HiDe-PET. Naive: a naive baseline of only task-specific parameters.
Setup           Method Split ImageNet-R Split Cars-196
FAA (\uparrow) CAA (\uparrow) FAA (\uparrow) CAA (\uparrow)
Sup-21/1K PreT Naive 69.77 76.36 46.08 53.59
WTP 73.01 78.20 48.23 54.11
WTP+TII 75.68 80.95 52.89 54.18
WTP+TAP 78.06 82.69 61.38 64.46
WTP+TII+TAP 78.93 83.44 68.73 69.19
Sup-21/1K LoRA Naive 74.54 80.66 45.39 54.28
WTP 75.59 81.31 48.01 53.85
WTP+TII 76.03 81.69 49.62 57.29
WTP+TAP 78.33 83.68 63.28 65.87
WTP+TII+TAP 79.32 83.97 69.65 69.36
Sup-21/1K Adapter Naive 75.17 81.46 47.20 55.69
WTP 76.10 82.23 53.12 59.35
WTP+TII 76.80 82.60 55.93 60.89
WTP+TAP 76.98 82.73 64.65 67.38
WTP+TII+TAP 78.65 83.55 70.98 71.31
iBOT-21K PreT Naive 63.78 73.47 41.54 47.96
WTP 64.98 73.54 52.99 56.31
WTP+TII 66.33 74.98 53.89 57.01
WTP+TAP 69.84 77.02 59.75 61.28
WTP+TII+TAP 70.57 77.89 63.98 64.18
iBOT-21K LoRA Naive 67.07 77.07 53.13 59.07
WTP 68.06 77.39 56.03 61.18
WTP+TII 68.54 77.60 59.48 63.36
WTP+TAP 72.60 79.95 61.50 64.88
WTP+TII+TAP 74.46 80.89 67.73 68.64
iBOT-21K Adapter Naive 68.17 77.57 53.58 59.69
WTP 69.11 77.29 57.71 62.21
WTP+TII 69.65 77.90 62.12 65.66
WTP+TAP 71.32 79.17 62.78 65.59
WTP+TII+TAP 74.25 80.48 69.62 70.11

Ablation Study: Table III presents an ablation study to validate the effectiveness of optimizing the three hierarchical components in HiDe-PET. Specifically, we progressively incorporate the designs of within-task prediction (WTP), task-identity inference (TII) and task-adaptive prediction (TAP) on the top of a naive architecture, which consists of only task-specific parameters 𝒆1,,𝒆t\boldsymbol{e}_{1},...,\boldsymbol{e}_{t} implemented via mainstream PET techniques. In general, the optimization of each component all contributes to the strong performance of HiDe-PET. Although their contributions are relatively comparable under supervised pre-training and general tasks, the improvement of TAP becomes more significant under self-supervised pre-training and fine-grained tasks, demonstrating the necessity of TAP within the CL objective. Besides, the improvement of TII often becomes more apparent with WTP+TAP rather than with WTP alone, suggesting that WTP, TII and TAP operate in concert rather than in isolation.

Implementation Strategy: Now we evaluate the implementation strategy of task-shared parameters and representation recovery, which are both important for the optimization of our HiDe-PET and potentially shared by many recent methods. In contrast to task-specific parameters discussed above, task-shared parameters 𝒈\boldsymbol{g} aim to improve pre-trained representations in a task-agnostic manner, demanding effective strategies to mitigate catastrophic forgetting. Various strategies have been employed in previous work, including (1) fix-and-tuning (F&T) [14], which updates the output layer with frozen 𝒈\boldsymbol{g} in earlier epochs and then updates 𝒈\boldsymbol{g} for representation learning in later epochs; (2) first-session adaptation (FSA) [43], which updates 𝒈\boldsymbol{g} for representation learning exclusively from the first task and then fixes 𝒈\boldsymbol{g} in subsequent tasks; (3) slow learner (SL) [4], which reduces the learning rate of 𝒈\boldsymbol{g} in all tasks; and (4) exponential moving average (EMA) [14], which employs an interim copy of 𝒈\boldsymbol{g} to learn each task and then updates 𝒈\boldsymbol{g} with a small momentum.

TABLE IV: Comparison of different strategies for learning task-shared parameters. TII (%): performance of task identity inference. FAA-U (%): final average accuracy of learning all classes from uninstructed representations. F&T: fix-and-tuning. FSA: first-session adaptation. SL: slow learner. EMA: exponential moving average.
Setup   Method Split ImageNet-R Split Cars-196
TII (\uparrow) FAA-U (\uparrow) TII (\uparrow) FAA-U (\uparrow)
Sup-21/1K PreT F&T [14] 76.45 74.50 59.46 52.94
FSA [43] 75.85 73.76 68.42 58.85
SL [4] 77.06 74.68 66.13 56.67
EMA [14] 76.17 73.93 68.35 58.99
FSA+SL 77.15 75.02 69.43 59.32
Sup-21/1K LoRA F&T [14] 71.90 69.85 64.65 57.68
FSA [43] 77.74 75.72 70.90 61.37
SL [4] 77.26 75.41 68.68 59.33
EMA [14] 78.33 76.51 71.20 61.87
FSA+SL 78.43 76.35 71.92 62.89
Sup-21/1K Adapter F&T [14] 75.45 73.88 57.16 52.11
FSA [43] 78.15 76.30 72.75 63.51
SL [4] 77.52 75.98 63.20 56.16
EMA [14] 78.29 76.30 73.59 64.46
FSA+SL 80.09 78.52 73.71 64.93

However, most of these strategies have their own limitations. Both F&T and SL impose restrictions on the extent of updates, sacrificing the effectiveness of representation learning and suffering from potential representation shifts in subsequent recovery. FSA adeptly integrates knowledge from the first task and completely avoids representation shifts, but is unable to perform subsequent representation learning. Considering their complementary properties, we propose a simple but effective strategy that employs FSA for learning the first task and SL for learning subsequent tasks, which clearly outperforms other strategies (see Table IV). Notably, EMA can be seen as a coarse implementation of FSA+SL and indeed achieves the second-highest performance. Therefore, the task-shared parameters in HiDe-PET may also be implemented with EMA by updating 𝒈\boldsymbol{g} from each 𝒆i\boldsymbol{e}_{i} for i[t]i\in[t], offering a slight compromise in performance but reducing a half of the training cost.

On the other hand, we evaluate effective strategies for representation recovery. For classification tasks, the pre-trained representations of each class tend to be single-peaked and therefore can be modeled as a Gaussian with dedicated mean and covariance. Although the covariance achieves considerable performance as shown in Table V, it requires a storage complexity O(d2)O(d^{2}) for embedding dimension dd, which is comparable to the MSA layer of the backbone. There are three alternatives that reduces the storage complexity to O(d)O(d), such as simplifying the covariance to variance, preserving randomly selected prototypes, and obtaining multiple representation centroids with kkNNs. Among them, the multi-centroid demonstrates superior performance and is applicable to different task types, which therefore becomes our default implementation. Interestingly, the variance achieves comparable performance as the covariance and the multi-centroid under general tasks (i.e., Split ImageNet-R) and supervised pre-training (i.e., Sup-21/1K) while requires negligible parameter costs. This further strengthens the advantages of our HiDe-PET in such scenarios targeted by previous work.

Besides, we note that the proposed PET ensemble of task-specific parameters ensures efficiency and scalability due to the lightweight nature of mainstream PET techniques. As described in Sec. 3.2, ProT and PreT employ 𝒑d𝒑×d\boldsymbol{p}\in\mathbb{R}^{d_{\boldsymbol{p}}\times d} with d𝒑dd_{\boldsymbol{p}}\ll d, while Adapter and LoRA employ 𝑾downd×r\boldsymbol{W}_{{\rm{down}}}\in\mathbb{R}^{d\times r} and 𝑾upr×d\boldsymbol{W}_{{\rm{up}}}\in\mathbb{R}^{r\times d} with rdr\ll d. In our default implementation, the additional parameter costs of PreT, Adapter, and LoRA for learning each task are kept the same, i.e., d𝒑=20d_{\boldsymbol{p}}=20 for PreT, and r=10r=10 for Adapter and LoRA, inserted into layers 1-5. Therefore, the additional parameter cost is 0.073M with embedding dimension d=768d=768. Even though the model needs to learn a sufficiently long task sequence, e.g., 100 tasks, the total parameter cost is only 7.3M (around 8.5% of the ViT-B/16 backbone).

PET Technique: While mainstream PET techniques universally amount to modulating specific hidden states of the PTMs [33], their potential differences in CL are noteworthy. As mentioned above, Prompt-based PET usually lags behind LoRA/Adapter-based PET for both LAE and HiDe-PET (see Table I and Fig. 6). The performance gap tends to be more pronounced under the more challenging scenarios of pre-trained checkpoints and CL benchmarks. A major cause is the limited capacity of Prompt-based PET (i.e., updating attached tokens) in representation learning compared to LoRA/Adapter-based PET (i.e., updating network parameters), especially when handling self-supervised pre-training [36] and fine-grained tasks [40], as validated in our results on both task-specific parameters (see WTP in Table III) and task-shared parameters (see FSA+SL in Table IV).

Beyond the overall performance, the choice of PET technique also exerts distinct influences on the three hierarchical components. Compared to Prompt-based PET, LoRA/Adapter-based PET excels in WTP performance through more effectively incorporating task-specific knowledge, but reveals a heightened sensitivity to TII performance, manifested in the errors of predicting an inappropriate set of task-specific parameters (i.e., mismatched representations for each task lead to more errors in the final prediction). This effect is further compensated by learning a robust TAP function. As shown in Table III, the effectiveness of TII is remarkably pronounced when coupled with WTP+TAP for LoRA/Adapter-based PET, whereas diminishes for either Prompt-based PET or WTP alone. Moreover, our HiDe-PET outperforms its preliminary version [13] especially in LoRA/Adapter-based PET and the more challenging scenarios (see Fig. 6), thanks to the improved TII performance with task-shared parameters.

TABLE V: Comparison of different strategies for representation recovery. We set k=10k=10 for kkNNs to obtain Multi-Centroid. #Param: average parameter costs per class, where d=768d=768 in this case.
Setup          Method Split ImageNet-R Split Cars-196
FAA (\uparrow) #Param (\downarrow) FAA (\uparrow) #Param (\downarrow)
Sup-21/1K PreT No Recovery 75.68 0 52.89 0
Prototype 76.88 10dd 62.65 10dd
Variance 77.54 1dd 57.04 1dd
Covariance 77.58 d2d^{2} 73.14 d2d^{2}
Multi-Centroid 78.93 9.5dd 68.73 8.0dd
iBOT-21K PreT No Recovery 58.88 0 41.89 0
Prototype 67.08 10dd 46.34 10dd
Variance 70.55 1dd 48.27 1dd
Covariance 68.85 d2d^{2} 66.42 d2d^{2}
Multi-Centroid 70.57 9.5dd 63.98 8.7dd
Sup-Weak PreT No Recovery 54.63 0 44.35 0
Prototype 55.06 10dd 46.91 10dd
Variance 55.49 1dd 47.77 1dd
Covariance 57.46 d2d^{2} 56.06 d2d^{2}
Multi-Centroid 57.98 9.1dd 52.89 8.7dd
TABLE VI: Evaluation of adaptive knowledge accumulation (AKA) with LoRA-based PET. Full (%): average accuracy of learning subsequent tasks with all training samples. Few (%): average accuracy of learning subsequent tasks with a few training samples (5 per class).
PTM Method Split Dogs-120 Split CUB-200 Split Cars-196 Split Aircraft-102 CL of Mixture
Full (\uparrow) Few (\uparrow) Full (\uparrow) Few (\uparrow) Full (\uparrow) Few (\uparrow) Full (\uparrow) Few (\uparrow) FAA (\uparrow) CAA (\uparrow)
Sup-21/1K w/o AKA 92.32 88.36 89.51 81.26 83.77 57.04 77.98 55.56 77.32 81.14
w/   AKA 92.50 88.92 91.39 84.91 89.46 64.30 83.38 62.37 83.27 86.78
iBOT-21K w/o AKA 81.70 54.08 74.66 45.79 79.27 38.15 79.24 53.97 68.38 74.57
w/   AKA 84.41 65.10 82.95 62.30 88.32 56.12 85.06 62.76 74.99 81.55
Sup-Weak w/o AKA 88.14 80.88 71.10 47.46 59.03 36.51 62.68 35.78 53.23 58.54
w/   AKA 88.17 81.38 77.50 56.92 77.42 50.37 72.64 46.32 65.48 70.41

PET Architecture: The generality of our HiDe-PET is also reflected in its PET architecture, which strategically exploits both task-specific and task-shared parameters for representation learning. These two kinds of parameters are used to acquire knowledge with different levels of differentiation and need to overcome their respective challenges as described in Sec. 3.3. Within HiDe-PET, they both contribute to the outstanding performance in Table I and complement each other (see WTP in Table III and FSA+SL in Table IV). In contrast, our preliminary version [13] and LAE [14] exclusively engage either task-specific or task-shared parameters, missing out on fully harnessing the benefits of PTMs and PET. We further present an extensive comparison of our HiDe-PET, our preliminary version and LAE in terms of the overall performance, memory stability and learning plasticity, so as to better demonstrate the respective contributions of different PET architectures (see Fig. 6). It can be seen that the use of task-shared parameters in HiDe-PET can largely improve ALA of new tasks and FAA of all tasks, but probably results in a slight increase in FFM of old tasks due to the ongoing updates in our default implementation (i.e., FSA+SL). This trend is comparably more pronounced under Sup-Weak that is more demanding to update the pre-trained representations, with FAA, ALA and FFM increasing on average by 6.83%, 9.64% and 2.21%, respectively. We further evaluate alternative implementations of task-shared parameters in Table IV, where FSA has no extra forgetting but performs less well than FSA+SL. The users may select appropriate implementations according to their customised requirements of evaluation metrics.

It is noteworthy that previous work such as L2P [5] and DualPrompt [6] also explicitly or implicitly exploit both task-specific and task-shared prompts, but in a sequential manner to instruct representation learning of each task (corresponding to WTP in our framework). In contrast, our HiDe-PET optimizes these two kinds of parameters in a parallel manner to improve the three hierarchical components, allowing for a more adequate differentiation of the acquired knowledge. Interesting, using only task-shared parameters coupled with representation recovery within HiDe-PET (i.e., FSA+SL in Table IV) has already achieved better performance than these methods (see Sec. 7 for a conceptual explanation), serving as a strong baseline to evaluate current progress. The inherent connections of task-specific and task-shared parameters will be further explored below with a PET hierarchy inspired by OOD detection.

Adaptive Knowledge Accumulation: As analyzed in Sec. 5.2, the use of 𝒆1,,𝒆t\boldsymbol{e}_{1},...,\boldsymbol{e}_{t} and 𝒈\boldsymbol{g} can be seen as a special case tailored for target tasks randomly split from the same dataset, i.e., there is no actual task structure. When considering more realistic CL scenarios with apparent similarity and dissimilarity between task distributions, we devise a hierarchy of expandable parameter sets 𝒈1,,𝒈k\boldsymbol{g}_{1},...,\boldsymbol{g}_{k} upon OOD detection to achieve adaptive knowledge accumulation (AKA), and focus on the LoRA-based PET to examine if the pre-trained knowledge can evolve flexibly with target tasks in CL. Here we construct such scenario with the two fine-grained datasets (i.e., CUB-200 [49] and Cars-196 [50]) and another two (i.e., Dogs-120 [55] and Aircraft-102 [56]), which cover both natural and artificial objects. Each dataset is randomly split into 10 tasks. We collect 5 tasks per dataset and mix then as a task sequence (20 tasks in total) for CL, while leaving the rest 5 tasks per dataset for validation.

In CL, the OOD detection threshold λOOD\lambda_{{\rm{OOD}}} determines the expansion of parameter sets. As shown in Fig. 7, using a larger λOOD\lambda_{{\rm{OOD}}} tends to expand fewer parameter sets, and vice versa. In particular, the choice of λOOD\lambda_{{\rm{OOD}}} is relatively insensitive and can consistently construct one parameter set for each dataset (with λOOD=0.7\lambda_{{\rm{OOD}}}=0.7 or 0.80.8) under different pre-trained checkpoints. Then we validate the effectiveness of AKA in Table VI. Inspired by a recent work [57], we evaluate the improvements of pre-trained knowledge through the average accuracy of learning the validation tasks under large-shot or few-shot setting. Through selecting the most relevant 𝒈j\boldsymbol{g}_{j} and adding it to θ\theta, the pre-trained backbone fθf_{\theta} is able to learn each task more effectively. With the improved fθf_{\theta}, the overall performance of CL (i.e., FAA and CAA) for the mixed task sequence is also significantly enhanced.

Interestingly, the idea of updating the pre-trained backbone with a mixture of LoRA experts [58] has been shown effective to accumulate knowledge from multi-task learning, which is consistent with our results. In contrast, the design of the OOD detection, FSA+SL, and representation recovery enables our HiDe-PET to achieve this purpose in a lifelong manner. Besides, our HiDe-PET can also adapt to task-agnostic CL [3] through expanding 𝒆1,,𝒆t\boldsymbol{e}_{1},...,\boldsymbol{e}_{t} upon the OOD detection. We leave it as a further work.

Refer to caption
Figure 7: OOD detection threshold λOOD\lambda_{{\rm{OOD}}} and the number of expanded parameter sets 𝒈1,,𝒈k\boldsymbol{g}_{1},...,\boldsymbol{g}_{k} under different pre-trained checkpoints.

7 Discussion and Conclusion

In this work, we present a unified framework for CL with PTMs and PET, in order to advance this direction with improved effectiveness and generality. Our framework features a profound integration of theoretical and empirical insights, a broad coverage of relevant techniques, as well as a robust adaptation to different scenarios. Considering the particular impact of pre-trained knowledge on CL, we decompose the CL objective into three hierarchical components, i.e., WTP, TII and TAP, and devise an innovative approach to explicitly optimize them with mainstream PET techniques. During the optimization process, pre-trained representations are effectively instructed via task-specific and task-shared PET architectures, and are efficiently recovered through preserving their statistical information.

Our framework allows for a comprehensive evaluation of various technical elements inherent in CL with PTMs and PET. Through an extensive empirical investigation, we demonstrate the better performance of LoRA/Adapter-based PET over Prompt-based PET within both task-specific and task-shared PET architectures, which tends to be more evident under the more challenging scenarios in terms of pre-trained checkpoints and CL benchmarks. We also unravel the distinct behaviors of different PET techniques in response to the three hierarchical components, as well as the respective challenges and complementary effects of different PET architectures. These technical elements are potentially shared by many recent methods, making it possible to dissect their specific implementations and incorporate the most appropriate ones. Owning to the above extensive explorations, our approach achieves remarkably superior performance across various CL scenarios over a wide range of recent strong baselines.

Intriguingly, the correspondence of our approach to the three hierarchical components suggests a more profound connection between existing methods. As discussed in Sec. 3.3, the use of task-specific parameters [5, 6, 8, 7, 13] necessitates learning to predict their identities, equivalent to optimizing the decomposed WTP performance P(𝒙𝒳i¯,j¯|𝒙𝒳i¯,𝒟,θ)P(\boldsymbol{x}\in\mathcal{X}_{\bar{i},\bar{j}}|\boldsymbol{x}\in\mathcal{X}_{\bar{i}},\mathcal{D},\theta) and TII performance P(𝒙𝒳i¯|𝒟,θ)P(\boldsymbol{x}\in\mathcal{X}_{\bar{i}}|\mathcal{D},\theta). In contrast, the use of task-shared parameters [5, 6, 34, 14, 4] needs to overcome catastrophic forgetting, equivalent to optimizing the pre-decomposed performance P(𝒙𝒳i¯,j¯|𝒟,θ)P(\boldsymbol{x}\in\mathcal{X}_{\bar{i},\bar{j}}|\mathcal{D},\theta). On the top of representation learning, the use of representation recovery [34, 4, 44] to rectify the output layer further improves the TAP performance P(𝒙𝒳y|𝒟,θ)P(\boldsymbol{x}\in\mathcal{X}^{y}|\mathcal{D},\theta). This connection is summarized by the multi-objective optimization problem in Eq. (32), and also demonstrates why our approach clearly outperforms other baselines and why the use of only task-shared parameters and representation recovery (i.e., FSA+SL in Table IV) is powerful enough. Subsequent efforts in CL with PTMs and PET could employ this as a theoretical reference to develop more advanced methods.

Moreover, the hierarchical decomposition along with the design of our approach showcase a close relationship with the mechanisms of robust biological CL. In the mammalian brain, the memory of an experience is consolidated with the interplay of hippocampus and neocortex, known as the complementary learning system theory [59, 60] that has been widely used to inspire CL of AI. The hippocampus-depended and neocortex-depended memories tend to be more specific and more generalized, respectively [61, 62], and the retrieval of these two memory paths is adaptively switched from the concrete scenarios [63]. Within hippocampus, the activation of distinct populations of memory cells also undergoes adaptive switching [64], and the neural representations of previous experiences are frequently recovered [65]. The entire process is consistent with the parallel organization of task-specific and task-shared parameters, the exclusive selection of the former and the representation recovery of task distributions.

In the era of large-scale PTMs, we would emphasize the pressing need for these adaptive algorithms that are designed with machine learning fundamentals and real-world considerations. By leveraging the power of PTMs and the adaptability of CL, we can customize solutions to address the unique challenges posed by specific domains, and envision extending our approach to numerous areas such as healthcare, robotics and industrial manufacturing. Such an elevated goal requires extending the target of CL from homogeneous to heterogeneous tasks, which also provides novel opportunities to explore generalizable knowledge behind them. Taken together, we expect this work to not only facilitate direct applications but also set the stage for the robustness, adaptability and reliability of future AI systems, as a general purpose of CL research.

Acknowledgments

This work was supported by the NSFC Projects (Nos. 62406160, 62350080, 62106123, 62106120, 92370124, 92248303), Beijing Natural Science Foundation L247011, Tsinghua Institute for Guo Qiang, and the High Performance Computing Center, Tsinghua University. L.W. is also supported by the Postdoctoral Fellowship Program of CPSF under Grant Number GZB20230350 and the Shuimu Tsinghua Scholar. J.Z. is also supported by the XPlorer Prize.

References

  • [1] V. V. Ramasesh, A. Lewkowycz, and E. Dyer, “Effect of scale on catastrophic forgetting in neural networks,” in ICLR, 2021.
  • [2] S. V. Mehta  et al., “An empirical investigation of the role of pre-training in lifelong learning,” arXiv preprint arXiv:2112.09153, 2021.
  • [3] L. Wang  et al., “A comprehensive survey of continual learning: Theory, method and application,” IEEE TPAMI, 2024.
  • [4] G. Zhang  et al., “Slca: Slow learner with classifier alignment for continual learning on a pre-trained model,” in ICCV, 2023.
  • [5] Z. Wang  et al., “Learning to prompt for continual learning,” in CVPR, 2022.
  • [6] Z. Wang  et al., “Dualprompt: Complementary prompting for rehearsal-free continual learning,” in ECCV, 2022.
  • [7] Y. Wang, Z. Huang, and X. Hong, “S-prompts learning with pre-trained transformers: An occam’s razor for domain incremental learning,” NeurIPS, 2022.
  • [8] J. S. Smith  et al., “Coda-prompt: Continual decomposed attention-based prompting for rehearsal-free continual learning,” in CVPR, 2023.
  • [9] B. Lester, R. Al-Rfou, and N. Constant, “The power of scale for parameter-efficient prompt tuning,” in EMNLP, 2021.
  • [10] X. L. Li and P. Liang, “Prefix-tuning: Optimizing continuous prompts for generation,” in ACL-IJCNLP, 2021.
  • [11] S.-A. Rebuffi, H. Bilen, and A. Vedaldi, “Learning multiple visual domains with residual adapters,” NeurIPS, 2017.
  • [12] E. J. Hu  et al., “Lora: Low-rank adaptation of large language models,” arXiv preprint arXiv:2106.09685, 2021.
  • [13] L. Wang  et al., “Hierarchical decomposition of prompt-based continual learning: Rethinking obscured sub-optimality,” NeurIPS, 2023.
  • [14] Q. Gao  et al., “A unified continual learning framework with general parameter-efficient tuning,” in ICCV, 2023.
  • [15] G. I. Parisi  et al., “Continual lifelong learning with neural networks: A review,” Neur. Netw., 2019.
  • [16] J. Kirkpatrick  et al., “Overcoming catastrophic forgetting in neural networks,” PNAS, 2017.
  • [17] L. Wang  et al., “Afec: Active forgetting of negative transfer in continual learning,” NeurIPS, 2021.
  • [18] R. Aljundi  et al., “Memory aware synapses: Learning what (not) to forget,” in ECCV, 2018.
  • [19] S.-A. Rebuffi  et al., “icarl: Incremental classifier and representation learning,” in CVPR, 2017.
  • [20] H. Shin  et al., “Continual learning with deep generative replay,” NeurIPS, 2017.
  • [21] L. Wang  et al., “Memory replay with data compression for continual learning,” in ICLR, 2021.
  • [22] Q. Pham, C. Liu, and S. Hoi, “Dualnet: Continual learning, fast and slow,” NeurIPS, 2021.
  • [23] H. Cha, J. Lee, and J. Shin, “Co2l: Contrastive continual learning,” in ICCV, 2021.
  • [24] O. Ostapenko  et al., “Foundational models for continual learning: An empirical study of latent replay,” in CoLLAs, 2022.
  • [25] D. Lopez-Paz and M. Ranzato, “Gradient episodic memory for continual learning,” NeurIPS, 2017.
  • [26] S. Wang  et al., “Training networks in null space of feature covariance for continual learning,” in CVPR, 2021.
  • [27] G. Saha, I. Garg, and K. Roy, “Gradient projection memory for continual learning,” in ICLR, 2020.
  • [28] J. Serra  et al., “Overcoming catastrophic forgetting with hard attention to the task,” in ICML, 2018.
  • [29] L. Wang  et al., “Coscl: Cooperation of small continual learners is stronger than a big one,” in ECCV, 2022.
  • [30] L. Wang  et al., “Incorporating neuro-inspired adaptability for continual learning in artificial intelligence,” Nat. Mach. Intell., 2023.
  • [31] Y. Wu  et al., “Large scale incremental learning,” in CVPR, 2019.
  • [32] J. Knoblauch, H. Husain, and T. Diethe, “Optimal continual learning has perfect memory and is np-hard,” in ICML, 2020.
  • [33] J. He  et al., “Towards a unified view of parameter-efficient transfer learning,” in ICLR, 2021.
  • [34] M. D. McDonnell  et al., “Ranpac: Random projections and pre-trained models for continual learning,” NeurIPS, 2023.
  • [35] M. Jia  et al., “Visual prompt tuning,” in ECCV, 2022.
  • [36] S. Yoo  et al., “Improving visual prompt tuning for self-supervised vision transformers,” in ICML, 2023.
  • [37] G. M. Van de Ven and A. S. Tolias, “Three scenarios for continual learning,” arXiv preprint arXiv:1904.07734, 2019.
  • [38] A. Vaswani  et al., “Attention is all you need,” NeurIPS, 2017.
  • [39] A. Dosovitskiy  et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” in ICLR, 2020.
  • [40] X. Ma  et al., “When visual prompt tuning meets source-free domain adaptive semantic segmentation,” NeurIPS, 2023.
  • [41] G. Kim  et al., “A theoretical study on solving continual learning,” NeurIPS, 2022.
  • [42] J. Yang  et al., “Generalized out-of-distribution detection: A survey,” arXiv preprint arXiv:2110.11334, 2021.
  • [43] A. Panos  et al., “First session adaptation: A strong replay-free baseline for class-incremental learning,” arXiv preprint arXiv:2303.13199, 2023.
  • [44] Q. Tran  et al., “Koppa: Improving prompt-based continual learning with key-query orthogonal projection and prototype-based one-versus-all,” arXiv preprint arXiv:2311.15414, 2023.
  • [45] Y. Sun  et al., “Out-of-distribution detection with deep nearest neighbors,” in ICML, 2022.
  • [46] A. Krizhevsky, G. Hinton  et al., “Learning multiple layers of features from tiny images,” Tech. Rep., 2009.
  • [47] D. Hendrycks, S. Basart  et al., “The many faces of robustness: A critical analysis of out-of-distribution generalization,” in ICCV, 2021.
  • [48] T. Ridnik  et al., “Imagenet-21k pretraining for the masses,” arXiv preprint arXiv:2104.10972, 2021.
  • [49] C. Wah  et al., “The caltech-ucsd birds-200-2011 dataset,” 2011.
  • [50] J. Krause  et al., “3d object representations for fine-grained categorization,” in ICCVW, 2013.
  • [51] J. Zhou  et al., “Image bert pre-training with online tokenizer,” in ICLR, 2021.
  • [52] G. Kim, B. Liu, and Z. Ke, “A multi-head model for continual learning via out-of-distribution replay,” in CoLLAs, 2022.
  • [53] M. Oquab  et al., “Dinov2: Learning robust visual features without supervision,” TMLR.
  • [54] Y.-M. Tang, Y.-X. Peng, and W.-S. Zheng, “When prompt-based incremental learning does not meet strong pretraining,” in ICCV, 2023.
  • [55] A. Khosla  et al., “Novel dataset for fine-grained image categorization: Stanford dogs,” in CVPRW, 2011.
  • [56] S. Maji  et al., “Fine-grained visual classification of aircraft,” arXiv preprint arXiv:1306.5151, 2013.
  • [57] W. Liao  et al., “Does continual learning meet compositionality? new benchmarks and an evaluation framework,” NeurIPS, 2023.
  • [58] X. Wu, S. Huang, and F. Wei, “Mole: Mixture of lora experts,” in ICLR, 2023.
  • [59] D. Kumaran, D. Hassabis, and J. L. McClelland, “What learning systems do intelligent agents need? complementary learning systems theory updated,” Trends Cogn. Sci., 2016.
  • [60] J. L. McClelland, B. L. McNaughton, and R. C. O’Reilly, “Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory.” Psychol. Rev., 1995.
  • [61] P. W. Frankland and B. Bontempi, “The organization of recent and remote memories,” Nature Rev. Neurosci., 2005.
  • [62] L. Wang  et al., “Triple-memory networks: A brain-inspired method for continual learning,” IEEE TNNLS, 2021.
  • [63] I. Goshen  et al., “Dynamics of retrieval strategies for remote memories,” Cell, 2011.
  • [64] B. Lei  et al., “Social experiences switch states of memory engrams through regulating hippocampal rac1 activity,” PNAS, 2022.
  • [65] D. Kudithipudi  et al., “Biological underpinnings for lifelong learning machines,” Nat. Mach. Intell., 2022.
  • [66] Y. Zuo  et al., “Hierarchical prompts for rehearsal-free continual learning,” arXiv preprint arXiv:2401.11544, 2024.
[Uncaptioned image] Liyuan Wang is currently a postdoctoral researcher in Tsinghua University, working with Prof. Jun Zhu at the Department of Computer Science and Technology. Before that, he received the B.S. and Ph.D. degrees from Tsinghua University. His research interests include continual learning, incremental learning, lifelong learning and brain-inspired AI. His work in continual learning has been published in major conferences and journals in related fields, such as Nature Machine Intelligence, IEEE TPAMI, IEEE TNNLS, NeurIPS, ICLR, CVPR, ICCV, ECCV, etc.
[Uncaptioned image] Jingyi Xie is currently a researcher engineer in Prof. Jun Zhu’s group. She received the B.Sc. and M.Sc. degree with the Department of Mathematics and Statistics, Wuhan University, Wuhan, China. Her current research interests include representation learning, continual learning, and deep learning.
[Uncaptioned image] Xingxing Zhang received the Ph.D. degree from the Institute of Information Science, Beijing Jiaotong University in 2020 and B.E. degree in 2015. She was also a visiting student with the Department of Computer Science, University of Rochester, USA, from 2018 to 2019. She was a postdoc in the Department of Computer Science, Tsinghua University, from 2020 to 2022. Her research interests include continual learning and zero/few-shot learning. She has received the excellent Ph.D. thesis award from the Chinese Institute of Electronics in 2020.
[Uncaptioned image] Hang Su , IEEE member, is an associated professor in the department of computer science and technology at Tsinghua University. His research interests lie in the adversarial machine learning and robust computer vision, based on which he has published more than 50 papers including CVPR, ECCV, TMI, etc. He has served as area chair in NeurIPS and the workshop co-chair in AAAI22. he received “Young Investigator Award” from MICCAI2012, the “Best Paper Award” in AVSS2012, and “Platinum Best Paper Award” in ICME2018.
[Uncaptioned image] Jun Zhu received his B.S. and Ph.D. degrees from the Department of Computer Science and Technology in Tsinghua University, where he is currently a Bosch AI professor. He was an adjunct faculty and postdoctoral fellow in the Machine Learning Department, Carnegie Mellon University. His research interest is primarily on developing machine learning methods to understand scientific and engineering data arising from various fields. He regularly serves as senior Area Chairs and Area Chairs at prestigious conferences, including ICML, NeurIPS, ICLR, IJCAI and AAAI. He was selected as “AI’s 10 to Watch” by IEEE Intelligent Systems. He is a Fellow of the IEEE and an associate editor-in-chief of IEEE TPAMI.

Appendix A Theoretical Foundation I

In this section, we present the complete proof of our hierarchical decomposition under different CL scenarios.

A.1 Class-Incremental Learning (CIL)

Proof of Theorem 1
For CIL with pre-training, assume 𝔼𝒙[HWTP(𝒙)]δ\mathbb{E}_{\boldsymbol{x}}[{H}_{\rm{WTP}}(\boldsymbol{x})]\leq\delta, 𝔼𝒙[HTII(𝒙)]ϵ\mathbb{E}_{\boldsymbol{x}}[{H}_{\rm{TII}}(\boldsymbol{x})]\leq\epsilon, and 𝔼𝒙[HTAP(𝒙)]η\mathbb{E}_{\boldsymbol{x}}[{H}_{\rm{TAP}}(\boldsymbol{x})]\leq\eta. Let y𝒴i¯,j¯y\in\mathcal{Y}_{\bar{i},\bar{j}} be the ground truth of an 𝒙\boldsymbol{x}, where i¯[t]\bar{i}\in[t] and j¯[|𝒴i¯|]\bar{j}\in[|\mathcal{Y}_{\bar{i}}|] denote the task identity and within-task index, respectively.

As we defined,

HWTP(𝒙)=(𝟏j¯,{P(𝒙𝒳i¯,j|𝒙𝒳i¯,𝒟,θ)}j)=logP(𝒙𝒳i¯,j¯|𝒙𝒳i¯,𝒟,θ),\displaystyle\begin{split}{H}_{\rm{WTP}}(\boldsymbol{x})&=\mathcal{H}(\boldsymbol{1}_{\bar{j}},\{P(\boldsymbol{x}\in\mathcal{X}_{\bar{i},j}|\boldsymbol{x}\in\mathcal{X}_{\bar{i}},\mathcal{D},\theta)\}_{j})\\ &=-\log P(\boldsymbol{x}\in\mathcal{X}_{\bar{i},\bar{j}}|\boldsymbol{x}\in\mathcal{X}_{\bar{i}},\mathcal{D},\theta),\end{split} (20)
HTII(𝒙)=(𝟏i¯,{P(𝒙𝒳i|𝒟,θ)}i)=logP(𝒙𝒳i¯|𝒟,θ),\displaystyle\begin{split}{H}_{\rm{TII}}(\boldsymbol{x})&=\mathcal{H}(\boldsymbol{1}_{\bar{i}},\{P(\boldsymbol{x}\in\mathcal{X}_{i}|\mathcal{D},\theta)\}_{i})\\ &=-\log P(\boldsymbol{x}\in\mathcal{X}_{\bar{i}}|\mathcal{D},\theta),\end{split} (21)
HTAP(𝒙)=(𝟏y,{P(𝒙𝒳c|𝒟,θ)}c)=logP(𝒙𝒳y|𝒟,θ).\displaystyle\begin{split}{H}_{\rm{TAP}}(\boldsymbol{x})&=\mathcal{H}(\boldsymbol{1}_{y},\{P(\boldsymbol{x}\in\mathcal{X}^{c}|\mathcal{D},\theta)\}_{c})\\ &=-\log P(\boldsymbol{x}\in\mathcal{X}^{y}|\mathcal{D},\theta).\end{split} (22)

Then, we have

(𝟏i¯,j¯,{P(𝒙𝒳i,j|𝒟,θ)}i,j)=logP(𝒙𝒳i¯,j¯|𝒟,θ)=log(P(𝒙𝒳i¯,j¯|𝒙𝒳i¯,𝒟,θ)P(𝒙𝒳i¯|𝒟,θ))=logP(𝒙𝒳i¯,j¯|𝒙𝒳i¯,𝒟,θ)logP(𝒙𝒳i¯|𝒟,θ)=HWTP(𝒙)+HTII(𝒙).\displaystyle\begin{split}&\mathcal{H}(\boldsymbol{1}_{\bar{i},\bar{j}},\{P(\boldsymbol{x}\in\mathcal{X}_{{i},{j}}|\mathcal{D},\theta)\}_{{i},{j}})\\ &=-\log P(\boldsymbol{x}\in\mathcal{X}_{\bar{i},\bar{j}}|\mathcal{D},\theta)\\ &=-\log(P(\boldsymbol{x}\in\mathcal{X}_{\bar{i},\bar{j}}|\boldsymbol{x}\in\mathcal{X}_{\bar{i}},\mathcal{D},\theta)P(\boldsymbol{x}\in\mathcal{X}_{\bar{i}}|\mathcal{D},\theta))\\ &=-\log P(\boldsymbol{x}\in\mathcal{X}_{\bar{i},\bar{j}}|\boldsymbol{x}\in\mathcal{X}_{\bar{i}},\mathcal{D},\theta)-\log P(\boldsymbol{x}\in\mathcal{X}_{\bar{i}}|\mathcal{D},\theta)\\ &={H}_{\rm{WTP}}(\boldsymbol{x})+{H}_{\rm{TII}}(\boldsymbol{x}).\end{split} (23)

Taking expectations on Eq. (22), we have

1=𝔼𝒙[HTAP(𝒙)]η.\displaystyle\begin{split}\mathcal{L}_{1}=\mathbb{E}_{\boldsymbol{x}}[{H}_{\rm{TAP}}(\boldsymbol{x})]\leq\eta.\end{split} (24)

Taking expectations on both sides of Eq. (23), we have

2=𝔼𝒙[(𝟏i¯,j¯,{P(𝒙𝒳i,j|𝒟,θ)}i,j)]=𝔼𝒙[HWTP(𝒙)]+𝔼𝒙[HTII(𝒙)]δ+ϵ.\displaystyle\begin{split}\mathcal{L}_{2}&=\mathbb{E}_{\boldsymbol{x}}[\mathcal{H}(\boldsymbol{1}_{\bar{i},\bar{j}},\{P(\boldsymbol{x}\in\mathcal{X}_{{i},{j}}|\mathcal{D},\theta)\}_{{i},{j}})]\\ &=\mathbb{E}_{\boldsymbol{x}}[{H}_{\rm{WTP}}(\boldsymbol{x})]+\mathbb{E}_{\boldsymbol{x}}[{H}_{\rm{TII}}(\boldsymbol{x})]\\ &\leq\delta+\epsilon.\end{split} (25)

Considering the multi-objective optimization problem max[P(𝒙𝒳i¯,j¯|𝒟,θ),P(𝒙𝒳y|𝒟,θ)]\max[P(\boldsymbol{x}\in\mathcal{X}_{\bar{i},\bar{j}}|\mathcal{D},\theta),P(\boldsymbol{x}\in\mathcal{X}^{y}|\mathcal{D},\theta)] in Eq. (32), we have the loss error

=max{𝔼𝒙[(𝟏i¯,j¯,{P(𝒙𝒳i,j|𝒟,θ)}i,j)],𝔼𝒙[HTAP(𝒙)]}=max{2,1}=max{δ+ϵ,η}.\displaystyle\begin{split}\mathcal{L}&=\max\{\mathbb{E}_{\boldsymbol{x}}[\mathcal{H}(\boldsymbol{1}_{\bar{i},\bar{j}},\{P(\boldsymbol{x}\in\mathcal{X}_{{i},{j}}|\mathcal{D},\theta)\}_{{i},{j}})],\mathbb{E}_{\boldsymbol{x}}[{H}_{\rm{TAP}}(\boldsymbol{x})]\}\\ &=\max\{\mathcal{L}_{2},\mathcal{L}_{1}\}\\ &=\max\{\delta+\epsilon,\eta\}.\end{split} (26)

This finishes the proof.

Proof of Theorem 2
For CIL with pre-training, its loss error ξ\mathcal{L}\leq\xi. Assume 𝒙𝒳i¯,j¯𝒳i¯\boldsymbol{x}\in\mathcal{X}_{\bar{i},\bar{j}}\subseteq\mathcal{X}_{\bar{i}}. According to the proof of Theorem 1, we have

HWTP(𝒙)=logP(𝒙𝒳i¯,j¯|𝒙𝒳i¯,𝒟,θ)=logP(𝒙𝒳i¯,j¯|𝒟,θ)P(𝒙𝒳i¯|𝒟,θ)logP(𝒙𝒳i¯,j¯|𝒟,θ)=(𝟏i¯,j¯,{P(𝒙𝒳i,j|𝒟,θ)}i,j)=2ξ.\displaystyle\begin{split}{H}_{\rm{WTP}}(\boldsymbol{x})&=-\log P(\boldsymbol{x}\in\mathcal{X}_{\bar{i},\bar{j}}|\boldsymbol{x}\in\mathcal{X}_{\bar{i}},\mathcal{D},\theta)\\ &=-\log\frac{P(\boldsymbol{x}\in\mathcal{X}_{\bar{i},\bar{j}}|\mathcal{D},\theta)}{P(\boldsymbol{x}\in\mathcal{X}_{\bar{i}}|\mathcal{D},\theta)}\\ &\leq-\log P(\boldsymbol{x}\in\mathcal{X}_{\bar{i},\bar{j}}|\mathcal{D},\theta)\\ &=\mathcal{H}(\boldsymbol{1}_{\bar{i},\bar{j}},\{P(\boldsymbol{x}\in\mathcal{X}_{{i},{j}}|\mathcal{D},\theta)\}_{{i},{j}})\\ &=\mathcal{L}_{2}\leq\xi.\end{split} (27)

Likewise, we have

HTII(𝒙)=logP(𝒙𝒳i¯|𝒟,θ)=logP(𝒙𝒳i¯,j¯|𝒟,θ)P(𝒙𝒳i¯,j¯|𝒙𝒳i¯,𝒟,θ)logP(𝒙𝒳i¯,j¯|𝒟,θ)=(𝟏i¯,j¯,{P(𝒙𝒳i,j|𝒟,θ)}i,j)=2ξ.\displaystyle\begin{split}{H}_{\rm{TII}}(\boldsymbol{x})&=-\log P(\boldsymbol{x}\in\mathcal{X}_{\bar{i}}|\mathcal{D},\theta)\\ &=-\log\frac{P(\boldsymbol{x}\in\mathcal{X}_{\bar{i},\bar{j}}|\mathcal{D},\theta)}{P(\boldsymbol{x}\in\mathcal{X}_{\bar{i},\bar{j}}|\boldsymbol{x}\in\mathcal{X}_{\bar{i}},\mathcal{D},\theta)}\\ &\leq-\log P(\boldsymbol{x}\in\mathcal{X}_{\bar{i},\bar{j}}|\mathcal{D},\theta)\\ &=\mathcal{H}(\boldsymbol{1}_{\bar{i},\bar{j}},\{P(\boldsymbol{x}\in\mathcal{X}_{{i},{j}}|\mathcal{D},\theta)\}_{{i},{j}})\\ &=\mathcal{L}_{2}\leq\xi.\end{split} (28)

Considering the multi-objective optimization problem max[P(𝒙𝒳i¯,j¯|𝒟,θ),P(𝒙𝒳y|𝒟,θ)]\max[P(\boldsymbol{x}\in\mathcal{X}_{\bar{i},\bar{j}}|\mathcal{D},\theta),P(\boldsymbol{x}\in\mathcal{X}^{y}|\mathcal{D},\theta)] in Eq. (32), each component must guarantee the loss error less than ξ\xi, i.e.,

HTAP(𝒙)=logP(𝒙𝒳y|𝒟,θ)=1ξ.\displaystyle\begin{split}{H}_{\rm{TAP}}(\boldsymbol{x})&=-\log P(\boldsymbol{x}\in\mathcal{X}^{y}|\mathcal{D},\theta)\\ &=\mathcal{L}_{1}\leq\xi.\end{split} (29)

This finishes the proof.

A.2 Domain-Incremental Learning (DIL)

For DIL, let each “task”555For naming consistency, here we still use “task” to denote the sequentially arriving “domain” in DIL. 𝒟i\mathcal{D}_{i} consist of domain 𝒳i=j𝒳i,j\mathcal{X}_{i}=\bigcup_{j}\mathcal{X}_{i,j} and label 𝒴i=j𝒴i,j\mathcal{Y}_{i}=\bigcup_{j}\mathcal{Y}_{i,j}, where j[|𝒴i|]j\in[|\mathcal{Y}_{i}|] denotes the jj-th class in task i[t]i\in[t], and 𝒴i=𝒴i\mathcal{Y}_{i}=\mathcal{Y}_{i^{\prime}} for ii\forall i\neq i^{\prime}. Similar to the analysis of CIL, the goal is to learn a projection from 𝒳=i=1t𝒳i\mathcal{X}=\bigcup_{i=1}^{t}\mathcal{X}_{i} to 𝒴=i=1t𝒴i\mathcal{Y}=\bigcup_{i=1}^{t}\mathcal{Y}_{i} so as to achieve TAP. When training from scratch, the TAP performance P(𝒙𝒳y|𝒟)P(\boldsymbol{x}\in\mathcal{X}^{y}|\mathcal{D}) is to predict across all classes without distinguishing tasks, where 𝒟={𝒟1,,𝒟t}\mathcal{D}=\{\mathcal{D}_{1},...,\mathcal{D}_{t}\}, y[|i=1t𝒴i|]y\in[|\bigcup_{i=1}^{t}\mathcal{Y}_{i}|] denotes the ground truth label of 𝒙\boldsymbol{x}, and 𝒳y\mathcal{X}^{y} denotes the domain of class yy. Given the assumptions of disjoint domains, the DIL probability can be expressed as a hierarchical process of TII and WTP:

P(𝒙𝒳,j|𝒟)DIL=iP(𝒙𝒳i|𝒟)TIIP(𝒙𝒳i,j|𝒙𝒳i,𝒟)WTP.\displaystyle\underbrace{P(\boldsymbol{x}\in\mathcal{X}_{*,j}|\mathcal{D})}_{\text{DIL}}=\sum_{i}\underbrace{P(\boldsymbol{x}\in\mathcal{X}_{i}|\mathcal{D})}_{\text{TII}}\underbrace{P(\boldsymbol{x}\in\mathcal{X}_{i,j}|\boldsymbol{x}\in\mathcal{X}_{i},\mathcal{D})}_{\text{WTP}}. (30)

In this case, the TAP performance P(𝒙𝒳y|𝒟)P(\boldsymbol{x}\in\mathcal{X}^{y}|\mathcal{D}) is essentially equivalent to the DIL performance P(𝒙𝒳,j¯|𝒟,θ)P(\boldsymbol{x}\in\mathcal{X}_{*,\bar{j}}|\mathcal{D},\theta), where j¯[|𝒴t|]\bar{j}\in[|\mathcal{Y}_{t}|] denotes the ground truth of an 𝒙\boldsymbol{x} w.r.t. the within-task index.

When considering the pre-trained knowledge carried by parameters θ\theta, the TAP is redefined as P(𝒙𝒳y|𝒟,θ)P(\boldsymbol{x}\in\mathcal{X}^{y}|\mathcal{D},\theta), while the DIL probability of TII and WTP is re-written as

P(𝒙𝒳,j|𝒟,θ)DIL=iP(𝒙𝒳i|𝒟,θ)TIIP(𝒙𝒳i,j|𝒙𝒳i,𝒟,θ)WTP.\displaystyle\underbrace{P(\boldsymbol{x}\in\mathcal{X}_{*,j}|\mathcal{D},\theta)}_{\text{DIL}}=\sum_{i}\underbrace{P(\boldsymbol{x}\in\mathcal{X}_{i}|\mathcal{D},\theta)}_{\text{TII}}\underbrace{P(\boldsymbol{x}\in\mathcal{X}_{i,j}|\boldsymbol{x}\in\mathcal{X}_{i},\mathcal{D},\theta)}_{\text{WTP}}. (31)

It can be seen that both the TAP performance P(𝒙𝒳y|𝒟,θ)P(\boldsymbol{x}\in\mathcal{X}^{y}|\mathcal{D},\theta) and the DIL performance P(𝒙𝒳,j¯|𝒟,θ)P(\boldsymbol{x}\in\mathcal{X}_{*,\bar{j}}|\mathcal{D},\theta) are now affected by θ\theta, but in different ways. Therefore, we propose to further optimize TAP along with the improved TII and WTP, formulating the ultimate goal of DIL as a multi-objective optimization problem, i.e.,

max[P(𝒙𝒳,j¯|𝒟,θ)DIL,P(𝒙𝒳y|𝒟,θ)TAP].\displaystyle\max[\,\underbrace{P(\boldsymbol{x}\in\mathcal{X}_{*,\bar{j}}|\mathcal{D},\theta)}_{\text{DIL}},\underbrace{P(\boldsymbol{x}\in\mathcal{X}^{y}|\mathcal{D},\theta)}_{\text{TAP}}\,]. (32)

We further derive the following theorems in terms of the sufficient and necessary conditions for improving CL.

Theorem 4

For DIL with pre-training, if 𝔼𝐱[HWTP(𝐱)]δ\mathbb{E}_{\boldsymbol{x}}[{H}_{\rm{WTP}}(\boldsymbol{x})]\leq\delta, 𝔼𝐱[HTII(𝐱)]ϵ\mathbb{E}_{\boldsymbol{x}}[{H}_{\rm{TII}}(\boldsymbol{x})]\leq\epsilon, and 𝔼𝐱[HTAP(𝐱)]η\mathbb{E}_{\boldsymbol{x}}[{H}_{\rm{TAP}}(\boldsymbol{x})]\leq\eta, we have the loss error [0,max{δ+ϵ+logt,η}]\mathcal{L}\in[0,\max\{{\delta+\epsilon+\log t},\eta\}], regardless whether the WTP predictor, TII predictor and TAP predictor are trained together or separately.

Proof of Theorem 4
As similarly defined in CIL, here

HWTP(𝒙)=(𝟏j¯,{P(𝒙𝒳i,j|𝒙𝒳i,𝒟,θ)}j)=logP(𝒙𝒳i,j¯|𝒙𝒳i,𝒟,θ),\displaystyle\begin{split}{H}_{\rm{WTP}}(\boldsymbol{x})&=\mathcal{H}(\boldsymbol{1}_{\bar{j}},\{P(\boldsymbol{x}\in\mathcal{X}_{i,j}|\boldsymbol{x}\in\mathcal{X}_{i},\mathcal{D},\theta)\}_{j})\\ &=-\log P(\boldsymbol{x}\in\mathcal{X}_{i,\bar{j}}|\boldsymbol{x}\in\mathcal{X}_{i},\mathcal{D},\theta),\end{split} (33)
HTII(𝒙)=(γ,{P(𝒙𝒳i|𝒟,θ)}i)=γilogP(𝒙𝒳i|𝒟,θ),\displaystyle\begin{split}{H}_{\rm{TII}}(\boldsymbol{x})&=\mathcal{H}(\gamma,\{P(\boldsymbol{x}\in\mathcal{X}_{i}|\mathcal{D},\theta)\}_{i})\\ &=-\gamma_{i}\log P(\boldsymbol{x}\in\mathcal{X}_{i}|\mathcal{D},\theta),\end{split} (34)
HTAP(𝒙)=(𝟏y,{P(𝒙𝒳c|𝒟,θ)}c)=logP(𝒙𝒳y|𝒟,θ),\displaystyle\begin{split}{H}_{\rm{TAP}}(\boldsymbol{x})&=\mathcal{H}(\boldsymbol{1}_{y},\{P(\boldsymbol{x}\in\mathcal{X}^{c}|\mathcal{D},\theta)\}_{c})\\ &=-\log P(\boldsymbol{x}\in\mathcal{X}^{y}|\mathcal{D},\theta),\end{split} (35)

where γ={γi}i=1t\gamma=\{\gamma_{i}\}_{i=1}^{t} represents the possibility of 𝒙\boldsymbol{x} belonging to each observed domain, γi[0,1]\gamma_{i}\in[0,1] and iγi=1\sum_{i}\gamma_{i}=1.

Then, for any simplex γ\gamma, we have

(𝟏j¯,{P(𝒙𝒳,j|𝒟,θ)}j)=logP(𝒙𝒳,j¯|𝒟,θ)=log(iP(𝒙𝒳i,j¯|𝒙𝒳i,𝒟,θ)P(𝒙𝒳i|𝒟,θ))iγilog(P(𝒙𝒳i,j¯|𝒙𝒳i,𝒟,θ)P(𝒙𝒳i|𝒟,θ)γi)=iγilogP(𝒙𝒳i,j¯|𝒙𝒳i,𝒟,θ)iγilogP(𝒙𝒳i|𝒟,θ)+iγilog(γi)=iγiHWTP+HTII+(γ).\displaystyle\begin{split}&\mathcal{H}(\boldsymbol{1}_{\bar{j}},\{P(\boldsymbol{x}\in\mathcal{X}_{*,{j}}|\mathcal{D},\theta)\}_{{j}})\\ &=-\log P(\boldsymbol{x}\in\mathcal{X}_{*,\bar{j}}|\mathcal{D},\theta)\\ &=-\log(\sum_{i}P(\boldsymbol{x}\in\mathcal{X}_{i,\bar{j}}|\boldsymbol{x}\in\mathcal{X}_{i},\mathcal{D},\theta)P(\boldsymbol{x}\in\mathcal{X}_{i}|\mathcal{D},\theta))\\ &\leq-\sum_{i}\gamma_{i}\log(\frac{P(\boldsymbol{x}\in\mathcal{X}_{i,\bar{j}}|\boldsymbol{x}\in\mathcal{X}_{i},\mathcal{D},\theta)P(\boldsymbol{x}\in\mathcal{X}_{i}|\mathcal{D},\theta)}{\gamma_{i}})\\ &=-\sum_{i}\gamma_{i}\log P(\boldsymbol{x}\in\mathcal{X}_{i,\bar{j}}|\boldsymbol{x}\in\mathcal{X}_{i},\mathcal{D},\theta)\\ &-\sum_{i}\gamma_{i}\log P(\boldsymbol{x}\in\mathcal{X}_{i}|\mathcal{D},\theta)+\sum_{i}\gamma_{i}\log(\gamma_{i})\\ &=\sum_{i}\gamma_{i}{H}_{\rm{WTP}}+{H}_{\rm{TII}}+\mathcal{H}(\gamma).\end{split} (36)

Taking expectations on Eq. (35), we have

1=𝔼𝒙[HTAP(𝒙)]η.\displaystyle\begin{split}\mathcal{L}_{1}=\mathbb{E}_{\boldsymbol{x}}[{H}_{\rm{TAP}}(\boldsymbol{x})]\leq\eta.\end{split} (37)

Taking expectations on both sides of Eq. (36), we have

2=𝔼𝒙[(𝟏j¯,{P(𝒙𝒳,j|𝒟,θ)}j]iγi𝔼𝒙[HWTP(𝒙)]+𝔼𝒙[HTII(𝒙)]+(γ)δ+ϵ+logt.\displaystyle\begin{split}\mathcal{L}_{2}&=\mathbb{E}_{\boldsymbol{x}}[\mathcal{H}(\boldsymbol{1}_{\bar{j}},\{P(\boldsymbol{x}\in\mathcal{X}_{*,{j}}|\mathcal{D},\theta)\}_{j}]\\ &\leq\sum_{i}\gamma_{i}\mathbb{E}_{\boldsymbol{x}}[{H}_{\rm{WTP}}(\boldsymbol{x})]+\mathbb{E}_{\boldsymbol{x}}[{H}_{\rm{TII}}(\boldsymbol{x})]+\mathcal{H}(\gamma)\\ &\leq\delta+\epsilon+\log t.\end{split} (38)

Considering the multi-objective optimization problem max[P(𝒙𝒳,j¯|𝒟,θ),P(𝒙𝒳y|𝒟,θ)]\max[P(\boldsymbol{x}\in\mathcal{X}_{*,\bar{j}}|\mathcal{D},\theta),P(\boldsymbol{x}\in\mathcal{X}^{y}|\mathcal{D},\theta)], we have the loss error

=max{𝔼𝒙[(𝟏j¯,{P(𝒙𝒳,j|𝒟,θ)}j)],𝔼𝒙[HTAP(𝒙)]}=max{2,1}=max{δ+ϵ+logt,η}.\displaystyle\begin{split}\mathcal{L}&=\max\{\mathbb{E}_{\boldsymbol{x}}[\mathcal{H}(\boldsymbol{1}_{\bar{j}},\{P(\boldsymbol{x}\in\mathcal{X}_{*,{j}}|\mathcal{D},\theta)\}_{{j}})],\mathbb{E}_{\boldsymbol{x}}[{H}_{\rm{TAP}}(\boldsymbol{x})]\}\\ &=\max\{\mathcal{L}_{2},\mathcal{L}_{1}\}\\ &=\max\{\delta+\epsilon+\log t,\eta\}.\end{split} (39)

This finishes the proof.

Theorem 5

For DIL with pre-training, if the loss error ξ\mathcal{L}\leq\xi, then there always exist (1) a WTP predictor, s.t. HWTPξ{H}_{\rm{WTP}}\leq\xi; (2) a TII predictor, s.t. HTIIξ{H}_{\rm{TII}}\leq\xi; and (3) a TAP predictor, s.t. HTAPξ{H}_{\rm{TAP}}\leq\xi.

Proof of Theorem 5
For DIL with pre-training, its loss error =max[1,2]ξ\mathcal{L}=\max[\mathcal{L}_{1},\mathcal{L}_{2}]\leq\xi. Assume 𝒙𝒳,j¯𝒳y\boldsymbol{x}\in\mathcal{X}_{*,\bar{j}}\subseteq\mathcal{X}^{y}. According to the proof of Theorem 4, if we define P(𝒙𝒳i,j¯|𝒟,θ)=P(𝒙𝒳,j¯|𝒟,θ)P(\boldsymbol{x}\in\mathcal{X}_{i,\bar{j}}|\mathcal{D},\theta)=P(\boldsymbol{x}\in\mathcal{X}_{*,\bar{j}}|\mathcal{D},\theta), we have

HWTP(𝒙)=logP(𝒙𝒳i,j¯|𝒙𝒳i,𝒟,θ)=logP(𝒙𝒳i,j¯|𝒟,θ)P(𝒙𝒳i|𝒟,θ)logP(𝒙𝒳i,j¯|𝒟,θ)=logP(𝒙𝒳,j¯|𝒟,θ)=(𝟏j¯,{P(𝒙𝒳,j|𝒟,θ)}j)=2ξ.\displaystyle\begin{split}{H}_{\rm{WTP}}(\boldsymbol{x})&=-\log P(\boldsymbol{x}\in\mathcal{X}_{i,\bar{j}}|\boldsymbol{x}\in\mathcal{X}_{i},\mathcal{D},\theta)\\ &=-\log\frac{P(\boldsymbol{x}\in\mathcal{X}_{i,\bar{j}}|\mathcal{D},\theta)}{P(\boldsymbol{x}\in\mathcal{X}_{i}|\mathcal{D},\theta)}\\ &\leq-\log P(\boldsymbol{x}\in\mathcal{X}_{i,\bar{j}}|\mathcal{D},\theta)\\ &=-\log P(\boldsymbol{x}\in\mathcal{X}_{*,\bar{j}}|\mathcal{D},\theta)\\ &=\mathcal{H}(\boldsymbol{1}_{\bar{j}},\{P(\boldsymbol{x}\in\mathcal{X}_{*,{j}}|\mathcal{D},\theta)\}_{{j}})\\ &=\mathcal{L}_{2}\leq\xi.\end{split} (40)

Likewise, if we define γi=1\gamma_{i}=1 and γi=0\gamma_{i^{\prime}}=0 for ii,i[t]\forall i\neq i^{\prime},i\in[t], we have

HTII(𝒙)=iγilogP(𝒙𝒳i|𝒟,θ)=logP(𝒙𝒳i|𝒟,θ)=logP(𝒙𝒳i,j¯|𝒟,θ)P(𝒙𝒳i,j¯|𝒙𝒳i,𝒟,θ)log(𝒙𝒳i,j¯|𝒟,θ)=log(𝒙𝒳,j¯|𝒟,θ)=(𝟏j¯,{P(𝒙𝒳,j|𝒟,θ)}j)=2ξ.\displaystyle\begin{split}{H}_{\rm{TII}}(\boldsymbol{x})&=-\sum_{i}\gamma_{i}\log P(\boldsymbol{x}\in\mathcal{X}_{{i}}|\mathcal{D},\theta)\\ &=-\log P(\boldsymbol{x}\in\mathcal{X}_{{i}}|\mathcal{D},\theta)\\ &=-\log\frac{P(\boldsymbol{x}\in\mathcal{X}_{i,\bar{j}}|\mathcal{D},\theta)}{P(\boldsymbol{x}\in\mathcal{X}_{i,\bar{j}}|\boldsymbol{x}\in\mathcal{X}_{i},\mathcal{D},\theta)}\\ &\leq-\log(\boldsymbol{x}\in\mathcal{X}_{i,\bar{j}}|\mathcal{D},\theta)\\ &=-\log(\boldsymbol{x}\in\mathcal{X}_{*,\bar{j}}|\mathcal{D},\theta)\\ &=\mathcal{H}(\boldsymbol{1}_{\bar{j}},\{P(\boldsymbol{x}\in\mathcal{X}_{*,{j}}|\mathcal{D},\theta)\}_{{j}})\\ &=\mathcal{L}_{2}\leq\xi.\end{split} (41)

Considering the multi-objective optimization problem max[P(𝒙𝒳,j¯|𝒟,θ),P(𝒙𝒳y|𝒟,θ)]\max[P(\boldsymbol{x}\in\mathcal{X}_{*,\bar{j}}|\mathcal{D},\theta),P(\boldsymbol{x}\in\mathcal{X}^{y}|\mathcal{D},\theta)], each component must guarantee the loss error less than ξ\xi, i.e.,

HTAP(𝒙)=logP(𝒙𝒳y|𝒟,θ)=1ξ.\displaystyle\begin{split}{H}_{\rm{TAP}}(\boldsymbol{x})&=-\log P(\boldsymbol{x}\in\mathcal{X}^{y}|\mathcal{D},\theta)\\ &=\mathcal{L}_{1}\leq\xi.\end{split} (42)

This finishes the proof.

A.3 Task-Incremental Learning (TIL)

For task-incremental learning (TIL), let each task 𝒟i\mathcal{D}_{i} consist of domain 𝒳i=j𝒳i,j\mathcal{X}_{i}=\bigcup_{j}\mathcal{X}_{i,j} and label 𝒴i=j𝒴i,j\mathcal{Y}_{i}=\bigcup_{j}\mathcal{Y}_{i,j}, where j[|𝒴i|]j\in[|\mathcal{Y}_{i}|] denotes the jj-th class in task i[t]i\in[t]. Unlike CIL and DIL, TIL has the task identity provided during the testing phase. Whether or not the impact of pre-trained parameters θ\theta is taken into account, the TAP objective is to learn P(𝒙𝒳i¯,j|𝒙𝒳i¯,𝒟)P(\boldsymbol{x}\in\mathcal{X}_{\bar{i},j}|\boldsymbol{x}\in\mathcal{X}_{\bar{i}},\mathcal{D}) or P(𝒙𝒳i¯,j|𝒙𝒳i¯,𝒟,θ)P(\boldsymbol{x}\in\mathcal{X}_{\bar{i},j}|\boldsymbol{x}\in\mathcal{X}_{\bar{i}},\mathcal{D},\theta), where 𝒟={𝒟1,,𝒟t}\mathcal{D}=\{\mathcal{D}_{1},...,\mathcal{D}_{t}\}, i¯[t]\bar{i}\in[t], and j[|𝒴i¯|]j\in[|\mathcal{Y}_{\bar{i}}|]. In fact, this is equivalent to WTP alone. For completeness, we further derive the following theorems in terms of the sufficient and necessary conditions for improving CL.

Theorem 6

For TIL with pre-training, 𝔼𝐱[HTII(𝐱)]=0\mathbb{E}_{\boldsymbol{x}}[{H}_{\rm{TII}}(\boldsymbol{x})]=0, and TAP is degraded into WTP. If 𝔼𝐱[HWTP(𝐱)]δ\mathbb{E}_{\boldsymbol{x}}[{H}_{\rm{WTP}}(\boldsymbol{x})]\leq\delta, we have the loss error [0,δ]\mathcal{L}\in[0,\delta].

Proof of Theorem 6
For TIL with pre-training, assume 𝔼𝒙[HWTP(𝒙)]δ\mathbb{E}_{\boldsymbol{x}}[{H}_{\rm{WTP}}(\boldsymbol{x})]\leq\delta. Given an 𝒙\boldsymbol{x} with the task identity i¯[t]\bar{i}\in[t], let j¯[|𝒴i¯|]\bar{j}\in[|\mathcal{Y}_{\bar{i}}|] be the ground truth of 𝒙\boldsymbol{x} w.r.t. the within-task index, and y[|i=1t𝒴i|]y\in[|\bigcup_{i=1}^{t}\mathcal{Y}_{i}|] be the ground truth label of 𝒙\boldsymbol{x}.

As similarly defined in CIL, here

HWTP(𝒙)=(𝟏j¯,{P(𝒙𝒳i¯,j|𝒙𝒳i¯,𝒟,θ)}j)=logP(𝒙𝒳i¯,j¯|𝒙𝒳i¯,𝒟,θ),\displaystyle\begin{split}{H}_{\rm{WTP}}(\boldsymbol{x})&=\mathcal{H}(\boldsymbol{1}_{\bar{j}},\{P(\boldsymbol{x}\in\mathcal{X}_{\bar{i},j}|\boldsymbol{x}\in\mathcal{X}_{\bar{i}},\mathcal{D},\theta)\}_{j})\\ &=-\log P(\boldsymbol{x}\in\mathcal{X}_{\bar{i},\bar{j}}|\boldsymbol{x}\in\mathcal{X}_{\bar{i}},\mathcal{D},\theta),\end{split} (43)
HTII(𝒙)=(𝟏i¯,{P(𝒙𝒳i|𝒟,θ)}i)=logP(𝒙𝒳i¯|𝒟,θ)=log1=0,\displaystyle\begin{split}{H}_{\rm{TII}}(\boldsymbol{x})&=\mathcal{H}(\boldsymbol{1}_{\bar{i}},\{P(\boldsymbol{x}\in\mathcal{X}_{i}|\mathcal{D},\theta)\}_{i})\\ &=-\log P(\boldsymbol{x}\in\mathcal{X}_{\bar{i}}|\mathcal{D},\theta)\\ &=-\log 1\\ &=0,\end{split} (44)
HTAP(𝒙)=(𝟏y,{P(𝒙𝒳c|𝒙𝒳i¯,𝒟,θ)}c)=logP(𝒙𝒳y|𝒙𝒳i¯,𝒟,θ)=HWTP(𝒙).\displaystyle\begin{split}{H}_{\rm{TAP}}(\boldsymbol{x})&=\mathcal{H}(\boldsymbol{1}_{y},\{P(\boldsymbol{x}\in\mathcal{X}^{c}|\boldsymbol{x}\in\mathcal{X}_{\bar{i}},\mathcal{D},\theta)\}_{c})\\ &=-\log P(\boldsymbol{x}\in\mathcal{X}^{y}|\boldsymbol{x}\in\mathcal{X}_{\bar{i}},\mathcal{D},\theta)\\ &={H}_{\rm{WTP}}(\boldsymbol{x}).\end{split} (45)

Then, we have

(𝟏i¯,j¯,{P(𝒙𝒳i,j|𝒟,θ)}i,j)=logP(𝒙𝒳i¯,j¯|𝒟,θ)=log(P(𝒙𝒳i¯,j¯|𝒙𝒳i¯,𝒟,θ)P(𝒙𝒳i¯|𝒟,θ))=logP(𝒙𝒳i¯,j¯|𝒙𝒳i¯,𝒟,θ)logP(𝒙𝒳i¯|𝒟,θ)=HWTP(𝒙)+HTII(𝒙)=HWTP(𝒙).\displaystyle\begin{split}&\mathcal{H}(\boldsymbol{1}_{\bar{i},\bar{j}},\{P(\boldsymbol{x}\in\mathcal{X}_{{i},{j}}|\mathcal{D},\theta)\}_{{i},{j}})\\ &=-\log P(\boldsymbol{x}\in\mathcal{X}_{\bar{i},\bar{j}}|\mathcal{D},\theta)\\ &=-\log(P(\boldsymbol{x}\in\mathcal{X}_{\bar{i},\bar{j}}|\boldsymbol{x}\in\mathcal{X}_{\bar{i}},\mathcal{D},\theta)P(\boldsymbol{x}\in\mathcal{X}_{\bar{i}}|\mathcal{D},\theta))\\ &=-\log P(\boldsymbol{x}\in\mathcal{X}_{\bar{i},\bar{j}}|\boldsymbol{x}\in\mathcal{X}_{\bar{i}},\mathcal{D},\theta)-\log P(\boldsymbol{x}\in\mathcal{X}_{\bar{i}}|\mathcal{D},\theta)\\ &={H}_{\rm{WTP}}(\boldsymbol{x})+{H}_{\rm{TII}}(\boldsymbol{x})\\ &={H}_{\rm{WTP}}(\boldsymbol{x}).\end{split} (46)

Taking expectations on both sides of Eq. (46), we have

=𝔼𝒙[(𝟏i¯,j¯,{P(𝒙𝒳i,j|𝒟,θ)}i,j)]=𝔼𝒙[HWTP(𝒙)]δ.\displaystyle\begin{split}\mathcal{L}&=\mathbb{E}_{\boldsymbol{x}}[\mathcal{H}(\boldsymbol{1}_{\bar{i},\bar{j}},\{P(\boldsymbol{x}\in\mathcal{X}_{{i},{j}}|\mathcal{D},\theta)\}_{{i},{j}})]\\ &=\mathbb{E}_{\boldsymbol{x}}[{H}_{\rm{WTP}}(\boldsymbol{x})]\\ &\leq\delta.\end{split} (47)

Considering the TIL objective P(𝒙𝒳i¯,j¯|𝒙𝒳i¯,𝒟,θ)P(\boldsymbol{x}\in\mathcal{X}_{\bar{i},\bar{j}}|\boldsymbol{x}\in\mathcal{X}_{\bar{i}},\mathcal{D},\theta), we have the loss error δ\mathcal{L}\leq\delta. This finishes the proof.

Theorem 7

For TIL with pre-training, if the loss error ξ\mathcal{L}\leq\xi, then there always exists a WTP predictor, s.t. HWTPξ{H}_{\rm{WTP}}\leq\xi.

Proof of Theorem 7
For TIL with pre-training, its loss error ξ\mathcal{L}\leq\xi. Assume 𝒙𝒳i¯,j¯𝒳i¯\boldsymbol{x}\in\mathcal{X}_{\bar{i},\bar{j}}\subseteq\mathcal{X}_{\bar{i}}. According to the proof of Theorem 6, we have

HWTP(𝒙)=logP(𝒙𝒳i¯,j¯|𝒙𝒳i¯,𝒟,θ)=logP(𝒙𝒳i¯,j¯|𝒟,θ)P(𝒙𝒳i¯|𝒟,θ)logP(𝒙𝒳i¯,j¯|𝒟,θ)=(𝟏i¯,j¯,{P(𝒙𝒳i,j|𝒟,θ)}i,j)ξ.\displaystyle\begin{split}{H}_{\rm{WTP}}(\boldsymbol{x})&=-\log P(\boldsymbol{x}\in\mathcal{X}_{\bar{i},\bar{j}}|\boldsymbol{x}\in\mathcal{X}_{\bar{i}},\mathcal{D},\theta)\\ &=-\log\frac{P(\boldsymbol{x}\in\mathcal{X}_{\bar{i},\bar{j}}|\mathcal{D},\theta)}{P(\boldsymbol{x}\in\mathcal{X}_{\bar{i}}|\mathcal{D},\theta)}\\ &\leq-\log P(\boldsymbol{x}\in\mathcal{X}_{\bar{i},\bar{j}}|\mathcal{D},\theta)\\ &=\mathcal{H}(\boldsymbol{1}_{\bar{i},\bar{j}},\{P(\boldsymbol{x}\in\mathcal{X}_{{i},{j}}|\mathcal{D},\theta)\}_{{i},{j}})\\ &\leq\xi.\end{split} (48)

This finishes the proof.

Appendix B Theoretical Foundation II

In this section, we first present the complete proof of connecting TII to OOD detection, and then derive the sufficient and necessary conditions of improving CL with WTP, OOD detection and TAP.

B.1 TII to OOD Detection

Proof of Theorem 3
For CL in a pre-training context, define the TII probability as P(𝒙𝒳i|𝒟,θ)=Pi(𝒙𝒳i|𝒟,θ)jPj(𝒙𝒳j|𝒟,θ)P(\boldsymbol{x}\in\mathcal{X}_{i}|\mathcal{D},\theta)=\frac{P_{i}(\boldsymbol{x}\in\mathcal{X}_{i}|\mathcal{D},\theta)}{\sum_{j}P_{j}(\boldsymbol{x}\in\mathcal{X}_{j}|\mathcal{D},\theta)}. If HOOD,i(𝒙)ϵiH_{{\rm{OOD}},i}(\boldsymbol{x})\leq\epsilon_{i} for i[t]i\in[t], then we have

HOOD,i(𝒙)={(1,Pi(𝒙𝒳i|𝒟,θ))=logPi(𝒙𝒳i|𝒟,θ)ϵi,𝒙𝒳i(0,Pi(𝒙𝒳i|𝒟,θ))=logPi(𝒙𝒳i|𝒟,θ)ϵi,𝒙𝒳i.\begin{split}&H_{{\rm{OOD}},i}(\boldsymbol{x})=\\ &\begin{cases}\mathcal{H}(1,P_{i}(\boldsymbol{x}\in\mathcal{X}_{i}|\mathcal{D},\theta))=-\log P_{i}(\boldsymbol{x}\in\mathcal{X}_{i}|\mathcal{D},\theta)\leq\epsilon_{i},&\boldsymbol{x}\in\mathcal{X}_{i}\\ \mathcal{H}(0,P_{i}(\boldsymbol{x}\in\mathcal{X}_{i}|\mathcal{D},\theta))=-\log P_{i}(\boldsymbol{x}\notin\mathcal{X}_{i}|\mathcal{D},\theta)\leq\epsilon_{i},&\boldsymbol{x}\notin\mathcal{X}_{i}\\ \end{cases}.\end{split} (49)

This means Pi(𝒙𝒳i|𝒟,θ)eϵiP_{i}(\boldsymbol{x}\in\mathcal{X}_{i}|\mathcal{D},\theta)\geq e^{-\epsilon_{i}} for 𝒙𝒳i\boldsymbol{x}\in\mathcal{X}_{i}, and Pi(𝒙𝒳i|𝒟,θ)eϵiP_{i}(\boldsymbol{x}\notin\mathcal{X}_{i}|\mathcal{D},\theta)\geq e^{-\epsilon_{i}} (i.e., Pi(𝒙𝒳i|𝒟,θ)1eϵiP_{i}(\boldsymbol{x}\in\mathcal{X}_{i}|\mathcal{D},\theta)\leq 1-e^{-\epsilon_{i}}) for 𝒙𝒳i\boldsymbol{x}\notin\mathcal{X}_{i}.

Let i¯[t]\bar{i}\in[t] denote the task identity of 𝒙\boldsymbol{x}, then we have

HTII(𝒙)=(𝟏i¯,{P(𝒙𝒳i|𝒟,θ)}i)=logP(𝒙𝒳i¯|𝒟,θ)=logPi¯(𝒙𝒳i¯|𝒟,θ)jPj(𝒙𝒳j|𝒟,θ)=log[1+ji¯Pj(𝒙𝒳j|𝒟,θ)Pi¯(𝒙𝒳i¯|𝒟,θ)]log[1+ji¯1eϵjeϵi¯]=log[1+eϵi¯ji¯1eϵj]eϵi¯ji¯1eϵj=(i𝟏𝒙𝒳ieϵi)(i𝟏𝒙𝒳i(1eϵi)).\displaystyle\begin{split}{H}_{\rm{TII}}(\boldsymbol{x})&=\mathcal{H}(\boldsymbol{1}_{\bar{i}},\{P(\boldsymbol{x}\in\mathcal{X}_{i}|\mathcal{D},\theta)\}_{i})\\ &=-\log P(\boldsymbol{x}\in\mathcal{X}_{\bar{i}}|\mathcal{D},\theta)\\ &=-\log\frac{P_{\bar{i}}(\boldsymbol{x}\in\mathcal{X}_{\bar{i}}|\mathcal{D},\theta)}{\sum_{j}P_{j}(\boldsymbol{x}\in\mathcal{X}_{j}|\mathcal{D},\theta)}\\ &=\log[1+\frac{\sum_{j\neq{\bar{i}}}P_{j}(\boldsymbol{x}\in\mathcal{X}_{j}|\mathcal{D},\theta)}{P_{\bar{i}}(\boldsymbol{x}\in\mathcal{X}_{\bar{i}}|\mathcal{D},\theta)}]\\ &\leq\log[1+\frac{\sum_{j\neq{\bar{i}}}1-e^{-\epsilon_{j}}}{e^{-\epsilon_{\bar{i}}}}]\\ &=\log[1+{e^{\epsilon_{\bar{i}}}}\sum_{j\neq{\bar{i}}}1-e^{-\epsilon_{j}}]\\ &\leq{e^{\epsilon_{\bar{i}}}}\sum_{j\neq{\bar{i}}}1-e^{-\epsilon_{j}}\\ &=(\sum_{i}\boldsymbol{1}_{\boldsymbol{x}\in\mathcal{X}_{i}}e^{\epsilon_{i}})(\sum_{i}\boldsymbol{1}_{\boldsymbol{x}\notin\mathcal{X}_{i}}(1-e^{-\epsilon_{i}})).\end{split} (50)

The last inequation holds due to log(1+z)z\log(1+z)\leq z for z0z\geq 0.

Now, let us move on the proof of adequacy for TII and OOD detection. If HTII(𝒙)ϵH_{\rm{TII}}(\boldsymbol{x})\leq\epsilon, then we have

HTII(𝒙)=(𝟏i¯,{P(𝒙𝒳i|𝒟,θ)}i)=logP(𝒙𝒳i¯|𝒟,θ)ϵ.\begin{split}{H}_{\rm{TII}}(\boldsymbol{x})&=\mathcal{H}(\boldsymbol{1}_{\bar{i}},\{P(\boldsymbol{x}\in\mathcal{X}_{i}|\mathcal{D},\theta)\}_{i})\\ &=-\log P(\boldsymbol{x}\in\mathcal{X}_{\bar{i}}|\mathcal{D},\theta)\\ &\leq\epsilon.\end{split} (51)

Further, for P(𝒙𝒳i|𝒟,θ)=Pi(𝒙𝒳i|𝒟,θ)jPj(𝒙𝒳j|𝒟,θ)P(\boldsymbol{x}\in\mathcal{X}_{i}|\mathcal{D},\theta)=\frac{P_{i}(\boldsymbol{x}\in\mathcal{X}_{i}|\mathcal{D},\theta)}{\sum_{j}P_{j}(\boldsymbol{x}\in\mathcal{X}_{j}|\mathcal{D},\theta)}, we have

HOOD,i(𝒙)=(1,Pi(𝒙𝒳i|𝒟,θ))=logPi(𝒙𝒳i|𝒟,θ)=log(P(𝒙𝒳i|𝒟,θ)jPj(𝒙𝒳j|𝒟,θ))=logP(𝒙𝒳i|𝒟,θ)logjPj(𝒙𝒳j|𝒟,θ)=HTII(𝒙)logjPj(𝒙𝒳j|𝒟,θ)ϵ\begin{split}&H_{{\rm{OOD}},i}(\boldsymbol{x})=\mathcal{H}(1,P_{i}(\boldsymbol{x}\in\mathcal{X}_{i}|\mathcal{D},\theta))\\ &=-\log P_{i}(\boldsymbol{x}\in\mathcal{X}_{i}|\mathcal{D},\theta)\\ &=-\log(P(\boldsymbol{x}\in\mathcal{X}_{i}|\mathcal{D},\theta){\sum_{j}P_{j}(\boldsymbol{x}\in\mathcal{X}_{j}|\mathcal{D},\theta)})\\ &=-\log P(\boldsymbol{x}\in\mathcal{X}_{i}|\mathcal{D},\theta)-\log{\sum_{j}P_{j}(\boldsymbol{x}\in\mathcal{X}_{j}|\mathcal{D},\theta)}\\ &={H}_{\rm{TII}}(\boldsymbol{x})-\log{\sum_{j}P_{j}(\boldsymbol{x}\in\mathcal{X}_{j}|\mathcal{D},\theta)}\\ &\leq\epsilon\end{split} (52)

The last inequation holds due to jPj(𝒙𝒳j|𝒟,θ)1\sum_{j}P_{j}(\boldsymbol{x}\in\mathcal{X}_{j}|\mathcal{D},\theta)\geq 1.

Likewise, for 𝒙𝒳i\boldsymbol{x}\notin\mathcal{X}_{i}, we have

HOOD,i(𝒙)=(0,Pi(𝒙𝒳i|𝒟,θ))=logPi(𝒙𝒳i|𝒟,θ)=log(1Pi(𝒙𝒳i|𝒟,θ))=log(1P(𝒙𝒳i|𝒟,θ)jPj(𝒙𝒳j|𝒟,θ))logP(𝒙𝒳i¯|𝒟,θ)ϵ\begin{split}&H_{{\rm{OOD}},i}(\boldsymbol{x})=\mathcal{H}(0,P_{i}(\boldsymbol{x}\in\mathcal{X}_{i}|\mathcal{D},\theta))\\ &=-\log P_{i}(\boldsymbol{x}\notin\mathcal{X}_{i}|\mathcal{D},\theta)\\ &=-\log(1-P_{i}(\boldsymbol{x}\in\mathcal{X}_{i}|\mathcal{D},\theta))\\ &=-\log(1-P(\boldsymbol{x}\in\mathcal{X}_{i}|\mathcal{D},\theta){\sum_{j}P_{j}(\boldsymbol{x}\in\mathcal{X}_{j}|\mathcal{D},\theta)})\\ &\leq-\log P(\boldsymbol{x}\in\mathcal{X}_{\bar{i}}|\mathcal{D},\theta)\\ &\leq\epsilon\end{split} (53)

This finishes the proof.

B.2 Sufficient and Necessary Conditions

Now we discuss the upper bound of CIL in relation to WTP, OOD detection and TAP.

Theorem 8

For CIL with pre-training (i.e., θ\theta), define the TII probability as P(𝐱𝒳i|𝒟,θ)=Pi(𝐱𝒳i|𝒟,θ)jPj(𝐱𝒳j|𝒟,θ)P(\boldsymbol{x}\in\mathcal{X}_{i}|\mathcal{D},\theta)=\frac{P_{i}(\boldsymbol{x}\in\mathcal{X}_{i}|\mathcal{D},\theta)}{\sum_{j}P_{j}(\boldsymbol{x}\in\mathcal{X}_{j}|\mathcal{D},\theta)}. If HWTP(𝐱)δ{H}_{\rm{WTP}}(\boldsymbol{x})\leq\delta, HTAP(𝐱)η{H}_{\rm{TAP}}(\boldsymbol{x})\leq\eta, and HOOD,i(𝐱)ϵiH_{{\rm{OOD}},i}(\boldsymbol{x})\leq\epsilon_{i} for i[t]i\in[t], then we have the loss error

[0,max{δ+(i𝟏𝒙𝒳ieϵi)(i𝟏𝒙𝒳i(1eϵi)),η}].\mathcal{L}\in[0,\max\{{\delta+(\sum_{i}\boldsymbol{1}_{\boldsymbol{x}\in\mathcal{X}_{i}}e^{\epsilon_{i}})(\sum_{i}\boldsymbol{1}_{\boldsymbol{x}\notin\mathcal{X}_{i}}(1-e^{-\epsilon_{i}}))},\eta\}].

As shown in Theorem 8, the good performance of WTP, TAP, and OOD detection are sufficient to guarantee a good CIL performance. Now we further study the necessary conditions of a well-performed CIL model.

Theorem 9

For CIL with pre-training (i.e., θ\theta), define the TII probability as P(𝐱𝒳i|𝒟,θ)=Pi(𝐱𝒳i|𝒟,θ)jPj(𝐱𝒳j|𝒟,θ)P(\boldsymbol{x}\in\mathcal{X}_{i}|\mathcal{D},\theta)=\frac{P_{i}(\boldsymbol{x}\in\mathcal{X}_{i}|\mathcal{D},\theta)}{\sum_{j}P_{j}(\boldsymbol{x}\in\mathcal{X}_{j}|\mathcal{D},\theta)}. If the loss error ξ\mathcal{L}\leq\xi, then there always exist (1) a WTP predictor, s.t. HWTPξ{H}_{\rm{WTP}}\leq\xi; (2) a TII predictor, s.t. HTIIξ{H}_{\rm{TII}}\leq\xi; (3) a TAP predictor, s.t. HTAPξ{H}_{\rm{TAP}}\leq\xi; and (4) an OOD detector for each task, s.t. HOOD,iξ{H}_{{\rm{OOD}},i}\leq\xi for i[t]i\in[t].

This theorem shows that if a good CIL model is trained, then a good WTP, a good TII, a good TAP, and a good OOD detector for each task are always implied.

Proof of Theorem 8
For CIL with pre-training (i.e., θ\theta), define TII as P(𝒙𝒳i|𝒟,θ)=Pi(𝒙𝒳i|𝒟,θ)jPj(𝒙𝒳j|𝒟,θ)P(\boldsymbol{x}\in\mathcal{X}_{i}|\mathcal{D},\theta)=\frac{P_{i}(\boldsymbol{x}\in\mathcal{X}_{i}|\mathcal{D},\theta)}{\sum_{j}P_{j}(\boldsymbol{x}\in\mathcal{X}_{j}|\mathcal{D},\theta)}. If HWTP(𝒙)δ{H}_{\rm{WTP}}(\boldsymbol{x})\leq\delta, HTAP(𝒙)η{H}_{\rm{TAP}}(\boldsymbol{x})\leq\eta, and HOOD,i(𝒙)ϵiH_{{\rm{OOD}},i}(\boldsymbol{x})\leq\epsilon_{i} for i[t]i\in[t], then using Theorem 8 we have HTII(𝒙)(i𝟏𝒙𝒳ieϵi)(i𝟏𝒙𝒳i(1eϵi))H_{\rm{TII}}(\boldsymbol{x})\leq(\sum_{i}\boldsymbol{1}_{\boldsymbol{x}\in\mathcal{X}_{i}}e^{\epsilon_{i}})(\sum_{i}\boldsymbol{1}_{\boldsymbol{x}\notin\mathcal{X}_{i}}(1-e^{-\epsilon_{i}})). Further, using Theorem 1 we have the loss error

=max{𝔼𝒙[(𝟏i¯,j¯,{P(𝒙𝒳i,j|𝒟,θ)}i,j)],𝔼𝒙[HTAP(𝒙)]}=max{𝔼𝒙[HWTP(𝒙)]+𝔼𝒙[HTII(𝒙)],1}=max{δ+(i𝟏𝒙𝒳ieϵi)(i𝟏𝒙𝒳i(1eϵi)),η}.\begin{split}\mathcal{L}&=\max\{\mathbb{E}_{\boldsymbol{x}}[\mathcal{H}(\boldsymbol{1}_{\bar{i},\bar{j}},\{P(\boldsymbol{x}\in\mathcal{X}_{{i},{j}}|\mathcal{D},\theta)\}_{{i},{j}})],\mathbb{E}_{\boldsymbol{x}}[{H}_{\rm{TAP}}(\boldsymbol{x})]\}\\ &=\max\{\mathbb{E}_{\boldsymbol{x}}[{H}_{\rm{WTP}}(\boldsymbol{x})]+\mathbb{E}_{\boldsymbol{x}}[{H}_{\rm{TII}}(\boldsymbol{x})],\mathcal{L}_{1}\}\\ &=\max\{\delta+(\sum_{i}\boldsymbol{1}_{\boldsymbol{x}\in\mathcal{X}_{i}}e^{\epsilon_{i}})(\sum_{i}\boldsymbol{1}_{\boldsymbol{x}\notin\mathcal{X}_{i}}(1-e^{-\epsilon_{i}})),\eta\}.\end{split} (54)

This finishes the proof.

Proof of Theorem 9
For CIL with pre-training (i.e., θ\theta), define the TII probability as P(𝒙𝒳i|𝒟,θ)=Pi(𝒙𝒳i|𝒟,θ)jPj(𝒙𝒳j|𝒟,θ)P(\boldsymbol{x}\in\mathcal{X}_{i}|\mathcal{D},\theta)=\frac{P_{i}(\boldsymbol{x}\in\mathcal{X}_{i}|\mathcal{D},\theta)}{\sum_{j}P_{j}(\boldsymbol{x}\in\mathcal{X}_{j}|\mathcal{D},\theta)}. If the loss error ξ\mathcal{L}\leq\xi, then using Theorem 2 there always exist (1) a WTP predictor, s.t. HWTPξ{H}_{\rm{WTP}}\leq\xi; (2) a TII predictor, s.t. HTIIξ{H}_{\rm{TII}}\leq\xi; and (3) a TAP predictor, s.t. HTAPξ{H}_{\rm{TAP}}\leq\xi. Furthermore, if HTIIξ{H}_{\rm{TII}}\leq\xi, using Theorem 3, we always have an OOD detector for each task, s.t. HOOD,iξ{H}_{{\rm{OOD}},i}\leq\xi for i[t]i\in[t]. This finishes the proof.

TABLE VII: Comparison of recent CL methods relevant to PTMs and PET. tt is the total number of tasks. dd is the embedding dimension. ss is the expansion rate of embedding dimension. Full: fine-tuning of full backbone parameters. General: applicable to mainstream PET techniques, such as ProT, PreT, LoRA, Adapter, etc. Note that t<<dt<<d in general, e.g., t=10t=10 and d=768d=768 for all cases in this work. s=100s=100 in [34].
Method Year Avenue PET Technique Task-Specific Parameters Task-Shared Parameters Representation Recovery
L2P [5] 2022 CVPR ProT N/A
DualPrompt [6] 2022 ECCV PreT N/A
S-Prompt [7] 2022 NeurIPS ProT N/A N/A
CODA-Prompt [8] 2023 CVPR PreT N/A N/A
SLCA [4] 2023 ICCV Full N/A O(td2)O(td^{2})
FSA [43] 2023 ICCV Full N/A O(d2)O(d^{2})
LAE [14] 2023 ICCV General N/A N/A
RanPAC [34] 2023 NeurIPS PreT N/A O(s2d2)O(s^{2}d^{2})
KOPPA [44] 2023 arXiv PreT N/A O(td)O(td)
H-Prompt [66] 2024 arXiv PreT O(td2)O(td^{2})
\cdashline1-7[2pt/2pt] HiDe-Prompt [13] 2023 NeurIPS PreT N/A O(td2)O(td^{2})
HiDe-PET 2024 Current General O(td)O(td)

Appendix C Implementation Details

Here we describe the supplementary implementation details of the empirical investigation.

Comparison with Preliminary Version: The major technical difference between our HiDe-PET and our preliminary version [13] lies in the use of task-shared parameters 𝒈\boldsymbol{g} to improve TII, which is critical for LoRA/Adapter-based PET that is sensitive to the TII errors. To mitigate catastrophic forgetting in 𝒈\boldsymbol{g}, we set a cosine-decaying learning rate of 0.01 for FSA, a cosine-decaying learning rate of 0.001 for SL, and a momentum of 0.1 for EMA. The PET ensemble strategy sets α=0.1\alpha=0.1 in all cases. To ensure generality and resource efficiency, the specific implementations are slightly modified in three aspects. First, our preliminary version [13] followed the implementation of L2P [5] and DualPrompt [6], which employed a constant learning rate of 0.005 and a supervised checkpoint of ImageNet-21K (i.e., Sup-21K). We notice that many recent methods followed the implementation of CODA-Prompt [8], which employed a cosine-decaying learning rate of 0.001, a self-supervised/supervised checkpoint on ImageNet-21/1K (i.e., Sup-21/1K) and a different split of ImageNet-R. Considering that a smaller learning rate with cosine decay has been more commonly used for fine-tuning large-scale PTMs, we reproduce all baselines with the implementation of CODA-Prompt [8] in the current manuscript. This consideration further ensures the generality of our HiDe-PET in adapting to different experimental settings. Second, our preliminary version [13] devised a contrastive regularization (CR) term to balance the instructed representations for WTP and TAP, which brings some benefits to the performance of Prompt-based PET. In subsequent explorations, we observe that the CR term cannot improve the performance of LoRA/Adapter-based PET, and therefore remove it in the current manuscript to ensure generality in adapting to different PET techniques. Third, our preliminary version [13] employed dedicated covariance matrices (additional d2d^{2} parameters for each class) in representation recovery and a two-layer MLP (additional d2d^{2} parameters for the first layer) in h^ω\hat{h}_{\omega}, in order to acquire better performance. In contrast, the current manuscript employs multi-centroid (additional <10d<10d parameters for each class) in representation recovery and a one-layer MLP in h^ω\hat{h}_{\omega}, which slightly compromise the performance but largely improve resource efficiency.

Adaptive Knowledge Accumulation: In Sec. 5.2, we devise a PET hierarchy 𝒈1,,𝒈k\boldsymbol{g}_{1},...,\boldsymbol{g}_{k} inspired by OOD detection to demonstrate the connections between task-specific and task-shared PET architecture. We further consider a specialized implementation of 𝒈1,,𝒈k\boldsymbol{g}_{1},...,\boldsymbol{g}_{k} for HiDe-PET in Algorithm 1, serving as a plug-in module to achieve adaptive knowledge accumulation from pronounced distribution changes. As described in Sec. 5.2, 𝒈1,,𝒈k\boldsymbol{g}_{1},...,\boldsymbol{g}_{k} are adaptively expanded or retrieved upon the distance Dis(𝒙,𝒢^i){\rm{Dis}}(\boldsymbol{x},\hat{\mathcal{G}}_{i}) to each previous task i[t]i\in[t]. The learning of each 𝒈j\boldsymbol{g}_{j} for j[k]j\in[k] is identical to the learning of 𝒈\boldsymbol{g}, i.e., a combination of FSA and SL. As for the exploitation of 𝒈1,,𝒈k\boldsymbol{g}_{1},...,\boldsymbol{g}_{k}, before learning each task ii, the most relevant 𝒈j\boldsymbol{g}_{j} is first retrieved upon the current training samples 𝒟i\mathcal{D}_{i} and then temporarily added to the backbone parameters θ\theta to better incorporate task-specific knowledge (the improved θ\theta is denoted as θ\theta^{\prime}). The improved backbone fθf_{\theta^{\prime}} is used to obtain 𝒢i,c\mathcal{G}_{i,c}, i.e., Step 10 in Algorithm 1. Whereas, the original backbone fθf_{\theta} is still used to obtain 𝒢^i,c\hat{\mathcal{G}}_{i,c}, i.e., Step 9 in Algorithm 1. At the testing phase, the most relevant 𝒈j\boldsymbol{g}_{j} is retrieved upon the current testing samples and temporarily added to θ\theta.