¹¹institutetext: Microsoft
¹¹email: {peyu, yiche, yinjin, zliu}@microsoft.com

Improving Vision Transformers for
Incremental Learning

Pei Yu Yinpeng Chen Ying Jin Zicheng Liu

Abstract

This paper proposes a working recipe of using Vision Transformer (ViT) in class incremental learning. Although this recipe only combines existing techniques, developing the combination is not trivial. Firstly, naive application of ViT to replace convolutional neural networks (CNNs) in incremental learning results in serious performance degradation. Secondly, we nail down three issues of naively using ViT: (a) ViT has very slow convergence when the number of classes is small, (b) more bias towards new classes is observed in ViT than CNN-based architectures, and (c) the conventional learning rate of ViT is too low to learn a good classifier layer. Finally, our solution, named ViTIL (ViT for Incremental Learning) achieves new state-of-the-art on both CIFAR and ImageNet datasets for all three class incremental learning setups by a clear margin. We believe this advances the knowledge of transformer in the incremental learning community. Code will be publicly released.

Keywords:

Incremental Learning, Vision Transformer

1 Introduction

Refer to caption — Figure 1: Comparison with state-of-the-art methods on three CIL setups on ImageNet-1000. (a) shows average incremental accuracy. (b) illustrates CIL setups. B500 denotes initial step contains half of total 1000 classes. B0 denotes model starts from scratch, and each step adds same number of new classes. C100 denotes each incremental step adds 100 new classes. R20 denotes each old class keeps same number of 20 exemplars. T20K denotes each incremental step keeps total 20000 exemplars for all classes. (Best viewed in color)

Recent progress in Vision Transformer [13] demonstrates superior performance over convolutional neural networks (CNNs) on various computer vision tasks such as image recognition [33, 26] and object detection [6]. The success of ViT motivates us to investigate if it is also suitable for class incremental learning (CIL), which is another great interest to the community since its setting is close to real-world situation, where the model needs to handle sequentially incoming data of new classes and avoid performance degradation on previous classes. The key question to answer is whether ViT provides a better feature extractor than CNNs for class incremental learning?

However, applying ViT in class incremental learning is not trivial. Naively replacing CNN feature extractor with ViT results in significant performance degradation across different CIL settings, as shown in Fig. 2. After careful analysis, we found three factors contribute to the degradation: (a) ViT models have slow convergence, especially at the beginning of incremental learning where the number of classes is small, (b) more bias towards new classes is observed in ViT models than CNN models, mainly because the margin ranking loss (an effective bias removal technique for CNN) conflicts with data augmentation such as MixUp or CutMix which are important for ViT, and (c) using the same learning rate for both ViT feature extractor and classifier causes underfitting in classifier, which is reflected in the magnitude of learnable temperature in softmax.

Base upon our analysis, we further address these issues by simply using existing techniques in either network architecture or incremental learning. Firstly, we address the slow convergence by using convolutional stem [36] to replace the patchify stem in ViT model, which achieves better performance with significantly shorter training. Secondly, we find that finetuning with balanced dataset [35, 17] is effective to correct bias towards new classes for vision transformer, but not conflicting with data augmentation (MixUp, CutMix, etc). Finally, we show that using larger learning rate for classifier results in larger value of softmax temperature, further boosting the performance. As shown in Fig. 2, these techniques effectively boost performance across three CIL settings. Note that none of these techniques is originally proposed in this paper, our major contribution is to locate key issues in applying vision transformer in incremental learning and connects existing techniques to address these issues effectively.

With these techniques, as shown in Fig. 1 (a), our method ViTIL(ViT for Incremental Learning) achieves new state-of-the-art consistently for all three class incremental learning setups by a clear margin. In contrast, existing methods only perform well in one CIL setup. For instance, on ImageNet-1000, our ViTIL achieves 69.20% top-1 accuracy for the CIL setup of 500 initial classes, and each incremental step adds 100 new classes, outperforming LUCIR+DDE [18] by $1.69\%$ . For CIL setup of 10 incremental steps (each step adds 100 new classes), our method outperforms PODNet [14] by $7.27\%$ ( $65.13\%$ vs $57.86\%$ ), when each old class keeps 20 exemplars. When each incremental step keeps total 20000 exemplars, our method outperforms BiC [35] by $6.28\%$ ( $68.75\%$ vs $62.47\%$ ).

In summary, our contributions are three-fold.

1.

We nail down three key issues (slow convergence, bias towards new classes, underfitted classifier) that contributes to the degradation of applying vision transformer in class incremental learning.
2.

We showcase a simple solution with existing techniques that effectively addresses all three issues, achieving significant performance boost.
3.

Our method (ViTIL) achieves the new state-of-the-art across three class incremental setups by a clear margin. This is challenging as these setups previously have different lead methods.

The rest of this paper is organized as follows. Sec. 2 introduces related work. In Sec. 3, we investigate applying ViT naively to CIL, and its performance issue. In Sec. 4, our method is proposed to address these issues. Sec. 5 evaluates the proposed method in different CIL setups, and studies the impact of different components of the proposed method. Sec. 6 concludes this paper.

2 Related Work

Incremental learning methods can be categorized by how they tackle catastrophic forgetting. Regularization-based methods penalize the changes of model parameters [20, 39, 27, 38, 2, 8, 1, 23, 9]. Distillation-based methods tackle catastrophic forgetting by distilling knowledge from previous models [24, 31, 3, 30, 16, 7, 17, 35, 12, 22, 14, 25, 18, 40]. Some methods finetune model with data exemplars of previous classes, without weight constraints or distillation [4, 5]. Some methods seek synthetic data [32, 19]. A more comprehensive survey can be found in [28].

Distillation-based methods: Hinton et al. propose knowledge distillation to improve the performance of single model by distilling the knowledge from an ensemble of models [15]. In Learning without Forgetting (LwF) [24], knowledge distillation is applied to address catastrophic forgetting under multi-task incremental learning scenario. Following LwF, Aljundi et al. propose gating autoencoders to automatically select the task classifier [3]. In [30], distillation loss is added on the features from autoencoder. In [16], knowledge is distilled from an expert CNN trained dedicated to new task. In [12], attention map is distilled.

CIL with data replay: In iCaRL [31], Rebuffi et al. show that, in addition to knowledge distillation, store a set exemplars from previous classes, and replay them in the following training process can significantly alleviate forgetting issue. Nevertheless, bias towards new classes is a major issue in CIL with data replay. Castro et al. propose to use balanced finetuning to tackle it [7]. In [17], Hou et al. use normalized feature, normalized classifier, and margin ranking loss. In [35], bias correction is done by learning a bias correction layer. In [40], without an extra validation set, bias is mitigated by aligning the weight norm of new classes to old classes. In [14], more sophisticated distillation is proposed by penalizing changes of spatially pooled features. Instead of using herding to select exemplars, Liu et al. propose an automatic exemplar extraction framework mnemonics [25]. [18] alleviates dependency on exemplars by distilling the causal data effect.

Transformers: Transformers were originally proposed for NLP tasks [34], and show superior performance for language model pre-training [29, 11]. Increasing number of works seek to apply transformer to vision tasks, e.g., image recognition [33, 26], object detection [6]. Vision transformer was proposed in [13], where input image is divided into non-overlapping patches simply with patchify operation. Although it achieves superior performance, it replies on large corps of pre-training data. In [33], Touvron et al. propose a data-efficient method for training Vision Transformer. In [36], Xiao et al. seek to improve optimizability of Vision Transformer by using convolutional stem.

3 Degradation of Applying ViT in CIL

In this section, we show that a severe degradation is observed when naively appling Vision Transformers (ViT) in multi-class class incremental learning (CIL) with exemplars, and analyze the key degradation-causing factors.

3.1 Review of Class Incremental Learning

Class incremental learning (CIL) [31, 17, 35, 14, 18] is performed progressively, consisting of incremental learning steps in which only the data of new classes is available. Particularly, at each incremental step $t$ , a new model $\mathcal{M}_{t}$ is learned from an old model $\mathcal{M}_{t-1}$ and data of new classes, to perform classification on all classes it has seen. Let us denote the old classes as $\mathcal{C}_{o}$ and the corresponding data as $\mathcal{X}_{o}=\{(x_{i},y_{i})|y_{i}\in\mathcal{C}_{o}\}$ . Similarly, the data of new classes $\mathcal{C}_{n}$ is denoted as $\mathcal{X}_{n}=\{(x_{i},y_{i})|y_{i}\in\mathcal{C}_{n}\}$ . We follow the setup of [31, 17, 35] to use the replay of old class exemplars, denoted by $\mathcal{X}^{\prime}_{o}\subset\mathcal{X}_{o}$ .

In this paper, we follow LUCIR [17] by replacing its CNN feature extractor with ViT model, but keeping its classifier that applies softmax on the cosine similarity as:

p_{i}(x)=\frac{\exp(\eta\left\langle\bar{\theta}_{i},\bar{f}(x)\right\rangle)}{\sum_{j}\exp(\eta\left\langle\bar{\theta}_{j},\bar{f}(x)\right\rangle)},

(1)

where $\bar{\theta},\bar{f}(x)$ denote $l_{2}$ -normalized weight vector of each class, and the feature vector of input $x$ , respectively. $\eta$ is a learnable softmax temperature parameter.

3.2 Degradation of Replacing CNN with ViT

As shown in Fig. 2, a severe performance degradation is observed when using ViT-tiny (ViT-ti) [13] to replace ResNet-18 as feature extractor, where the two models have similar computational cost (1.1G vs. 1.8G FLOPs). The average incremental accuracy drops by 10.45%. The experiments are conducted on ImageNet-100, in which 50 classes are used to learn the first model and 5 incremental steps (10 classes for each) are followed. Interestingly, ViT model is behind ResNet-18 from the beginning (the first 50 classes) even though more training epochs are consumed (400 vs. 90 epochs). This seems not consistent with recent findings that ViT models outperform ResNet with similar FLOPs on ImageNet dataset. Next, we will discuss the causing factors.

3.3 Analysis of Causing Factors

Our analysis locates three key issues related to the performance degradation of naive application of ViT, which will be discussed in this section.

Slow convergence when the number of classes is small: As ViT is significantly behind ResNet-18 for the first 50 classes, a reasonable guess is that ViT model is not fully trained even though it consumes longer training. As shown in Fig. 4, ViT-ti takes more than 2000 training epochs to achieve saturated performance. In contrast, ResNet usually is trained for only 90 epochs. When we increase training length from 400 to 800 epochs, it results in significant performance boosts from $74.72\%$ to $82.28\%$ (shown in Fig. 4). However, it is still lower than accuracy of ResNet-18. Although previous works [13, 33] report that ViT models require longer training schedule (300-400 epochs) to achieve saturated performance, this issue becomes even more severe when the number of classes is small, which is typical in the beginning of incremental leaning. As shown in Fig. 4, ViT model requires more training epochs to achieve saturated performance on 10 classes than on 50 classes. In addition, the initial model is crucial for the performance of the following incremental steps. As shown in Fig. 2(a), when the initial model is improved by training for 800 epochs, the average accuracy of all incremental steps is also significantly improved from $60.02\%$ to $67.11\%$ .

More bias towards new classes: We also compare ViT and ResNet on confusion matrices in Fig. 3, where clearly more bias towards new classes is observed in ViT (highlighted in a red box). Quantitatively, compared to ResNet-18, ViT-ti has 9.3% more testing samples from old classes are falsely classified as new classes (11.2% vs. 20.5%). We believe this is caused by the conflict between the margin ranking loss and the data augmentation like Mixup or CutMix. Margin ranking loss [17] is an important technique to prevent bias, but requires hard class label per sample. Such hard class label is not available in ViT as it heavily relies on augmentation like Mixup or CutMix that generates soft (or mixed) label. In contrast, ResNet is not affected as strong augmentation is not necessary. Note that to avoid bias caused by long training schedule of ViT, we have already reduced training length for incremental steps $t>1$ . More details about training setting are described in Sec. 4.2.

Underfitted classifier when using the same learning rate to ViT feature extractor: Another clear difference between ResNet and ViT is observed in the magnitude of learned softmax temperature $\eta$ in Equ. 1: $\eta=34$ for ResNet vs. $\eta=10$ for ViT in the final model after the last incremental step. We hypothesize that (a) the magnitude of $\eta$ is correlated to the learning rate, and (b) the small learning rate $2.5\times 10^{-4}$ is good for ViT as feature extractor but too low for the classifier. To validate this, we conduct experiments by applying different learning rates ( $2.5\times 10^{-4}$ , $5\times 10^{-4}$ , $2.5\times 10^{-3}$ ) for the classifier alone while keeping the learning rate of ViT feature extractor unchanged ( $2.5\times 10^{-4}$ ). Experimental results validate our hypothesis. As shown in Fig. 6, $\eta$ is highly correlated to the learning rate. Fig. 6 shows increasing learning rate from $2.5\times 10^{-4}$ to $2.5\times 10^{-3}$ for classifier boosts both the final and average incremental accuracy significantly.

4 Our Method: ViTIL

In this section, we introduce our method ViTIL to address the aforementioned issues by using existing techniques, which significantly boosts the class incremental learning performance. We note that our contribution is not in these techniques developed in previous works, but in nailing down the causes of ViT’s degradation and effective treatments by leveraging existing techniques.

4.1 Bag of Treatments in ViTIL

Convolutional stem for faster convergence:

Inspired by the findings in [36] that convolutional stem introduces quick converge, we follow it replace patchify operation and first transformer block with a small convolutional network with four convolution layers. Each convolution layer has kernel size of 3, and stride size of 2. This change results in much faster convergence to saturated performance (see Fig. 4). For 50 classes from ImageNet-100, ViT with convolutional stem (denoted as ViT_C-ti) not only speeds up the convergence to around 800 epochs, but also achieves higher accuracy (+ $11.68$ % and + $4.96$ %) at both 400 and 800 training epochs. It outperforms ResNet-18 by $4.24$ %, providing a strong start for the following incremental steps.

Bias correction via balanced finetuning: We correct the biases in classifier by finetuning it with balanced dataset, which is denoted as $\mathcal{X}^{\prime}_{o}\cup\mathcal{X}^{\prime}_{n}$ , where $\mathcal{X}^{\prime}_{o}$ and $\mathcal{X}^{\prime}_{n}$ are the exemplar set from old and new classes respectively. The idea of using balanced dataset to correct biases in the classifier was studied in [35], where a linear layer with two parameters is learned. However, it does not apply to our case, since our classifier has normalized weights per class. Instead, we finetune all parameters in the classifier. As shown in Fig. 3(c), this finetuning strategy effectively mitigates bias towards new classes for using ViT model. This results in a two-stage incremental learning. In the first stage, parameters in both feature extractor and classifier (denoted as $\Theta_{t}$ for incremental step $t$ ) are updated. The second stage deals with balance finetuning of classifier where only the parameters in classifier (denoted as $\Theta^{g}_{t}$ ) are finetuned, with feature extractor frozen.

Large learning rate for classifier: It is straightforward to address the underfitting of classifier, by using a larger learning rate for it. In this paper, we use learning rate $2.5\times 10^{-3}$ on ImageNet dataset, which is ten times higher than the learning rate for the backbone feature extractor.

4.2 Implementation Details

ViT architecture: We follow the design of ViT models in [13], and adopt ViT-ti as the base model for ImageNet dataset. Existing CIL works apply ResNet-18 as base model for CIL on ImageNet dataset. ViT-ti has 5M parameters and 1.1G FLOPs, less than ResNet-18 with 12M parameters and 1.8G FLOPs. CLS token of final transformer block is treated as the feature output of backbone ViT model, and fed into classifier head, which is cosine linear layer in our case. For CIL on CIFAR-100 dataset, existing works apply a modified 32-layer ResNet with 0.47M parameters, 69.8M FLOPs. For a fair comparison, we modify our ViT model with embedding dimension of 120, depth of 4, mlp ratio of 1, and patch size of 4. This lightweight model contains 0.47M parameters, and 41.0M FLOPs, with input image resolution 32 $\times$ 32 same as existing works.

Convolutional stem architecture: Follow [36], we replace the patchify stem and the first transformer block in ViT-ti model with a small convolutional network. It consists of 4 convolution layers with kernel size $3$ and stride size $2$ . The number of output channels is [24, 48, 96, 192]. Each convolution layer is followed by a batch norm layer and a ReLU layer. As pointed out in [36], convolutional stem has negligible impact on the model’s FLOPs.

Loss: Our loss function, $L=\sum_{x\in\mathcal{X}^{\prime}_{o}\cup\mathcal{X}_{n}}(L_{ce}(x)+\lambda L_{dis}(x))$ , includes a standard cross entropy loss $L_{ce}$ , and distillation loss $L_{dist}$ , where $\lambda$ is a balancing factor. Same as in [17], $L_{dis}$ penalizes the change of feature. $L_{dis}(x)=1-\left\langle\bar{f}^{\ast}(x),\bar{f}(x)\right\rangle$ , where $f^{\ast}(x)$ represents feature vector extracted by old model. In the finetuning stage, only cross entropy loss $L_{ce}$ is calculated on new class exemplar set $\mathcal{X}_{n}^{\prime}$ and old class exemplar set $\mathcal{X}_{o}^{\prime}$ . With feature extractor frozen, only cosine linear classifier parameters are updated. $\lambda$ is set to 3. Same as [17], we use adaptive $\lambda$ .

Augmentation: We adopt the same augmentation recipe in [36] for ImageNet dataset, including Mixup, CutMix, soft label, and AutoAugmentation. All images are re-scaled to 224 $\times$ 224. For CIFAR-100 dataset, we use same data augmentation with images re-scaled to 32 $\times$ 32.

Optimizer: Note that in all experiments, all ViT-based models are trained from scratch, without using any pre-training or extra dataset. We adopt commonly used AdamW optimizer. Batch size is set to 1024. Weight decay is set to 0.24. Learning rate is set to $2.5\times 10^{-4}$ for feature extractor, and $2.5\times 10^{-3}$ for classifier. Learning rate is scaled based on batch size as $lr\times$ BatchSize/512. Cosine learning rate scheduler is applied with 5 warmup epochs. For the initial step $t=1$ , the number of training epochs is set to $800$ . For the following incremental steps $t>1$ , number of training epochs is set to $50$ for B50/B500 settings, and $200$ for B0 setting. This is because when the incremental learning starts with half of total classes, learned feature representation is more robust and can better generalize to the other half classes. For CIFAR-100 dataset, since it contains less training data, number of training epochs is set to $2000$ for initial step $t=1$ , and $500$ for following incremental steps $t>1$ . Batch size is set to 512. Learning rate is set to $2.5\times 10^{-3}$ for feature extractor, and $5\times 10^{-3}$ for classifier.

5 Expriments

In this section, we evaluate the proposed method with public benchmarks and commonly used protocols. Ablation study is provided to analyze the impact of different components of the proposed method.

5.1 Settings

Datasets: Our experiments are conducted on three widely used CIL evaluation datasets, ImageNet-1000 [10], ImageNet-100 and CIFAR-100 [21]. ImageNet-1000 is a large-scale dataset with 1000 classes. ImageNet-100 is a subset of ImageNet-1000 with 100 classes. Same as in [17], we use random seed 1993 to shuffle 1000 classes and choose the first 100 classes as ImageNet-100. CIFAR-100 is a dataset with small image resolution $32\times 32$ .

Protocols: CIL protocols in existing works are different on three main factors: number of classes in the initial step, number of new classes each incremental step adds, and the number of exemplars. For simplicity, we use abbreviated notation of CIL protocol as in Fig. 1 (b). Existing works mainly adopt three CIL protocols. For example, (1) B500 R20, (2) B0 R20, (3) B0 T20K on ImageNet-1000 dataset. B500 denotes initial step contains half of total 1000 classes. B0 denotes model starts from scratch, and each step adds same number of new classes. R20 denotes each old class keeps same number of 20 exemplars. T20K denotes each incremental step keeps total 20000 exemplars. These protocols are evaluated with different numbers of new classes each incremental step adds, e.g., C100 denotes each incremental step adds 100 new classes. For ImageNet-100 and CIFAR-100, B500 is changed to B50, and T20K is changed to T2K. We conduct experiments on C2, C5 and C10 settings for CIFAR-100, C10 and C5 settings for ImageNet-100, C100 setting for ImageNet-1000.

Baselines: In our experiments, we compare with 5 baselines, including iCaRL [31], LUCIR [17], BiC [35], PODNet [14], and DDE [18]. Note that these baselines may only provide results on part of the three protocols in their original papers. Here, we report their reported results for the protocols they have evaluated. For the protocols they do not evaluate, we use either their official code release, or our own implementation. BiC [35] is evaluated with our PyTorch implementation. Note that in the released code of PODNet [14] for ImageNet, stride size of first convolution layer in ResNet-18 is changed from 2 to 1, which results in larger spatial size of every convolution feature map and much larger FLOPs (6.9G FLOPs vs 1.8G FLOPs). To investigate this issue, we evaluated it with two settings. In the first one, we adopt official ResNet-18 setting for feature extractor. In the second one, we use the modified ResNet same as in [14]. Our method is also evaluated with these two settings.

Table 1: Average incremental accuracy (

\%

) on ImageNet. All baselines apply regular ResNet-18 with 1.8G FLOPs. Our method has 1.1G FLOPs. C10 denotes each incremental step adds 10 new classes. B50 denotes initial step contains 50 classes. R20 denotes each old class keeps 20 exemplars. T2K denotes each incremental step keeps total 2K exemplars. Results with

\ast

are reported directly from original paper. BiC^† represents our implementation of BiC. PODNet(1.8G) represents changing feature extractor in PODNet to regular ResNet-18 with 1.8G FLOPs. Original PODNet in [14] has 6.9G FLOPs.

Methods	ImageNet-100						ImageNet-1000
	C10			C5			C100
	B50	B0	B0	B50	B0	B0	B500	B0	B0
	R20	R20	T2K	R20	R20	T2K	R20	R20	T20K
iCaRL [31]	65.06	61.60	66.29	58.76	54.02	62.76	53.03	55.95	59.63
LUCIR [17]	70.47^∗	55.33	62.02	68.09^∗	45.72	56.80	64.34^∗	57.14	60.65
BiC^† [35]	68.58	64.35	68.79	62.83	54.38	62.61	56.78	57.60	62.47
PODNet(1.8G) [14]	74.48	62.59	67.28	70.97	54.81	60.85	64.22	57.86	59.49
LUCIR+DDE [18]	72.34^∗	58.49	65.90	70.20^∗	50.17	61.30	67.51^∗	57.16	59.13
Ours	79.43	69.68	73.09	76.92	61.57	67.17	69.20	65.13	68.75

Table 2: Average incremental accuracy (

\%

) on CIFAR-100. C10 denotes each incremental step adds 10 new classes. B50 denotes initial step contains 50 classes. R20 denotes each old class keeps 20 exemplars. T2K denotes each incremental step keeps total 2K exemplars.

Methods	C10			C5			C2
	B50	B0	B0	B50	B0	B0	B50	B0	B0
	R20	R20	T2K	R20	R20	T2K	R20	R20	T2K
iCaRL [31]	57.25	60.02	64.74	53.47	55.72	63.94	48.02	47.67	60.91
LUCIR [17]	63.42^∗	55.84	61.66	60.18^∗	50.23	59.54	57.03	41.46	57.48
BiC^† [35]	59.24	60.68	65.58	53.64	56.60	63.66	47.77	45.80	60.10
PODNet [14]	66.02	56.61	56.55	63.70	49.74	49.43	60.39	40.98	40.84
LUCIR+DDE [18]	65.27^∗	57.09	60.40	62.36^∗	50.50	55.45	54.07	36.37	53.08
Ours	71.92	67.26	70.11	67.89	61.12	67.70	61.98	55.63	65.35

5.2 Results

ImageNet-100: The quantitative results are shown in Tab. 1. The incremental accuracy of each incremental learning step is shown in Fig. 7. Compared with CNN counter part, LUCIR [17], the proposed method outperforms by $8.96\%$ , $14.35\%$ , $11.07\%$ , $8.83\%$ , $15.85\%$ , $10.37\%$ for 6 CIL settings on ImageNet-100, respectively. Note that LUCIR, PODNet, BiC only perform well on partial CIL protocols. In contrast, our method consistently outperforms existing methods on all CIL protocols. Moreover, our accuracy degradation slope is flatter than CNN counterpart, LUCIR, indicating a better ability to address forgetting. For example, with B50 C10 R20 in Fig. 7(a), accuracy improvement between ours and LUCIR is $4.94\%$ for the initial step, and $12.68\%$ for the final incremental step.

ImageNet-1000: As shown in Tab. 1, the results on ImageNet-1000 is consistent with those on ImageNet-100. With more classes, the catastrophic forgetting of ViT models is more severe compared with ImageNet-100. The proposed method outperforms state-of-the-art methods by at least $1.69\%$ with B500 setting, by at least $7.27\%$ on B0 R20 setting, and by at least $6.28\%$ with B0 T20K setting. Compared with CNN counterpart, baseline LUCIR, our improvements on three CIL protocols are $4.86\%$ , $7.99\%$ , and $8.10\%$ , respectively.

CIFAR-100: The results are shown in Tab. 2. Our method outperforms baselines with clear margin, using same input resolution 32 $\times$ 32. Moreover, our method has 0.47M parameters and 41.0M FLOPs, lower than 0.47M parameters 69.8M FLOPs of commonly adopted ResNet-32 model in existing works. Note that our model is trained from scratch, without using any pre-training or extra dataset.

Table 3: Comparing ViTIL with PODNet at higher FLOPs (6.9G). Average incremental accuracy (

\%

) on ImageNet-100 and ImageNet-1000. “

\ast

” indicates the results reported in original paper [14].

				PODNet	Ours
100class	C10	B50	R20	$75.54^{\ast}$	81.07
		B0	R20	67.01	70.41
		B0	T2K	71.66	73.22
	C5	B50	R20	$74.33^{\ast}$	78.99
		B0	R20	58.77	62.85
		B0	T2K	65.10	68.03
1Kclass	C100	B500	R20	$66.95^{\ast}$	72.36
		B0	R20	61.60	67.40
		B0	T20K	62.68	70.69

Table 4: Ablative results on ImageNet-100 dataset C10 setting. Average incremental accuracy of different variants. CNN denotes CNN stem. Bias denotes bias correction. Large lr denotes large classifier learning rate.

CNN	Bias	Large lr	B50	B0	B0
CNN	Bias	Large lr	R20	R20	T2K
			67.11	45.17	50.02
	$\checkmark$		70.07	54.67	57.19
	$\checkmark$	$\checkmark$	74.13	59.45	61.70
$\checkmark$			71.52	56.60	63.06
$\checkmark$	$\checkmark$		77.83	67.88	70.09
$\checkmark$	$\checkmark$	$\checkmark$	79.43	69.68	73.09

Model with larger FLOPs: Here we change the first convolution layer of our method to stride size 1, and obtain a model with 6.9G FLOPs. In the released code of PODNet [14], the first convolution layer of ResNet-18 is modified with kernel size of 3 and stride size of 1. With this modification, PODNet has 6.9G FLOPs. For a fair comparison, we compare with our model with 6.9G FLOPs. The results are shown in Tab. 4. Our method still outperforms PODNet consistently on ImageNet-100, ImageNet-1000 with all CIL protocols.

Table 5: Average incremental forgetting (

\%

) on ImageNet-100. Lower is better.

Methods	C10			C5
	B50	B0	B0	B50	B0	B0
	R20	R20	T2K	R20	R20	T2K
iCaRL [31]	21.88	28.95	22.60	24.64	34.80	26.11
LUCIR [17]	13.92	38.62	28.36	15.53	46.34	32.03
BiC^† [35]	10.50	18.92	12.45	10.79	24.90	16.08
PODNet(1.8G FLOPs) [14]	7.60	19.09	16.25	21.22	32.31	30.16
LUCIR+DDE [18]	5.44	32.11	19.65	5.03	35.94	19.56
Ours	2.37	18.01	13.92	2.83	17.34	12.25

5.3 Ablation Study

In this section, we investigate the effectiveness of each component in the proposed method. All variants apply same training hyper-parameters. The results on ImageNet-100 are summarized in Tab. 4. Moreover, we investigate the methods’ ability to mitigate forgetting with evaluation metric of average incremental forgetting, same as in [18].

Convolutional stem: As shown in Tab. 4, without convolutional stem, the naive ViT model’s performance is significantly lower than its ResNet counterpart, LUCIR. Moreover, with the same setting of classifier learning rate and bias correction, the models without CNN stem perform significantly worse than the models with CNN stem. This is mainly due to the inferior optimizability of ViT models, which requires much longer training schedule.

Bias correction: As shown in Tab. 4, without bias correction, the average incremental accuracy is much lower, especially for the protocol with more incremental steps with B0 protocols. This is because bias towards new classes is more severe when there are more incremental steps.

Large classifier learning rate: As shown in Tab. 4, when the model has bias correction, larger classifier learning rate further improves average incremental accuracy, which is caused by underfitted classifier parameters. Moreover, it also improves ViT models without CNN stem. In particular, without CNN stem, the initial step accuracy of ViT-ti is slightly below LUCIR on B50 C10 R20 setting. However, with bias correction and large classifier lr, it still outperforms LUCIR on average incremental accuracy by $3.66\%$ .

Average incremental forgetting: As shown in Tab. 5, forgetting of our method is significantly lower than CNN counterpart, LUCIR. Please note that lower forgetting does NOT necessarily lead to higher average accuracy, when initial accuracy is the same. For example, in Tab. 5 B50 R20, forgetting of BiC is lower than LUCIR. However, as in Fig. 7(a), average incremental accuracy of BiC is lower than LUCIR, and their initial step accuracy is very close. This is because average forgetting only measures model stability, while CIL pursues a trade-off between stability and plasticity. A trivial solution, with model frozen after initial step, can achieve 0 forgetting, while not handling new classes.

5.4 Comparison with Methods of Dynamic Structure

In this section, we discuss the difference between our approach and methods with dynamic structure, and compare their performance and model parameter. In our experiments, all baselines and our method use a fixed model structure in each incremental step. In contrast, a branch of CIL methods apply dynamic model structure, where the model structure changes in each incremental step. For example, in DER [37], individual feature extractor is trained for each step and then concatenated together to avoid forgetting issue. Its model size continuously grows as more classes come in. Dynamic model structure for anti-forgetting is not our main focus. However, our method does not conflict with dynamic model structure, and has the potential to utilize it.

We compare with DER with the reported results in [37]. For ImageNet-1000, with setting B0 C100 T20K, DER obtains average accuracy of $66.73\%$ with 14.52M average parameters. In contrast, our method achieves much higher accuracy $70.69\%$ , with much lower parameter size 5.31M. This demonstrates the advantage of our method on large-scale dataset. For ImageNet-100, with settings B50 C5 R20, and B0 C10 T2K, DER achieves average accuracy of $77.73\%$ and $76.12\%$ , with average parameter 8.87M and 7.67M, respectively. Our method achieves average accuracy $78.99\%$ and $73.22\%$ , while maintains constant parameter size 5.31M. More detailed comparison is provided in appendix.

6 Conclusion

In this paper, we investigate applying ViT models to class incremental learning scenario. We find that naively replacing CNN feature extractor of current CIL method with ViT model results in severe performance degradation. We nail down the causes of this performance degradation, and further address these issues with simple and effective existing techniques. Our proposed method consistently outperforms state-of-the-art class incremental learning methods on two benchmarks across different CIL settings, by a clear margin. Ablation study also demonstrates the effectiveness of our method. The proposed method provides a strong baseline for future class incremental learning research. One drawback of current method is the training schedule still longer than standard ResNet training recipe, which is currently an unsolved issue of ViT models in the community. More works will be done to address it in the future.

References

[1] Ahn, H., Cha, S., Lee, D., Moon, T.: Uncertainty-based continual learning with adaptive regularization. In: NeurIPS (2019)
[2] Aljundi, R., Babiloni, F., Elhoseiny, M., Rohrbach, M., Tuytelaars, T.: Memory aware synapses: Learning what (not) to forget. In: ECCV. pp. 139–154 (2018)
[3] Aljundi, R., Chakravarty, P., Tuytelaars, T.: Expert gate: Lifelong learning with a network of experts. In: CVPR. pp. 3366–3375 (2017)
[4] Belouadah, E., Popescu, A.: Il2m: Class incremental learning with dual memory. In: ICCV. pp. 583–592 (2019)
[5] Belouadah, E., Popescu, A.: Scail: Classifier weights scaling for class incremental learning. In: WACV. pp. 1266–1275 (2020)
[6] Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: ECCV. pp. 213–229 (2020)
[7] Castro, F.M., Marín-Jiménez, M.J., Guil, N., Schmid, C., Alahari, K.: End-to-end incremental learning. In: ECCV. pp. 233–248 (2018)
[8] Chaudhry, A., Dokania, P.K., Ajanthan, T., Torr, P.H.: Riemannian walk for incremental learning: Understanding forgetting and intransigence. In: ECCV. pp. 532–547 (2018)
[9] Chaudhry, A., Ranzato, M., Rohrbach, M., Elhoseiny, M.: Efficient lifelong learning with a-gem. In: ICLR (2018)
[10] Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: CVPR. pp. 248–255 (2009)
[11] Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. In: NAACL (2018)
[12] Dhar, P., Singh, R.V., Peng, K.C., Wu, Z., Chellappa, R.: Learning without memorizing. In: CVPR. pp. 5138–5146 (2019)
[13] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021)
[14] Douillard, A., Cord, M., Ollion, C., Robert, T., Valle, E.: Podnet: Pooled outputs distillation for small-tasks incremental learning. In: ECCV. pp. 86–102 (2020)
[15] Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. In: NeurIPS (2014)
[16] Hou, S., Pan, X., Loy, C.C., Wang, Z., Lin, D.: Lifelong learning via progressive distillation and retrospection. In: ECCV. pp. 437–452 (2018)
[17] Hou, S., Pan, X., Loy, C.C., Wang, Z., Lin, D.: Learning a unified classifier incrementally via rebalancing. In: CVPR. pp. 831–839 (2019)
[18] Hu, X., Tang, K., Miao, C., Hua, X.S., Zhang, H.: Distilling causal effect of data in class-incremental learning. In: CVPR. pp. 3957–3966 (2021)
[19] Kemker, R., Kanan, C.: Fearnet: Brain-inspired model for incremental learning. In: ICLR (2018)
[20] Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A.A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., et al.: Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences 114(13), 3521–3526 (2017)
[21] Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009)
[22] Lee, K., Lee, K., Shin, J., Lee, H.: Overcoming catastrophic forgetting with unlabeled data in the wild. In: ICCV. pp. 312–321 (2019)
[23] Li, X., Zhou, Y., Wu, T., Socher, R., Xiong, C.: Learn to grow: A continual structure learning framework for overcoming catastrophic forgetting. In: ICML. pp. 3925–3934 (2019)
[24] Li, Z., Hoiem, D.: Learning without forgetting. In: ECCV. pp. 614–629 (2016)
[25] Liu, Y., Su, Y., Liu, A.A., Schiele, B., Sun, Q.: Mnemonics training: Multi-class incremental learning without forgetting. In: CVPR. pp. 12245–12254 (2020)
[26] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV (2021)
[27] Lopez-Paz, D., Ranzato, M.: Gradient episodic memory for continual learning. In: NeurIPS. pp. 6467–6476 (2017)
[28] Parisi, G.I., Kemker, R., Part, J.L., Kanan, C., Wermter, S.: Continual lifelong learning with neural networks: A review. Neural Networks 113, 54–71 (2019)
[29] Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving language understanding by generative pre-training (2018)
[30] Rannen, A., Aljundi, R., Blaschko, M.B., Tuytelaars, T.: Encoder based lifelong learning. In: ICCV. pp. 1320–1328 (2017)
[31] Rebuffi, S.A., Kolesnikov, A., Sperl, G., Lampert, C.H.: icarl: Incremental classifier and representation learning. In: CVPR. pp. 2001–2010 (2017)
[32] Shin, H., Lee, J.K., Kim, J., Kim, J.: Continual learning with deep generative replay. In: NeurIPS (2017)
[33] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML. pp. 10347–10357 (2021)
[34] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: NeurIPS. pp. 5998–6008 (2017)
[35] Wu, Y., Chen, Y., Wang, L., Ye, Y., Liu, Z., Guo, Y., Fu, Y.: Large scale incremental learning. In: CVPR. pp. 374–382 (2019)
[36] Xiao, T., Singh, M., Mintun, E., Darrell, T., Dollár, P., Girshick, R.: Early convolutions help transformers see better. In: NIPS (2021)
[37] Yan, S., Xie, J., He, X.: Der: Dynamically expandable representation for class incremental learning. In: CVPR. pp. 3014–3023 (2021)
[38] Yoon, J., Yang, E., Lee, J., Hwang, S.J.: Lifelong learning with dynamically expandable networks. In: ICLR (2017)
[39] Zenke, F., Poole, B., Ganguli, S.: Continual learning through synaptic intelligence. In: ICML. pp. 3987–3995 (2017)
[40] Zhao, B., Xiao, X., Gan, G., Zhang, B., Xia, S.T.: Maintaining discrimination and fairness in class incremental learning. In: CVPR. pp. 13208–13217 (2020)

Appendix 0.A Additional Experimental Results

In this section, we show additional experimental results on CIFAR-100. In addition, we compare with methods using dynamic model structure, and CNN methods with more data augmentation.

0.A.1 Results on CIFAR-100

In this section, we provide experimental results on CIFAR-100. Fig. 9 shows the incremental accuracy of each incremental step with 9 CIL settings. Our method outperforms all baselines by a clear margin on all CIL protocols. Moreover, between our method and baselines, the accuracy improvement of each incremental step increases as more classes come in. For example, for B0 C10 T2K setting, our improvement over LUCIR on initial step is 4.50% (94.50% vs 90.00%), while the improvement on final incremental step is 10.04% (56.28% vs 46.24%), and improvement of average incremental accuracy is 8.45% (70.11% vs 61.66%). This demonstrates that our performance boost of average incremental accuracy is larger than the boost of initial step.

0.A.2 Comparison with Methods of Dynamic Structure

In this section, we compare our method VITIL with methods using dynamic model structure. A recent state-of-the-art method, DER [37], learns a separate feature extractor for new classes of each incremental step, and concatenate them together for classification. As a result, its model structure changes in each incremental step, and its model size continuously grows as more classes come in. In contrast, our method is based on a fixed model structure, which is simpler than DER. Moreover, the model size is fixed in our method, while DER model size grows as more classes come in. In addition, since DER applies model pruning, the model has irregular structure and random memory access. To the contrary, our method is based on regular model structure.

We compare with the performance of DER reported in [37]. Tab. 6 shows comparison with DER on both average incremental accuracy and number of model parameters on ImageNet-100 and ImageNet-1000. Tab. 7 shows comparison on CIFAR-100. Out of total 7 CIL setups, our method outperforms DER for 5 setups. On the most challenging dataset, ImageNet-1000, our method achieves 3.96% (70.69% vs 66.73%) improvement on average incremental accuracy with setting B0 C100 T20K. Moreover, our model parameter size is 63.4% smaller than DER (5.31M vs 14.52M). This demonstrates the advantage of our method on large-scale dataset. For ImageNet-100, with B50 C5 R20 setting, our performance improvement is 1.26% (78.99% vs 77.73%), with model size 40.1% smaller than DER (5.31M vs 8.87M). For B0 C10 T2K setting, our model size is 30.1% smaller. For the 4 CIL protocols on CIFAR-100, our method outperforms DER on 3 protocols with smaller model size.

Table 6: Comparison between VITIL and DER on ImageNet-1000 and ImageNet-100 over average incremental accuracy (

\%

) and number of model parameters (M). Acc denotes average incremental accuracy. #Paras denotes number of model parameters. C5 denotes each incremental step adds 5 new classes. B50 denotes initial step contains 50 classes. R20 denotes each old class keeps 20 exemplars. T2K denotes each incremental step keeps total 2K exemplars.

Methods	ImageNet-1K		ImageNet-100
	C100		C5		C10
	B0		B50		B0
	T20K		R20		T2K
	Acc	#Paras	Acc	#Paras	Acc	#Paras
DER [37]	66.73	14.52	77.73	8.87	76.12	7.67
Ours	70.69	5.31	78.99	5.31	73.22	5.31

Table 7: Comparison between VITIL and DER on CIFAR-100 over average incremental accuracy (

\%

) and number of model parameters (M). Acc denotes average incremental accuracy. #Paras denotes number of model parameters. C10 denotes each incremental step adds 10 new classes. B50 denotes initial step contains 50 classes. R20 denotes each old class keeps 20 exemplars. T2K denotes each incremental step keeps total 2K exemplars.

Methods	C10				C5
	B50		B0		B50		B0
	R20		T2K		R20		T2K
	Acc	#Paras	Acc	#Paras	Acc	#Paras	Acc	#Paras
DER [37]	67.60	0.59	69.41	0.52	66.36	0.61	68.82	0.45
Ours	71.92	0.47	70.11	0.47	67.89	0.47	67.70	0.47

0.A.3 CNN with More Data Augmentation

In this section, we investigate if CNN-based method with same data augmentation in ViT model can achieve similar incremental learning performance. We add same data augmentation of ViT to LUCIR with ResNet-18, including Mixup, CutMix, soft label, and AutoAugmentation. Similarly, margin ranking loss does not apply anymore due to soft label, and is replaced by class balance finetuning. Results on ImageNet-100 and ImageNet-1000 are shown in Tab. 8. It does not show improvements over original LUCIR.

Table 8: Comparison between LUCIR w/ and w/o data augmentation in ViT models. Average incremental accuracy (

\%

) on ImageNet-100 ImageNet-1000. LUCIR+Aug denotes adding same data augmentation from ViT to LUCIR with ResNet-18. Results with

\ast

are reported directly from original paper.

Methods	ImageNet-100						ImageNet-1000
	C10			C5			C100
	B50	B0	B0	B50	B0	B0	B500	B0	B0
	R20	R20	T2K	R20	R20	T2K	R20	R20	T20K
LUCIR [17]	70.47^∗	55.33	62.02	68.09^∗	45.72	56.80	64.34^∗	57.14	60.65
LUCIR + Aug	69.26	50.65	55.30	66.25	43.73	53.80	62.91	52.55	53.69

Improving Vision Transformers for Incremental Learning