This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Open-World Continual Learning: Unifying Novelty Detection and Continual Learning

Gyuhak Kim111Equal contribution gkim87@uic.edu Changnan Xiao222Equal contribution changnanxiao@gmail.com Tatsuya Konishi333The work was done when this author was visiting Bing Liu’s group at the University of Illinois Chicago. The original affiliation and address of this author are KDDI Research, Inc., 2-1-15 Ohara, Fujimino-shi, Saitama, 356-8502, Japan tt-konishi@kddi.com Zixuan Ke zke4@uic.edu Bing Liu liub@uic.edu
Abstract

As AI agents are increasingly used in the real open world with unknowns or novelties, they need the ability to (1) recognize objects that (a) they have learned before and (b) detect items that they have never seen or learned, and (2) learn the new items incrementally to become more and more knowledgeable and powerful. (1) is called novelty detection or out-of-distribution (OOD) detection and (2) is called class incremental learning (CIL), which is a setting of continual learning (CL). In existing research, OOD detection and CIL are regarded as two completely different problems. This paper first provides a theoretical proof that good OOD detection for each task within the set of learned tasks (called closed-world OOD detection) is necessary for successful CIL. We show this by decomposing CIL into two sub-problems: within-task prediction (WP) and task-id prediction (TP), and proving that TP is correlated with closed-world OOD detection. The key theoretical result is that regardless of whether WP and OOD detection (or TP) are defined explicitly or implicitly by a CIL algorithm, good WP and good closed-world OOD detection are necessary and sufficient conditions for good CIL, which unifies novelty or OOD detection and continual learning (CIL, in particular). We call this traditional CIL the closed-world CIL as it does not detect future OOD data in the open world. The paper then proves that the theory can be generalized or extended to open-world CIL, which is the proposed open-world continual learning, that can perform CIL in the open world and detect future or open-world OOD data. Based on the theoretical results, new CIL methods are also designed, which outperform strong baselines in CIL accuracy and in continual OOD detection by a large margin.

keywords:
Open world learning , continual learning , OOD detection
journal: Artificial Intelligence
\affiliation

[label1]organization=University of Illinois Chicago, addressline=851 S Morgan St, city=Chicago, state=Illinois, postcode=60607, country=United States

\affiliation

[label2]organization=ByteDance, addressline=Building 24, Zone B, 1999 Yishan Road, city=Shanghai, postcode=201100, country=China

1 Introduction

The current dominant machine learning paradigm (ML) makes the closed-world assumption, which means that the classes of objects seen by the system in testing or deployment must have been seen during training [60, 59, 5, 14, 39], i.e., there is nothing novel occurring during testing or deployment. This assumption is invalid in practice as the real environment is an open world that is full of unknowns or novel objects. To make an AI agent thrive in the open world, it has to detect novelties and learn them incrementally to make the system more knowledgeable and adaptable over time. This process involves multiple activities, such as novelty/OOD detection, novelty characterization, adaption, risk assessment, and continual learning of the detected novel items or objects [40, 41]. Novelty detection, also called out-of-distribution (OOD) detection, aims to detect unseen objects that the agent has not learned. On detecting novel objects or situations, the agent has to respond or adapt its actions. But in order to adapt, it must first characterize the novel object as without it, the agent would not know how to respond or adapt. For example, it may characterize a detected novel object as looking like a dog. Then, the agent may react like it would react to a dog. In the process, the agent also constantly assesses the risk of its actions. Finally, it also learns to recognize the new object incrementally so that it will not be surprised when it sees the same kind of object in the future. This incremental learning is called continual learning (CL) or lifelong learning [13, 27]. Note that before learning, the agent must obtain labeled training data, which can be collected by the agent through interaction with the environment or human users. This aspect is out of the scope of this paper. See [40, 41] for details.

This paper focuses only on the key learning aspects of the open world scenario: (1) OOD/novelty detection and (2) continual learning, more specifically class incremental learning (CIL) (see the definition below). In the research community, (1) and (2) are regarded as two completely different problems, but this paper theoretically unifies them by proving that good OOD detection for each task within the set of learned tasks, which we call closed-world OOD detection, is in fact necessary for CIL. Below, we define the concepts of OOD detection and continual learning.

Out-of-distribution (OOD) detection: Given the training data 𝒟={(xi,yi)i=1n}\mathcal{D}=\{(x^{i},y^{i})_{i=1}^{n}\}, where nn is the number of data samples, and xi𝐗x^{i}\in\mathbf{X} is an input sample and yi𝐘y^{i}\in\mathbf{Y} (the set of all class labels in 𝒟\mathcal{D}) is its class label, our goal is to build a classifier f:𝐗𝐘{O}f:\mathbf{X}\rightarrow\mathbf{Y}\cup\{O\} that can detect test instances that do not belong to any classes in 𝐘\mathbf{Y} (called OOD detection), which are assigned to the class OO. 𝐘\mathbf{Y} is often called the in-distribution (IND) classes.

We also call this open-world OOD detection. As we can see from the definition, an OOD detection algorithm can also classify test instances belonging to 𝐘\mathbf{Y} to their respective classes, which is called IND classification, although most OOD detection papers do not report the IND classification results.

Continual learning (CL) aims to incrementally learn a sequence of tasks. Each task consists of a set of classes to be learned together (the set may contain only a single class). Once a task is learned, its training data (at least a majority of it) is no longer accessible. Thus, unlike multitask learning, in learning a new task, CL will not be able to use the data of the previous tasks. A major challenge of CL is catastrophic forgetting (CF), which refers to the phenomenon that in learning a new task, the neural network model parameters need to be modified, which may corrupt the knowledge learned for previous tasks in the network and cause performance degradation for the previous tasks [47]. Although many CL techniques have been proposed, they are mainly empirical. Limited theoretical work has been done on how to solve CL. This paper performs such a theoretical study about the necessary and sufficient conditions for effective CL. Two main CL settings have been extensively studied: class incremental learning (CIL) and task incremental learning (TIL) [68]. In CIL, the learning process builds a single classifier for all tasks/classes learned so far. In testing, a test instance from any class may be presented for the model to classify. No prior task information (e.g., task-id) of the test instance is provided. Formally, CIL is defined as follows.

Class incremental learning (CIL). CIL learns a sequence of tasks, 1,2,1,2,\cdots. Let TT be the number of tasks that have been learned so far. Each task kk (1kT1\leq k\leq T) has a training dataset 𝒟k={(xki,yki)i=1nk}\mathcal{D}_{k}=\{(x_{k}^{i},y_{k}^{i})_{i=1}^{n_{k}}\}, where nkn_{k} is the number of data samples in task kk, and xki𝐗x_{k}^{i}\in\mathbf{X} is an input sample and yki𝐘ky_{k}^{i}\in\mathbf{Y}_{k} (the set of all classes of task kk) is its class label. All 𝐘k\mathbf{Y}_{k}’s are disjoint (𝐘k𝐘k=,kk\mathbf{Y}_{k}\cap\mathbf{Y}_{k^{\prime}}=\emptyset,\,\forall k\neq k^{\prime}) and k=1T𝐘k=𝐘\bigcup_{k=1}^{T}\mathbf{Y}_{k}=\mathbf{Y}. The goal of CIL is to construct a single predictive function or classifier f:𝐗𝐘f:\mathbf{X}\rightarrow\mathbf{Y} that can identify the class label yy of each given test instance xx from the TT tasks.

Based on CIL, we can also define the term close-world OOD detection.

Closed-world OOD detection: Closed-world OOD detection for a given task kk among the TT tasks that have been learned so far is OOD detection regarding the classes of task kk as the IND classes and those of the other T1T-1 tasks as the OOD classes.

From now on when we refer to OOD detection on its own (which is open-world OOD detection), we mean it is not limited to the TT learned tasks, as opposed to the closed-world OOD detection. Clearly, (open-world) OOD detection encompasses closed-world OOD detection, but not vice versa.

Unlike CIL, each task in TIL is a separate or independent classification problem. For example, one task could be to classify different breeds of dogs and another task could be to classify different types of animals (the tasks may not be disjoint). One model is built for each task in a shared network. In testing, the task-id of each test instance is provided and the system uses only the specific model for the task (dog or animal classification) to classify the test instance. Formally, TIL is defined as follows.

Task incremental learning (TIL). TIL learns a sequence of tasks, 1,2,1,2,\cdots. Let TT be the number of tasks that have been learned so far. Each task kk (1kT1\leq k\leq T) has a training dataset 𝒟k={((xki,k),yki)i=1nk}\mathcal{D}_{k}=\{((x_{k}^{i},k),y_{k}^{i})_{i=1}^{n_{k}}\}, where nkn_{k} is the number of data samples in task k𝐓={1,2,,T}k\in\mathbf{T}=\{1,2,...,T\}, and xki𝐗x_{k}^{i}\in\mathbf{X} is an input sample and yki𝐘k𝐘y_{k}^{i}\in\mathbf{Y}_{k}\subset\mathbf{Y} is its class label. The goal of TIL is to construct a predictor f:𝐗×𝐓𝐘f:\mathbf{X}\times\mathbf{T}\rightarrow\mathbf{Y} to identify the class label y𝐘ky\in\mathbf{Y}_{k} for (x,k)(x,k) (the given test instance xx from task kk).

This paper focuses on CIL, which involves incrementally learning new or novel object classes—a key aspect of open-world learning. While the proposed methods are also applicable to TIL, we do not address it in this paper. TIL is generally simpler, and several existing techniques can achieve it without CF [61, 69]. In contrast, CIL remains highly challenging due to the difficulty of Inter-task Class Separation (ICS), i.e., establishing decision boundaries between classes from the new task and those from previous tasks in learning the new task without accessing the training data of previous tasks.

Problem Statement (open-world continual learning): Open-world continual learning (OWCL) is defined as CIL with the OOD detection capability. We also call it open-world CIL or CIL+. At any time, the resulting open-world CIL model can classify test instances belonging to the classes in the TT tasks that have been learned so far to their respective classes and also detect OOD instances that do not belong to any of the learned classes so far.

Note that OOD detection in CIL+ is different from traditional OOD detection (which sees the full IND data together) because, in CIL+, the model does not see all the IND data together. Instead, the IND data comes in a sequence of tasks incrementally, and in learning each task, the model does not see any data (or only a very small sample) of the old or previous tasks.

Main contributions: This paper makes three main contributions. First, it theoretically proves the necessary and sufficient conditions for solving the CIL problem. A good closed-world OOD detection performance is one of the necessary conditions, which connects or unifies OOD detection and CIL. Since in this traditional CIL, the test instances are assumed to be from one of the TT tasks that have been learned, we call the existing CIL the closed-world CIL. Second, we prove that the theory can naturally be generalized or extended to the open-world CIL, which is the proposed open-world continual learning. Open-world CIL can perform CIL in the open world and detect OOD test data that do not belong to any of the TT tasks learned so far. Third, based on the theory, several new CIL algorithms are designed, which are also able to detect novel (or OOD) instances for the open-world continual learning (OWCL) setting. Note that from here onward, when we do not explicitly say open-world CIL, CIL means the traditional CIL.

Theory. We conduct a theoretical study of CIL, which is applicable to any CIL classification model. Instead of focusing on the traditional PAC generalization bound [53] or neural tangent kernel (NTK) [26], we focus on how to solve the CIL problem. We first decompose the CIL problem into two sub-problems in a probabilistic framework: Within-task Prediction (WP) and Task-id Prediction (TP). WP means that the prediction for a test instance is only made within the classes of the task to which the test instance belongs, which is basically the TIL problem. TP predicts the task-id. TP is needed because, in CIL, task-id is not provided at test time. This paper then proves based on cross-entropy loss that (i) the CIL performance is bounded by WP and TP performances, and (ii) TP and closed-world OOD detection performance bound each other. This paper further generalizes the result to open-world CIL (or CIL+). These results unify CIL and OOD detection.

Key theoretical results: Regardless of whether WP and TP or OOD detection are defined explicitly or implicitly by a closed-world or open-world CIL algorithm, (1) good WP and good TP or closed-world OOD detection are necessary and sufficient conditions for good closed-world CIL performances and (2) good WP and good TP or open-world OOD detection are necessary and sufficient conditions for good open-world CIL performances.444This result applies to both batch/offline and online/stream CIL and to CIL problems with blurry task boundaries which means that some training data of a task may come later together with a future task.

The intuition of the theory is simple because if a closed-world or open-world CIL model is perfect at detecting OOD samples for each task, which solves the ICS problem, then closed-world or open-world CIL is reduced to WP, which is the traditional single-task supervised learning for each task. Note that many OOD detection algorithms can also perform IND classification, which is WP.

New CIL Algorithms for OWCL. The theoretical result provides principled guidance for solving the (closed-world or open-world) CIL problem. Based on the theory, several new CIL methods are designed. (1) The first few methods integrate a TIL method and an OOD detection method, which outperform strong baselines in both the CIL and TIL settings by a large margin. This combination is attractive because TIL has achieved no forgetting, and we only need a strong OOD detection technique that can perform both IND prediction and OOD detection to learn each task to achieve strong CIL results. We do not propose a new OOD detection method as there are numerous such methods in the literature. We use two existing ones. (2) Another method is based on a pre-trained model and an OOD replay technique, which performs even better, outperforming existing baselines markedly in both CIL and OOD detection in the OWCL setting.

2 Related Work

Although a large number of algorithms have been proposed to solve the CIL problem, they are mainly empirical. Two papers have focused on studying the traditional PAC generalization bound [53] or NTK [26], but they do not tell how to solve the CL problem. This paper focuses on how to solve the CIL problem. To the best of our knowledge, we are not aware of any work that has proposed a theory on how to solve CIL. Also, none of the existing work has connected CIL and OOD detection. Our work shows that a good CIL algorithm can naturally perform OOD detection in the open world. Below, we first survey four popular families of CL approaches, which are mainly for overcoming catastrophic forgetting (CF). We then discuss related works about open-world learning.

Regularization-based methods prevent forgetting by restricting the learned parameters for previous tasks from being modified significantly by using a regularization term to penalize such changes [31, 76] or to regularize the learned representations or outputs so that they are not far from those of the previously learned network [37, 77].

Replay-based methods [56, 10, 7, 6, 72] mitigate forgetting by saving a small amount of training data from previous tasks in a memory buffer and jointly train the network using the current data and the previous task data saved in the memory. Some methods in this family also study which samples in the memory should be used in replaying [2] or which samples in the training data should be saved for later replaying [56, 43].

Generative methods construct a generative network to generate raw training data [62, 48, 3]. The generated data are used with the current task training data to jointly train the classification network. [77] generates features instead of raw data. The generated samples in these methods are used to prevent forgetting in both the generative network and the classification network.

Parameter-isolation methods [61, 69] train a set of task-specific parameters to effectively protect the important parameters of each task from being updated, which thus has almost no forgetting. A limitation of the approach is that the correct task-id of each test instance must be known to the system to select the corresponding task-specific parameters at inference. These methods are thus mainly used for task incremental learning (TIL). Some CIL methods also used these methods [1, 49, 54, 22] and they have separate mechanisms to predict the task-id (more on this below). However, their CIL performances are far below that of recent replay-based counterparts (see Sections 4.2 and 5.2 for details). Two of our proposed CIL methods also use two parameter-isolation methods (HAT [61] and SupSup [69]) for TIL as one of the components.

Using a TIL method for CIL means that CIL is decomposed into WP and TP. Task-id prediction (TP) is the key challenge. For example, CCG [1] constructs an additional network to predict the task-id. iTAML [54] identifies the task-id of the test data in a batch. A serious limitation of this is that it requires the test data in a batch to belong to the same task. Our methods are different as they can predict for a single test instance at a time. HyperNet [49] and PR [22] propose an entropy-based task-id prediction method. SupSup [69] predicts task-id by finding the optimal superpositions at inference. However, these methods perform poorly because they either do not know that OOD detection is the key for accurate task-id prediction or their task models are not built for OOD detection. It is also important to note that our theory does not explicitly predict task-id. Instead, it uses the TP probability and WP probability for test prediction.

Several papers have explicitly or implicitly indicated the use of OOD detection for task-id prediction in continual learning. For example, the CIL method in [24] is based on one-class classification, which is OOD detection with only a single class as the in-distribution (IND) class. In [22], the authors proposed an uncertainty-based OOD detection framework for task-id prediction. Two specific methods were presented. One uses entropy to quantify the uncertainty (which has also been used in some other systems discussed above) and the other is called agree, which selects the task that leads to the highest agreement in predictions across task models. There are also related works that did not explicitly make a connection between CIL and OOD detection, their methods implicitly imply it. For example, [72] uses a regularization similar to OOD detection, which employs the replay data from previous tasks as OOD samples. [67] proposed to train a VAE model for each class to be learned. It then estimates the likelihood p(x|y)p(x|y) and uses the Bayes rule to predict the class (yy) of each test instance (xx). Our work makes a theoretical contribution by formally connecting CIL and OOD detection and proving that for a good CIL performance, a good OOD detection capability for each task is necessary.

Open world learning has been studied by many researchers [60, 59, 5, 14, 39, 41]. However, the existing research mainly focused on novelty detection, also called open set recognition or out-of-distribution (OOD) detection. Some researchers have also studied learning the novel objects after they are detected and manually labeled [5, 14, 71, 25]. However, none of them perform continual learning, which has additional challenges of catastrophic forgetting (CF) and inter-task class separation (ICS). Several researchers also studied other related tasks in addition to novelty detection, e.g., characterization of novelties and adaptation of novelties to maximize the performance task [46, 65]. Again, these works are not about continual learning. Excellent surveys of novelty detection or OOD detection and open-world learning can be found in [73, 51, 52, 25]. [16] did novelty detection and also continual learning, but its continual learning uses the regularization-based method. It is quite weak because it has serious forgetting. A position paper [33] recently presented some nice blue-sky ideas about open-world learning, but it does not propose or implement any algorithm.

Our proposed algorithms are quite different. In training, based on our theory, we use two existing OOD detection methods to verify that our theory can guide us to design new and much more effective CIL algorithms. In testing, our OOD detection is in the open-world continual learning (OWCL) setting, which has been described in the introduction section.

Several researchers have studied novel class discovery [17], which is defined as discovering the hidden classes in the detected novel or OOD instances. Our work does not perform this function. We assume that the training data for each new task is given. Performing automatic class label discovery is still very challenging as in many cases, the class assignments can be subjective and are determined by human users. For example, for a dog, whether it should just be labeled as a dog or a specific breed of dog is a subjective decision and depends on applications.

Some existing works have combined OOD detection and continual learning [16, 18, 57]. These papers use OOD score thresholds to determine OOD instances and also do continual learning afterward. However, their continual learning still assumes that the training data are given as it is hard to do real-time detection and learning. This is because, without human involvement, it is impossible to obtain novel class labels in general and verify the correctness of OOD detection results. Any error in OOD detection will propagate to the continual learning phase. [40, 41] reported a continual learning chatbot that can detect novel user utterances that the system does not understand and chat with the user through its novelty characterization mechanism to get the ground truth. However, this system is based on saving new/novel utterances and performing matching and retrieval in subsequent chatting. [46] proposed an integrated online architecture that combines and extends probabilistic programming and planning to (1) detect novelty, (2) incrementally characterize the novelty, and (3) continually adapt its task-based reasoning to the evolving understanding of the novelty to maximize task performance. However, this work is not about continual learning. [50] also reported a system for continuous emotion novelty detection.

3 A Theoretical Study on Solving CIL

This section presents our theory for solving CIL, which also covers novelty or OOD detection. It first shows that the CIL performance improves if the within-task prediction (WP) performance and/or the task-id prediction (TP) performance improve, and then shows that TP and OOD detection bound each other, which indicates that CIL performance is controlled by WP and OOD detection. This connects CIL and OOD detection. After that, we study the necessary conditions for a good CIL model, which includes a good WP, and a good TP or OOD detection. In the first four sub-sections, we focus on the traditional CIL that is limited to the number of tasks TT that have been learned so far, which we also call closed-world CIL. OOD detection in this context is also within the TT learned tasks and is called closed-world OOD detection (see Section 1). For simplicity in presentation, we will not add closed-world before CIL or OOD detection below. In Section 3.5, we generalize/extend the theory to open-world CIL or open-world continual learning, which will also detect OOD data that do not belong to any of the TT tasks learned so far. Table 1 gives the list of acronyms used in the paper.

Table 1: Acronyms used in the paper
CL Continual learning
CIL Class incremental learning
TIL Task incremental learning
OOD Out-of-distribution
IND In-distribution
WP Within-task prediction
TP Task-id prediction
CF Catastrophic forgetting
OWCL Open-world continual learning
NTK Neural tangent kernel
AUC Area Under the ROC Curve

3.1 CIL Problem Decomposition

This sub-section first presents the assumptions made by CIL based on its definition and then proposes a decomposition of the CIL problem into two sub-problems. Assume that a CIL system has learned a sequence of TT tasks {(𝐗k,𝐘k)}k=1,,T\{(\mathbf{X}_{k},\mathbf{Y}_{k})\}_{k=1,\dots,T} so far, where 𝐗k\mathbf{X}_{k} is the domain of task kk and 𝐘k\mathbf{Y}_{k} are classes of task kk as 𝐘k={𝐘k,j}\mathbf{Y}_{k}=\{\mathbf{Y}_{k,j}\}, where jj indicates the jjth class in task kk. Let 𝐗k,j\mathbf{X}_{k,j} to be the domain of jjth class of task kk, where 𝐗k=j𝐗k,j\mathbf{X}_{k}=\bigcup_{j}\mathbf{X}_{k,j}. For accuracy, we will use x𝐗k,jx\in\mathbf{X}_{k,j} instead of 𝐘k,j\mathbf{Y}_{k,j} in probabilistic analysis. Based on the definition of class incremental learning (CIL) (Section 1), the following assumptions are implied,

Assumption 1.

The domains of classes of the same task are disjoint, i.e., 𝐗k,j𝐗k,j=,jj\mathbf{X}_{k,j}\cap\mathbf{X}_{k,j^{\prime}}=\emptyset,\,\forall j\neq j^{\prime}.

Assumption 2.

The domains of tasks are disjoint, i.e., 𝐗k𝐗k=,kk\mathbf{X}_{k}\cap\mathbf{X}_{k^{\prime}}=\emptyset,\,\forall k\neq k^{\prime}.

For any ground event DD, the goal of a CIL problem is to learn 𝐏(x𝐗k,j|D)\mathbf{P}(x\in\mathbf{X}_{k,j}|D). This can be decomposed into two probabilities, within-task IND prediction (WP) probability and task-id prediction (TP) probability. WP probability is 𝐏(x𝐗k,j|x𝐗k,D)\mathbf{P}(x\in\mathbf{X}_{k,j}|x\in\mathbf{X}_{k},D) and TP probability is 𝐏(x𝐗k|D)\mathbf{P}(x\in\mathbf{X}_{k}|D). We can rewrite the CIL problem using WP and TP based on the two assumptions,

𝐏(x𝐗k0,j0|D)\displaystyle\mathbf{P}(x\in\mathbf{X}_{k_{0},j_{0}}|D) =k=1,,n𝐏(x𝐗k,j0|x𝐗k,D)𝐏(x𝐗k|D)\displaystyle=\sum_{k=1,\dots,n}\mathbf{P}(x\in\mathbf{X}_{k,j_{0}}|x\in\mathbf{X}_{k},D)\mathbf{P}(x\in\mathbf{X}_{k}|D) (1)
=𝐏(x𝐗k0,j0|x𝐗k0,D)𝐏(x𝐗k0|D)\displaystyle=\mathbf{P}(x\in\mathbf{X}_{k_{0},j_{0}}|x\in\mathbf{X}_{k_{0}},D)\mathbf{P}(x\in\mathbf{X}_{k_{0}}|D) (2)

where k0k_{0} means a particular task and j0j_{0} is a particular class in the task.

Some remarks are in order about Eq. 2 and our subsequent analysis to set the stage.

Remark 1.

Eq. 2 shows that if we can improve either the WP or TP performance, or both, we can improve the CIL performance.

Remark 2.

It is important to note that our theory is not concerned with the learning algorithm or training process. But we will propose some concrete CIL algorithms based on the theoretical result in the experiment section.

Remark 3.

We note that the CIL definition and the subsequent analysis apply to tasks with any number of classes (including only one class per task) and to online CIL where the training data for each task or class comes gradually in a data stream and may also cross task boundaries (blurry tasks [4]) because our analysis is based on an already-built CIL model after training. Regarding blurry task boundaries, suppose dataset 1 has classes {dog, cat, tiger} and dataset 2 has classes {dog, computer, car}. We can define task 1 as {dog, cat, tiger} and task 2 as {computer, car}. The shared class dog in dataset 2 is just additional training data of dog appeared after task 1.

Remark 4.

CIL = WP * TP in Eq. 2 means that when we have WP and TP (defined either explicitly or implicitly by implementation), we can find a corresponding CIL model defined by WP * TP. Similarly, when we have a CIL model, we can find the corresponding underlying WP and TP defined by their probabilistic definitions.

In the following sub-sections, we develop this further concretely to derive the sufficient and necessary conditions for solving the CIL problem in the context of cross-entropy loss as it is used in almost all supervised CIL systems.

3.2 CIL Improves as WP and/or TP Improve

As stated in Remark 2 above, the study here is based on a trained CIL model and not concerned with the algorithm used in training the model. We use cross-entropy as the performance measure of a trained model as it is the most popular loss function used in supervised CL. For experimental evaluation, we use accuracy following CL papers. Denote the cross-entropy of two probability distributions pp and qq as

H(p,q)=def𝔼p[logq]=ipilogqi.\displaystyle H(p,q)\overset{def}{=}-\mathbb{E}_{p}[\log q]=-\sum_{i}p_{i}\log q_{i}. (3)

For any x𝐗x\in\mathbf{X}, let yy to be the CIL ground truth label of xx, where yk0,j0=1y_{k_{0},j_{0}}=1 if x𝐗k0,j0x\in\mathbf{X}_{k_{0},j_{0}} otherwise yk,j=0y_{k,j}=0, (k,j)(k0,j0)\forall(k,j)\neq(k_{0},j_{0}). Let y~\tilde{y} be the WP ground truth label of xx, where y~k0,j0=1\tilde{y}_{k_{0},j_{0}}=1 if x𝐗k0,j0x\in\mathbf{X}_{k_{0},j_{0}} otherwise y~k0,j=0\tilde{y}_{k_{0},j}=0, jj0\forall j\neq j_{0}. Let y¯\bar{y} be the TP ground truth label of xx, where y¯k0=1\bar{y}_{k_{0}}=1 if x𝐗k0x\in\mathbf{X}_{k_{0}} otherwise y¯k=0\bar{y}_{k}=0, kk0\forall k\neq k_{0}. Denote

HWP(x)\displaystyle H_{WP}(x) =H(y~,{𝐏(x𝐗k0,j|x𝐗k0,D)}j),\displaystyle=H(\tilde{y},\{\mathbf{P}(x\in\mathbf{X}_{k_{0},j}|x\in\mathbf{X}_{k_{0}},D)\}_{j}), (4)
HCIL(x)\displaystyle H_{CIL}(x) =H(y,{𝐏(x𝐗k,j|D)}k,j),\displaystyle=H(y,\{\mathbf{P}(x\in\mathbf{X}_{k,j}|D)\}_{k,j}), (5)
HTP(x)\displaystyle H_{TP}(x) =H(y¯,{𝐏(x𝐗k|D)}k)\displaystyle=H(\bar{y},\{\mathbf{P}(x\in\mathbf{X}_{k}|D)\}_{k}) (6)

where HWPH_{WP}, HCILH_{CIL}, and HTPH_{TP} are the cross-entropy values of WP, CIL, and TP, respectively. We now present our first theorem. The theorem connects CIL to WP and TP and suggests that by having a good WP or TP, the CIL performance improves as the upper bound for the CIL loss decreases.

Theorem 1.

If HTP(x)δH_{TP}(x)\leq\delta and HWP(x)ϵH_{WP}(x)\leq\epsilon, we have HCIL(x)ϵ+δ.H_{CIL}(x)\leq\epsilon+\delta.

The detailed proof is given in A.1. This theorem holds regardless of whether WP and TP are trained together or separately. When they are trained separately, if WP is fixed and we let ϵ=HWP(x)\epsilon=H_{WP}(x), HCIL(x)HWP(x)+δH_{CIL}(x)\leq H_{WP}(x)+\delta, which means if TP is better, CIL is better. Similarly, if TP is fixed, we have HCIL(x)ϵ+HTP(x)H_{CIL}(x)\leq\epsilon+H_{TP}(x). When they are trained concurrently, there exists a functional relationship between ϵ\epsilon and δ\delta depending on implementation. But no matter what it is, when ϵ+δ\epsilon+\delta decreases, CIL gets better.

Theorem 1 holds for any x𝐗=k𝐗kx\in\mathbf{X}=\bigcup_{k}\mathbf{X}_{k} that satisfies HTP(x)δH_{TP}(x)\leq\delta or HWP(x)ϵH_{WP}(x)\leq\epsilon. To measure the overall performance under expectation, we present the following corollary.

Corollary 1.

Let U(𝐗)U(\mathbf{X}) represent the uniform distribution on 𝐗\mathbf{X}. i) If 𝔼xU(𝐗)[HTP(x)]δ\mathbb{E}_{x\sim U(\mathbf{X})}[H_{TP}(x)]\leq\delta, then 𝔼xU(𝐗)[HCIL(x)]𝔼xU(𝐗)[HWP(x)]+δ\mathbb{E}_{x\sim U(\mathbf{X})}[H_{CIL}(x)]\leq\mathbb{E}_{x\sim U(\mathbf{X})}[H_{WP}(x)]+\delta. Similarly, ii) 𝔼xU(𝐗)[HWP(x)]ϵ\mathbb{E}_{x\sim U(\mathbf{X})}[H_{WP}(x)]\leq\epsilon, then 𝔼xU(𝐗)[HCIL(x)]ϵ+𝔼xU(𝐗)[HTP(x)]\mathbb{E}_{x\sim U(\mathbf{X})}[H_{CIL}(x)]\leq\epsilon+\mathbb{E}_{x\sim U(\mathbf{X})}[H_{TP}(x)].

The proof is given in A.2. The corollary is a direct extension of Theorem 1 in expectation. The implication is that given TP performance, CIL is positively related to WP. The better the WP is, the better the CIL is as the upper bound of the CIL loss decreases. Similarly, given WP performance, a better TP performance results in a better CIL performance. Due to the positive relation, we can improve CIL by improving either WP or TP using their respective methods developed in each area.

3.3 Task Prediction (TP) to OOD Detection

Building on Eq. 2, we have studied the relationship of CIL, WP, and TP in Theorem 1. We now connect TP and OOD detection. They are shown to be dominated by each other to a constant factor.

We again use cross-entropy HH to measure the performance of TP and OOD detection of a trained network as in Section 3.2. To build the connection between HTP(x)H_{TP}(x) and OOD detection of each task, we first define the notations of OOD detection. We use 𝐏k(x𝐗k|D)\mathbf{P}^{\prime}_{k}(x\in\mathbf{X}_{k}|D) to represent the probability distribution predicted by the kkth task’s OOD detector. Notice that the task prediction (TP) probability distribution 𝐏(x𝐗k|D)\mathbf{P}(x\in\mathbf{X}_{k}|D) is a categorical distribution over TT tasks, while the OOD detection probability distribution 𝐏k(x𝐗k|D)\mathbf{P}^{\prime}_{k}(x\in\mathbf{X}_{k}|D) is a Bernoulli distribution. For any x𝐗x\in\mathbf{X}, define

HOOD,k(x)={H(1,𝐏k(x𝐗k|D))=log𝐏k(x𝐗k|D),x𝐗k,H(0,𝐏k(x𝐗k|D))=log𝐏k(x𝐗\𝐗k|D),x𝐗\𝐗k.\displaystyle H_{OOD,k}(x)=\left\{\begin{aligned} H(1,\mathbf{P}^{\prime}_{k}(x\in\mathbf{X}_{k}|D))=&-\log\mathbf{P}^{\prime}_{k}(x\in\mathbf{X}_{k}|D),&&x\in\mathbf{X}_{k},\\ H(0,\mathbf{P}^{\prime}_{k}(x\in\mathbf{X}_{k}|D))=&-\log\mathbf{P}^{\prime}_{k}(x\in\mathbf{X}\backslash\mathbf{X}_{k}|D),&&x\in\mathbf{X}\backslash\mathbf{X}_{k}.\end{aligned}\right. (7)

Note that the OOD detection here is the closed-world OOD detection. But for presentation simplicity, we still use OOD detection below. In CIL, the term OOD detection probability for a task can be defined using the output values corresponding to the classes of the task. Some examples of the function are the sigmoid of maximum logit value and the maximum softmax probability after re-scaling to 0 to 1. It is also possible to define the OOD detector directly as a function of tasks instead of a function of the output values of all classes of tasks, i.e. Mahalanobis distance. The following theorem shows that TP and OOD detection bound each other.

Theorem 2.

i) If HTP(x)δH_{TP}(x)\leq\delta, let 𝐏k(x𝐗k|D)=𝐏(x𝐗k|D)\mathbf{P}^{\prime}_{k}(x\in\mathbf{X}_{k}|D)=\mathbf{P}(x\in\mathbf{X}_{k}|D), then HOOD,k(x)δ,k=1,,TH_{OOD,k}(x)\leq\delta,\forall\,k=1,\dots,T. ii) If HOOD,k(x)δk,k=1,,TH_{OOD,k}(x)\leq\delta_{k},k=1,\dots,T, let 𝐏(x𝐗k|D)=𝐏k(x𝐗k|D)k𝐏k(x𝐗k|D)\mathbf{P}(x\in\mathbf{X}_{k}|D)=\frac{\mathbf{P}_{k}^{\prime}(x\in\mathbf{X}_{k}|D)}{\sum_{k}\mathbf{P}_{k}^{\prime}(x\in\mathbf{X}_{k}|D)}, then HTP(x)(k𝟏x𝐗keδk)(k1eδk)H_{TP}(x)\leq(\sum_{k}\mathbf{1}_{x\in\mathbf{X}_{k}}e^{\delta_{k}})(\sum_{k}1-e^{-\delta_{k}}), where 𝟏x𝐗k\mathbf{1}_{x\in\mathbf{X}_{k}} is an indicator function.

See A.3 for the proof. As we use cross-entropy, the lower the bound, the better the performance is. The first statement (i) says that the OOD detection performance improves if the TP performance gets better (i.e., lower δ\delta). Similarly, the second statement (ii) says that the TP performance improves if the OOD detection performance on each task improves (i.e., lower δk\delta_{k}). Besides, since (k𝟏x𝐗keδk)(k1eδk)(\sum_{k}\mathbf{1}_{x\in\mathbf{X}_{k}}e^{\delta_{k}})(\sum_{k}1-e^{-\delta_{k}}) converges to 0 as δk\delta_{k}’s converge to 0 in order of O(|kδk|)O(|\sum_{k}\delta_{k}|), we further know that HTPH_{TP} and kHOOD,k\sum_{k}H_{OOD,k} are equivalent in quantity up to a constant factor.

For the traditional CIL, Theorem 1 studied how CIL is related to WP and TP. Theorem 2 showed that TP and OOD detection bound each other. Now we explicitly give the upper bound of CIL in relation to WP and OOD detection of each task. The detailed proof can be found in A.5.

Theorem 3.

If HOOD,k(x)δk,k=1,,TH_{OOD,k}(x)\leq\delta_{k},\,k=1,\dots,T and HWP(x)ϵH_{WP}(x)\leq\epsilon, we have

HCIL(x)ϵ+(k𝟏x𝐗keδk)(k1eδk),H_{CIL}(x)\leq\epsilon+(\sum_{k}\mathbf{1}_{x\in\mathbf{X}_{k}}e^{\delta_{k}})(\sum_{k}1-e^{-\delta_{k}}),

where 𝟏x𝐗k\mathbf{1}_{x\in\mathbf{X}_{k}} is an indicator function.

3.4 Necessary Conditions for Improving CIL

In Theorem 1, we showed that good performances of WP and TP are sufficient to guarantee a good performance of CIL. In Theorem 3, we showed that good performances of WP and OOD detection are sufficient to guarantee a good performance of CIL. Again, for simplicity, OOD detection here refers to the closed-world OOD detection. For completeness, we study the necessary conditions of a well-performed CIL in this sub-section.

Theorem 4.

If HCIL(x)ηH_{CIL}(x)\leq\eta, then there exist i) a WP, s.t. HWP(x)ηH_{WP}(x)\leq\eta, ii) a TP, s.t. HTP(x)ηH_{TP}(x)\leq\eta, and iii) an OOD detector for each task, s.t. HOOD,kη,k=1,,TH_{OOD,k}\leq\eta,\,k=1,\dots,T.

The detailed proof is given in A.6. This theorem tells that if a good CIL model is trained, then a good WP, a good TP, and a good OOD detector for each task are always implied. More importantly, by transforming Theorem 4 into its contraposition, we have the following statements: If for any WP, HWP(x)>ηH_{WP}(x)>\eta, then HCIL(x)>ηH_{CIL}(x)>\eta. If for any TP, HTP(x)>ηH_{TP}(x)>\eta, then HCIL(x)>ηH_{CIL}(x)>\eta. If for any OOD detector, HOOD,k(x)>η,k=1,,TH_{OOD,k}(x)>\eta,\,k=1,\dots,T, then HCIL(x)>ηH_{CIL}(x)>\eta. Regardless of whether WP and TP (or OOD detection) are defined explicitly or implicitly by a CIL algorithm, the existence of a good WP and the existence of a good TP or OOD detection are necessary conditions for good CIL performance. Note that the OOD detection here is closed-world OOD detection.

Remark 5.

It is important to note again that our study in this section is based on a CIL model that has already been built. In other words, our study tells the CIL designers what should be achieved in the final model. Clearly, one would also like to know how to design a strong CIL model based on the theoretical results, which also considers catastrophic forgetting (CF). One effective method is to make use of a strong existing TIL algorithm, which can already achieve no or little forgetting (CF), and combine it with a strong OOD detection algorithm (as mentioned earlier, most OOD detection methods can also perform WP). Thus, any improved method from the OOD detection community can be applied to CIL to produce improved CIL systems (see Sections 4.2.3 and 4.2.4).

3.5 Generalization to Open-World Continual Learning

As mentioned at the beginning of this section, the first four subsections focused on the traditional closed-world CIL. This subsection generalizes or extends the theory to the open-world CIL, denoted by CIL+. CIL+ is CIL with an additional pseudo-task on top of the TT learned tasks representing OOD detection beyond the TT tasks, which we call the OOD task with a single pseudo-class (called OOD class) as we cannot predict the unseen class of an OOD sample because it is unknown. In this context, OOD detection is referred to as open-world OOD detection. For simplicity, we will continue using the term OOD detection.

We first note that Eq. 2 still applies because CIL+ only adds a new OOD task with one OOD class. Theorem 1 for the closed-world CIL can be extended to the open-world CIL (CIL+) by replacing HTPH_{TP} with HTP+H_{TP^{+}}, and HCILH_{CIL} with HCIL+H_{CIL^{+}}. HWPH_{WP} stays the same as the WP definition has no change in CIL+. The proof is trivially identical to the proof of Theorem 1. The key extension is to Theorem 2 of the traditional closed-world CIL so that test samples that do not belong to any of the TT already-learned tasks (i.e., OOD to the TT tasks) can also be detected.

Theorem 2 can be generalized to CIL+ by changing the closed-world TP to open-world TP, denoted by TP+, which must now predict the additional OOD task.

We denote 𝐗+\mathbf{X}^{+} as the open-world OOD domain beyond 𝐗\mathbf{X}. For any x𝐗𝐗+x\in\mathbf{X}\cup\mathbf{X}^{+}, define

HCIL+(x)\displaystyle H_{CIL^{+}}(x) =H(y,{𝐏(x𝐗k,j|D)}k,j{𝐏(x𝐗+|D)}),\displaystyle=H(y,\{\mathbf{P}(x\in\mathbf{X}_{k,j}|D)\}_{k,j}\cup\{\mathbf{P}(x\in\mathbf{X^{+}}|D)\}), (8)
HTP+(x)\displaystyle H_{TP^{+}}(x) =H(y¯,{𝐏(x𝐗k|D)}k{𝐏(x𝐗+|D)}).\displaystyle=H(\bar{y},\{\mathbf{P}(x\in\mathbf{X}_{k}|D)\}_{k}\cup\{\mathbf{P}(x\in\mathbf{X^{+}}|D)\}). (9)

For any x𝐗𝐗+x\in\mathbf{X}\cup\mathbf{X}^{+}, define

HOOD+,k(x)={H(1,𝐏k(x𝐗k|D))=log𝐏k(x𝐗k|D),x𝐗k,H(0,𝐏k(x𝐗k|D))=log𝐏k(x(𝐗𝐗+)\𝐗k|D),x(𝐗𝐗+)\𝐗k.\displaystyle H_{OOD^{+},k}(x)=\left\{\begin{aligned} H(1,\mathbf{P}^{\prime}_{k}(x\in\mathbf{X}_{k}|D))=&-\log\mathbf{P}^{\prime}_{k}(x\in\mathbf{X}_{k}|D),\\ &~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}x\in\mathbf{X}_{k},\\ H(0,\mathbf{P}^{\prime}_{k}(x\in\mathbf{X}_{k}|D))=&-\log\mathbf{P}^{\prime}_{k}(x\in(\mathbf{X}\cup\mathbf{X}^{+})\backslash\mathbf{X}_{k}|D),\\ &~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}x\in(\mathbf{X}\cup\mathbf{X}^{+})\backslash\mathbf{X}_{k}.\\ \end{aligned}\right. (10)

where OOD+ denotes the open-world OOD detection.

It is clear that open-world OOD detection implies closed-world OOD detection, but the reverse is not true. Since the classification in the closed-world CIL is limited to the TT tasks learned so far, it cannot derive open-world OOD detection but only closed-world OOD detection. Thus, only closed-world OOD detection is necessary for the traditional closed-world CIL.

We now generalize Theorem 2 to the open-world CIL (CIL+) setting with the following Corollary. The proof is given in A.4.

Corollary 2.

i) If HTP+(x)δH_{TP^{+}}(x)\leq\delta, let 𝐏k(x𝐗k|D)=𝐏(x𝐗k|D)\mathbf{P}^{\prime}_{k}(x\in\mathbf{X}_{k}|D)=\mathbf{P}(x\in\mathbf{X}_{k}|D), then HOOD+,k(x)δ,k=1,,TH_{OOD^{+},k}(x)\leq\delta,\forall\,k=1,\dots,T. ii) If HOOD+,k(x)δk,k=1,,TH_{OOD^{+},k}(x)\leq\delta_{k},k=1,\dots,T, let 𝐏(x𝐗k|D)=𝐏k(x𝐗k|D)k𝐏k(x𝐗k|D)+k(1𝐏k(x𝐗k|D))\mathbf{P}(x\in\mathbf{X}_{k}|D)=\frac{\mathbf{P}^{\prime}_{k}(x\in\mathbf{X}_{k}|D)}{\sum_{k}\mathbf{P}^{\prime}_{k}(x\in\mathbf{X}_{k}|D)+\prod_{k}(1-\mathbf{P}^{\prime}_{k}(x\in\mathbf{X}_{k}|D))} and 𝐏(x𝐗+|D)=k(1𝐏k(x𝐗k|D))k𝐏k(x𝐗k|D)+k(1𝐏k(x𝐗k|D))\mathbf{P}(x\in\mathbf{X}^{+}|D)=\frac{\prod_{k}(1-\mathbf{P}^{\prime}_{k}(x\in\mathbf{X}_{k}|D))}{\sum_{k}\mathbf{P}^{\prime}_{k}(x\in\mathbf{X}_{k}|D)+\prod_{k}(1-\mathbf{P}^{\prime}_{k}(x\in\mathbf{X}_{k}|D))}, then HTP+(x)max((k𝟏x𝐗keδk)(k(1+𝟏x𝐗k)(1eδk)),keδkk1eδk)H_{TP^{+}}(x)\leq\max((\sum_{k}\mathbf{1}_{x\in\mathbf{X}_{k}}e^{\delta_{k}})(\sum_{k}(1+\mathbf{1}_{x\in\mathbf{X}_{k}})(1-e^{-\delta_{k}})),\prod_{k}e^{\delta_{k}}\sum_{k}1-e^{-\delta_{k}}), where 𝟏x𝐗k\mathbf{1}_{x\in\mathbf{X}_{k}} is an indicator function.

For Corollary 2, we have that (k𝟏x𝐗keδk)(k(1+𝟏x𝐗k)(1eδk))(\sum_{k}\mathbf{1}_{x\in\mathbf{X}_{k}}e^{\delta_{k}})(\sum_{k}(1+\mathbf{1}_{x\in\mathbf{X}_{k}})(1-e^{-\delta_{k}})) converges to 0 as δk\delta_{k}’s converges to 0 in order of O(|kδk|+maxk|δk|)=O(|kδk|)O(|\sum_{k}\delta_{k}|+\max_{k}|\delta_{k}|)=O(|\sum_{k}\delta_{k}|), and keδkk1eδk\prod_{k}e^{\delta_{k}}\sum_{k}1-e^{-\delta_{k}} converges to 0 in order of O(|kδk|)O(|\sum_{k}\delta_{k}|). Therefore, we know that HTP+H_{TP^{+}} and kHOOD+,k\sum_{k}H_{OOD^{+},k} are equivalent in quantity up to a constant factor.

We can extend Theorem 3 to the open-world CIL+ using Corollary 2. By substituting HOOD,kH_{OOD,k} with HOOD+,kH_{OOD^{+},k} and HCILH_{CIL} with HCIL+H_{CIL^{+}}, we obtain a new upper bound for the open-world CIL, ϵ+max((k𝟏x𝐗keδk)(k(1+𝟏x𝐗k)(1eδk)),keδkk1eδk)\epsilon+\max((\sum_{k}\mathbf{1}_{x\in\mathbf{X}_{k}}e^{\delta_{k}})(\sum_{k}(1+\mathbf{1}_{x\in\mathbf{X}_{k}})(1-e^{-\delta_{k}})),\prod_{k}e^{\delta_{k}}\sum_{k}1-e^{-\delta_{k}}). The proof is trivially identical to the original proof of Theorem 3.

We can establish the same theorem as Theorem 4 for CIL+ by replacing HOOD,kH_{OOD,k} with HOOD+,kH_{OOD^{+},k} and HCILH_{CIL} with HCIL+H_{CIL^{+}}. Again, the proof is trivially identical to the original proof of Theorem 4.

The new theorems establish that a good TP+ or OOD+ (open-world OOD detection) and a good WP are necessary and sufficient for a good CIL+.

4 Proposed Approach 1: Combining TIL and OOD Detection

Based on the above theoretical result, we have designed two approaches to solving CIL that employ OOD detection methods, more precisely open-world OOD detection methods. Although theoretically speaking, open-world OOD detection implies closed-world OOD detection, in practical applications, we often do not need to distinguish whether an OOD detection method is a closed-world or an open-world method as they usually can be used for either closed-world or open-world CIL. We just want them to be as accurate as possible for the applications.

This section presents the first approach, which combines a task incremental learning (TIL) method and an OOD detection method. The approach does not save any training data from previous tasks. The OOD detection method here is an open-world method as it does not use any information from the other tasks learned in the CIL process. The next section presents the second approach, which is based on replay and needs to save some training data from previous tasks.555Note that this paper focuses on establishing a theoretical connection between novelty (or OOD) detection and class incremental learning (CIL). Our experiments show the validity of the theory. We also report the OOD detection results using AUC but this paper does not focus on the problem of real-time decision-making and learning using OOD detection to detect each novel instance, acquire its class label, and incrementally learn it. The reason is that this will involve setting an OOD score threshold to decide each OOD instance and interacting with human users to acquire the class label to learn. Such user interactions wouldn’t give the system a large number of labeled training data. Then the highly challenging few-shot continual learning is required. We leave this to future work. The OOD detection method used there is a closed-world OOD detection method as it treats the replay data from previous tasks as the OOD data in the model building, but this method can also be used for open-world CIL.

Refer to caption
Refer to caption
Figure 1: Overview of prediction and training framework of HAT+CSI and Sup+CSI. (a) HAT+CSI: The CIL prediction is made by argmax over the concatenated output from each task. The training of each task uses CSI. That is, the training batch is augmented to give different views of the samples for contrastive training. The training consists of two steps following CSI. The first step learns the feature extractor by using the hard attention algorithm [61], which applies task embeddings to find hard masks at each layer. Then given the learned feature representations, it fine-tunes the classifier in step 2. (b) Sup+CSI: The CIL prediction is also made by taking argmax over the concatenated output values from each task as HAT+CSI. The model training for each task is similar to HAT+CSI except that it uses the Edge Popup algorithm of SupSup [55] for finding a sparse network for each task. The sparse networks are indicated by edges of different colors in the diagram. The second step fine-tunes the classifier only with the fixed feature extractor.

4.1 Combining a TIL Method and an OOD Detection Method

As mentioned earlier, several existing TIL methods can overcome CF. This proposed approach basically leverages the CF prevention ability in two TIL methods (HAT [61] and SupSup (Sup) [69]) and replaces their task learning methods with an OOD detection technique, called CSI [64], which can perform both within-task or IND prediction (WP) and OOD instance detection. Below, we first introduce the two TIL methods, HAT and SupSup, and the OOD detection method, CSI. The combinations give two new CIL methods, HAT+CSI and Sup+CSI. None of these methods needs to save any data from previous tasks.

Figure 1 shows the overall training frameworks of HAT+CSI and Sup+CSI. Note that both HAT and Sup are multi-head methods (one head for each task) designed for task incremental learning (TIL).

4.1.1 HAT: Hard Attention Masks

To prevent forgetting the trained OOD detection model fkhkf^{k}\circ h^{k} for each task kk in subsequent task learning, the hard attention mask (HAT) [61] for TIL is employed (which prevents forgetting in the feature extractor). Specifically, in learning a task, a set of embeddings is trained to protect the important neurons so that the corresponding parameters are not interfered with by subsequent tasks. The importance of a neuron is measured by the 0-1 pseudo-step function, where 0 indicates not important and 1 indicates important (and thus protected).

The hard attention mask is an output of sigmoid function uu with a hyper-parameter ss

alk=u(selk),\displaystyle a_{l}^{k}=u(se_{l}^{k}), (11)

where elke_{l}^{k} is a learnable embedding at layer ll of task kk. Since the step function is not differentiable, a sigmoid function with a large ss is used to approximate it. Sigmoid is approximately a 0-1 step function with a large ss. The attention is multiplied to the output hl=ReLU(Wlhl1+bl)h_{l}=\text{ReLU}(W_{l}h_{l-1}+b_{l}) of layer ll,

hl=alkhl\displaystyle h^{\prime}_{l}=a_{l}^{k}\otimes h_{l} (12)

The jjth element aj,lka_{j,l}^{k} in the attention mask blocks (or unblocks) the information flow from neuron jj at layer ll if its value is 0 (or 11). With 0 value of aj,lka_{j,l}^{k}, the corresponding parameters in WlW_{l} and blb_{l} can be freely changed as the output values hlh^{\prime}_{l} are not affected. The neurons with non-zero mask values are necessary to perform the task and thus need protection from catastrophic forgetting.

We modify the gradients of parameters that are important in performing the previous tasks (1,,k1)(1,\cdots,k-1) during training task kk so they are not interfered with. Denote the accumulated mask by

al<k=max(al<k1,alk1)\displaystyle a_{l}^{<k}=\max(a_{l}^{<k-1},a_{l}^{k-1}) (13)

where max\max is element-wise maximum and the initial mask al0a_{l}^{0} is a zero vector. It is a collection of mask values at layer ll where a neuron has value 1 if it has ever been activated previously. The gradient of parameter wij,lw_{ij,l} is modified as

wij,l=(1min(ai,l<k,aj,l1<k))wij,l\displaystyle\nabla w_{ij,l}^{\prime}=\left(1-\min\left(a_{i,l}^{<k},a_{j,l-1}^{<k}\right)\right)\nabla w_{ij,l} (14)

where ai,l<ka_{i,l}^{<k} is the iith unit of al<ka_{l}^{<k}. The gradient flow is blocked if both neurons ii in the current layer and jj in the previous layer have been activated. We apply the mask for all layers except the last layer. The parameters in the last layer do not need to be protected as they are task-specific parameters.

A regularization is introduced to encourage sparsity in alka_{l}^{k} and parameter sharing with al<ka_{l}^{<k}. The capacity of a network depletes when al<ka_{l}^{<k} becomes 1-vector in all layers. Despite a set of new neurons can be added to the network at any point in training for more capacity, we utilize resources more efficiently by minimizing the loss

r=λliai,lk(1ai,l<k)li(1ai,l<k)\displaystyle\mathcal{L}_{r}=\lambda\frac{\sum_{l}\sum_{i}a_{i,l}^{k}\left(1-a_{i,l}^{<k}\right)}{\sum_{l}\sum_{i}\left(1-a_{i,l}^{<k}\right)} (15)

where λ\lambda is a hyper-parameter. The final objective of training a comprehensive task network without forgetting is

=ce+r\displaystyle\mathcal{L}=\mathcal{L}_{ce}+\mathcal{L}_{r} (16)

where ce\mathcal{L}_{ce} is the cross-entropy loss. The overall framework of the algorithm is shown in Figure 1(a).

Note that for TIL, HAT needs the task-id for each test instance in order to choose the right task model for prediction or classification. However, by replacing the original model building method for each task in HAT with the OOD detection method in CSI (more specifically, c\mathcal{L}_{c}) during training, HAT+CSI does not require to know the task-id of each test instance at inference, which makes HAT+CSI suitable for CIL (class incremental learning). We will see the detailed prediction/classification method in Section 4.2.

4.1.2 SupSup: Supermasks in Superposition

SupSup (Sup) [69] is also a highly effective method that can overcome forgetting in the TIL setting. Sup trains supermasks by Edge Popup algorithm in [55]. Specifically, given the initial weights of a base network 𝐖\mathbf{W}, find binary masks 𝐌k\mathbf{M}_{k} for task kk to minimize the cross-entropy loss.

=1|𝐗k|logp(y|x,k),\displaystyle\mathcal{L}=-\frac{1}{|\mathbf{X}_{k}|}\sum\log p(y|x,k), (17)

where 𝐗k\mathbf{X}_{k} is the training data for task kk, and

p(y|x,k)=f(h(x;𝐖𝐌k)),\displaystyle p(y|x,k)=f(h(x;\mathbf{W}\otimes\mathbf{M}_{k})), (18)

where \otimes indicates element-wise product. The masks are obtained by selecting the top pp% of entries in the score matrices 𝐕\mathbf{V}. The pp value determines the sparsity of the mask 𝐌k\mathbf{M}_{k}. The subnetwork found by Edge Popup algorithm is indicated by different colors in Figure 1(b).

Like HAT, Sup is also for TIL and needs the task-id kk of each test instance at inference. With kk, the system (which is referred to as Sup GG in the original Sup paper) uses the task-specific mask 𝐌k\mathbf{M}_{k} to obtain the classification output. Like HAT+CSI, by replacing the cross-entropy loss in mask finding with the OOD detection loss in CSI, Sup+CSI also does not require the task-id of each test instance, which makes Sup+CSI applicable to CIL (class incremental learning). We will discuss the detailed prediction/classification method in Section 4.2.

4.1.3 CSI: Contrasting Shifted Instances for OOD Detection

The OOD detection method CSI is based on contrastive learning [12, 29], data and class augmentations, and results ensembling [64]. The OOD training process is similar to that of contrastive learning. It consists of two steps: Step 1 learns the feature representation by the composite ghg\circ h, where hh is the feature extractor and gg is the projection to contrastive representation, and Step 2 learns/fine-tunes the linear classifier ff, mapping the feature representation of hh to the label space (the classifier is the OOD model)). This two-step training process is outlined in Figure 1(b). In the following, we first describe the two-step training process and then explain how to make a prediction based on an ensemble method to further improve the prediction.

Step 1 (Contrastive Loss for Feature Learning). Supervised contrastive learning is used to try to repel data of different classes and align data of the same class more closely to make it easier to classify them. A key operation is data augmentation via transformations.

Given a batch of NN samples, each sample x{x} is first duplicated. Each version then goes through three initial augmentations (horizontal flip, color changes, and Inception crop [63]) to generate two different views x1{x}^{1} and x2{x}^{2} (they keep the same class label as x{x}). Denote the augmented batch by \mathcal{B}, which now has 2N2N samples. In [21] and [64], it was shown that using image rotations is effective in learning OOD detection models because such rotations can effectively serve as out-of-distribution (OOD) training data. For each augmented sample x{x}\in\mathcal{B} with class yy of a task, we rotate x{x} by 90,180,27090^{\circ},180^{\circ},270^{\circ} to create three images, which are assigned three new classes y1,y2y_{1},y_{2}, and y3y_{3}, respectively. This results in a larger augmented batch ~\tilde{\mathcal{B}}. Since we generate three new images from each x{x}, the size of ~\tilde{\mathcal{B}} is 8N8N. For each original class, we now have 4 classes. For a sample x~{x}\in\tilde{\mathcal{B}}, let ~(x)=~\{x}\mathcal{\tilde{B}}({x})=\mathcal{\tilde{B}}\backslash\{{x}\} and let P(x)~\{x}P({x})\subset\tilde{\mathcal{B}}\backslash\{{x}\} be a set consisting of the data of the same class as x{x} distinct from x{x}. The contrastive representation of a sample x{x} is zx=g(h(x,k))/g(h(x,k)){z}_{x}=g(h({x,k}))/\|g(h({x,k}))\|, where kk is the current task. In learning, we minimize the supervised contrastive loss.

c\displaystyle\mathcal{L}_{c} =18Nx~1|P(x)|pP(x)logexp(zxzp/τ)x~(x)exp(zxzx/τ),\displaystyle=\frac{1}{8N}\sum_{{x}\in\tilde{\mathcal{B}}}\frac{-1}{|P({x})|}\sum_{{p}\in P({x})}\log{\frac{\text{exp}({z}_{{x}}\cdot{z}_{{p}}/\tau)}{\sum_{{x}^{\prime}\in\tilde{\mathcal{B}}({x})}\text{exp}({z}_{{x}}\cdot{z}_{{x}^{\prime}}/\tau)}}, (19)

where τ\tau is a scalar temperature, \cdot is dot product, and ×\times is multiplication. The loss is reduced by repelling z{z} of different classes and aligning z{z} of the same class more closely. c\mathcal{L}_{c} basically trains a feature extractor with good representations for learning an OOD classifier.

Since the feature extractor is shared across tasks in continual learning, protection is needed to prevent catastrophic forgetting. HAT and Sup use their respective techniques to protect their feature extractor from forgetting. Therefore, the losses \mathcal{L} of Eq. 17 and ce\mathcal{L}_{ce} of Eq. 16 are replaced by Eq. 19 while the forgetting prevention mechanisms still hold.

Step 2 (Fine-tuning the Classifier). Given the feature extractor hh trained with the loss in Eq. 19, we freeze hh and only fine-tune the linear classifier ff, which is trained to predict the classes of task kk and the augmented rotation classes. ff maps the feature representation to the label space in 4|𝒞k|\mathcal{R}^{4|\mathcal{C}^{k}|}, where 44 is the number of rotation classes including the original data with 00^{\circ} rotation and |𝒞k||\mathcal{C}^{k}| is the number of original classes in task kk. We minimize the cross-entropy loss,

ft=1|~|(x,y)~logp~(y|x,k),\displaystyle\mathcal{L}_{\text{ft}}=-\frac{1}{|\tilde{\mathcal{B}}|}\sum_{({x},y)\in\tilde{\mathcal{B}}}\log\tilde{p}(y|{x},k), (20)

where ft indicates fine-tune, and

p~(y|x,k)=softmax(f(h(x,k)))\displaystyle\tilde{p}(y|{x},k)=\text{softmax}\left(f(h({x},k))\right) (21)

where f(h(x,k))4|𝒞k|f(h({x,k}))\in\mathcal{R}^{4|\mathcal{C}^{k}|}. The output f(h(x,k))f(h({x,k})) includes the rotation classes. The linear classifier is trained to predict the original and the rotation classes. Since an individual classifier is trained for each task and the feature extractor is frozen, no protection is necessary.

Ensemble Class Prediction. We now discuss the prediction of class label yy for a test sample x{x}. Note that the network fhf\circ h in Eq. 21 returns logits for rotation classes (including the original task classes). Note also for each original class label jk𝒞kj_{k}\in\mathcal{C}^{k} (original classes) of a task kk, we created three additional rotation classes. For class jkj_{k}, the classifier ff will produce four output values from its four rotation class logits, i.e., fjk,0(h(x0,k))f_{j_{k},0}(h({x_{0},k})), fjk,90(h(x90,k))f_{j_{k},90}(h({x_{90},k})), fjk,180(h(x180,k))f_{j_{k},180}(h({x_{180},k})), and fjk,270(h(x270,k))f_{j_{k},270}(h({x_{270},k})), where 0, 90, 180, and 270 represent 0,90,1800^{\circ},90^{\circ},180^{\circ}, and 270270^{\circ} rotations respectively and x0{x}_{0} is the original x{x}. We compute an ensemble output fjk(h(x))f_{j_{k}}(h({x})) for each class jk𝒞kj_{k}\in\mathcal{C}^{k} of task kk,

f(h(x,k))jk=14degf(h(xdeg,k))jk,deg.\displaystyle f(h({x,k}))_{j_{k}}=\frac{1}{4}\sum_{\text{deg}}f(h({x}_{\text{deg}},k))_{j_{k},\text{deg}}. (22)

4.2 Experiments

We now present the experimental results of the combination techniques HAT+CSI and Sup+CSI for class incremental learning (CIL). We will also use another OOD detection method ODIN [38] to show that a better OOD detection method leads to better CIL results. We do not conduct extensive experiments on ODIN as it is much weaker than CSI in terms of ODD detection. Note that we will not report the ODD detection results for HAT+CSI and Sup+CSI in the open world in the continual learning process as the proposed method MORE in the next section performs better.

4.2.1 Experimental Datasets and Baselines

Datasets and CIL tasks: We use three standard image classification benchmark datasets and construct five different CIL experiments.

1. CIFAR-10 [32]: This dataset consists of 32x32 color images of 10 classes with 50,000 training and 10,000 testing samples. We construct an experiment (C10-5T) of 5 tasks with 2 classes per task.

2. CIFAR-100 [32]: This dataset consists of 32x32 color images of 100 classes with 50,000 training and 10,000 testing samples. We construct two experiments of 10 tasks (C100-10T) and 20 tasks (C100-20T), where each task has 10 classes and 5 classes, respectively.

3. Tiny-ImageNet [34]: This is an image classification dataset with 64x64 color images of 200 classes with 100,000 training and 10,000 validation samples. Since the dataset does not provide labels for testing data, we use the validation data for testing. We construct two experiments of 5 tasks (T-5T) and 10 tasks (T-10T) with 40 classes per task and 20 classes per task, respectively.

Baselines: We use 18 diverse continual learning baselines:

1. One projection method (OWM [75]).

2. Two exemplar-free (no replay data is saved) regularization methods (MUC  [42] and PASS  [77]).

3. Nine replay-based methods (LwF [37], iCaRL [56], A-GEM [10], EEIL [7], GD [36], Mnemonics [43], BiC [70], DER++ [6]], and HAL [9]).

4. Three parameter-isolation methods (HAT [61], HyperNet [49], and SupSup [69]).

5. Additionally, we report the accuracies of replay-based method Co2L [8] and parameter isolation methods CCG [1] and PR-Ent [22] from their original papers as CCG has not released the code and we are unable to run Co2L and PR-Ent on our machines.

4.2.2 Training Details and Evaluation Metrics

Training Details. For the backbone structure, we follow [69, 77, 6] and use ResNet-18 [19]. For CIFAR-100 and Tiny-ImageNet, the number of channels is doubled to fit more classes. For all baselines, the same ResNet-18 backbone architecture is employed except for OWM and HyperNet, for which we use their original architectures. OWM uses AlexNet. It is not obvious how to apply its orthogonal projection technique to the ResNet structure. HyperNet uses ResNet-32 and we are unable to replace it due to model initialization arguments unexplained in the original paper. For the replay methods, we use the memory buffer of 200 for CIFAR-10 and 2000 for CIFAR-100 and Tiny-ImageNet as in [56, 6]. We use the hyper-parameters suggested by the authors. If we cannot reproduce any result, we use 10% of the training data as a validation set to grid-search for good hyper-parameters. For our proposed methods, we report the hyper-parameters in F. All the results are averages over 5 runs with random seeds.

Evaluation Metrics.

1. Average classification accuracy over all classes after learning the last task. The final class prediction depends on prediction methods (see below). We also report forgetting rate in G.

2. Average AUC (Area Under the ROC Curve) over all task models for the evaluation of OOD detection. AUC is the main measure used in OOD detection papers. Using this measure, we show that a better OOD detection method will result in a better CIL performance. Let AUCk\textit{AUC}_{k} be the AUC score of task kk. It is computed by using only the model (or classes) of task kk to score the test data of task kk as the in-distribution (IND) data and the test data from other tasks as the out-of-distribution (OOD) data. The average AUC score is: AUC=kAUCk/nAUC=\sum_{k}\textit{AUC}_{k}/n, where nn is the number of tasks.

It is not straightforward to change existing CL algorithms to include a new OOD detection method that needs training, e.g., CSI, except for TIL (task incremental learning) methods like HAT and Sup. For HAT and Sup, we can simply switch their methods for learning each task with CSI (see Section 4.1.1 and Section 4.1.2).

Prediction Methods. The theoretical result in Section 3 states that we use Eq. 2 to perform the final prediction. The first probability (WP) in Eq. 2 is easy to get as we can simply use the softmax values of the classes in each task. However, the second probability (TP) in Eq. 2 is tricky as each task is learned without the data of other tasks. There can be many options. We take the following approaches for prediction (which are a special case of Eq. 2, see below):

1. For those approaches that use a single classification head to include all classes learned so far, we predict as follows (which is also the approach taken by the existing papers.)

y^=argmaxf(x)\displaystyle\hat{y}=\operatorname*{arg\,max}f(x) (23)

where f(x)f(x) is the logit output of the network.

2. For multi-head methods (e.g., HAT, HyperNet, and Sup), which use one head for each task, we use the concatenated output as

y^=argmaxkf(x)k\displaystyle\hat{y}=\operatorname*{arg\,max}\bigoplus_{k}f(x)_{k} (24)

where \bigoplus indicate concatenation and f(x)kf(x)_{k} is the output of task kk.666The Sup paper proposed a one-shot task-id prediction assuming that the test instances come in a batch and all belong to the same task like iTAML. We assume a single test instance per batch. Its task-id prediction results in an accuracy of 50.2 on C10-5T, which is much lower than 62.6 by using Eq. 24. The task-id prediction of HyperNet also works poorly. The accuracy of its task-id prediction is 49.34 on C10-5T while it is 53.4 using Eq. 24. PR uses entropy to find task-id. Among many variations of PR, we use the variations that perform the best for each dataset with exemplar-free and single sample per batch at testing (i.e., no PR-BW).

These methods (in fact, they are the same method used in two different settings) are a special case of Eq. 2 if we define OODkOOD_{k} as σ(maxf(x)k)\sigma(\max f(x)_{k}), where σ\sigma is the sigmoid. Hence, the theoretical results in Section 3 are still applicable. We present a detailed explanation of this prediction method and some other options in C. These two approaches work quite well.

4.2.3 Better OOD Detection Produces Better CIL Performance

The key theoretical result in Section 3 is that better OOD detection will produce better CIL performance. We compare a weaker OOD method ODIN with the strong CSI. ODIN is a post-processing method for OOD detection [38]. Note that it does not always improve the OOD detection performance compared to without the ODIN post-processing (see below).

Table 2: Performance Comparison between the Original Output and ODIN. Note that ODIN does not apply to iCaRL and Mnemonics as they are not based on softmax but some distance functions. As mentioned earlier, for Co2L, CCG, and PR-Ent, they either have no code, or their codes do not run on our machine. The results for other datasets are in B.
AUC CIL
Method Original ODIN Original ODIN
OWM 71.31 70.06 28.91 28.88
MUC 72.69 72.53 30.42 29.79
PASS 69.89 69.60 33.00 31.00
LwF 88.30 87.11 45.26 51.82
A-GEM 78.01 79.00 9.29 13.48
EEIL 83.37 79.73 48.99 41.74
GD 85.37 82.98 49.67 47.28
BiC 87.89 86.73 52.92 48.65
DER++ 85.99 88.21 53.71 55.29
HAL 64.21 64.83 15.59 21.01
HAT 77.72 77.80 41.06 41.21
HyperNet 71.82 72.32 30.23 30.83
Sup 79.16 80.58 44.58 46.74

Applying ODIN. We first train the baseline models using their original algorithms, and then apply temperature scaling and input noise of ODIN at testing for each task (no training data needed). More precisely, the output of class jj in task kk changes by temperature scaling factor τk\tau_{k} of task kk as

s(x;τk)j=ef(x)kj/τk/jef(x)kj/τk\displaystyle s(x;\tau_{k})_{j}=e^{f(x)_{kj}/\tau_{k}}/\sum_{j}e^{f(x)_{kj}/\tau_{k}} (25)

and the input changes by the noise factor ϵk\epsilon_{k} as

x~=xϵksign(xlogs(x;τk)y^)\displaystyle\tilde{x}=x-\epsilon_{k}\text{sign}(-\nabla_{x}\log s(x;\tau_{k})_{\hat{y}}) (26)

where y^\hat{y} is the class with the maximum output value in task kk. This is a positive adversarial example inspired by [15]. The values τk\tau_{k} and ϵk\epsilon_{k} are hyper-parameters and we use the same values for all tasks except for PASS, for which we use a validation set to tune τk\tau_{k} (see B).

Table 2 gives the results for C100-10T. The CIL results clearly show that the CIL performance increases if the AUC increases with ODIN. For instance, the CIL of DER++ and Sup improves from 53.71 to 55.29 and 44.58 to 46.74, respectively, as the AUC increases from 85.99 to 88.21 and 79.16 to 80.58. It shows that when this method is incorporated into each task model in the existing trained CIL network, the CIL performance of the original method improves. We note that ODIN does not always improve the average AUC. For those experiencing a decrease in AUC, the CIL performance also decreases except LwF. The inconsistency of LwF is due to its severe classification bias towards later tasks as discussed in BiC [70]. The temperature scaling in ODIN has a similar effect as the bias correction in BiC, and the CIL of LwF becomes close to that of BiC after the correction. Regardless of whether ODIN improves AUC or not, the positive correlation between AUC and CIL (except LwF) verifies the efficacy of Theorem 3, indicating better OOD detection results in better CIL performances.

Table 3: Average CIL and AUC of HAT and Sup after applying OOD detection methods ODIN and CSI. ODIN is a traditional OOD detection method while CSI is a recent OOD detection method known to be better than ODIN. As CL methods produce better OOD detection performance by CSI, their CIL performances are better than the ODIN counterparts.
CL OOD C10-5T C100-10T C100-20T T-5T T-10T
AUC CIL AUC CIL AUC CIL AUC CIL AUC CIL
HAT ODIN 82.5 62.6 77.8 41.2 75.4 25.8 72.3 38.6 71.8 30.0
CSI 91.2 87.8 84.5 63.3 86.5 54.6 76.5 45.7 78.5 47.1
Sup ODIN 82.4 62.6 80.6 46.7 81.6 36.4 74.0 41.1 74.6 36.5
CSI 91.6 86.0 86.8 65.1 88.3 60.2 77.1 48.9 79.4 45.7

Applying CSI. We now apply the OOD detection method CSI. Due to its sophisticated data augmentation, supervised contrastive learning, and results ensemble, it is hard to apply CSI to other baselines without fundamentally changing them except for HAT and Sup (SupSup) as these methods are parameter isolation-based TIL methods. We can simply replace their model for training each task with CSI wholesale. As mentioned earlier, both HAT and Sup as TIL methods have almost no forgetting.

Table 3 reports the results of using CSI and ODIN. ODIN is a weaker OOD method than CSI. Both HAT and Sup improve greatly as the systems are equipped with a better OOD detection method CSI. These experiment results empirically demonstrate the efficacy of Theorem 3, i.e., the CIL performance can be improved if a better OOD detection method is used.

4.2.4 Full Comparison of HAT+CSI and Sup+CSI with Baselines

Table 4: Average accuracy (CIL) of all methods after all tasks are learned. The baselines are grouped into (a), (b), (c), and (d) for projection, regularization, replay, and parameter-isolation methods, respectively. Our proposed methods are grouped into (e). Exemplar-free methods are italicized. \dagger indicates that in their original papers, PASS and Mnemonics are pre-trained with the first half of the classes. Their results with pre-train are 50.1 and 53.5 on C100-10T, respectively, which are still much lower than the proposed HAT+CSI and Sup+CSI without pre-training. We do not use pre-training in our experiment for fairness. * indicates that iCaRL and Mnemonics report average incremental accuracy in their original papers. We report average accuracy over all classes after all tasks are learned. The last column Avg. shows the average CIL accuracy of each method over all datasets.
Method C10-5T C100-10T C100-20T T-5T T-10T Avg.
(a) OWM 51.8

±\pm0.05

28.9

±\pm0.60

24.1

±\pm0.26

10.0

±\pm0.55

8.6

±\pm0.42

24.7
(b) MUC 52.9

±\pm1.03

30.4

±\pm1.18

14.2

±\pm0.30

33.6

±\pm0.19

17.4

±\pm0.17

29.7
PASS 47.3

±\pm0.98

33.0

±\pm0.58

25.0

±\pm0.69

28.4

±\pm0.51

19.1

±\pm0.46

30.6
(c) LwF 54.7

±\pm1.18

45.3

±\pm0.75

44.3

±\pm0.46

32.2

±\pm0.50

24.3

±\pm0.26

40.2
iCaRL 63.4

±\pm1.11

51.4

±\pm0.99

47.8

±\pm0.48

37.0

±\pm0.41

28.3

±\pm0.18

45.6
A-GEM 20.0

±\pm0.37

9.3

±\pm0.17

4.1

±\pm0.89

13.5

±\pm0.08

7.7

±\pm0.07

10.9
EEIL 57.1

±\pm0.28

49.0

±\pm1.27

33.5

±\pm0.08

14.7

±\pm0.40

9.8

±\pm0.19

32.8
GD 58.7

±\pm0.31

49.7

±\pm0.33

38.9

±\pm0.02

16.4

±\pm1.40

11.7

±\pm0.25

35.1
Mnemonics†∗ 64.1

±\pm1.47

51.0

±\pm0.34

47.6

±\pm0.74

37.1

±\pm0.46

28.5

±\pm0.72

45.7
BiC 61.4

±\pm1.74

52.9

±\pm0.64

48.9

±\pm0.54

41.7

±\pm0.74

33.8

±\pm0.40

47.7
DER++ 66.0

±\pm1.20

53.7

±\pm1.30

46.6

±\pm1.44

35.8

±\pm0.77

30.5

±\pm0.47

46.5
HAL 32.8

±\pm2.17

15.6

±\pm0.31

13.5

±\pm1.53

3.4

±\pm0.35

3.4

±\pm0.38

13.7
Co2L 65.6
(d) CCG 70.1
HAT 62.7

±\pm1.45

41.1

±\pm0.93

25.6

±\pm0.51

38.5

±\pm1.85

29.8

±\pm0.65

39.5
HyperNet 53.4

±\pm2.19

30.2

±\pm1.54

18.7

±\pm1.10

7.9

±\pm0.69

5.3

±\pm0.50

23.1
Sup 62.4

±\pm1.45

44.6

±\pm0.44

34.7

±\pm0.30

41.8

±\pm1.50

36.5

±\pm0.36

44.0
PR-Ent 61.9 45.2
(e) HAT+CSI 87.8

±\pm0.71

63.3

±\pm1.00

54.6

±\pm0.92

45.7

±\pm0.26

47.1

±\pm0.18

59.7
Sup+CSI 86.0

±\pm0.41

65.1

±\pm0.39

60.2

±\pm0.51

48.9

±\pm0.25

45.7

±\pm0.76

61.2
HAT+CSI+c 88.0

±\pm0.48

65.2

±\pm0.71

58.0

±\pm0.45

51.7

±\pm0.37

47.6

±\pm0.32

62.1
Sup+CSI+c 87.3

±\pm0.37

65.2

±\pm0.37

60.5

±\pm0.64

49.2

±\pm0.28

46.2

±\pm0.53

61.7
Table 5: TIL (WP) accuracy results of 3 best-performing baselines and our methods. The full results are given in E. The calibrated versions (+c) of our methods are omitted as calibration does not affect TIL performances.
Method C10-5T C100-10T C100-20T T-5T T-10T Avg.
BiC 95.4

±\pm0.35

84.6

±\pm0.48

88.7

±\pm0.19

61.5

±\pm0.60

62.2

±\pm0.45

78.5
HAT 96.7

±\pm0.18

84.0

±\pm0.23

85.0

±\pm0.98

61.2

±\pm0.72

63.8

±\pm0.41

78.1
Sup 96.6

±\pm0.21

87.9

±\pm0.27

91.6

±\pm0.15

64.3

±\pm0.24

68.4

±\pm0.22

81.8
HAT+CSI 98.7

±\pm0.06

92.0

±\pm0.37

94.3

±\pm0.06

68.4

±\pm0.16

72.4

±\pm0.21

85.2
Sup+CSI 98.7

±\pm0.07

93.0

±\pm0.13

95.3

±\pm0.20

65.9

±\pm0.25

74.1

±\pm0.28

85.4

We now make a full comparison of the two proposed systems HAT+CSI and Sup+CSI designed based on the theoretical results with baselines. Since HAT and Sup are exemplar-free CL methods, HAT+CSI and Sup+CSI do not need to save any previous task data for replaying. Table 4 shows that HAT and Sup equipped with CSI outperform the baselines by large margins. DER++, the best replay method, achieves 66.0 and 53.7 on C10-5T and C100-10T, respectively, while HAT+CSI achieves 87.8 and 63.3 and Sup+CSI achieves 86.0 and 65.1. The large performance gap remains consistent in more challenging problems, T-5T and T-10T.

Due to the definition of OOD in the prediction method and the fact that each task is trained separately in HAT and Sup, the outputs f(x)kf(x)_{k} from different tasks can be in different scales, which will result in incorrect predictions. To deal with the problem, we can calibrate the output as αkf(x)k+βk\alpha_{k}f(x)_{k}+\beta_{k} and use OODk=σ(αkf(x)k+βk)OOD_{k}=\sigma(\alpha_{k}f(x)_{k}+\beta_{k}). The optimal αk\alpha_{k}^{*} and βk\beta_{k}^{*} for each task kk can be found by optimization with a memory buffer to save a very small number of training examples from previous tasks like that in the replay-based methods. We refer to the calibrated methods as HAT+CSI+c and Sup+CSI+c. They are trained by using a memory buffer of the same size as the replay methods (see Section 4.2.2). Table 4 shows that the calibration improves from their memory-free versions, i.e., without calibration. We provide the details about how to train the calibration parameters αk\alpha_{k} and βk\beta_{k} in D.

We note that CSI uses extensive data augmentations in its OOD detection. However, the baseline systems do not. To be fair, we added the same data augmentations to the three top-performing baselines, Mnemonics, BiC, and DER++. The average accuracy values over the five CIL experiments are 36.66, 35.75, and 18.43 for Mnemonics, BiC, and DER++, respectively. Our methods, HAT+CSI, Sup+CSI, HAT+CSI+c, and Sup+CSI+c, achieve accuracy values of 59.7, 61.2, 62.1, and 61.7, respectively, which are significantly better. In fact, with the augmentations, the three baselines perform worse than their original versions. We believe the reason is that while augmentations improve the performance of the current task as they help learn finer-grained and more task-specific features, they also cause more model updates in learning a task due to the additional augmented data, which leads to significantly more forgetting of prior tasks. However, our technique incorporating robust TIL mechanisms prevents forgetting while also concurrently benefiting from the strong OOD detection performance for each task model of CSI, which exploits data augmentations.

Finally, as shown in Theorem 1, the CIL performance also depends on the TIL (WP) performance. We compare the TIL accuracies of the baselines and our methods in Table 5. Our systems again outperform the baselines by large margins on more challenging datasets (e.g., CIFAR100 and Tiny-ImageNet).

5 Proposed Approach 2: Out-of-Distribution Replay

The approach presented above does not save any training data from previous tasks except for the optional step of calibration. The method presented in this section is based on the replay approach to solving CIL, which saves a small number of training data from each previous task. The proposed method is called Multi-head model for continual learning via OOD REplay (MORE). As mentioned in Section 4, the OOD detection method used in this section is a closed-world method as it uses the saved samples from previous tasks as OOD samples in learning each new task.

5.1 The Proposed MORE Technique

Recall a replay-based method for continual learning works by memorizing or saving a small subset of the training samples from each previous task in a memory buffer. The saved data is called the replay data. In learning a new task, the new task data and the replay data are trained jointly to update the model. Clearly, using the replay data can partially deal with the inter-class separation (ICS) problem because the model sees some data from all classes learned so far. However, it cannot solve the ICS problem completely because the amount of replay data is often very small.

Unlike existing replay-based CIL methods, which simply use the replay data to update the decision boundaries between the old class and the new classes (in the new task), the proposed method uses the replay data to build an OOD detection model for each task in continual learning, which gives the name of the proposed method, i.e., out-of-distribution replay. Further, unlike existing OOD detection methods, which usually do not use any OOD data in training, the proposed method uses the replay data from previous tasks as the OOD data for the current new task in building its OOD detection model.

Unlike HAT+CSI and Sup+CSI, which do not use a pre-trained network, MORE trains a multi-head network as an adapter [23] to a pre-trained network (see Figure 2(b)). Note that using a pre-trained transformer network and adapter modules is a common practice in existing continual learning methods in the natural language processing community [28, 27]. Here we also leverage this approach for image classification tasks. In continual learning, the pre-trained network is frozen, only the adapters and the norm layers are trainable. Similar to HAT+CSI, a hard attention mask (HAT) is again employed to protect each task model or classifier to avoid forgetting. Each head is also an OOD detection model for a task, but, as mentioned above, MORE uses the replay data as the OOD data to build an OOD detection model. Since HAT has been described in Section 4.1.1, we will not discuss it further except to state that we need to use ood\mathcal{L}_{ood} in Eq. 27 to replace ce\mathcal{L}_{ce} in Eq. 16 after incorporating the trainable embedding 𝒆k{\bm{e}}^{k}. We describe the whole training and prediction process in H.

5.1.1 Training an OOD Detection Model

At task kk, the system receives the training data 𝒟k={(xki,yki)i=1nk}\mathcal{D}_{k}=\{(x_{k}^{i},y_{k}^{i})_{i=1}^{n_{k}}\}, where nkn_{k} is the number of samples, and xki𝐗kx_{k}^{i}\in\mathbf{X}_{k} is an input sample and yki𝐘ky_{k}^{i}\in\mathbf{Y}_{k} (the set of all classes of task kk) is its class label. We train the feature extractor z=h(x,k;θ)z=h(x,k;\theta) and task-specific classifier f(z;ϕk)f(z;\phi_{k}) using 𝒟k\mathcal{D}_{k} and the samples in the memory buffer \mathcal{M}. We treat the buffer data as OOD data to encourage the network to learn the current task and also detect ODD samples (the models or classifiers of the previous tasks are not touched). We achieve it by maximizing p(y|x,k)=softmaxf(h(x,k;θ);ϕk)p(y|x,k)=\text{softmax}f(h(x,k;\theta);\phi_{k}) for an IND sample x𝐗kx\in\mathbf{X}_{k} and maximizing p(ood|x,k)p(ood|x,k) for an OOD sample xx\in\mathcal{M}. The additional label oodood is reserved for previous and possible future unseen classes. Figure 2(a) shows the overall idea of the proposed approach. We formulate the problem as follows.

Refer to caption
Figure 2: (a) We train the feature extractor and the task classifier kk at task kk. The output values of the classifier correspond to |𝐘k|+1|\mathbf{Y}_{k}|+1 classes, in which the last class is for OOD (i.e., representing previous and unseen future classes). At inference/testing, the probability values of each task model without the OOD class are concatenated and the system chooses the class with the maximum score. (b) Transformer and adapter module. The masked adapter network consists of 2 fully connected layers and task-specific masks. During training, only the masked adapters and norm layers are updated and the other parts in the transformer layers remain unchanged.

Given the training data 𝐗k\mathbf{X}_{k} of size nkn_{k} at task kk and the memory buffer \mathcal{M} of size MM, we minimize the loss

ood(θ,ϕk)=1M+N((x,y)logp(ood|x,k)+(x,y)𝒟klogp(y|x,k))\displaystyle\begin{split}\mathcal{L}_{ood}(\theta,\phi_{k})=-\frac{1}{M+N}\left(\sum_{(x,y)\in\mathcal{M}}\log p(ood|x,k)+\sum_{(x,y)\in\mathcal{D}^{k}}\log p(y|x,k)\right)\end{split} (27)

It is the sum of two cross-entropy losses. The first loss is for learning OOD samples while the second loss is for learning the classes from the current task. We optimize the shared parameter θ\theta in the feature extractor. The task-specific classification parameters ϕk\phi_{k} are independent of other tasks. The learned representation of the current data should be robust to OOD data. The classifier thus can classify both IND and OOD data.

In testing, we perform prediction by comparing the softmax probability output values using all the task classifiers from task 1 to kk without the OOD class as

y^=argmax1jkp(𝐘j|x,j)\displaystyle\hat{y}=\operatorname*{arg\,max}\bigoplus_{1\leq j\leq k}p(\mathbf{Y}_{j}|x,j) (28)

where \bigoplus is the concatenation over the output space. Figure 2(a) shows the prediction rule. We are basically choosing the class with the highest softmax probability over all classes from all learned tasks.

5.1.2 Back-Updating the Previous OOD Models

Each task model works better if more diverse OOD data is provided during training. As in a replay-based approach, MORE saves an equal number of samples per class after each task [11]. The saved samples in the memory are used as OOD samples for each new task. Thus, in the beginning of continual learning when the system is trained on only a small number of tasks, the classes of samples in the memory are less diverse than after more tasks are learned. This makes the performance of OOD detection stronger for later tasks, but weaker in earlier tasks. To prevent this asymmetry, we update the model of each previous task so that it can also identify the samples from subsequent classes (which were unseen during the training of the previous task) as OOD samples.

At task kk, we update the previous task models (j=1,,k1)(j=1,\cdots,k-1) as follows. Denote the samples of task jj in memory \mathcal{M} by 𝐗~j\tilde{\mathbf{X}}_{j}. We construct a new dataset using the current task dataset and the samples in the memory buffer. We randomly select |||\mathcal{M}| samples from the training data 𝒟k\mathcal{D}_{k} and pool them with the remaining samples in \mathcal{M} after removing the IND samples 𝒟j~\tilde{\mathcal{D}_{j}} of task jj from \mathcal{M}. We do not use the entire training data 𝒟k\mathcal{D}_{k} as we do not want a large sample imbalance between IND and OOD. Denote the new dataset by ~\tilde{\mathcal{M}}. Using the data, we update only the parameters ϕj\phi_{j} of the classifier for task jj with the feature representations frozen by minimizing the loss

(ϕj)=12M((x,y)~logp(ood|x,j)+(x,y)𝒟~jlogp(y|x,j))\displaystyle\mathcal{L}(\phi_{j})=-\frac{1}{2M}\left(\sum_{(x,y)\in\tilde{\mathcal{M}}}\log p(ood|x,j)+\sum_{(x,y)\in\tilde{\mathcal{D}}_{j}}\log p(y|x,j)\right) (29)

We reduce the loss by updating the parameters of classifier jj to maximize the probability of the class if the sample belongs to task jj and maximize the OOD probability otherwise.

5.1.3 Improving Prediction Performance by a Distance Based Technique

We further improve the prediction in Eq. 28 by introducing a distance-based factor used as a coefficient to the softmax probabilities in Eq. 28. It is quite intuitive that if a test instance is close to a class, it is more likely to belong to the class. We thus propose to combine this new distance factor and the softmax probability output of the task jj model to make the final prediction decision. In some sense, this can be considered as an ensemble of the two methods.

We define the distance-based coefficient sj(x)s_{j}(x) of task jj for the test instance xx by the maximum of inverse Mahalanobis distance [35] between the feature of xx and the Gaussian distributions of the classes in task jj parameterized by the mean μji\mu_{j}^{i} of the class ii in task jj and the sample covariance SjS_{j}. They are estimated by the features of class ii’s training data for each class ii in task jj. If a test instance is from the task, its feature should be close to the distribution that the instance belongs to. Conversely, if the instance is OOD to the task, its feature should not be close to any of the distributions of the classes in the task. More precisely, for task jj with class y1,,y|𝐘j|y_{1},\cdots,y_{|\mathbf{Y}_{j}|} (where |𝐘j|{|\mathbf{Y}_{j}|} represents the number of classes in task kk), we define the coefficient sj(x)s_{j}(x) as

sj(x)=max[1/MD(x;μjy1,Sj),,1/MD(x;μjy|𝐘j|,Sj)]\displaystyle s_{j}(x)=\max\left[1/\text{MD}(x;\mu_{j}^{y_{1}},S_{j}),\cdots,1/\text{MD}(x;\mu_{j}^{y_{|\mathbf{Y}_{j}|}},S_{j})\right] (30)

MD(x;μji,Sj)\text{MD}(x;\mu_{j}^{i},S_{j}) is the Mahalanobis distance. The coefficient is large if at least one of the Mahalanobis distances is small but the coefficient is small if all the distances are large (i.e. the feature is far from all the distributions of the task). The parameters μji\mu_{j}^{i} and SjS_{j} can be computed and saved when each task is learned. The mean μji\mu_{j}^{i} is computed using the training samples 𝒟ji\mathcal{D}^{i}_{j} of class ii as follows,

μji=x𝒟jih(x,i)/|𝒟ji|\displaystyle\mu_{j}^{i}=\sum_{x\in\mathcal{D}^{i}_{j}}h(x,i)/|\mathcal{D}^{i}_{j}| (31)

and the covariance SjS_{j} of task jj is the mean of covariances of the classes in task jj,

Sj=i𝐘jSji/|𝐘j|\displaystyle S_{j}=\sum_{i\in\mathbf{Y}_{j}}S^{i}_{j}/|\mathbf{Y}_{j}| (32)

where Sji=x𝒟ji(xμji)T(xμji)/|𝒟ji|S^{i}_{j}=\sum_{x\in\mathcal{D}^{i}_{j}}(x-\mu_{j}^{i})^{T}(x-\mu_{j}^{i})/|\mathcal{D}^{i}_{j}| is the sample covariance of class ii. By multiplying the coefficient sj(x)s_{j}(x) to the original softmax probabilities p(𝐘j|x,j)p(\mathbf{Y}_{j}|x,j), the task output p(𝐘j|x,j)sj(x)p(\mathbf{Y}_{j}|x,j)s_{j}(x) increases if xx is from task jj and decreases otherwise. The final prediction is made by (which replaces Eq. 28)

y=argmax1jkp(𝐘j|x,j)sj(x),\displaystyle y=\operatorname*{arg\,max}\bigoplus_{1\leq j\leq k}p(\mathbf{Y}_{j}|x,j)s_{j}(x), (33)

where kk is the last task that we have learned.

5.2 Experiments

We now report the experiment results of the proposed method MORE. For experimental datasets, we use the same three image classification benchmark datasets as in Section 4.2.1. For baselines, the same systems are used as well (see Section 4.2.1) except Mnemonics, HyperNet, CCG, Co2L, and PR-Ent. Mnemonics requires optimization of training instances and it is not clear how to implement it for images after interpolation for a given input size of a pre-trained model (see below). For HyperNet, it is due to the reason explained in Training Details in Section 4.2.2. For CCG, Co2L, and PR-Ent, CCG has no code and the codes of Co2L, and PR-Ent do not run in our environment and thus we could not convert their codes to use a pre-trained model. Finally, we are left with 13 baselines. Note that HAT-CSI and Sup+CSI are not included as they are much weaker (up to 15% lower than MORE in accuracy) as CSI’s approach of using contrastive learning and data augmentations does not work well with a pre-trained model.

Evaluation Metrics. We still use the same evaluation measures as we used in Section 4.2.2. (1). Average classification accuracy over all classes after learning the last task. (2). Average AUC (Area Under the ROC Curve) for evaluating OOD detection performance of continual learning in the open world. See Section 5.2.4 for more details.

5.2.1 Pre-trained Network

We pre-train a vision transformer [66] using a subset of the ImageNet data [58] and apply the pre-trained network/model to all baselines and our method. To ensure that there is no overlapping of data between ImageNet and our experimental datasets, we manually removed 389 classes from the original 1000 classes in ImageNet that are similar/identical to the classes in CIFAR-10, CIFAR-100, or Tiny-ImageNet. We pre-train the network with the remaining subset of 611 classes of ImageNet.

Using the pre-trained network, both our system and the baselines improve dramatically compared to their versions without using the pre-trained network. For instance, the two best baselines (DER++ and PASS) in our experiments achieved the average classification accuracy of 66.89 and 68.25 (after the final task) with the pre-trained network over 5 experiments while they achieved only 46.88 and 32.42 without using the pre-train network.

We insert an adapter module at each transformer layer to exploit the pre-trained transformer network in continual learning. During training, the adapter module and the layer norm are trained while the transformer parameters are unchanged to prevent forgetting in the pre-trained network.

5.2.2 Training Details

For all experiments, we use the same backbone architecture DeiT-S/16 [66] with a 2-layer adapter [23] at each transformer layer, and the same class order for both baselines and our method. The first fully connected layer in the adapter maps from dimension 384 to the bottleneck. The second fully connected layer following ReLU activation function maps from bottleneck to 384. The bottleneck dimension is the same for all adapters in a model. For our method, we use SGD with a momentum value 0.9. The back-updating method in Section 5.1.2 is also a hyper-parameter choice. If we apply it, we train each classifier for 10 epochs by SGD with a learning rate 0.01, batch size 16, and momentum value 0.9. We choose 500 for ss in Eq. 11 and 0.75 for λ\lambda in Eq. 15 as recommended in [61]. We find a good set of learning rates and number of epochs on the validation set made of 10% of the training data. We follow [11] and save an equal number of random samples per class in the replay memory. Following the experiment settings in [56, 77], we fix the size of the memory buffer and reduce the saved samples to accommodate a new set of samples after a new task is learned. We use the class order protocol in [56, 6] by generating random class orders for the experiments. The baselines and our method use the same class ordering. We also report the size of memory required for each experiment in I.

For CIFAR-10, we split 10 classes into 5 tasks (2 classes per task). The bottleneck size in each adapter is 64. Following [6], we use the memory size 200, and train for 20 epochs with a learning rate 0.005, and apply the back-updating method in Section 5.1.2.

For CIFAR-100, we conducted 10 tasks and 20 tasks experiments, where each task has 10 classes and 5 classes, respectively. We double the bottleneck size of the adapter to learn more classes. We use the memory size 2000 following [56] and train for 40 epochs with learning rates 0.001 and 0.005 for 10 tasks and 20 tasks, respectively, and apply the back-updating method in Section 5.1.2.

For Tiny-ImageNet, two experiments are conducted. We split 200 classes into 5 and 10 tasks, where each task has 40 classes and 20 classes per task, respectively. We use the bottleneck size 128, and save 2000 samples in memory. We train with the learning rate 0.005 for 15 and 10 epochs for 5 tasks and 10 tasks, respectively. There is no need to use the back-updating method as the earlier tasks already have diverse OOD classes.

5.2.3 Accuracy and Forgetting Rate Results and Analysis

Table 6: Average accuracy after the final task. ‘-XT’ means X number of tasks. Our system MORE and all baselines use the pre-trained network. The baselines are grouped into (a), (b), (c), and (d) for projection, regularization, replay, and parameter-isolation methods, respectively. The last column shows the average accuracy of each method over all datasets and experiments. We highlight the best results in each column in bold.
Method C10-5T C100-10T C100-20T T-5T T-10T Avg.
(a) OWM 41.69

±\pm6.34

21.39

±\pm3.18

16.98

±\pm4.44

24.55

±\pm2.48

17.52

±\pm3.45

24.43
(b) MUC 73.95

±\pm7.24

57.87

±\pm1.11

43.98

±\pm2.68

62.47

±\pm0.34

55.79

±\pm0.49

58.81
PASS 86.21

±\pm1.10

68.90

±\pm0.94

66.77

±\pm1.18

61.03

±\pm0.38

58.34

±\pm0.42

68.25
(c) LwF 67.59

±\pm4.27

66.50

±\pm1.93

67.54

±\pm0.97

33.51

±\pm4.36

36.85

±\pm4.46

54.40
iCaRL 87.55

±\pm0.99

68.90

±\pm0.47

69.15

±\pm0.99

53.13

±\pm1.04

51.88

±\pm2.36

66.12
A-GEM 56.33

±\pm7.77

25.21

±\pm4.00

21.99

±\pm4.01

30.53

±\pm3.99

21.90

±\pm5.52

31.20
EEIL 82.34

±\pm3.13

68.08

±\pm0.51

63.79

±\pm0.66

53.34

±\pm0.54

50.38

±\pm0.97

63.59
GD 89.16 ±\pm0.53 64.36

±\pm0.57

60.10

±\pm0.74

53.01

±\pm0.97

42.48

±\pm2.53

61.82
BiC 67.44

±\pm3.93

64.47

±\pm1.30

67.69

±\pm1.97

38.78

±\pm1.26

40.98

±\pm2.39

55.87
DER++ 84.63

±\pm2.91

69.73

±\pm0.99

70.03

±\pm1.46

55.84

±\pm2.21

54.20

±\pm3.28

66.89
HAL 84.38

±\pm2.70

67.17

±\pm1.50

67.37

±\pm1.45

52.80

±\pm2.37

55.25

±\pm3.60

65.39
(d) HAT 83.30

±\pm1.54

62.34

±\pm0.93

56.72

±\pm0.44

57.91

±\pm0.72

53.12

±\pm0.94

62.68
Sup 80.91

±\pm2.99

62.49

±\pm0.49

57.32

±\pm1.11

58.43

±\pm0.67

54.52

±\pm0.45

62.74
MORE 89.16 ±\pm0.96 70.23 ±\pm2.27 70.53 ±\pm1.09 64.97 ±\pm1.28 63.06 ±\pm1.26 71.59

Average Accuracy. Table 6 shows that our method MORE consistently outperforms the baselines. All the reported results are the averages of 5 runs. The last column gives the average of each row. We compare with the replay-based methods first. The best replay-based method on average over all the datasets is DER++. Our method MORE achieves an accuracy of 71.59, much better than 66.89 of DER++. This demonstrates that the existing replay-based methods utilizing the replay samples to update all learned classes are inferior to our MORE method using samples for OOD learning. The best baseline is the generative method PASS. Its average accuracy over all the datasets is 68.25, which is still poorer than our method’s performance of 71.59. The performance of the multi-head method HAT [61] using task-id prediction is only 62.68, which is lower than many other baselines. Its performance is particularly low in experiments where the number of classes per task is small. For instance, its accuracy on C100-20T is 56.72, much lower than our method of 70.53 trained based on OOD detection.

Accuracy with Smaller Memory Sizes. For all the datasets, we run additional experiments with half of the original memory size and show that our method is even stronger with a smaller memory. The new memory sizes are 100, 1000, and 1000 for CIFAR-10, CIFAR-100, and Tiny-ImageNet, respectively. Table 7 shows that MORE has experienced almost no performance drop with the reduced memory size while the memory-based baselines suffer from major performance reduction. The accuracy of the best memory-based baseline (DER++) has decreased the accuracy from 66.89 to 62.16, while MORE only decreases from 71.59 to 71.44, which shows that a small number of OOD samples is enough to enable the system to produce a robust OOD detection model.

Table 7: Average accuracy of the baselines and our method MORE with smaller memory sizes. We reduce the size of the memory buffer by half. The new sizes are 100, 1000, and 1000 for CIFAR10, CIFAR100, and Tiny-ImageNet. Numbers in bold are the best results in each column.
Method C10-5T C100-10T C100-20T T-5T T-10T Avg.
(a) OWM 41.69

±\pm6.34

21.39

±\pm3.18

16.98

±\pm4.44

24.55

±\pm2.48

17.52

±\pm3.45

24.43
(b) MUC 73.95

±\pm7.24

57.87

±\pm1.11

43.98

±\pm2.68

62.47

±\pm0.34

55.79

±\pm0.49

58.81
PASS 86.21

±\pm1.10

68.90

±\pm0.94

66.77

±\pm1.18

61.03

±\pm0.38

58.34

±\pm0.42

68.25
(c) LwF 63.01

±\pm4.19

56.76

±\pm3.72

63.53

±\pm2.86

26.79

±\pm2.36

28.08

±\pm4.88

47.63
iCaRL 86.08

±\pm1.19

66.96

±\pm2.08

68.16

±\pm0.71

47.27

±\pm3.22

49.51

±\pm1.87

63.60
A-GEM 56.64

±\pm4.29

23.18

±\pm2.54

20.76

±\pm2.88

31.44

±\pm3.84

23.73

±\pm6.27

31.15
EEIL 77.44

±\pm3.04

62.95

±\pm0.68

57.86

±\pm0.74

48.36

±\pm1.38

44.59

±\pm1.72

58.24
GD 85.96

±\pm1.64

57.17

±\pm1.06

50.30

±\pm0.58

46.09

±\pm1.77

32.41

±\pm2.75

54.39
BiC 56.28

±\pm3.31

58.42

±\pm2.48

62.19

±\pm1.20

33.29

±\pm2.65

28.44

±\pm2.41

47.72
DER++ 80.09

±\pm3.00

64.89

±\pm2.48

65.84

±\pm1.46

50.74

±\pm2.41

49.24

±\pm5.01

62.16
HAL 79.16

±\pm4.56

62.65

±\pm0.83

63.96

±\pm1.49

48.17

±\pm2.94

47.11

±\pm6.00

60.21
(d) HAT 83.30

±\pm1.54

62.34

±\pm0.93

56.72

±\pm0.44

57.91

±\pm0.72

53.12

±\pm0.94

62.68
Sup 80.91

±\pm2.99

62.49

±\pm0.49

57.32

±\pm1.11

58.43

±\pm0.67

54.52

±\pm0.45

62.74
MORE 88.13 ±\pm1.16 71.69 ±\pm0.11 71.29 ±\pm0.55 64.17 ±\pm0.77 61.90 ±\pm0.90 71.44
Refer to caption
Figure 3: Average forgetting rate (%). The lower the rate, the better the method is.

Average Forgetting Rate. The average forgetting rate is defined as follows [43]: t=k=1t1(𝒜kinit𝒜kt)/(t1)\mathcal{F}^{t}=\sum_{k=1}^{t-1}(\mathcal{A}^{\text{init}}_{k}-\mathcal{A}^{t}_{k})/(t-1), where 𝒜kinit\mathcal{A}_{k}^{\text{init}} is the classification accuracy on samples of task kk right after learning the task kk. We do not consider the task tt as it is the last task.

We compare the forgetting rate of our method MORE against the baselines using C10-5T, C100-10T, and C100-20T. Figure 3 shows that the performance drop of our method as more tasks are learned is relatively lower than many baselines. MUC, A-GEM, HAT, and Sup achieve lower drops than our method. However, they are not able to adapt to new tasks well as the accuracy values of the four methods on C100-10T are 57.87, 25.21, 62.34, and 62.49, respectively, while our method MORE achieves 70.23. The performance gaps remain consistent in the other dataset. PASS experiences a smaller drop in performance on C100-10T and C100-20T than our method, but its accuracy of 68.90 and 66.77 are significantly lower than 70.23 and 70.53 of our MORE.

5.2.4 Out-of-Distribution Detection Results

As we explained in the introduction section, since our method MORE is based on OOD detection in building each task model, our method can naturally be used to detect test samples that are out-of-distribution for all the classes or tasks learned thus far. We are not aware of any existing continual learning system that has done such an evaluation. We evaluate the performance of the baselines and our system in this out-of-distribution scenario, which is also called the open set setting.

This OOD detection ability is highly desirable for a continual learning system because in a real-life environment in the open world, the system can be exposed to not only seen classes but also unseen classes. When the test sample is from one of the seen classes, the system should be able to predict its class. If the sample does not belong to any of the training classes seen so far (i.e., the sample is out-of-distribution), the system should detect it.

We formulate the performance of OOD detection of a continual learning system as follows. A continual learning system accepts and classifies a test sample after training task kk if the test sample is from one of the classes in tasks 1,,k1,\cdots,k. If it is from one of the classes of the future tasks k+1,,tk+1,\cdots,t, it should be rejected as OOD (where tt is the last task in each evaluation).

Table 8: AI-AUC Results. Numbers in bold are the best results in each column
Method C10-5T C100-10T C100-20T T-5T T-10T Avg.
(a) OWM 70.02

±\pm3.59

63.17

±\pm1.06

59.42

±\pm1.26

67.24

±\pm0.92

62.17

±\pm0.35

64.41
(b) MUC 85.47

±\pm3.97

79.28

±\pm1.15

74.82

±\pm1.91

83.91 ±\pm0.54 81.42 ±\pm0.47 80.98
PASS 84.57

±\pm1.54

77.74

±\pm1.40

77.42

±\pm1.44

77.07

±\pm2.14

74.79

±\pm2.36

78.32
(c) LwF 72.18

±\pm4.15

74.95

±\pm0.39

75.40

±\pm0.64

66.44

±\pm1.14

65.52

±\pm0.64

70.90
iCaRL 82.12

±\pm5.38

77.42

±\pm0.45

76.91

±\pm1.30

71.86

±\pm1.57

74.24

±\pm1.66

76.06
A-GEM 74.92

±\pm5.62

64.19

±\pm0.86

60.23

±\pm0.95

67.88

±\pm1.28

63.08

±\pm1.12

66.06
EEIL 87.19

±\pm2.31

78.89

±\pm1.32

77.69

±\pm1.40

74.82

±\pm0.79

73.45

±\pm1.33

78.39
GD 89.71 ±\pm1.85 77.31

±\pm1.03

75.19

±\pm0.87

75.36

±\pm0.78

70.90

±\pm1.75

77.69
BiC 71.29

±\pm3.57

74.49

±\pm0.72

75.71

±\pm0.60

67.45

±\pm0.89

66.63

±\pm0.77

71.11
DER++ 84.61

±\pm2.64

78.42

±\pm0.64

78.37

±\pm0.42

74.80

±\pm1.72

74.86

±\pm1.93

78.09
HAL 84.09

±\pm3.30

77.37

±\pm0.55

77.66

±\pm0.31

74.52

±\pm1.93

75.47

±\pm2.35

77.82
(d) HAT 87.83

±\pm2.44

79.57

±\pm0.29

77.20

±\pm0.74

79.78

±\pm1.59

78.25

±\pm1.68

80.53
Sup 87.06

±\pm3.68

80.54

±\pm0.12

77.81

±\pm0.66

80.01

±\pm0.71

78.96

±\pm0.64

80.87
MORE 88.06

±\pm1.84

81.67 ±\pm1.27 80.97 ±\pm0.80 80.72

±\pm3.38

79.73

±\pm2.97

82.23

We use maximum softmax probability (MSP) [20] as the OOD score of a test sample for the baselines and use maximum output with coefficient in Eq. 33 for our method MORE. We employ Area Under the ROC Curve (AUC) to measure the performance of OOD detection as AUC is the standard metric used in OOD detection papers [73]. We report average incremental AUC (AI-AUC) which is the average AUC result at all tasks except the last one as there is no more OOD data after the last task.

Table 8 shows that our method MORE outperforms all baselines consistently except MUC and GD. For MUC, it performs better than MORE on Tiny-ImageNet, but on average, it is poorer (80.98) than MORE (82.23). For GD, it outperforms MORE on C10-5T, but its overall performance is much lower as it achieves an average of only 77.69 over the 5 experiments. The best baselines based on accuracy are DER++ and PASS. Their average AI-AUC scores over the experiments shown in the last column are 78.09 for DER++ and 78.32 for PASS, but our method MORE achieves 82.23. In J, we report the OOD detection performances of the models on novel classes drawn from datasets that are completely different from the datasets used in the continual learning process.

5.2.5 Ablation Study

We conduct an ablation study to measure the performance gain by each proposed technique, back-updating in previous models in Section 5.1.2 and the distance-based coefficient in Section 5.1.3, using three experiments. The back-updating by Eq. 29 is to improve the earlier task models as they are trained with less diverse OOD data than later models. The modified output with coefficient in Eq. 30 is to improve the classification accuracy by combining the softmax probability of the task networks and the inverse Mahalanobis distances.

Table 9 compares accuracy obtained after applying each method. Both distance-based coefficient and back-updating show large improvements from the original method without any of the two techniques. Although the performance is already competitive with either technique, the performance improves further after applying them together.

Table 9: Ablation Study
C10-5T C100-10T C100-20T
Original 91.01

±\pm2.48

76.93

±\pm1.58

75.76

±\pm2.35

Coefficient (C) 93.86

±\pm1.12

80.31

±\pm1.02

80.77

±\pm1.36

Back (B) 93.36

±\pm0.79

80.35

±\pm1.08

80.32

±\pm0.82

C + B 94.23

±\pm0.82

81.24

±\pm1.24

81.59

±\pm0.98

Performance gains with the proposed techniques. The row Original indicates the method without the coefficient and back-updating and the row Back means the back-updating method.

6 Conclusion

This paper studied open-world continual learning (OWCL), which is necessary for learning in the open world. An open-world learning algorithm first needs to detect novel or out-of-distribution (OOD) items and then learn them continually or incrementally. This incremental learning of new items or classes is referred to as class incremental learning (CIL), which is a challenging setting of continual learning. Since the traditional CIL does not do OOD detection, we call it the closed-world CIL. In existing research, novelty/OOD detection and CIL are regarded as two completely different problems. This paper theoretically showed that the two problems can be unified. In particular, we first decomposed the closed-world CIL problem into within-task prediction (WP) and task-id prediction (TP), and then proved that TP is correlated with closed-world OOD detection. It then showed that a good performance of the two is both necessary and sufficient for good CIL performances. We then generalized the traditional closed-world CIL to the open-world CIL (or CIL+). CIL+ does both (open-world) OOD detection and CIL. Our theory thus connects and unifies open-world OOD/novelty detection and CIL in continual learning. This combination gives us the paradigm of open-world continual learning or CIL+. The theoretical result also provides principled guidance for designing better continual learning algorithms. Based on the result, several new CIL methods have been designed. They outperform strong CIL baselines by a large margin and also perform novelty (or OOD) detection well in the open-world continual learning setting.

Several interesting future directions are worth pursuing. The central goal of these directions is to achieve learning autonomy. It means that the agent should discover new tasks to learn by itself and also take the initiative to acquire the ground-truth training data for learning. In the current research, the training data for each task is assumed to be provided by human engineers who have collected a large amount of labeled training data for each task. To achieve autonomy, the agent has to interact with human users, other agents, and/or the application environment. This requires an interactive module. To interact with humans, a natural language dialogue system is needed, which must also improve itself continually. Since new tasks are the discovered unknowns, an effective algorithm is also needed to decide whether and how to interact with human users to acquire the ground truth labels and training data when it is uncertain about an unfamiliar object that it sees. Another interesting direction is few-shot continual learning as human users usually cannot provide a large number of training examples. Few-shot continual learning is highly challenging as it has to deal with the difficulties of few-shot learning, catastrophic forgetting, and inter-task class separation. To address these challenges, a powerful foundation model with strong reasoning capabilities is likely to be necessary.

Author Contributions

Gyuhak Kim developed the ideas and algorithms, conducted the experiments, and led the writing of the paper. Changnan Xiao created the mathematical framework and derived the proofs. Tatsuya Konishi and Zixuan Ke contributed to discussions and assisted with editing. Bing Liu supervised the project, contributed to the development of the ideas and algorithms, and helped with writing and revising the paper.

Acknowledgements

We would like to express our sincere gratitude to the reviewers for their valuable and constructive feedback, which has significantly improved the rigor and precision of the theory presented in the paper. The work of Gyuhak Kim, Zixuan Ke, and Bing Liu was supported in part by four NSF grants (IIS-2229876, IIS-1910424, IIS-1838770, and CNS-2225427), a research contract from DARPA (HR001120C0023), and a research contract from KDDI.

Appendix A Proof of Theorems and Corollaries

A.1 Proof of Theorem 1

Proof.

Since

HCIL(x)\displaystyle H_{CIL}(x) =H(y,{𝐏(x𝐗k,j|D)}k,j)\displaystyle=H(y,\{\mathbf{P}(x\in\mathbf{X}_{k,j}|D)\}_{k,j})
=k,jyk,jlog𝐏(x𝐗k,j|D)\displaystyle=-\sum_{k,j}y_{k,j}\log\mathbf{P}(x\in\mathbf{X}_{k,j}|D)
=log𝐏(x𝐗k0,j0|D),\displaystyle=-\log\mathbf{P}(x\in\mathbf{X}_{k_{0},j_{0}}|D),
HWP(x)\displaystyle H_{WP}(x) =H(y~,{𝐏(x𝐗k0,j|x𝐗k0,D)}j)\displaystyle=H(\tilde{y},\{\mathbf{P}(x\in\mathbf{X}_{k_{0},j}|x\in\mathbf{X}_{k_{0}},D)\}_{j})
=jyk0,jlog𝐏(x𝐗k0,j|x𝐗k0,D)\displaystyle=-\sum_{j}y_{k_{0},j}\log\mathbf{P}(x\in\mathbf{X}_{k_{0},j}|x\in\mathbf{X}_{k_{0}},D)
=log𝐏(x𝐗k0,j0|x𝐗k0,D),\displaystyle=-\log\mathbf{P}(x\in\mathbf{X}_{k_{0},j_{0}}|x\in\mathbf{X}_{k_{0}},D),

and

HTP(x)\displaystyle H_{TP}(x) =H(y¯,{𝐏(x𝐗k|D)}k)\displaystyle=H(\bar{y},\{\mathbf{P}(x\in\mathbf{X}_{k}|D)\}_{k})
=ky¯klog𝐏(x𝐗k|D)\displaystyle=-\sum_{k}\bar{y}_{k}\log\mathbf{P}(x\in\mathbf{X}_{k}|D)
=log𝐏(x𝐗k0|D),\displaystyle=-\log\mathbf{P}(x\in\mathbf{X}_{k_{0}}|D),

we have

HCIL(x)\displaystyle H_{CIL}(x) =log𝐏(x𝐗k0,j0|D)\displaystyle=-\log\mathbf{P}(x\in\mathbf{X}_{k_{0},j_{0}}|D)
=log𝐏(x𝐗k0,j0|x𝐗k0,D)log𝐏(x𝐗k0|D)\displaystyle=-\log\mathbf{P}(x\in\mathbf{X}_{k_{0},j_{0}}|x\in\mathbf{X}_{k_{0}},D)-\log\mathbf{P}(x\in\mathbf{X}_{k_{0}}|D)
=HWP(x)+HTP(x)\displaystyle=H_{WP}(x)+H_{TP}(x)
ϵ+δ.\displaystyle\leq\epsilon+\delta.

A.2 Proof of Corollary 1.

Proof.

By proof of Theorem 1, we have

HCIL(x)=HWP(x)+HTP(x).H_{CIL}(x)=H_{WP}(x)+H_{TP}(x).

Taking expectations on both sides, we have i)

𝔼xU(𝐗)[HCIL(x)]\displaystyle\mathbb{E}_{x\sim U(\mathbf{X})}[H_{CIL}(x)] =𝔼xU(𝐗)[HWP(x)]+𝔼xU(𝐗)[HTP(x)]\displaystyle=\mathbb{E}_{x\sim U(\mathbf{X})}[H_{WP}(x)]+\mathbb{E}_{x\sim U(\mathbf{X})}[H_{TP}(x)]
𝔼xU(𝐗)[HWP(x)]+δ.\displaystyle\leq\mathbb{E}_{x\sim U(\mathbf{X})}[H_{WP}(x)]+\delta.

and ii)

𝔼xU(𝐗)[HCIL(x)]\displaystyle\mathbb{E}_{x\sim U(\mathbf{X})}[H_{CIL}(x)] =𝔼xU(𝐗)[HWP(x)]+𝔼xU(𝐗)[HTP(x)]\displaystyle=\mathbb{E}_{x\sim U(\mathbf{X})}[H_{WP}(x)]+\mathbb{E}_{x\sim U(\mathbf{X})}[H_{TP}(x)]
ϵ+𝔼xU(𝐗)[HTP(x)].\displaystyle\leq\epsilon+\mathbb{E}_{x\sim U(\mathbf{X})}[H_{TP}(x)].

A.3 Proof of Theorem 2.

Proof.

i) Assume x𝐗k0x\in\mathbf{X}_{k_{0}}. For k=k0k=k_{0}, we have

HOOD,k0(x)\displaystyle H_{OOD,k_{0}}(x) =log𝐏k0(x𝐗k0|D)\displaystyle=-\log\mathbf{P}^{\prime}_{k_{0}}(x\in\mathbf{X}_{k_{0}}|D)
=log𝐏(x𝐗k0|D)\displaystyle=-\log\mathbf{P}(x\in\mathbf{X}_{k_{0}}|D)
=HTP(x)δ.\displaystyle=H_{TP}(x)\leq\delta.

For kk0k\neq k_{0}, we have

HOOD,k(x)\displaystyle H_{OOD,k}(x) =log𝐏k(x𝐗\𝐗k|D)\displaystyle=-\log\mathbf{P}^{\prime}_{k}(x\in\mathbf{X}\backslash\mathbf{X}_{k}|D)
=log(1𝐏k(x𝐗k|D))\displaystyle=-\log(1-\mathbf{P}^{\prime}_{k}(x\in\mathbf{X}_{k}|D))
=log(1𝐏(x𝐗k|D))\displaystyle=-\log(1-\mathbf{P}(x\in\mathbf{X}_{k}|D))
=log𝐏(xkk𝐗k|D)\displaystyle=-\log\mathbf{P}(x\in\cup_{k^{\prime}\neq k}\mathbf{X}_{k^{\prime}}|D)
log𝐏(x𝐗k0|D)\displaystyle\leq-\log\mathbf{P}(x\in\mathbf{X}_{k_{0}}|D)
=HTP(x)δ.\displaystyle=H_{TP}(x)\leq\delta.

ii) Assume x𝐗k0x\in\mathbf{X}_{k_{0}}. For k=k0k=k_{0}, by HOOD,k0(x)δk0H_{OOD,{k_{0}}}(x)\leq\delta_{k_{0}}, we have

log𝐏k0(x𝐗k0|D)δk0,-\log\mathbf{P}^{\prime}_{k_{0}}(x\in\mathbf{X}_{k_{0}}|D)\leq\delta_{k_{0}},

which means

𝐏k0(x𝐗k0|D)eδk0.\mathbf{P}^{\prime}_{k_{0}}(x\in\mathbf{X}_{k_{0}}|D)\geq e^{-\delta_{k_{0}}}.

For kk0k\neq k_{0}, by HOOD,k(x)δkH_{OOD,{k}}(x)\leq\delta_{k}, we have

log𝐏k(x𝐗\𝐗k|D)δk,-\log\mathbf{P}^{\prime}_{k}(x\in\mathbf{X}\backslash\mathbf{X}_{k}|D)\leq\delta_{k},

which means

𝐏k(x𝐗k|D)1eδk.\mathbf{P}^{\prime}_{k}(x\in\mathbf{X}_{k}|D)\leq 1-e^{-\delta_{k}}.

Therefore, we have

𝐏(x𝐗k0|D)\displaystyle\mathbf{P}(x\in\mathbf{X}_{k_{0}}|D) =𝐏k0(x𝐗k0|D)k𝐏k(x𝐗k|D)\displaystyle=\frac{\mathbf{P}^{\prime}_{k_{0}}(x\in\mathbf{X}_{k_{0}}|D)}{\sum_{k^{\prime}}\mathbf{P}^{\prime}_{k^{\prime}}(x\in\mathbf{X}_{k^{\prime}}|D)}
eδk01+kk01eδk\displaystyle\geq\frac{e^{-\delta_{k_{0}}}}{1+\sum_{k\neq k_{0}}1-e^{-\delta_{k}}}
=eδk0eδk0+k1eδk\displaystyle=\frac{e^{-\delta_{k_{0}}}}{e^{-\delta_{k_{0}}}+\sum_{k}1-e^{-\delta_{k}}}
=11+eδk0k1eδk.\displaystyle=\frac{1}{1+e^{\delta_{k_{0}}}\sum_{k}1-e^{-\delta_{k}}}.

Hence,

HTP(x)\displaystyle H_{TP}(x) =log𝐏(x𝐗k0|D)\displaystyle=-\log\mathbf{P}(x\in\mathbf{X}_{k_{0}}|D)
log11+eδk0k1eδk\displaystyle\leq-\log\frac{1}{1+e^{\delta_{k_{0}}}\sum_{k}1-e^{-\delta_{k}}}
=log(1+eδk0k1eδk)\displaystyle=\log(1+e^{\delta_{k_{0}}}\sum_{k}1-e^{-\delta_{k}})
eδk0(k1eδk)\displaystyle\leq e^{\delta_{k_{0}}}(\sum_{k}1-e^{-\delta_{k}})
=(k𝟏x𝐗keδk)(k1eδk).\displaystyle=(\sum_{k}\mathbf{1}_{x\in\mathbf{X}_{k}}e^{\delta_{k}})(\sum_{k}1-e^{-\delta_{k}}).

A.4 Proof of Corollary 2.

Proof.

i) Assume x𝐗k0x\in\mathbf{X}_{k_{0}}. For k=k0k=k_{0}, we have

HOOD+,k0(x)\displaystyle H_{OOD^{+},k_{0}}(x) =log𝐏k0(x𝐗k0|D)\displaystyle=-\log\mathbf{P}^{\prime}_{k_{0}}(x\in\mathbf{X}_{k_{0}}|D)
=log𝐏(x𝐗k0|D)\displaystyle=-\log\mathbf{P}(x\in\mathbf{X}_{k_{0}}|D)
=HTP+(x)δ.\displaystyle=H_{TP^{+}}(x)\leq\delta.

For kk0k\neq k_{0}, we have

HOOD+,k(x)\displaystyle H_{OOD^{+},k}(x) =log𝐏k(x(𝐗𝐗+)\𝐗k|D)\displaystyle=-\log\mathbf{P}^{\prime}_{k}(x\in(\mathbf{X}\cup\mathbf{X}^{+})\backslash\mathbf{X}_{k}|D)
=log(1𝐏k(x𝐗k|D))\displaystyle=-\log(1-\mathbf{P}^{\prime}_{k}(x\in\mathbf{X}_{k}|D))
=log(1𝐏(x𝐗k|D))\displaystyle=-\log(1-\mathbf{P}(x\in\mathbf{X}_{k}|D))
=log𝐏(x(kk𝐗k)X+|D)\displaystyle=-\log\mathbf{P}(x\in(\cup_{k^{\prime}\neq k}\mathbf{X}_{k^{\prime}})\cup X^{+}|D)
log𝐏(x𝐗k0|D)\displaystyle\leq-\log\mathbf{P}(x\in\mathbf{X}_{k_{0}}|D)
=HTP+(x)δ.\displaystyle=H_{TP^{+}}(x)\leq\delta.

ii.a) Assume x𝐗k0x\in\mathbf{X}_{k_{0}}. For k=k0k=k_{0}, by HOOD+,k0(x)δk0H_{OOD^{+},{k_{0}}}(x)\leq\delta_{k_{0}}, we have

log𝐏k0(x𝐗k0|D)δk0,-\log\mathbf{P}^{\prime}_{k_{0}}(x\in\mathbf{X}_{k_{0}}|D)\leq\delta_{k_{0}},

which means

𝐏k0(x𝐗k0|D)eδk0.\mathbf{P}^{\prime}_{k_{0}}(x\in\mathbf{X}_{k_{0}}|D)\geq e^{-\delta_{k_{0}}}.

For kk0k\neq k_{0}, by HOOD+,k(x)δkH_{OOD^{+},{k}}(x)\leq\delta_{k}, we have

log𝐏k(x(𝐗𝐗+)\𝐗k|D)δk,-\log\mathbf{P}^{\prime}_{k}(x\in(\mathbf{X}\cup\mathbf{X}^{+})\backslash\mathbf{X}_{k}|D)\leq\delta_{k},

which means

𝐏k(x𝐗k|D)1eδk.\mathbf{P}^{\prime}_{k}(x\in\mathbf{X}_{k}|D)\leq 1-e^{-\delta_{k}}.

Therefore, we have

𝐏(x𝐗k0|D)\displaystyle\mathbf{P}(x\in\mathbf{X}_{k_{0}}|D) =𝐏k0(x𝐗k0|D)k𝐏k(x𝐗k|D)+k(1𝐏k(x𝐗k|D))\displaystyle=\frac{\mathbf{P}^{\prime}_{k_{0}}(x\in\mathbf{X}_{k_{0}}|D)}{\sum_{k^{\prime}}\mathbf{P}^{\prime}_{k^{\prime}}(x\in\mathbf{X}_{k^{\prime}}|D)+\prod_{k^{\prime}}(1-\mathbf{P}^{\prime}_{k^{\prime}}(x\in\mathbf{X}_{k^{\prime}}|D))}
eδk01+kk01eδk+(1eδk0)kk1\displaystyle\geq\frac{e^{-\delta_{k_{0}}}}{1+\sum_{k\neq k_{0}}1-e^{-\delta_{k}}+(1-e^{-\delta_{k_{0}}})\prod_{k^{\prime}\neq k}1}
=eδk0eδk0+k1eδk+1eδk0\displaystyle=\frac{e^{-\delta_{k_{0}}}}{e^{-\delta_{k_{0}}}+\sum_{k}1-e^{-\delta_{k}}+1-e^{-\delta_{k_{0}}}}
=11+eδk0(1eδk0+k1eδk).\displaystyle=\frac{1}{1+e^{\delta_{k_{0}}}(1-e^{-\delta_{k_{0}}}+\sum_{k}1-e^{-\delta_{k}})}.

Hence,

HTP+(x)\displaystyle H_{TP^{+}}(x) =log𝐏(x𝐗k0|D)\displaystyle=-\log\mathbf{P}(x\in\mathbf{X}_{k_{0}}|D)
log11+eδk0(1eδk0+k1eδk)\displaystyle\leq-\log\frac{1}{1+e^{\delta_{k_{0}}}(1-e^{-\delta_{k_{0}}}+\sum_{k}1-e^{-\delta_{k}})}
=log(1+eδk0(1eδk0+k1eδk))\displaystyle=\log(1+e^{\delta_{k_{0}}}(1-e^{-\delta_{k_{0}}}+\sum_{k}1-e^{-\delta_{k}}))
eδk0(1eδk0+k1eδk)\displaystyle\leq e^{\delta_{k_{0}}}(1-e^{-\delta_{k_{0}}}+\sum_{k}1-e^{-\delta_{k}})
=(k𝟏x𝐗keδk)(k(1+𝟏x𝐗k)(1eδk)).\displaystyle=(\sum_{k}\mathbf{1}_{x\in\mathbf{X}_{k}}e^{\delta_{k}})(\sum_{k}(1+\mathbf{1}_{x\in\mathbf{X}_{k}})(1-e^{-\delta_{k}})).

ii.b) Assume x𝐗+x\in\mathbf{X}^{+}. For k=1,,Tk=1,\dots,T, by HOOD+,k(x)δkH_{OOD^{+},{k}}(x)\leq\delta_{k}, we have

log𝐏k(x(𝐗𝐗+)\𝐗k|D)δk,-\log\mathbf{P}^{\prime}_{k}(x\in(\mathbf{X}\cup\mathbf{X}^{+})\backslash\mathbf{X}_{k}|D)\leq\delta_{k},

which means

𝐏k(x𝐗k|D)1eδk.\mathbf{P}^{\prime}_{k}(x\in\mathbf{X}_{k}|D)\leq 1-e^{-\delta_{k}}.

By definition, we have

𝐏(x𝐗+|D)\displaystyle\mathbf{P}(x\in\mathbf{X}^{+}|D) =k(1𝐏k(x𝐗k|D))k𝐏k(x𝐗k|D)+k(1𝐏k(x𝐗k|D)).\displaystyle=\frac{\prod_{k^{\prime}}(1-\mathbf{P}^{\prime}_{k^{\prime}}(x\in\mathbf{X}_{k^{\prime}}|D))}{\sum_{k^{\prime}}\mathbf{P}^{\prime}_{k^{\prime}}(x\in\mathbf{X}_{k^{\prime}}|D)+\prod_{k^{\prime}}(1-\mathbf{P}^{\prime}_{k^{\prime}}(x\in\mathbf{X}_{k^{\prime}}|D))}.

Hence,

HTP+(x)\displaystyle H_{TP^{+}}(x) =log𝐏(x𝐗+|D)\displaystyle=-\log\mathbf{P}(x\in\mathbf{X}^{+}|D)
=log(1+k𝐏k(x𝐗k|D)k(1𝐏k(x𝐗k|D)))\displaystyle=\log(1+\frac{\sum_{k^{\prime}}\mathbf{P}^{\prime}_{k^{\prime}}(x\in\mathbf{X}_{k^{\prime}}|D)}{\prod_{k^{\prime}}(1-\mathbf{P}^{\prime}_{k^{\prime}}(x\in\mathbf{X}_{k^{\prime}}|D))})
k𝐏k(x𝐗k|D)k(1𝐏k(x𝐗k|D))\displaystyle\leq\frac{\sum_{k^{\prime}}\mathbf{P}^{\prime}_{k^{\prime}}(x\in\mathbf{X}_{k^{\prime}}|D)}{\prod_{k^{\prime}}(1-\mathbf{P}^{\prime}_{k^{\prime}}(x\in\mathbf{X}_{k^{\prime}}|D))}
k1eδkkeδk\displaystyle\leq\frac{\sum_{k^{\prime}}1-e^{-\delta_{k^{\prime}}}}{\prod_{k^{\prime}}e^{-\delta_{k^{\prime}}}}
=keδkk1eδk.\displaystyle=\prod_{k}e^{\delta_{k}}\sum_{k}1-e^{-\delta_{k}}.

A.5 Proof of Theorem 3.

Proof.

Using Theorem 1 and 2,

HCIL(x)\displaystyle H_{CIL}(x) =log𝐏(x𝐗k0,j0|D)\displaystyle=-\log\mathbf{P}(x\in\mathbf{X}_{k_{0},j_{0}}|D)
=log𝐏(x𝐗k0,j0|x𝐗k0,D)log𝐏(x𝐗k0|D)\displaystyle=-\log\mathbf{P}(x\in\mathbf{X}_{k_{0},j_{0}}|x\in\mathbf{X}_{k_{0}},D)-\log\mathbf{P}(x\in\mathbf{X}_{k_{0}}|D)
=HWP(x)+HTP(x)\displaystyle=H_{WP}(x)+H_{TP}(x)
ϵ+HTP(x)\displaystyle\leq\epsilon+H_{TP}(x)
ϵ+(k𝟏x𝐗keδk)(k1eδk)\displaystyle\leq\epsilon+(\sum_{k}\mathbf{1}_{x\in\mathbf{X}_{k}}e^{\delta_{k}})(\sum_{k}1-e^{-\delta_{k}})

A.6 Proof of Theorem 4.

Proof.

i) Assume x𝐗k0,j0𝐗k0x\in\mathbf{X}_{k_{0},j_{0}}\subset\mathbf{X}_{k_{0}}. Define 𝐏(x𝐗k,j|x𝐗k,D)=𝐏(x𝐗k,j|D)\mathbf{P}(x\in\mathbf{X}_{k,j}|x\in\mathbf{X}_{k},D)=\mathbf{P}(x\in\mathbf{X}_{k,j}|D). According to proof of Theorem 1,

HWP(x)\displaystyle H_{WP}(x) =log𝐏(x𝐗k0,j0|x𝐗k0,D),\displaystyle=-\log\mathbf{P}(x\in\mathbf{X}_{k_{0},j_{0}}|x\in\mathbf{X}_{k_{0}},D),
HCIL(x)\displaystyle H_{CIL}(x) =log𝐏(x𝐗k0,j0|D).\displaystyle=-\log\mathbf{P}(x\in\mathbf{X}_{k_{0},j_{0}}|D).

Hence, we have

HWP(x)\displaystyle H_{WP}(x) =log𝐏(x𝐗k0,j0|x𝐗k0,D)\displaystyle=-\log\mathbf{P}(x\in\mathbf{X}_{k_{0},j_{0}}|x\in\mathbf{X}_{k_{0}},D)
=log𝐏(x𝐗k0,j0|D)\displaystyle=-\log\mathbf{P}(x\in\mathbf{X}_{k_{0},j_{0}}|D)
=HCIL(x)η.\displaystyle=H_{CIL}(x)\leq\eta.

ii) Assume x𝐗k0,j0𝐗k0x\in\mathbf{X}_{k_{0},j_{0}}\subset\mathbf{X}_{k_{0}}. Define 𝐏(x𝐗k|D)=j𝐏(x𝐗k,j|D)\mathbf{P}(x\in\mathbf{X}_{k}|D)=\sum_{j}\mathbf{P}(x\in\mathbf{X}_{k,j}|D). According to proof of Theorem 1,

HTP(x)\displaystyle H_{TP}(x) =log𝐏(x𝐗k0|D),\displaystyle=-\log\mathbf{P}(x\in\mathbf{X}_{k_{0}}|D),
HCIL(x)\displaystyle H_{CIL}(x) =log𝐏(x𝐗k0,j0|D).\displaystyle=-\log\mathbf{P}(x\in\mathbf{X}_{k_{0},j_{0}}|D).

Hence, we have

HTP(x)\displaystyle H_{TP}(x) =log𝐏(x𝐗k0|D)\displaystyle=-\log\mathbf{P}(x\in\mathbf{X}_{k_{0}}|D)
=logj𝐏(x𝐗k0,j|D)\displaystyle=-\log\sum_{j}\mathbf{P}(x\in\mathbf{X}_{k_{0},j}|D)
log𝐏(x𝐗k0,j0|D)\displaystyle\leq-\log\mathbf{P}(x\in\mathbf{X}_{k_{0},j_{0}}|D)
=HCIL(x)η.\displaystyle=H_{CIL}(x)\leq\eta.

iii) Assume x𝐗k0,j0𝐗k0x\in\mathbf{X}_{k_{0},j_{0}}\subset\mathbf{X}_{k_{0}}. Define 𝐏i(x𝐗k|D)=𝐏(x𝐗k|D)=j𝐏(x𝐗k,j|D)\mathbf{P}^{\prime}_{i}(x\in\mathbf{X}_{k}|D)=\mathbf{P}(x\in\mathbf{X}_{k}|D)=\sum_{j}\mathbf{P}(x\in\mathbf{X}_{k,j}|D). According to proof of Theorem 4 ii), we have

HTP(x)η.H_{TP}(x)\leq\eta.

According to proof of Theorem 2 i), we have

HOOD,i(x)HTP(x).H_{OOD,i}(x)\leq H_{TP}(x).

Therefore,

HOOD,i(x)HTP(x)η.H_{OOD,i}(x)\leq H_{TP}(x)\leq\eta.

Appendix B Additional Results and Explanation Regarding Table 1 in the Main Paper

In Section 4.2.3, we showed that a better OOD detection improves CIL performance. For the post-processing method ODIN, we only reported the results on C100-10T. Table 10 shows the results on the other datasets.

Table 10: Performance comparison between the original output and output post-processed with OOD detection technique ODIN. Note that ODIN does not apply to iCaRL and Mnemonics as they are not based on softmax but some distance functions. The results for C100-10T are reported in the main paper.
C10-5T C100-20T T-5T T-10T
Method CIL AUC CIL AUC CIL AUC CIL AUC CIL
OWM Original 81.33 51.79 71.90 24.15 58.49 10.00 59.48 8.57
ODIN 71.72 40.65 68.52 23.05 58.46 10.77 59.38 9.52
MUC Original 79.49 52.85 66.20 14.19 68.42 33.57 62.63 17.39
ODIN 79.54 53.22 65.72 14.11 68.32 33.45 62.17 17.27
PASS Original 66.51 47.34 70.26 24.99 65.18 28.40 63.27 19.07
ODIN 63.08 35.20 69.81 21.83 65.93 29.03 62.73 17.78
LwF Original 89.39 54.67 89.84 44.33 78.20 32.17 79.43 24.28
ODIN 88.94 63.04 88.68 47.56 76.83 36.20 77.02 28.29
A-GEM Original 85.93 20.03 74.48 4.14 72.33 13.52 76.42 7.66
ODIN 86.43 34.03 75.12 6.99 72.46 14.69 76.75 8.50
EEIL Original 89.72 57.09 85.96 33.46 64.82 14.67 64.87 9.79
ODIN 89.20 59.47 85.46 35.16 57.01 11.92 55.42 6.88
GD Original 91.23 58.69 86.76 38.83 68.63 16.36 69.61 11.73
ODIN 90.39 60.53 86.64 42.33 60.75 13.43 63.92 11.83
BiC Original 90.89 61.41 89.46 48.92 80.17 41.75 80.37 33.77
ODIN 91.86 64.29 87.89 47.40 74.54 37.40 76.27 29.06
DER++ Original 90.16 66.04 85.44 46.59 71.80 35.80 72.41 30.49
ODIN 87.08 63.07 87.72 49.26 73.92 37.87 72.91 32.52
HAL Original 86.16 32.82 65.59 13.51 53.00 3.42 57.87 3.36
ODIN 76.27 44.75 64.46 17.40 53.26 4.80 58.13 4.74
HAT Original 82.47 62.67 75.35 25.64 72.28 38.46 71.82 29.78
ODIN 82.45 62.60 75.36 25.84 72.31 38.61 71.83 30.01
HyperNet Original 78.54 53.40 72.04 18.67 54.58 7.91 55.37 5.32
ODIN 79.39 56.72 73.89 23.8 54.60 8.64 55.53 6.91
Sup Original 79.16 62.37 81.14 34.70 74.13 41.82 74.59 36.46
ODIN 82.38 62.63 81.48 36.35 73.96 41.10 74.61 36.46

A continual learning method with a better AUC shows a better CIL performance than other methods with lower AUC. For instance, original HAT achieves AUC of 82.47 while HyperNet achieves 78.54 on C10-5T. The CIL for HAT is 62.67 while it is 53.40 for HyperNet. However, there are some exceptions that this comparison does not hold. An example is LwF. Its AUC and CIL are 89.39 and 54.67 on C10-5T. Although its AUC is better than HAT, the CIL is lower. This is due to the fact that CIL improves with WP and TP according to Theorem 1. The contraposition of Theorem 4 also says if the cross-entropy of TIL is large, that of CIL is also large. Indeed, the average within-task prediction (WP) accuracy for LwF on C10-5T is 95.2 while the same for HAT is 96.7. Improving WP is also important in achieving good CIL performances.

For PASS, we had to tune τk\tau_{k} using a validation set. This is because the softmax in Eq. 25 improves AUC by making the IND (in-distribution) and OOD scores more separable within a task, but deteriorates the final scores across tasks. To be specific, the test instances are predicted as one of the classes in the first task after softmax because the relative values between classes in task 1 are larger than the other tasks in PASS. Therefore, larger τ1\tau_{1} and smaller τk\tau_{k}, for k>1k>1, are chosen to compensate for the relative values.

Appendix C Definitions of TP

As noted in the main paper, the class prediction in Eq. 2 varies by the definition of WP and TP. The precise definition of WP and TP depends on implementation. Due to this subjectivity, we follow the prediction method in the existing approaches in continual learning, which is the argmax\operatorname*{arg\,max} over the output. In this section, we show that the argmax\operatorname*{arg\,max} over output is a special case of Eq. 2. We also provide CIL results using different definitions of TP.

We first establish another theorem. This is an extension of Theorem 2 and connects the standard prediction method to our analysis.

Theorem 5 (Extension of Theorem 2).

i) If HTP(x)δH_{TP}(x)\leq\delta, let 𝐏k(x𝐗k|D)=𝐏(x𝐗k|D)1/τk\mathbf{P}^{\prime}_{k}(x\in\mathbf{X}_{k}|D)=\mathbf{P}(x\in\mathbf{X}_{k}|D)^{1/\tau_{k}}, τk>0\forall\tau_{k}>0, then HOOD,k(x)max(δ/τk,log(1(1eδ)1/τk),k=1,,TH_{OOD,k}(x)\leq\max(\delta/\tau_{k},-\log(1-(1-e^{-\delta})^{1/\tau_{k}}),\forall\,k=1,\dots,T.

ii) If HOOD,k(x)δk,k=1,,TH_{OOD,k}(x)\leq\delta_{k},k=1,\dots,T, let 𝐏(x𝐗k|D)=𝐏k(x𝐗k|D)1/τkj𝐏j(x𝐗j|D)1/τj\mathbf{P}(x\in\mathbf{X}_{k}|D)=\frac{\mathbf{P}_{k}^{\prime}(x\in\mathbf{X}_{k}|D)^{1/\tau_{k}}}{\sum_{j}\mathbf{P}_{j}^{\prime}(x\in\mathbf{X}_{j}|D)^{1/\tau_{j}}}, τk>0\forall\tau_{k}>0, then HTP(x)k𝟏x𝐗kδkτk+k(1eδk)1/τkk𝟏x𝐗k(1(1eδk)1/τk)H_{TP}(x)\leq\sum_{k}\frac{\mathbf{1}_{x\in\mathbf{X}_{k}}\delta_{k}}{\tau_{k}}+\frac{\sum_{k}(1-e^{-\delta_{k}})^{1/\tau_{k}}}{\sum_{k}\mathbf{1}_{x\in\mathbf{X}_{k}}(1-(1-e^{-\delta_{k}})^{1/\tau_{k}})}, where 𝟏x𝐗k\mathbf{1}_{x\in\mathbf{X}_{k}} is an indicator function.

In Theorem 5 (proof appears later), we can observe that δ/τk\delta/\tau_{k} decreases with the increase of τk\tau_{k}, while log(1(1eδ)1/τk)-\log(1-(1-e^{-\delta})^{1/\tau_{k}}) increases. Hence, when TP is given, let δ=HTP(x)\delta=H_{TP}(x), we can find the optimal τi\tau_{i} to define OOD by solving δ/τk=log(1(1eδ)1/τk)\delta/\tau_{k}=-\log(1-(1-e^{-\delta})^{1/\tau_{k}}). Similarly, given OOD, let δk=HOOD,k(x)\delta_{k}=H_{OOD,k}(x), we can find the optimal τ1,,τT\tau_{1},\dots,\tau_{T} to define TP by finding the global minima of k𝟏x𝐗kδkτk+k(1eδk)1/τkk𝟏x𝐗k(1(1eδk)1/τk)\sum_{k}\frac{\mathbf{1}_{x\in\mathbf{X}_{k}}\delta_{k}}{\tau_{k}}+\frac{\sum_{k}(1-e^{-\delta_{k}})^{1/\tau_{k}}}{\sum_{k}\mathbf{1}_{x\in\mathbf{X}_{k}}(1-(1-e^{-\delta_{k}})^{1/\tau_{k}})}. The optimal τk\tau_{k} can be found using a memory buffer to save a small number of previous data like that in a replay-based continual learning method.

In Theorem 5 (ii), let 𝐏k(x𝐗k|D)=σ(maxf(x)k)\mathbf{P}^{\prime}_{k}(x\in\mathbf{X}_{k}|D)=\sigma(\max f(x)_{k}), where σ\sigma is the sigmoid and f(x)kf(x)_{k} is the output of task kk and choose τk0\tau_{k}\approx 0 for each kk. Then 𝐏(x𝐗k|D)\mathbf{P}(x\in\mathbf{X}_{k}|D) becomes approximately 1 for the task kk where the maximum logit value appears and 0 for the rest tasks. Therefore, Eq. 2 in the paper

𝐏(x𝐗k,j|D)=𝐏(x𝐗k,j|x𝐗k,D)𝐏(x𝐗k|D)\displaystyle\mathbf{P}(x\in\mathbf{X}_{k,j}|D)=\mathbf{P}(x\in\mathbf{X}_{k,j}|x\in\mathbf{X}_{k},D)\mathbf{P}(x\in\mathbf{X}_{k}|D)

is zero for all classes in tasks kkk^{\prime}\neq k. Since only the probabilities of classes in task kk are non-zero, taking argmax\operatorname*{arg\,max} over all class probabilities gives the same class as argmax\operatorname*{arg\,max} over output logits.

We have also tried another definition of WP and TP. The considered WP is

𝐏(x𝐗k,j|x𝐗k,D)=ef(x)kj/νkjef(x)kj/νk,\displaystyle\mathbf{P}(x\in\mathbf{X}_{k,j}|x\in\mathbf{X}_{k},D)=\frac{e^{f(x)_{kj}/\nu_{k}}}{\sum_{j}e^{f(x)_{kj}/\nu_{k}}}, (34)

where νk\nu_{k} is a temperature scaling parameter for task kk, and the TP is

𝐏(x𝐗k|D)=𝐏k(x𝐗k|D)k𝐏k(x𝐗k|D),\displaystyle\mathbf{P}(x\in\mathbf{X}_{k}|D)=\frac{\mathbf{P}_{k}^{\prime}(x\in\mathbf{X}_{k}|D)}{\sum_{k}\mathbf{P}_{k}^{\prime}(x\in\mathbf{X}_{k}|D)}, (35)

where 𝐏k(x𝐗k|D)=maxjef(x)kj/τk/jef(x)kj/τk\mathbf{P}_{k}^{\prime}(x\in\mathbf{X}_{k}|D)=\max_{j}e^{f(x)_{kj}/\tau_{k}}/\sum_{j}e^{f(x)_{kj}/\tau_{k}} and τk\tau_{k} is a temperature scaling parameter. This is the maximum softmax of task kk. We choose νk=0.1\nu_{k}=0.1 and τk=5\tau_{k}=5 for all kk. A good τ\tau and ν\nu can be found using grid search on a validation set. However, one can also find the optimal values by optimization using some past data saved for the memory buffer. The CIL results for the new prediction method is in Table 11.

Table 11: Average Accuracy with a Different Prediction Method
Method C10-5T C100-10T C100-20T T-5T T-10T
OWM 40.6

±\pm0.47

28.6

±\pm0.82

22.9

±\pm0.32

10.4

±\pm0.54

9.2

±\pm0.35

MUC 53.2

±\pm1.32

30.6

±\pm1.21

14.0

±\pm0.12

33.1

±\pm0.18

17.2

±\pm0.13

PASS 33.6

±\pm0.71

18.5

±\pm1.85

20.8

±\pm0.85

21.4

±\pm0.44

13.0

±\pm0.55

LwF 63.0

±\pm0.34

51.9

±\pm0.88

47.5

±\pm0.62

35.9

±\pm0.32

27.8

±\pm0.29

iCaRL 65.3

±\pm0.83

52.9

±\pm0.39

48.2

±\pm0.70

34.8

±\pm0.34

27.3

±\pm0.17

A-GEM 34.0

±\pm1.86

14.5

±\pm0.55

7.3

±\pm1.78

15.4

±\pm0.24

9.0

±\pm0.30

EEIL 59.5

±\pm0.41

41.8

±\pm0.78

37.9

±\pm6.11

15.1

±\pm0.00

7.5

±\pm0.19

GD 68.0

±\pm0.75

47.2

±\pm0.33

41.8

±\pm0.25

15.7

±\pm2.08

12.2

±\pm0.14

Mnemonics†∗ 65.6

±\pm1.55

50.7

±\pm0.72

47.9

±\pm0.71

36.3

±\pm0.30

27.7

±\pm0.78

BiC 65.5

±\pm0.81

50.8

±\pm0.69

47.2

±\pm0.71

37.0

±\pm0.58

29.1

±\pm0.34

DER++ 63.1

±\pm1.12

54.6

±\pm1.21

48.9

±\pm1.18

37.4

±\pm0.72

32.1

±\pm0.44

HAL 43.0

±\pm3.10

20.0

±\pm1.15

17.0

±\pm0.83

4.6

±\pm0.58

4.8

±\pm0.50

HAT 62.6

±\pm1.31

41.5

±\pm0.80

25.9

±\pm0.56

38.9

±\pm1.62

30.1

±\pm0.52

HyperNet 56.7

±\pm1.23

32.4

±\pm1.07

24.5

±\pm1.12

8.9

±\pm0.58

7.0

±\pm0.52

Sup 62.6

±\pm1.11

46.8

±\pm0.34

36.0

±\pm0.32

41.5

±\pm1.17

35.7

±\pm0.40

HAT+CSI 85.2

±\pm0.92

62.9

±\pm1.07

53.6

±\pm0.84

47.0

±\pm0.38

46.2

±\pm0.30

Sup+CSI 87.4

±\pm0.40

66.6

±\pm0.23

60.5

±\pm0.89

47.7

±\pm0.30

46.3

±\pm0.30

HAT+CSI+c 85.2

±\pm0.94

63.6

±\pm0.69

55.4

±\pm0.79

51.4

±\pm0.38

46.5

±\pm0.26

Sup+CSI+c 86.2

±\pm0.79

67.0

±\pm0.14

60.4

±\pm1.04

48.2

±\pm0.35

46.1

±\pm0.32

Average classification accuracy. The results are based on the class prediction method defined with WP and TP in Eq. 34 and Eq. 35, respectively. The results can be improved by finding optimal temperature scaling parameters.
Proof of Theorem 5..

i) Assume x𝐗k0x\in\mathbf{X}_{k_{0}}.

For k=k0k=k_{0}, we have

HOOD,k0(x)\displaystyle H_{OOD,k_{0}}(x) =log𝐏k0(x𝐗k0|D)\displaystyle=-\log\mathbf{P}^{\prime}_{k_{0}}(x\in\mathbf{X}_{k_{0}}|D)
=1τk0log𝐏(x𝐗k0|D)\displaystyle=-\frac{1}{\tau_{k_{0}}}\log\mathbf{P}(x\in\mathbf{X}_{k_{0}}|D)
=1τk0HTP(x)δτk0.\displaystyle=\frac{1}{\tau_{k_{0}}}H_{TP}(x)\leq\frac{\delta}{\tau_{k_{0}}}.

For kk0k\neq k_{0}, we have

HOOD,k(x)\displaystyle H_{OOD,k}(x) =log𝐏k(x𝐗k|D)\displaystyle=-\log\mathbf{P}^{\prime}_{k}(x\notin\mathbf{X}_{k}|D)
=log(1𝐏k(x𝐗k|D))\displaystyle=-\log(1-\mathbf{P}^{\prime}_{k}(x\in\mathbf{X}_{k}|D))
=log(1𝐏(x𝐗k|D)1/τk)\displaystyle=-\log(1-\mathbf{P}(x\in\mathbf{X}_{k}|D)^{1/\tau_{k}})
=log(1(1𝐏(xkk𝐗k|D))1/τk)\displaystyle=-\log(1-(1-\mathbf{P}(x\in\cup_{k^{\prime}\neq k}\mathbf{X}_{k^{\prime}}|D))^{1/\tau_{k}})
log(1(1𝐏(x𝐗k0|D))1/τk)\displaystyle\leq-\log(1-(1-\mathbf{P}(x\in\mathbf{X}_{k_{0}}|D))^{1/\tau_{k}})
=log(1(1eHTP(x))1/τk)\displaystyle=-\log(1-(1-e^{-H_{TP}(x)})^{1/\tau_{k}})
log(1(1eδ)1/τk).\displaystyle\leq-\log(1-(1-e^{-\delta})^{1/\tau_{k}}).

ii) Assume x𝐗k0x\in\mathbf{X}_{k_{0}}.

For k=k0k=k_{0}, by HOOD,k0(x)δk0H_{OOD,{k_{0}}}(x)\leq\delta_{k_{0}}, we have

log𝐏k0(x𝐗k0|D)δk0,-\log\mathbf{P}^{\prime}_{k_{0}}(x\in\mathbf{X}_{k_{0}}|D)\leq\delta_{k_{0}},

which means

𝐏k0(x𝐗k0|D)eδk0.\mathbf{P}^{\prime}_{k_{0}}(x\in\mathbf{X}_{k_{0}}|D)\geq e^{-\delta_{k_{0}}}.

For kk0k\neq k_{0}, by HOOD,k(x)δkH_{OOD,{k}}(x)\leq\delta_{k}, we have

log𝐏k(x𝐗k|D)δk,-\log\mathbf{P}^{\prime}_{k}(x\notin\mathbf{X}_{k}|D)\leq\delta_{k},

which means

𝐏k(x𝐗k|D)1eδk.\mathbf{P}^{\prime}_{k}(x\in\mathbf{X}_{k}|D)\leq 1-e^{-\delta_{k}}.

Therefore, we have

𝐏(x𝐗k0|D)\displaystyle\mathbf{P}(x\in\mathbf{X}_{k_{0}}|D) =𝐏k0(x𝐗k0|D)1/τk0k𝐏k(x𝐗k|D)1/τk\displaystyle=\frac{\mathbf{P}^{\prime}_{k_{0}}(x\in\mathbf{X}_{k_{0}}|D)^{1/\tau_{k_{0}}}}{\sum_{k}\mathbf{P}^{\prime}_{k}(x\in\mathbf{X}_{k}|D)^{1/\tau_{k}}}
eδk0/τk01+kk0(1eδk)1/τk\displaystyle\geq\frac{e^{-\delta_{k_{0}}/\tau_{k_{0}}}}{1+\sum_{k\neq k_{0}}(1-e^{-\delta_{k}})^{1/\tau_{k}}}
=eδk0/τk01(1eδk0)1/τk0+k(1eδk)1/τk\displaystyle=\frac{e^{-\delta_{k_{0}}/\tau_{k_{0}}}}{1-(1-e^{-\delta_{k_{0}}})^{1/\tau_{k_{0}}}+\sum_{k}(1-e^{-\delta_{k}})^{1/\tau_{k}}}
=eδk0/τk01(1eδk0)1/τk011+k(1eδk)1/τk1(1eδk0)1/τk0.\displaystyle=\frac{e^{-\delta_{k_{0}}/\tau_{k_{0}}}}{1-(1-e^{-\delta_{k_{0}}})^{1/\tau_{k_{0}}}}\cdot\frac{1}{1+\frac{\sum_{k}(1-e^{-\delta_{k}})^{1/\tau_{k}}}{1-(1-e^{-\delta_{k_{0}}})^{1/\tau_{k_{0}}}}}.

Hence,

HTP(x)\displaystyle H_{TP}(x) =log𝐏(x𝐗k0|D)\displaystyle=-\log\mathbf{P}(x\in\mathbf{X}_{k_{0}}|D)
logeδk0/τk01(1eδk0)1/τk011+k(1eδk)1/τk1(1eδk0)1/τk0\displaystyle\leq-\log\frac{e^{-\delta_{k_{0}}/\tau_{k_{0}}}}{1-(1-e^{-\delta_{k_{0}}})^{1/\tau_{k_{0}}}}\cdot\frac{1}{1+\frac{\sum_{k}(1-e^{-\delta_{k}})^{1/\tau_{k}}}{1-(1-e^{-\delta_{k_{0}}})^{1/\tau_{k_{0}}}}}
=δk0τk0+log[1(1eδk0)1/τk0]+log[1+k(1eδk)1/τk1(1eδk0)1/τk0]\displaystyle=\frac{\delta_{k_{0}}}{\tau_{k_{0}}}+\log[1-(1-e^{-\delta_{k_{0}}})^{1/\tau_{k_{0}}}]+\log\left[1+\frac{\sum_{k}(1-e^{-\delta_{k}})^{1/\tau_{k}}}{1-(1-e^{-\delta_{k_{0}}})^{1/\tau_{k_{0}}}}\right]
δk0τk0+k(1eδk)1/τk1(1eδk0)1/τk0\displaystyle\leq\frac{\delta_{k_{0}}}{\tau_{k_{0}}}+\frac{\sum_{k}(1-e^{-\delta_{k}})^{1/\tau_{k}}}{1-(1-e^{-\delta_{k_{0}}})^{1/\tau_{k_{0}}}}
=k𝟏x𝐗kδkτk+k(1eδk)1/τkk𝟏x𝐗k(1(1eδk)1/τk).\displaystyle=\sum_{k}\frac{\mathbf{1}_{x\in\mathbf{X}_{k}}\delta_{k}}{\tau_{k}}+\frac{\sum_{k}(1-e^{-\delta_{k}})^{1/\tau_{k}}}{\sum_{k}\mathbf{1}_{x\in\mathbf{X}_{k}}(1-(1-e^{-\delta_{k}})^{1/\tau_{k}})}.

Appendix D Output Calibration

In this section, we discuss the output calibration technique used in Section 4.2.4 to improve the final prediction accuracy. Even if an OOD detection of each task was perfect (i.e. the model accepts and rejects IND and OOD samples perfectly), the system could make an incorrect class prediction if the magnitudes of outputs across different tasks are different. To ensure that the output values are comparable, we calibrate the outputs by scaling αk\alpha_{k} and shifting βk\beta_{k} for each task. The optimal parameters (αk,βk)R×R(\alpha_{k},\beta_{k})\in R\times R can be found by solving the optimization problem using samples in the memory buffer. More precisely, denote the memory buffer \mathcal{M} and calibration parameters (α,β)RT×RT(\alpha,\beta)\in R^{T}\times R^{T}, where TT is the number of learned tasks. After training TTth task, we find optimal calibration parameters by minimizing the cross-entropy loss,

=1||(x,y)logp(y|x)\displaystyle\mathcal{L}=-\frac{1}{|\mathcal{M}|}\sum_{(x,y)\in\mathcal{M}}\log p(y|x) (36)

where p(c|x)p(c|x) is computed using the softmax,

softmax[αkf(x)k+βk]\displaystyle\text{softmax}\bigoplus[\alpha_{k}f(x)_{k}+\beta_{k}] (37)

where \bigoplus indicates the concatenation and f(x)kf(x)_{k} is the output of task kk as Eq. 24. Given the optimal parameters (α,β)(\alpha^{*},\beta^{*}), we make final prediction as

y^=argmax[αkf(x)k+βk]\displaystyle\hat{y}=\operatorname*{arg\,max}\bigoplus[\alpha_{k}^{*}f(x)_{k}+\beta_{k}^{*}] (38)

If we use OODk=σ(αkf(x)k+βk)OOD_{k}=\sigma(\alpha_{k}^{*}f(x)_{k}+\beta_{k}^{*}), where σ\sigma is the sigmoid, and TPk=OODk/kOODkTP_{k}=OOD_{k}/\sum_{k^{\prime}}OOD_{k^{\prime}}, the theoretical results in Section 3 hold.

Appendix E TIL (WP) Results

The TIL (WP) results of all the systems are reported in Table 12. HAT and Sup show strong performances compared to the other baselines as they leverage task-specific parameters. However, as shown in Theorem 1, the CIL depends on TP (or OOD). Without an OOD detection mechanism in HAT or Sup, they perform poorly in CIL as shown in the main paper. The contrastive learning in CSI also improves the IND prediction (i.e., WP), and this along with OOD detection results in strong CIL performance.

Table 12: The TIL Results of All the Systems.
Method C10-5T C100-10T C100-20T T-5T T-10T Avg.
OWM 85.0

±\pm0.07

59.6

±\pm0.83

65.4

±\pm0.48

22.4

±\pm0.87

28.1

±\pm0.55

52.1
MUC 95.1

±\pm0.10

77.3

±\pm0.83

73.4

±\pm9.16

55.9

±\pm0.26

47.2

±\pm0.22

69.8
PASS 83.8

±\pm0.68

72.1

±\pm0.70

76.8

±\pm0.32

49.9

±\pm0.56

46.5

±\pm0.39

65.8
LwF 95.2

±\pm0.30

86.2

±\pm1.00

89.0

±\pm0.45

56.4

±\pm0.48

55.3

±\pm0.35

76.4
iCaRL 94.9

±\pm0.34

84.2

±\pm1.04

85.7

±\pm0.68

54.5

±\pm0.29

52.7

±\pm0.37

74.4
A-GEM 82.5

±\pm4.19

58.9

±\pm2.14

56.4

±\pm7.03

32.1

±\pm0.90

30.1

±\pm0.29

52.0
EEIL 93.4

±\pm0.02

83.1

±\pm3.13

88.4

±\pm2.07

30.3

±\pm0.89

25.9

±\pm0.04

64.2
GD 94.4

±\pm0.09

82.2

±\pm0.18

85.7

±\pm0.20

30.7

±\pm1.79

32.2

±\pm0.37

65.0
Mnemonics†∗ 94.5

±\pm0.46

82.3

±\pm0.30

86.2

±\pm0.46

54.8

±\pm0.16

52.9

±\pm0.66

74.1
BiC 95.4

±\pm0.35

84.6

±\pm0.48

88.7

±\pm0.19

61.5

±\pm0.60

62.2

±\pm0.45

78.5
DER++ 92.0

±\pm0.54

84.0

±\pm9.43

86.6

±\pm9.44

57.4

±\pm1.31

60.0

±\pm0.74

76.0
HAL 82.8

±\pm1.94

49.5

±\pm1.51

61.1

±\pm1.43

13.2

±\pm0.77

21.2

±\pm0.41

26.2
HAT 96.7

±\pm0.18

84.0

±\pm0.23

85.0

±\pm0.98

61.2

±\pm0.72

63.8

±\pm0.41

78.1
HyperNet 94.6

±\pm0.37

76.8

±\pm1.22

83.5

±\pm0.98

23.9

±\pm0.60

28.0

±\pm0.69

61.4
Sup 96.6

±\pm0.21

87.9

±\pm0.27

91.6

±\pm0.15

64.3

±\pm0.24

68.4

±\pm0.22

81.8
HAT+CSI 98.7

±\pm0.06

92.0

±\pm0.37

94.3

±\pm0.06

68.4

±\pm0.16

72.4

±\pm0.21

85.2
Sup+CSI 98.7

±\pm0.07

93.0

±\pm0.13

95.3

±\pm0.20

65.9

±\pm0.25

74.1

±\pm0.28

85.4
The calibrated versions (+c) of our methods are omitted as calibration does not affect TIL performance. Exemplar-free methods are italicized. The last column Avg. shows the average TIL accuracy of each method over all datasets.

Appendix F Hyper-parameters

Here we report the hyper-parameters that we did not report in the main paper due to space limitations. We mainly report the hyper-parameters of the proposed methods, HAT+CSI, Sup+CSI, and their calibrated versions. For all the experiments of the proposed methods, we use the values chosen by the original CSI [64]. We use LARS [74] optimization with a learning rate 0.1 for training the feature extractor. We linearly increase the learning rate by 0.1 per epoch for the first 10 epochs. After that, we use cosine scheduler [45] without restart as in [64, 12]. After training the feature extractor, we train the linear classifier for 100 epochs with SGD with a learning rate 0.1 and reduce the rate by 0.1 at 60, 75, and 90 epochs. For all the experiments except MNIST, we train the feature extractor for 700 epochs with batch size 128.

For the following hyper-parameters, we use 10% of training data for validation to find a good set of values. For the number of epochs and batch size for MNIST, Sup+CSI trains for 1000 epochs with a batch size of 32 while HAT+CSI trains for 700 epochs with a batch size of 256. The hard attention regularization penalty λi\lambda_{i} in HAT is different by experiments and task ii. For MNIST, we use λ1=0.25\lambda_{1}=0.25, and λ2==λ5=0.1\lambda_{2}=\cdots=\lambda_{5}=0.1. For C10-5T, we use λ1=1.0\lambda_{1}=1.0, and λ2==λ5=0.75\lambda_{2}=\cdots=\lambda_{5}=0.75. For C100-10T, λ1=1.5\lambda_{1}=1.5, and λ2==λ10=1.0\lambda_{2}=\cdots=\lambda_{10}=1.0 are used. For C100-20T, λ1=3.5\lambda_{1}=3.5, and λ2==λ20=2.5\lambda_{2}=\cdots=\lambda_{20}=2.5 are used. For T-5T, λi=0.75\lambda_{i}=0.75 for all tasks, and lastly, for T-10T, λ1=1.0\lambda_{1}=1.0, and λ2==λ10=0.75\lambda_{2}=\cdots=\lambda_{10}=0.75 are used. We use larger λ1\lambda_{1} for the first task than the later tasks as we have found that the larger regularization on the first task results in better accuracy. This is by the definition of regularization in HAT. The earlier task gives a lower penalty than the later tasks. We manually give a larger penalty to the first task. We did not search hyper-parameter λt\lambda_{t} for tasks t2t\geq 2. For sparsity in Sup+CSI, we simply choose the least sparsity value of 32 used in the original Sup paper without parameter search.

Calibration methods (HAT+CSI+c and Sup+CSI+c) are based on its memory-free versions (i.e. HAT+CSI and Sup+CSI). Therefore, the model training part uses the same hyper-parameters as their calibration-free counterparts. For calibration training, we use SGD with a learning rate 0.01, 160 training iterations, and a batch size of 15 for HAT+CSI+c for all experiments. For Sup+CSI+c, we use the same values for all the experiments except for MNIST. For MNIST, we use a learning rate 0.05, batch size of 8, and run 280 iterations.

For the baselines, we use the hyper-parameters reported in the original papers or their code. If the hyper-parameters are unknown or the code does not reproduce the result (e.g., the baseline did not implement a particular dataset or the code had undergone significant version change), we search for the hyper-parameters as we did for HAT+CSI and Sup+CSI.

Appendix G Forgetting Rate

We discuss forgetting rate (i.e., backward transfer) [44], which is defined for task tt as

t=1t1k=1t1𝒜kinit𝒜kt,\displaystyle\mathcal{F}^{t}=\frac{1}{t-1}\sum_{k=1}^{t-1}\mathcal{A}_{k}^{\text{init}}-\mathcal{A}_{k}^{t}, (39)

where 𝒜kinit\mathcal{A}_{k}^{\text{init}} is the classification accuracy of task kk’s data after learning it for the first time and 𝒜kt\mathcal{A}_{k}^{t} is the accuracy of task kk’s data after learning task tt. We report the forgetting rate after learning the last task.

Refer to caption
Figure 4: Average forgetting rate (%). The lower the value, the better the method is on forgetting.

Figure 4 shows the forgetting rates of each method. Some methods (e.g., OWM, iCaRL) experience less forgetting than the proposed methods HAT+CSI and Sup+CSI on M-5T. On this dataset, all the systems performed well. For instance, OWM and iCaRL achieve 95.8% and 96.0% accuracy while HAT+CSI and HAT+CSI+c achieve 94.4 and 96.9% accuracy. As we have noted in the main paper, Sup+CSI and Sup+CSI+c achieve only 80.7 and 81.0 on M-5T although they have improved drastically from 70.1% of the base method Sup.

OWM and HyperNet show lower forgetting rates than HAT+CSI+c and Sup+CSI+c on T-5T and T-10T. However, they are not able to adapt to new classes as OWM and HyperNet achieve the classification accuracy of only 10.0% and 7.9%, respectively, on T-5T and 8.6% and 5.3% on T-10T. HAT+CSI+c and Sup+CSI+c achieves 51.7% and 49.2%, respectively, on T-5T and 47.6% and 46.2% on T-10T.

In fact, the performance reduction (i.e., forgetting) in our proposed methods occurs not because the systems forget the previous task knowledge, but because the systems learn more classes and the classification naturally becomes harder. The continual learning mechanisms (HAT and Sup) used in the proposed methods experience little or no forgetting because they find an independent subset of parameters for each task, and the learned parameters are not interfered with during training. For the forgetting rate results in the TIL setting, refer to our earlier workshop paper [30].

Appendix H Pseudo-Code

For task kk, Let p(y|𝒙,k)=softmaxf(h(𝒙,k;θ,𝒆k);ϕk)p(y|{\bm{x}},k)=\text{softmax}f(h({\bm{x}},k;\theta,{\bm{e}}^{k});\phi_{k}), where θ\theta is the parameters for the adapter, 𝒆k{\bm{e}}^{k} is the trainable embedding for hard attentions, and ϕk\phi_{k} is the set of parameters of the classification head of task kk. Algorithm 1 and Algorithm 2 describe the training and testing processes, respectively. We add comments with the symbol “//”.

Algorithm 1 Training MORE
1:Memory \mathcal{M}, learning rate λ\lambda, a sequence of tasks 𝒟={𝒟k}k=1\mathcal{D}=\{\mathcal{D}^{k}\}_{k=1}, and parameters {θ,𝒆,ϕ}\{\theta,{\bm{e}},\phi\}, where 𝒆{\bm{e}} and ϕ\phi are collections of task embeddings 𝒆k{\bm{e}}^{k} and task heads ϕk\phi_{k}
2:// CL starts
3:for each task data 𝒟k𝒟\mathcal{D}^{k}\in\mathcal{D} do
4:      // Model training
5:     for a batch (𝑿ik,𝒚)({\bm{X}}_{i}^{k},{\bm{y}}) in 𝒟k\mathcal{D}^{k}, until converge do
6:         𝑿s{\bm{X}}_{s} = sample()sample(\mathcal{M})
7:         Compute loss (Eq. 27++Eq. 16) and gradients of parameters
8:         Modify the model parameters θθ\nabla\theta\leftarrow\nabla\theta^{\prime} using Eq. 14
9:         Update parameters as θθλθ,𝒆k𝒆kλ,ϕkϕkλ\theta\leftarrow\theta-\lambda\nabla\theta,\ {\bm{e}}^{k}\leftarrow{\bm{e}}^{k}-\lambda\partial\mathcal{L},\ \phi_{k}\leftarrow\phi_{k}-\lambda\partial\mathcal{L}
10:     end for
11:      // Back-updating in Section 5.1.2
12:     Randomly select 𝒟~𝒟k\tilde{\mathcal{D}}\subset\mathcal{D}^{k}, where |𝒟~|=|||\tilde{\mathcal{D}}|=|\mathcal{M}|
13:     for each task jj, until converge do
14:         minimize (ϕj)\mathcal{L}(\phi_{j}) of Eq. 29
15:     end for
16:      // Obtain statistics in Section 5.1.3
17:     Compute 𝝁jk{\bm{\mu}}^{k}_{j} using Eq. 31 and 𝑺k{\bm{S}}^{k} using Eq. 32
18:end for
Algorithm 2 MORE Prediction
1:Test instance 𝒙{\bm{x}} and parameters {θ,𝒆,ϕ}\{\theta,{\bm{e}},\phi\}
2:for each task kk do
3:     Obtain p(𝒴k|𝒙,k)p(\mathcal{Y}^{k}|{\bm{x}},k)
4:     Obtain sk(𝒙)s^{k}({\bm{x}}) using Eq. 30
5:end for
6:// Concatenate outputs for final prediction yy and OOD score ss
7:y=argmax1ktp(𝒴k|𝒙,k)sk(𝒙)y=\operatorname*{arg\,max}\bigoplus_{1\leq k\leq t}p(\mathcal{Y}^{k}|{\bm{x}},k)s^{k}({\bm{x}}) (i.e. Eq. 33)
8:s=max1ktp(𝒴k|𝒙,k)sk(𝒙)s=\max\bigoplus_{1\leq k\leq t}p(\mathcal{Y}^{k}|{\bm{x}},k)s^{k}({\bm{x}})

Appendix I Size of Memory Required

In this section, we report the memory size required by each method in Section 5.2. The sizes include network size, replay buffer, and all other parameters or examples kept in memory simultaneously for a model to be functional.

We use an ‘entry’ to refer to a parameter or element in a vector or matrix to calculate the total memory required to train and test. The pre-trained backbone uses 21.6 million (M) entries (parameters). The adapter modules use 1.2M entries for CIFAR10 and 2.4M for other datasets. The baselines and our method use 22.9M and 24.1M entries for the model on CIFAR10 and other datasets, respectively. The unique technique of each method may add additional entries for training and test/inference.

The total memory required for each method without considering the replay memory buffer is reported in Table 13. Our method is competitive in memory consumption. Baselines such as OWM and A-GEM take considerably more memory than our system. iCaRL and DER++ take the least amount of memory, but the differences between our method and theirs are only 0.8M, 1.8M, 3.6M, 1.0M, and 1.8M for C10-5T, C100-10T, C100-20T, T-5T, and T-10T.

Table 13: Required Memory
Method C10-5T C100-10T C100-20T T-5T T-10T
OWM 26.6M 28.1M 28.1M 28.2M 28.2M
MUC 22.9M 24.1M 24.1M 24.1M 24.1M
PASS 22.9M 24.2M 24.2M 24.3M 24.4M
LwF 22.9M 24.1M 24.1M 24.1M 24.1M
iCaRL 22.9M 24.1M 24.1M 24.1M 24.1M
A-GEM 26.5M 31.4M 31.4M 31.5M 31.5M
EEIL 22.9M 24.1M 24.1M 24.1M 24.1M
GD 22.9M 24.1M 24.1M 24.1M 24.1M
BiC 22.9M 24.1M 24.1M 24.1M 24.1M
DER++ 22.9M 24.1M 24.1M 24.1M 24.1M
HAL 22.9M 24.1M 24.1M 24.1M 24.1M
HAT 23.0M 24.7M 25.4M 24.6M 25.1M
Sup 24.7M 33.7M 45.7M 27.7M 33.7M
MORE 23.7M 25.9M 27.7M 25.1M 25.9M
Total memory (in entries) required for each method without the replay memory buffer.

Many replay-based methods (e.g., iCaRL, HAL) need to save the previous network for distillation during training. This requires an additional 1.2M or 2.4M entries for CIFAR10 or other datasets. Our method does not save the previous model as we do not use distillation.

Note that a large memory consumption usually comes from the memory buffer as the raw data is of size 32*32*3 or 64*64*3 for CIFAR and T-ImageNet. For a memory buffer of size 2000, a system needs 6.1M or 24.6M entries for CIFAR or T-ImageNet. Therefore, saving a smaller number of samples is important for reducing memory consumption. As we demonstrated in Table 6 and Table 7 in the main paper, our method performs better than the baselines even with a smaller memory buffer. In Table 6, we use large memory sizes (e.g., 200 and 2000 for CIFAR10 and other datasets). In Table 7, we reduce the memory size by half. When we compare the accuracy of our method in Table 7 to those of the baselines in Table 6, our method still outperforms them on all datasets. Our method with a smaller memory buffer achieves average classification accuracy of 88.13, 71.69, 71.29, 64.17, 61.90 on C10-5T, C100-10T, C100-20T, T-5T, and T-10T. On the other hand, the best baselines achieve 88.98, 69.73, 70.03, 61.03, 58.34 on the same experiments with a larger memory buffer.

Appendix J OOD Detection on Different Datasets

Figure 5: AUC of the continually trained models following the final task: (a) We use all 10 classes learned from 5 tasks of CIFAR-10 as IND and consider LSUN, CIFAR-10, and CIFAR-100 as OOD. (b) We use all 200 classes learned from 10 tasks of Tiny-ImageNet as IND and consider LSUN, CIFAR-10, and CIFAR-100 as OOD.
MUC Sup HAT MORE
LSUN 92.60 93.60 92.88 95.61
CIFAR-100 84.23 86.64 87.88 89.56
Tiny-ImageNet 93.43 94.78 91.90 95.45
MUC Sup HAT MORE
LSUN 82.33 80.26 76.82 82.96
CIFAR-10 75.52 73.86 78.98 79.76
CIFAR-100 78.58 75.98 77.62 80.21

In Section 5.2.4 of the main text, we used the classes from unseen tasks of the same dataset used in continual learning as OOD. Here, we also use classes drawn from a dataset that is completely different from the datasets used in the continual learning process. We use the continual learning models after the final task of C10-5T (with the least number of classes, 10) and T-ImageNet-10T (with the largest number of classes, 200) to detect novel samples from completely different datasets. We compare the novelty/OOD detection performance with the three best-performing baseline methods, MUC, HAT, and SupSup. Table 5 shows that our method, MORE, outperforms the baselines on the completely different OOD classes from different datasets.

References

  • Abati et al. [2020] Abati, D., Tomczak, J., Blankevoort, T., Calderara, S., Cucchiara, R., Bejnordi, E., 2020. Conditional channel gated networks for task-aware continual learning, in: CVPR, pp. 3931–3940.
  • Aljundi et al. [2019] Aljundi, R., Belilovsky, E., Tuytelaars, T., Charlin, L., Caccia, M., Lin, M., Caccia, L., 2019. Online continual learning with maximal interfered retrieval, in: NeurIPS.
  • Ayub and Wagner [2021] Ayub, A., Wagner, A., 2021. {EEC}: Learning to encode and regenerate images for continual learning, in: International Conference on Learning Representations. URL: https://openreview.net/forum?id=lWaz5a9lcFU.
  • Bang et al. [2021] Bang, J., Kim, H., Yoo, Y., Ha, J.W., Choi, J., 2021. Rainbow memory: Continual learning with a memory of diverse samples, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8218–8227.
  • Bendale and Boult [2015] Bendale, A., Boult, T., 2015. Towards open world recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1893–1902.
  • Buzzega et al. [2020] Buzzega, P., Boschini, M., Porrello, A., Abati, D., CALDERARA, S., 2020. Dark experience for general continual learning: a strong, simple baseline, in: NeurIPS.
  • Castro et al. [2018] Castro, F.M., Marín-Jiménez, M.J., Guil, N., Schmid, C., Alahari, K., 2018. End-to-end incremental learning, in: Proceedings of the European conference on computer vision (ECCV).
  • Cha et al. [2021] Cha, H., Lee, J., Shin, J., 2021. Co2l: Contrastive continual learning, in: ICCV.
  • Chaudhry et al. [2021] Chaudhry, A., Gordo, A., Dokania, P., Torr, P., Lopez-Paz, D., 2021. Using hindsight to anchor past knowledge in continual learning. Proceedings of the AAAI Conference on Artificial Intelligence 35, 6993–7001. URL: https://ojs.aaai.org/index.php/AAAI/article/view/16861.
  • Chaudhry et al. [2018] Chaudhry, A., Ranzato, M., Rohrbach, M., Elhoseiny, M., 2018. Efficient lifelong learning with a-gem. arXiv:1812.00420 .
  • Chaudhry et al. [2019] Chaudhry, A., Rohrbach, M., Elhoseiny, M., Ajanthan, T., Dokania, P., Torr, P., Ranzato, M., 2019. Continual learning with tiny episodic memories, in: Workshop on Multi-Task and Lifelong Reinforcement Learning.
  • Chen et al. [2020] Chen, T., Kornblith, S., Norouzi, M., Hinton, G., 2020. A simple framework for contrastive learning of visual representations, in: ICML.
  • Chen and Liu [2018] Chen, Z., Liu, B., 2018. Lifelong machine learning. Morgan & Claypool Publishers.
  • Fei et al. [2016] Fei, G., Wang, S., Liu, B., 2016. Learning cumulatively to become more knowledgeable, in: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1565–1574.
  • Goodfellow et al. [2015] Goodfellow, I.J., Shlens, J., Szegedy, C., 2015. Explaining and harnessing adversarial examples. ICLR .
  • Gummadi et al. [2022] Gummadi, M., Kent, D., Mendez, J.A., Eaton, E., 2022. Shels: Exclusive feature sets for novelty detection and continual learning without class boundaries, in: Conference on Lifelong Learning Agents, PMLR. pp. 1065–1085.
  • Han et al. [2019] Han, K., Vedaldi, A., Zisserman, A., 2019. Learning to discover novel visual categories via deep transfer clustering, in: International Conference on Computer Vision (ICCV).
  • He and Zhu [2022] He, J., Zhu, F., 2022. Out-of-distribution detection in unsupervised continual learning, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3850–3855.
  • He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep residual learning for image recognition, in: CVPR.
  • Hendrycks and Gimpel [2016] Hendrycks, D., Gimpel, K., 2016. A baseline for detecting misclassified and out-of-distribution examples in neural networks. arXiv preprint arXiv:1610.02136 .
  • Hendrycks et al. [2019] Hendrycks, D., Mazeika, M., Kadavath, S., Song, D., 2019. Using self-supervised learning can improve model robustness and uncertainty, in: NeurIPS, pp. 15663–15674.
  • Henning et al. [2021] Henning, C., Cervera, M., D’Angelo, F., Von Oswald, J., Traber, R., Ehret, B., Kobayashi, S., Grewe, B.F., Sacramento, J., 2021. Posterior meta-replay for continual learning. Advances in Neural Information Processing Systems 34.
  • Houlsby et al. [2019] Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., De Laroussilhe, Q., Gesmundo, A., Attariyan, M., Gelly, S., 2019. Parameter-efficient transfer learning for nlp, in: International Conference on Machine Learning, PMLR. pp. 2790–2799.
  • Hu et al. [2021] Hu, W., Qin, Q., Wang, M., Ma, J., Liu, B., 2021. Continual learning by using information of each class holistically, in: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 7797–7805.
  • Jafarzadeh et al. [2021] Jafarzadeh, M., Dhamija, A.R., Cruz, S., Li, C., Ahmad, T., Boult, T.E., 2021. A review of open-world learning and steps toward open-world learning without labels. arXiv e-prints , arXiv–2011.
  • Karakida and Akaho [2022] Karakida, R., Akaho, S., 2022. Learning curves for continual learning in neural networks: Self-knowledge transfer and forgetting, in: International Conference on Learning Representations.
  • Ke and Liu [2022] Ke, Z., Liu, B., 2022. Continual learning of natural language processing tasks: A survey. arXiv preprint arXiv:2211.12701 .
  • Ke et al. [2021] Ke, Z., Liu, B., Ma, N., Xu, H., Shu, L., 2021. Achieving forgetting prevention and knowledge transfer in continual learning. Advances in Neural Information Processing Systems 34.
  • Khosla et al. [2020] Khosla, P., Teterwak, P., Wang, C., Sarna, A., Tian, Y., Isola, P., Maschinot, A., Liu, C., Krishnan, D., 2020. Supervised contrastive learning. arXiv preprint arXiv:2004.11362 .
  • Kim et al. [2022] Kim, G., Esmaeilpour, S., Xiao, C., Liu, B., 2022. Continual learning based on ood detection and task masking, in: CVPR 2022 Workshop on Continual Learning.
  • Kirkpatrick et al. [2017] Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A.A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., Others, 2017. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences 114, 3521–3526.
  • Krizhevsky and Hinton [2009] Krizhevsky, A., Hinton, G., 2009. Learning multiple layers of features from tiny images. Technical Report TR-2009, University of Toronto, Toronto. .
  • Langley [2020] Langley, P., 2020. Open-world learning for radically autonomous agents, in: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 13539–13543.
  • Le and Yang [2015] Le, Y., Yang, X., 2015. Tiny imagenet visual recognition challenge.
  • Lee et al. [2018] Lee, K., Lee, K., Lee, H., Shin, J., 2018. A simple unified framework for detecting out-of-distribution samples and adversarial attacks. Advances in neural information processing systems 31.
  • Lee et al. [2019] Lee, K., Lee, K., Shin, J., Lee, H., 2019. Overcoming catastrophic forgetting with unlabeled data in the wild, in: CVPR.
  • Li and Hoiem [2016] Li, Z., Hoiem, D., 2016. Learning Without Forgetting, in: ECCV, Springer. pp. 614–629.
  • Liang et al. [2018] Liang, S., Li, Y., Srikant, R., 2018. Enhancing the reliability of out-of-distribution image detection in neural networks, in: ICLR. URL: https://openreview.net/forum?id=H1VGkIxRZ.
  • Liu [2020] Liu, B., 2020. Learning on the job: Online lifelong and continual learning., in: AAAI.
  • Liu et al. [2023] Liu, B., Mazumder, S., Robertson, E., Grigsby, S., 2023. Ai autonomy: Self-initiated open-world continual learning and adaptation. AI Magazine 44, 185–199.
  • Liu et al. [2022] Liu, B., Robertson, E., Grigsby, S., Mazumder, S., 2022. Self-initiated open world learning for autonomous ai agents, in: Proceedings of AAAI Symposium on ‘Designing Artificial Intelligence for Open Worlds’.
  • Liu et al. [2020a] Liu, Y., Parisot, S., Slabaugh, G., Jia, X., Leonardis, A., Tuytelaars, T., 2020a. More classifiers, less forgetting: A generic multi-classifier paradigm for incremental learning, in: ECCV. Springer International Publishing, pp. 699–716.
  • Liu et al. [2020b] Liu, Y., Su, Y., Liu, A.A., Schiele, B., Sun, Q., 2020b. Mnemonics training: Multi-class incremental learning without forgetting, in: CVPR.
  • Lopez-Paz and Ranzato [2017] Lopez-Paz, D., Ranzato, M., 2017. Gradient Episodic Memory for Continual Learning, in: NeurIPS, pp. 6470–6479.
  • Loshchilov and Hutter [2016] Loshchilov, I., Hutter, F., 2016. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983 .
  • Loyall et al. [2022] Loyall, B., Pfeffer, A., Niehaus, J., Mayer, T., Rizzo, P., Gee, A., Cvijic, S., Manning, W., Skitka, M.K., Becker, M., et al., 2022. An integrated architecture for online adaptation to novelty in open worlds using probabilistic programming and novelty-aware planning, in: Proceedings of the AAAI Spring Symposium on Designing AI for Open-World Novelty.
  • McCloskey and Cohen [1989] McCloskey, M., Cohen, N.J., 1989. Catastrophic interference in connectionist networks: The sequential learning problem, in: Psychology of learning and motivation. Elsevier. volume 24, pp. 109–165.
  • Ostapenko et al. [2019] Ostapenko, O., Puscas, M., Klein, T., Jahnichen, P., Nabi, M., 2019. Learning to remember: A synaptic plasticity driven framework for continual learning, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11321–11329.
  • von Oswald et al. [2020] von Oswald, J., Henning, C., Sacramento, J., Grewe, B.F., 2020. Continual learning with hypernetworks. ICLR .
  • Palash and Bhargava [2022] Palash, M., Bhargava, B., 2022. Continuous learning based novelty aware emotion recognition system. AAAI 2022 Spring Symposium on Designing Artificial Intelligence for Open Worlds .
  • Pang et al. [2021] Pang, G., Shen, C., Cao, L., Hengel, A.V.D., 2021. Deep learning for anomaly detection: A review. ACM Computing Surveys (CSUR) 54, 1–38.
  • Parmar et al. [2021] Parmar, J., Chouhan, S.S., Raychoudhury, V., Rathore, S.S., 2021. Open-world machine learning: Applications, challenges, and opportunities. arXiv preprint arXiv:2105.13448 .
  • Pentina and Lampert [2014] Pentina, A., Lampert, C., 2014. A pac-bayesian bound for lifelong learning, in: International Conference on Machine Learning, PMLR. pp. 991–999.
  • Rajasegaran et al. [2020] Rajasegaran, J., Khan, S., Hayat, M., Khan, F.S., Shah, M., 2020. itaml: An incremental task-agnostic meta-learning approach, in: CVPR.
  • Ramanujan et al. [2020] Ramanujan, V., Wortsman, M., Kembhavi, A., Farhadi, A., Rastegari, M., 2020. What’s hidden in a randomly weighted neural network?, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11893–11902.
  • Rebuffi et al. [2017] Rebuffi, S.A., Kolesnikov, A., Lampert, C.H., 2017. iCaRL: Incremental classifier and representation learning, in: CVPR, pp. 5533–5542.
  • Rios et al. [2022] Rios, A., Ahuja, N., Ndiour, I., Genc, U., Itti, L., Tickoo, O., 2022. incdfm: Incremental deep feature modeling for continual novelty detection, in: European Conference on Computer Vision, Springer. pp. 588–604.
  • Russakovsky et al. [2015] Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al., 2015. Imagenet large scale visual recognition challenge. International journal of computer vision 115, 211–252.
  • Scheirer et al. [2014] Scheirer, W.J., Jain, L.P., Boult, T.E., 2014. Probability models for open set recognition. IEEE transactions on pattern analysis and machine intelligence 36, 2317–2324.
  • Schölkopf et al. [1999] Schölkopf, B., Williamson, R.C., Smola, A.J., Shawe-Taylor, J., Platt, J.C., et al., 1999. Support vector method for novelty detection., in: NIPS, Citeseer. pp. 582–588.
  • Serra et al. [2018] Serra, J., Suris, D., Miron, M., Karatzoglou, A., 2018. Overcoming catastrophic forgetting with hard attention to the task, in: International Conference on Machine Learning, PMLR. pp. 4548–4557.
  • Shin et al. [2017] Shin, H., Lee, J.K., Kim, J., Kim, J., 2017. Continual learning with deep generative replay, in: NIPS, pp. 2994–3003.
  • Szegedy et al. [2015] Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A., 2015. Going deeper with convolutions, in: CVPR. URL: http://arxiv.org/abs/1409.4842.
  • Tack et al. [2020] Tack, J., Mo, S., Jeong, J., Shin, J., 2020. Csi: Novelty detection via contrastive learning on distributionally shifted instances, in: NeurIPS.
  • Thai et al. [2022] Thai, T., Shen, M., Varshney, N., Gopalakrishnan, S., Soni, U., Baral, C., Scheutz, M., Sinapov, J., 2022. An architecture for novelty handling in a multi-agent stochastic environment: Case study in open-world monopoly, in: Designing Artificial Intelligence for Open Worlds: Papers from the 2022 Spring Symposium, Virtual. AAAI Press.
  • Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H., 2021. Training data-efficient image transformers & distillation through attention, in: International Conference on Machine Learning, PMLR. pp. 10347–10357.
  • Van De Ven et al. [2021] Van De Ven, G.M., Li, Z., Tolias, A.S., 2021. Class-incremental learning with generative classifiers, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3611–3620.
  • van de Ven and Tolias [2019] van de Ven, G.M., Tolias, A.S., 2019. Three scenarios for continual learning. arXiv preprint arXiv:1904.07734 .
  • Wortsman et al. [2020] Wortsman, M., R., V., Liu, R., Kembhavi, A., Rastegari, M., Yosinski, J., Farhadi, A., 2020. Supermasks in superposition, in: NeurIPS.
  • Wu et al. [2019] Wu, Y., Chen, Y., Wang, L., Ye, Y., Liu, Z., Guo, Y., Fu, Y., 2019. Large scale incremental learning, in: CVPR.
  • Xu et al. [2019] Xu, H., Liu, B., Shu, L., Yu, P., 2019. Open-world learning and application to product classification, in: The World Wide Web Conference, pp. 3413–3419.
  • Yan et al. [2021] Yan, S., Xie, J., He, X., 2021. Der: Dynamically expandable representation for class incremental learning, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3014–3023.
  • Yang et al. [2021] Yang, J., Zhou, K., Li, Y., Liu, Z., 2021. Generalized out-of-distribution detection: A survey. arXiv preprint arXiv:2110.11334 .
  • You et al. [2017] You, Y., Gitman, I., Ginsburg, B., 2017. Large batch training of convolutional networks. arXiv preprint arXiv:1708.03888 .
  • Zeng et al. [2019] Zeng, G., Chen, Y., Cui, B., Yu, S., 2019. Continuous learning of context-dependent processing in neural networks. Nature Machine Intelligence .
  • Zenke et al. [2017] Zenke, F., Poole, B., Ganguli, S., 2017. Continual learning through synaptic intelligence, in: ICML, pp. 3987–3995.
  • Zhu et al. [2021] Zhu, F., Zhang, X.Y., Wang, C., Yin, F., Liu, C.L., 2021. Prototype augmentation and self-supervision for incremental learning, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5871–5880.