This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Continual Learning via Inter-Task Synaptic Mapping

Mao Fubing∗∗ Weng Weiwei∗∗ Mahardhika Pratama*∗∗ Edward Yapp Kien Yee National Engineering Research Center for Big Data Technology and System, Services Computing Technology and System Lab, Cluster and Grid Computing Lab, School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan 430074,China School of Computer Science and Engineering, Nanyang Technological University, Singapore Singapore Institute of Manufacturing Technology, A*Star, Singapore
Abstract

Learning from streaming tasks leads a model to catastrophically erase unique experiences it absorbs from previous episodes. While regularization techniques such as LWF, SI, EWC have proven themselves as an effective avenue to overcome this issue by constraining important parameters of old tasks from changing when accepting new concepts, these approaches do not exploit common information of each task which can be shared to existing neurons. As a result, they do not scale well to large-scale problems since the parameter importance variables quickly explode. An Inter-Task Synaptic Mapping (ISYANA) is proposed here to underpin knowledge retention for continual learning. ISYANA combines task-to-neuron relationship as well as concept-to-concept relationship such that it prevents a neuron to embrace distinct concepts while merely accepting relevant concept. Numerical study in the benchmark continual learning problems has been carried out followed by comparison against prominent continual learning algorithms. ISYANA exhibits competitive performance compared to state of the arts. Codes of ISYANA is made available in https://github.com/ContinualAL/ISYANAKBS.

keywords:
continual learning, lifelong learning, catastrophic forgetting
journal: Journal of  Templatesmytitlenotemytitlenotefootnotetext: Corresponding Authormytitlenote1mytitlenote1footnotetext: Equal Contribution

1 Introduction

Continual learning aims to emulate the underlying trait of natural learning to learn various tasks on the fly without losing competence for what has been achieved in the past [1, 2, 3]. Ideally, already owned skill should ease the learning process of different-but-related tasks in the future. This problem appears as an extension of data stream learning [4] where a learner must not only demonstrate adaptive and evolving aptitudes to handle non-stationary environments but also possess the knowledge retention property such that it gains more intelligence as the increase of tasks being learned [5].

The underlying challenge of continual learning lies in the catastrophic forgetting problem where learning a new task catastrophically replaces old knowledge with new one thereby losing its relevance to handle old tasks. The regularization approach [5, 6] is among several well-known methods to resolve the catastrophic forgetting problem of the continual learning  [5, 6]. The key idea is to prevent important parameters from being perturbed by learning a new task. Elastic Weight Consolidation (EWC) [6] is one pioneering work in this area where the unique trait is seen in the application of Fisher information matrix to construct the parameter importance matrix while integrating the L-2 norm like regularization approach [5, 6]. The synaptic intelligence (SI) method is proposed to address expensive computation of parameter importance matrix via the Fisher information matrix [7]. The accumulation of loss for every training sample is used to inform the significance of each synaptic. This work has been extended in [8] called onlineEWC using the Laplace approximation. The learning without forgetting (LWF) approach is slightly different where it uses a joint optimization procedure between the cross entropy loss of current task and the knowledge distillation loss [9]. It is later found that the knowledge distillation loss can be replaced by the cross-entropy loss [9]. Memory Aware Synapses (MAS) is a regularization-based approach where the importance of network parameters are calculated in an unsupervised and online manner. Another prominent work is proposed in [10] where the regularization mechanism is performed in the neuron level instead of the synaptic level by measuring neuron importance and governing learning rates.

The main limitation of these approaches lies in the absence of inter-task relationship where a new task might share some commonalities to previous tasks. That is, a neuron might still accept new information or even contribute to understand the current task. Since a new task is embraced by those which do not contribute too much to the previous tasks, one should guarantee enough network capacity to handle all tasks otherwise new synapses have to be grown. Another challenge also exists in scaling the parameter importance matrix which can quickly explode in the large-scale problems thus causing the unlearning effect since it has to be accumulated across all tasks [11] unless an independent parameter importance matrix has to be created from scratch for every task.

An Inter-Task-Synaptic-Mapping (ISYANA) is proposed here to address the catastrophic forgetting problem of the continual learning. ISYANA is built upon the task-to-synapses and task-to-task mappers. The task-to-synapses module provides a relational mapping between each hidden node and all tasks determining the acceptable type of information it can absorb. The task-to-task component exhibits the inter-task relationship thus revealing the complete problem structure and the commonalities of each task. This module is capable of providing a sort of neighborhood degree of a sample to all tasks which can be linked to the task-to-synapses module. The relationship of a sample to previous tasks is assessed by checking its proximity to previous tasks. A node is frozen by reducing its learning rate provided that it has low relevance to the current task and the current task characterizes low mutual information to the old tasks. In a nutshell, the parameter importance matrix is crafted from the combination of the task-to-task and task-to-synapses relationships. The catastrophic forgetting is handled in the neuron level due to the fact where a connective weight plays little role to a learning problem unless it is combined with other weights to construct a neuron,. i.e., hierarchical structure of a network structure.

The concept-to-concept or task-to-task module is devised by the cluster guided mechanism in the deep latent space. It puts forward a meta-network crafted from an extension of the deep clustering network [12, 13]. That is, a class-specific cluster is designed in the deep latent space. Note that the clustering approach is carried out fully in the per-class manners such that learning a new task should not disturb the representation of old tasks. The concept-to-concept module applies the growing and consolidation phases. The growing phase constructs a set of clusters describing the current task based on the class label of each task. The consolidation phase is undertaken after the growing phase by calculating the mean values of the clusters to bound the network complexity where the mean values of the clusters exhibit a complete summary of already seen data points. In other words, there does not exist any requirement to store representation layer of each task which does not sustain for large-scale problems. The inter-task similarity is calculated via the KL divergence approach among all clusters between tasks.

The task-to-synapses module complements the task-to-task module where it functions as the one-to-one mapping between a neuron and a target class. It measures the mutual information of a node to each task making possible for a neuron to still accept the current concept notwithstanding that it is highly relevant to previously seen tasks. In short, a node can be shared by different but related tasks. All of which can be done in the online fashion without being prone to the issue of exploding parameter importance matrix thereby improving its feasibility for a high number of tasks. The advantage of ISYANA has been numerically validated in benchmark continual learning problems. It is compared with the prominent continual learning algorithms where it demonstrates competitive performance compared to recently published algorithms. Codes of ISYANA is made available in https://github.com/ContinualAL/ISYANAKBS.

2 Related Works

ISYANA is derived from recent works on regularization principle for continual learning where the important parameters of old tasks are precluded from accepting the new concept to overcome the catastrophic forgetting problem. In a nutshell, the regularization-based approach works in the synaptic level formulated as a joint optimization problem as follows:

L=L(Y,Y^)+α2Z(θθ)2L=L(Y,\hat{Y})+\frac{\alpha}{2}Z(\theta-\theta^{*})^{2} (1)

where α\alpha is a regularization constant while θ,θ\theta,\theta^{*} are the existing network parameters and the previous optimal parameters trained at T1T-1 task. That is, all network parameters are embedded to θ\theta. The key component lies in the parameter importance matrix ZZ measuring the contribution of each parameter in the previous task. ZZ is computed from the Fisher information matrix concept in EWC while the synaptic intelligence (SI) utilizes the combination of network gradient and parameter movement to deduce the network significance. Nonetheless, the main bottleneck of this approach is its application to the large-scale problems since the parameter importance matrix ZtZ_{t} has to be calculated for each task and combined to previous task Zt1+ZtZ_{t-1}+Z_{t} when being applied to the t+1t+1 task [6, 7]. This issue causes the explosion of ZZ in long run thereby leading to the unlearning effect of old parameters and hindering for the positive forward and backward transfer mechanism. It is almost impossible to be resolved without the introduction of new parameters to handle a new task. The practical solution is with the normalization technique [5, 6]. Another solution is by purposely insert the forgetting factor γtZt1+Zt\gamma_{t}Z_{t-1}+Z_{t} [5, 6].

Our approach, ISYANA, offers an alternative solution where the inter-task relationship is taken into account while considering the relevance between synaptic and a task. The goal is to allow a node to learn multiple relevant tasks thereby underpinning the positive forward and backward transfer. That is, the two mappings reveal the common feature of a task which can be shared across similar tasks while retaining private information of a particular task. Furthermore, the inter-task mapping also eliminates the accumulation of parameter importance mapping thus coping with the unlearning effect with the absence of controlled forgetting mechanism [5, 6].

3 ISYANA: Inter-Task Synaptic Mapping

3.1 Problem Formulation

Formal Definition: Continual learning aims to build a predictive model which can handle the never-ending arrival of tasks. Consider Tt=(Xt,Yt)NT_{t}=(X_{t},Y_{t})\in\Re^{N} is the ttht-th task consisting of a pair of data sample Xnu,YnmX_{n}\in\Re^{u},Y_{n}\in\Re^{m} as input and target data points where u,N,mu,N,m are the dimension of input space, data space and the number of classes seen thus far respectively. The number of tasks, TT, is unknown in practise. The typical characteristic of continual learning calls for a learner with light computational and space complexities which must not be a factor of the number of tasks and data samples. That is, a task cannot be revisited again in the future once learned.

The continual learning problem suffers from non-stationary environments where a task is not drawn from static concept. That is, there exists the issue of drift where the concept of ttht-th task is drifted toward a new distribution in the next task P(Y|X)tP(Y|X)t+1P(Y|X)_{t}\neq P(Y|X)_{t+1} [14]. Another common issue lies in the incremental class problem where a new task TtT_{t} is presented with a set of new classes mm^{\prime} while being completely isolated from the old classes mm previously seen in the old tasks. In other words, mm^{\prime} does not appear together with the previous classes. This case leads to the catastrophic forgetting problem in which a model loses its aptitude in dealing with previously seen tasks.

The typical characteristic of the continual learning problem distinguishing itself from other learning problems is seen in the knowledge retention requirement. That is, a continual learner must be prepared to be queried by any samples following the concepts or classes of old tasks meaning that learning a new task must not catastrophically erase its past knowledge base [5, 6]. The key challenge is to find tradeoff points addressing all tasks without seeing them in the same basket. A regularization approach is chosen here to cope with the catastrophic forgetting problem since it incurs low complexity being independent to the problem size - no old samples or representation layers of each task have to be stored.

Definition of Network Structure: ISYANA is integrated in the context of multi-layer perceptron (MLP) network where nl,Ln_{l},L stand for the number of hidden nodes of the lthl-th layer and the total number of layers respectively. A lthl-th layer is defined as hl=s(Winlsi,(l1)+bl)h_{l}=s(W_{in}^{l}s_{i,(l-1)}+b_{l}) where its node is assigned as the sigmoid function. Winlnl×nl1,blnlW_{in}^{l}\in\Re^{n_{l}\times n_{l-1}},b_{l}\in\Re^{n_{l}} are the connective weights and bias. The softmax function ξ=exp(oj)j=1m(oj)m\xi=\frac{\exp{(o_{j})}}{\sum_{j=1}^{m}(o_{j})}\in\Re^{m} is applied at the last layer LL where o=WouthL+co=W_{out}h_{L}+c. Woutm×nL,cmW_{out}\in\Re^{m\times n_{L}},c\in\Re^{m} denote the output weight and the output bias respectively.

3.2 Algorithm

ISYANA offers an efficient alternative for construction of the parameter importance matrix ZZ where the importance of a neuron is determined from its mutual information to the current task as well as to the old tasks thereby supporting the positive forward and backward transfer provided its high mutual information. ISYANA is underpinned by the task-to-task mapping and the synaptic-to-task mapping in deriving the parameter importance matrix. The importance of the jthj-th node of the lthl-th layer in respect to a current task is formalized as the linear combination of its task-to-task and synaptic-to-task relationships as follows:

ϕj,lt=TTt,1STj,l1+TTt,2STj,l2++TTt,mSTj,lm\phi_{j,l}^{t}=TT_{t,1}*ST_{j,l}^{1}+TT_{t,2}*ST_{j,l}^{2}+...+TT_{t,m}*ST_{j,l}^{m} (2)

where TTt,o[0,1]TT_{t,o}\in[0,1] represents the task-to-task mapping of the current ttht-th task to oo-th class while STj,lo[0,1]ST_{j,l}^{o}\in[0,1] defines the synaptic-to-task mapping of the jthj-th node of lthl-th layer to the current otho-th class. TTt,mt=1TT_{t,m_{t}}=1 for any mtm_{t} classes associated to the current ttht-th task. That is, the relationship of the current task to itself is maximum. In our implementation, the ttht-th task is formulated by its target classes following the normal distribution. (2) is bounded in [0,m][0,m] making the scaling convenient to be undertaken. The parameter importance matrix ZtZ_{t} for the ttht-th task is defined as follows:

Zt=[exp(ϕ1,1t)exp(ϕn1,1t)exp(ϕ1,2t)exp(ϕn2,2t)exp(ϕ1,Lt)exp(ϕnL,Lt)]Z_{t}=\begin{bmatrix}\exp{(-\phi_{1,1}^{t})}&...&\exp{(-\phi_{n_{1},1}^{t})}\\ \exp{(-\phi_{1,2}^{t})}&...&\exp{(-\phi_{n_{2},2}^{t})}\\ ......\\ \exp{(-\phi_{1,L}^{t})}&...&\exp{(-\phi_{n_{L},L}^{t})}\\ \end{bmatrix} (3)

The parameter importance matrix ZZ can be conveniently integrated to the learning rate with respect to the loss gradient. In other words, the catastrophic forgetting is controlled in the neuron level rather than in the synaptic level as per (1) because of the hierarchical nature of the deep neural network [10]. Instead of fixed learning rate, a node-varying learning rate is introduced as follows:

η=aexp(bZ+c)\displaystyle\eta=a*\exp{(-b*Z+c)} (4)

where a, b and c are constant parameter that helps to adjust learning rate. The node-varying learning rate is determined by the parameter importance matrix ZZ accordingly. The exponential term is incorporated into (3) since the regularization magnitude should be inversely proportional to the importance of a hidden node. The higher the importance of a node the lower the loss gradient is induced. A node accepts the current concept and enables positive forward and backward transfer if it receives relevant information. That is, its importance to the current task increases if the current task shows common information to classes in which a node is relevant to. On the other hand, its importance diminishes in the case of unique task to which a node is associated to. This strategy reflects to the random initialization property of deep neural network making possible for a node to converge to particular classes. This concept follows the neuron-level plasticity control in [10] but our approach not only takes into account the parameter importance but also the inter-task relationship.

Fig. 1 shows the overall learning policy of our approach ISYANA, consisting of training phase and testing phase. There is a set of tasks T=T1,T2,,TkT={T_{1},T_{2},...,T_{k}}. Each task contains several classes and each task trains the network in sequence. When the training process of all the tasks is finished, we feed the testing data of each task to test the final model and evaluate them according to three criterion: the average accuracy, the backward transfer and the forward transfer.

Training Phase of Fig. 1 introduces the steps of the training phase. Firstly, each task is divided into some mini batches used to train the MLP network. Secondly, we calculate the node importance to each class (current output class). This part belongs to synaptic-to-task mapping described in Section 3. Thirdly, we adopt a stacked autoencoder to map each batch to the latent space. We calculate the center for each target class under the current batch in the latent space. Then, we obtain the relation between the classes belonging to the current batch and classes belongs to other tasks. This component is attributed to task-to-task mapping introduced in Section 3. Fourthly, we acquire the parameter importance matrix ZtZ_{t} mentioned above which will be incorporated in (4) in the learning rate of stochastic gradient descent (SGD) method.

3.3 Synaptic-to-Task Mapping

This component examines the relevance of a node to target classes thus enabling the positive forward and backward transfer mechanism [3]. That is, a node can still learn the current task despite being important to the old task if it exhibits strong mutual information. It is inspired by the existence of common information of different tasks where learning old task might expedite the learning process of a new task. The relevance of the jthj-th node of lthl-th layer to otho-th target class is formulated by applying the symmetrical uncertainty measure [15] as follows:

STj,lo=2I(hj,l,o)H(hj,l)+H(o)ST_{j,l}^{o}=\frac{2I(h_{j,l},o)}{H(h_{j,l})+H(o)} (5)

where I(hj,l,o)I(h_{j,l},o) denotes the information gain or the mutual information of the hidden node hj,lh_{j,l} and the otho-th target class (network output) while H(hj,l),H(o)H(h_{j,l}),H(o) respectively stand for the entropy of the jthj-th node of lthl-th layer and the otho-th target class (network output). The symmetrical uncertainty measure is chosen here because it is simple and has low bias for multi-valued features [15]. Calculation of information gain and entropy are typically expensive. By assuming normal distribution, the notion of differential entropy is adopted here where the information gain and entropy are respectively defined as I(x,y)=12log(1ρ(x,y)2),H(x)=12(1+log(2πσx2))I(x,y)=-\frac{1}{2}log(1-\rho(x,y)^{2}),H(x)=\frac{1}{2}(1+log(2\pi\sigma_{x}^{2})) [15]. ρ(x,y),σx2\rho(x,y),\sigma_{x}^{2} refer to the Pearson’s correlation measure between two variables and the variance of a random variable xx. Note that both of them can be calculated recursively in the incremental fashion.

The synaptic-to-task mapping is not only applicable in the incremental class setting but also effective in the case of the concept drift albeit no introduction of new classes. In particular, a node supporting the otho-th class is not triggered due to low learning rate P(Y|X)tP(Y|X)t+1P(Y|X)_{t}\neq P(Y|X)_{t+1} if the new concept has little relationship to the previous class - the distributional change leads to the change of classification boundary. The concept change diminishes the relevance of a node if its relevance to another class is not substantiated. This mechanism only needs to store the synapses and bias of deep neural networks [Win,b],[Wout,c][W_{in},b],[W_{out},c].

3.4 Task-to-Task Mapping

This module is meant to learn the commonalities across tasks to explore the possibilities of positive forward transfer if a current task shares similarities to the previously seen tasks. The compatibility of the current task to the old tasks is examined by measuring the divergence of their class distributions. A sample of the current task having strong relationship to the old tasks can be shared to the existing nodes playing important roles to the old tasks. In other words, a node do not stop its learning process although it is found to be pivotal for the previous tasks. Notwithstanding that the synaptic-to-task mapping already analyzes the relevance of each node to all tasks, the task-to-task mapping is still required to confirm a learning process of a new task. Furthermore, the task-to-task mapping is relatively stable compared to the synaptic-to-task mapping because its calculation is not affected by other tasks. Once a representation of a class is created, it is frozen. It is only utilized to execute the task-to-task mapping.

It is built upon a flexible deep clustering method summarizing a task into a number of clusters in the latent space. It is inspired by the idea of deep clustering network in [12, 13] where the latent representation is constructed from the stacked auto encoder [12, 13] rather than the simple linear mapping often being trapped in the trivial solution. Our underlying contribution here is to perform clustering according to the labels of the tasks and the total number of clusters are equal to the number of classes among all the tasks. That is, every task is formulated as a cluster of normal distribution thereby enabling to evaluate their commonalities.

Construction of Cluster-Friendly Latent Space: the latent space is learned under the framework of stacked auto-encoder with LaeL_{ae} hidden layers. We adopt the same notion of deep clustering network [12, 13] where the overall loss function of the clustering network encompasses the reconstruction error as follows:

L=L(x^,x)L=L(\hat{x},x) (6)

(6) compresses the input space in the nonlinear fashion via the use of deep autoencoder thereby inducing the cluster-friendly latent space. Furthermore, the application of stacked autoencoder here prevents the trivial solution often encountered by the linear transformation [12, 13]. The end-to-end training strategy is carried out in respect to (6) for each layer where [Wenlae,benlae][W_{en}^{l_{ae}},b_{en}^{l_{a}e}] stand for the weight and bias of the laethl_{ae}-th encoder while [Wdelae,bdelae][W_{de}^{l_{ae}},b_{de}^{l_{ae}}] denote the weight and bias of the laethl_{ae}-th decoder. Note that the weight of decoder is the inverse mapping of encoder Wdelae=WenlaeTW_{de}^{l_{ae}}=W_{en}^{l_{ae}^{T}}. Note that the AE here is trained continuously across all tasks to establish the representation of each class. Furthermore, this mechanism only needs to store the weight and bias of autoencoder (AE) [Wde,bde],[Wen,ben][W_{de},b_{de}],[W_{en},b_{en}].

Calculation of The Inter-Task Mapping: once representing a class as N(μ,σ)N(\mu,\sigma), the relationship between tasks and the target class in the latent space sLaes_{L_{ae}} is formally defined using the Kullback-Leibler divergence measure as follows:

TTt,o=1/((KL(μi,1,μt,o)+KL(μi,2,μt,o)\displaystyle TT_{t,o}=1/((KL(\mu_{i,1},\mu_{t,o})+KL(\mu_{i,2},\mu_{t,o}) (7)
++KL(μi,CO,μt,o))/CO)\displaystyle+...+KL(\mu_{i,CO},\mu_{t,o}))/CO)

where KL is Kullback-Leibler Divergence [16] to evaluate the difference between two probability distributions over the same variable xx [16]. In our work, we assume that each class is represented as the normal distribution. COCO denotes the number of clusters belonging to a specific task ii evaluated to the otho-th target class belonging to a specific task tt. The higher the value of (7) leads to the higher the relationship of the current task to the old tasks. That is, a sample possibly offers the positive forward transfer mechanism thus being able to be accepted for the current model update - a weak regularization effect is returned. No forgetting case occurs here since the representation of each class is fixed once created. Furthermore, (7) does not depend on the AE representation updated continuously across all tasks. This mechanism only needs to store the mean of each class μt,o\mu_{t,o}.

4 Proof of Concept

This section discusses the experimental procedure of our algorithm in four problems: splitMNIST [3, 17], permuttedMNIST [3], rotatedMNIST [18] and omniglot  [8]. ISYANA is compared against eight recently published algorithms: Context-dependent Gating (XDG) [19], Elastic Weight Consolidation (EWC) [6], onlineEWC [3, 8], SI [7], LWF [9], Deep Generative Replay (DGR) [20], Deep Generative Replay distillation [1, 20] and Averaged Gradient Episodic Memory (A-GEM) [21].
Simulation Protocol: the standard evaluation protocol of continual learning [5, 6] is followed here where three evaluation metrics, namely average accuracy, positive forward transfer and positive backward transfer, are applied. It not only evaluates the ISYANA’s aptitude in dealing with catastrophic forgetting, but also assesses whether the past knowledge improves the network performance in learning new tasks or the newly observed knowledge in the current task undermines its capability in coping with the previous tasks. Suppose that ST×TS\in\Re^{T\times T} stands for an evaluation matrix where Si,jS_{i,j} denotes the test accuracy on the task TjT_{j} after observing completely the task TiT_{i} while b^\hat{b} is the accuracy vector for each task. The three evaluation metrics [1, 2, 3] are mathematically defined as follows:

ACC=1Ti=1TST,iACC=\frac{1}{T}\sum_{i=1}^{T}S_{T,i} (8)
BWT=1T1i=1T1(ST,iSi,i)BWT=\frac{1}{T-1}\sum_{i=1}^{T-1}(S_{T,i}-S_{i,i}) (9)
FWT=1T1i=2T(Si1,ib^i)FWT=\frac{1}{T-1}\sum_{i=2}^{T}(S_{i-1,i}-\hat{b}_{i}) (10)

where ACC,BWT,FWTACC,BWT,FWT stand for the average accuracy, the backward transfer and the forward transfer respectively. The BWTBWT and FWTFWT complement ACCACC if the accuracy returns the same results.
Baseline: the characteristics of eight baseline algorithms are outlined as follows:
XDG [19] uses a randomly generated number of hidden nodes to accept new tasks while leaving the remainder of nodes unchanged.
EWC [6] is a regularization approach for handling the catastrophic forgetting problem where the parameter importance matrix ZZ is derived from the Fisher information matrix.
OnlineEWC [8] is an online variant of EWC where it utilizes the Laplace approximation rather than the posterior approximation. It retains only the latest running sum on the Fisher’s information matrix.
SI [7] is akin to EWC but offers an alternative avenue in computing the parameter importance matrix ZZ. That is, it is calculated using the accumulated gradient of network parameters to reduce the computational burden.
LWF [9] prevents the catastrophic forgetting by formulating the joint optimization problem between the cross entropy loss of current task and the knowledge distillation loss.
DGR [20] utilizes the information replay mechanism to cope with the catastrophic forgetting problem. It makes use of the generative adversarial network (GAN) principle where there exists a generator and solver for each task.
DGR+distill [1, 20] presents an extension of DGR with the knowledge distillation concept. It reduces the computational issue on DGR with the integration of generative model into the main model through the backward connection.
A-GEM [21] is an extension of Generative Episodic Memory [3] where it applies some modification of loss function of GEM.
All algorithms are configured under the same network structure and utilizes the implementation [1, 2]. In a nutshell, the hyperparameters of consolidated algorithms are exhibited in Table 6 and 7. Additional details of the experiments can be found in the Appendix.
Datasets: the characteristics of the four baseline problems are detailed in the following.
Rotated MNIST is an extension of the original MNIST problem [3, 17, 18] having non-stationary traits. The changing characteristics are induced by dynamic rotations between [0,180][0,180] degrees.
Permutted MNIST presents the non-stationary version of MNIST problem by applying the fixed pixel permutation. Both permutation and rotation of the permuttedMNIST and the rotatedMNIST are unrelated for each task and represent the concept drift while having fixed target classes across tasks.
Split MNIST features the incremental class problem where the full MNIST problem is divided into 5 subsets or tasks of disjoint digits: T1[0,1],T2[2,3],T3[4,5],T4[6,7],T5[8,9]T_{1}\in[0,1],T_{2}\in[2,3],T_{3}\in[4,5],T_{4}\in[6,7],T_{5}\in[8,9].
Omniglot is a popular benchmark problem of the few-shot learning modified for the continual learning. It presents the continual learning of handwritten characters of 50 alphabets [8]. Each alphabet is considered as a separate task. Because of its size having 500 classes in total, this problem is capable of testing the scalability of continual learner.

We implement our flow based on the work [1, 2] in the python programming language and perform our experiments on a Windows platform. We adopt the original 28x28 pixel grey-scale images without pre-processing [1, 2]. For SplitMNIST, the dataset is divided into five tasks where each task is a two-way classification. For permutedMNIST and rotatedMNIST, the dataset is split into ten tasks where each task is a ten-way classification respectively. For omniglot dataset, it consists of 50 tasks where each task encompasses 10 classes [8]. It is akin to the splitMNIST problem where each task is devoted to handle the incremental class problem. We run each simulation for 10 times with different random seeds and adopt the average value as the measurable criterion to evaluate the performance of each method. Table 1, 2, 3 and 4 corresponding to Fig. 2, Fig. 3, Fig. 4 and Fig. 5 show the numerical results, i.e., the ACC, BWT, and FWT among different approaches for splitMNIST, rotatedMNIST, permutedMNIST and omniglot respectively.

5 Experimental Results

Table 2 corresponding to Fig. 3 demonstrate that ISYANA offers significant performance improvement over other algorithms in the permuttedMNIST problem while achieving the highest FWT. This observation demonstrates the advantage of neuron-level plasticity control to cope with the changing data distributions. The use of synaptic-to-task mapping allows flexible activation and deactivation of a node in respect to their relevance to a task. On the other hand, the task-to-task mapping measuring the inter-task relationship is capable of supporting positive forward and backward transfer. That is, an important node of previous tasks still accepts the incoming task provided high commonalities between these tasks. On the other hand, ISYANA attains comparable numerical results to SI and A-GEM in the rotatedMNIST problem. Notwithstanding that A-GEM delivers the highest performance here, A-GEM exploits past samples for sample replay mechanism. The advantage of ISYANA for large-scale continual learning problem is demonstrated in the Omniglot problem where its performance outperforms other algorithms except A-GEM with noticeable margin. It should be interpreted carefully since A-GEM utilizes external memory for sample replay whereas ISYANA is memory-free. ISYANA also has better BWT than A-GEM in this context. This finding is supported by the fact that ISYANA supports the positive forward and backward transfers across the task. Moreover, it is perceived that the explosion ZZ is apparent here for other methods hindering for its deployment in the large-scale setting. Note that the omniglot problem comprises 50 tasks and 500 classes in total.

ISYANA is, however, inferior to A-GEM, EWC, o-EWC and SI in the splitMNIST problem. We argue that this is as a result of the class-dependent nature of ISYANA where incremental classes are presented here. This leads to inaccurate estimation of neuron importance and inter-task relationship.

Table 1: Classification performance on splitMNIST dataset
Model Acc. (%) FWT BWT
XdG 60.22 0.000842 -0.000777
EWC 92.49 0.003571 -0.005320
o-EWC 92.73 0.002913 -0.002684
SI 92.32 0.003936 -0.014379
LwF 83.54 -0.001945 -0.001065
DGR 50.75 -0.003138 -0.314305
DGR+distill 83.71 -0.001902 -0.000926
A-GEM 92.37 -0.001215346 -0.002667118
ISYANA (Ours) 89.48 -0.027 -0.11323
Table 2: Classification performance on permutedMNIST dataset
Model Acc. (%) FWT BWT
XdG 82.70 -0.001100 -0.072413
EWC 71.36 0.000491 -0.018104
o-EWC 72.63 0.001222 -0.008469
SI 86.97 0.001414 -0.053851
LwF 85.78 -0.001990 -0.013095
DGR 31.12 0.000147 -0.614638
DGR+distill 84.12 -0.000105 -0.030367
A-GEM 89.61 0.002559 -0.027380
ISYANA (Ours) 91.3175 0.0028 -0.0436
Table 3: Classification performance on rotatedMNIST dataset
Model Acc. (%) FWT BWT
XdG 81.88 0.004178 -0.089339
EWC 75.69 -0.003319 -0.018699
o-EWC 77.89 -0.003201 -0.015153
SI 91.35 -0.000592 -0.027260
LwF 90.05 0.005582 0.001019
DGR 40.70 -0.000957 -0.546055
DGR+distill 86.85 0.001869 -0.035341
A-GEM 93.99 -0.000430 0.000082
ISYANA (Ours) 90.6706 -0.0015375 -0.065291667
Table 4: Classification performance on omniglot dataset
Model Acc. (%) FWT BWT
XdG 13.774 0.000272 -0.00102
EWC 10 0.001292 -0.01
o-EWC 10 0.001089 -0.00218
SI 30.208 0.002381 -0.0351
LwF 14.7 0.002925 -0.01755
DGR 10 0.001565 -0.00837
DGR+distill 14.74 0 -0.01327
A-GEM 36.36 0.00381 0.009524
ISYANA (Ours) 35.85333 0.002109 0.010136
Table 5: Performance comparison on splitMNIST,permutedMNIST,rotatedMNIST and omniglot dataset
splitMNIST
Model Acc. (%) FWT BWT
ISYANA(No TT) 89.062 0.012268 -0.1177148
ISYANA (Ours) 89.48 -0.027 -0.11323
permutedMNIST
Model Acc. (%) FWT BWT
ISYANA(No TT) 90.372 0.0002288 -0.0520135
ISYANA (Ours) 91.3175 0.0028 -0.0436
rotatedMNIST
Model Acc. (%) FWT BWT
ISYANA(No TT) 90.136 -0.0040442 -0.0674123
ISYANA (Ours) 90.6706 -0.0015375 -0.065291667
omniglot
Model Acc. (%) FWT BWT
ISYANA(No TT) 9.99 -0.0001021 -0.0005102
ISYANA (Ours) 35.85333 0.002109 0.010136
Refer to caption
Figure 1: The overall flow framework of ISYANA. The training phase contains several steps.
Refer to caption
(a) Average Accuracy.
Refer to caption
(b) Forward Transfer.
Refer to caption
(c) Backward Transfer.
Figure 2: Classification performance of different algorithms on splitMNIST.
Refer to caption
(a) Average Accuracy.
Refer to caption
(b) Forward Transfer.
Refer to caption
(c) Backward Transfer.
Figure 3: Classification performance of different algorithms on permutedMNIST.
Refer to caption
(a) Average Accuracy.
Refer to caption
(b) Forward Transfer.
Refer to caption
(c) Backward Transfer.
Figure 4: Classification performance of different algorithms on rotatedMNIST.
Refer to caption
(a) Average Accuracy.
Refer to caption
(b) Forward Transfer.
Refer to caption
(c) Backward Transfer.
Figure 5: Classification performance of different algorithms on omniglot.

6 Ablation Study

This section aims to study the effect of learning components to the overall learning performance of ISYANA. The underlying interest is to assess the inter-task nature of ISYANA where shareable information of each task is exploited to enable positive forward and backward transfer. That is, the tast-to-task mapping is deactivated such that the learning process is solely guided by the synaptic-to-task mapping. This study is carried out using all four datasets: splitMNIST, rotatesMNIST, permuttedMNIST and omniglot. Table 5 reports our results.

From Table 5, the advantage of the task-to-task module is obvious in which its absence results in performance’s deterioration. It reduces the classification rate of ISYANA by around 0.4%0.5%0.4\%-0.5\% in the splitMNIST problem and the rotatedMNIST problem. On the other hand, there does not exist any performance difference in the splitMNIST problem. Dramatic performance degeneration exists in the omniglot problem where ISYANA’s accuracy decreases by around 25%25\%. This finding demonstrates the influence of the task-to-task component for a large-scale continual learning problem. It aligns with the fact that the task-to-task component is capable of exploiting the inter-task relationship which can be shared across important neurons.

7 Other Approaches

In the continual learning domain, there exists other approaches to combat the catastrophic forgetting approach encompassing the memory-based approach and the structure-based approach. It is outlined as follows:
Memory-based Approach: the catastrophic forgetting problem is solved here by utilizing external memory of past samples to be replayed such that the catastrophic forgetting problem can be overcome. iCaRL is a representative approach utilizing the external memory in addressing the catastrophic forgetting problem [22]. It creates exemplar sets representing each class and the classification decision is performed by checking the similarity of a sample to the most similar exemplar set. Gradient Episodic Memory [3] and its extension A-GEM [21] are categorized as the memory-based approaches. GEM utilizes an episodic memory [3] storing a subset of the observed samples where the forgetting case is indicated by measuring the angle of the gradient vector and the proposed update. A-GEM modifies the loss function of GEM expediting the model update. Deep Generative Replay (DGR) [20] does not store previous samples to alleviate the catastrophic forgetting problem rather utilizes generative adversarial network (GAN) to generate pseudo-data preventing the catastrophic forgetting problem. This work is extended in [1] with the knowledge distillation and feedback connection. Since the memory-based approach utilizes the experience-replay mechanism to address the catastrophic forgetting problem, it is capable of handling the single-head continual learning scenario where other approaches cannot deliver. Nevertheless, compared to regularization-based approach, the memory-based approach is computationally prohibitive. Furthermore, choosing which samples to retain remains an open question because of the dynamic learning environments.
Structure-based Approach: another approach in the continual learning uses the structure-based approach where it freezes some components of a model while introducing a new component to embrace new task. Progressive Neural Network [23] is a pioneering work in this domain where a new column is inserted for new task while freezing the old components to avoid the catastrophic forgetting problem. Dynamically Expandable Networks (DEN) [24] is another technique in this domain where it addresses some flaws of PNN. It adopts the selective retraining approach and the splitting/duplicating strategy. Recently, the so-called learn-to-grow framework is put forward in [25]. It adopts the neural architecture search to find a network structure that best handles each task. The underlying limitation of the structure-based approach for continual learning is seen in its high computational and memory burdens.

8 Conclusion

An inter-task synaptic mapping (ISYANA) is proposed here to perform the continual learning of data streams having the dynamic and evolving characteristics. ISYANA is underpinned by two components, the task-to-task and synaptic-to-task mapping components where the task-to-task mapping is developed to study the mutual information across tasks while the synaptic-to-task mapping focuses on the mapping between a node to a task. This trait enables exploitation of shareable information across task while retaining private information thereby enhancing the forward and backward transfers. The advantage of ISYANA has been numerically validated in the benchmark problems and recently published algorithms clearly showing improved performance. Our future work targets an extension of ISYANA for purely streaming environments where there exist uncertainties for task’s boundary.

Acknowledgment

This research is financially supported by National Research Foundation, Republic of Singapore under IAFPP in the AME domain (contract no.: A19C1A0018). This work was mainly done when the first author was a research fellow in NTU.

References

  • [1] G. M. van de Ven, A. S. Tolias, Three scenarios for continual learning, arXiv preprint arXiv:1904.07734 (2019).
  • [2] G. M. van de Ven, A. S. Tolias, Generative replay with feedback connections as a general strategy for continual learning, arXiv preprint arXiv:1809.10635 (2018).
  • [3] D. Lopez-Paz, M. Ranzato, Gradient episodic memory for continual learning, in: Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, Curran Associates Inc., Red Hook, NY, USA, 2017, p. 6470–6479.
  • [4] P. Li, X. Wu, X. Hu, Learning from concept drifting data streams with unlabeled data, in: Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence, AAAI’10, AAAI Press, 2010, p. 1945–1946.
  • [5] S.-W. Lee, J.-H. Kim, J. Jun, J.-W. Ha, B.-T. Zhang, Overcoming catastrophic forgetting by incremental moment matching, in: Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, Curran Associates Inc., Red Hook, NY, USA, 2017, p. 4655–4665.
  • [6] J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, D. Hassabis, C. Clopath, D. Kumaran, R. Hadsell, Overcoming catastrophic forgetting in neural networks (2016). arXiv:1612.00796.
  • [7] F. Zenke, B. Poole, S. Ganguli, Continual learning through synaptic intelligence (2017). arXiv:1703.04200.
  • [8] J. Schwarz, J. Luketina, W. M. Czarnecki, A. Grabska-Barwinska, Y. W. Teh, R. Pascanu, R. Hadsell, Progress & compress: A scalable framework for continual learning (2018). arXiv:1805.06370.
  • [9] Z. Li, D. Hoiem, Learning without forgetting (2016). arXiv:1606.09282.
  • [10] I. Paik, S. Oh, T. Kwak, I. Kim, Overcoming catastrophic forgetting by neuron-level plasticity control, in: The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, AAAI Press, 2020, pp. 5339–5346.
  • [11] G. I. Parisi, R. Kemker, J. L. Part, C. Kanan, S. Wermter, Continual lifelong learning with neural networks: A review (2018). arXiv:1802.07569.
  • [12] B. Yang, X. Fu, N. D. Sidiropoulos, M. Hong, Towards k-means-friendly spaces: Simultaneous deep learning and clustering (2016). arXiv:1610.04794.
  • [13] X. Guo, X. Liu, E. Zhu, J. Yin, Deep clustering with convolutional autoencoders, in: ICONIP, 2017.
  • [14] M. Pratama, C. Za’in, A. Ashfahani, Y. S. Ong, W. Ding, Automatic construction of multi-layer perceptron network from streaming examples (2019). arXiv:1910.03437.
  • [15] R. J. Oentaryo, M. Pasquier, C. Quek, Rfcmac: A novel reduced localized neuro-fuzzy system approach to knowledge extraction, Expert Systems with Applications 38 (10) (2011) 12066 – 12084.
  • [16] P. G. Sankaran, S. M. Sunoj, N. U. Nair, Kullback-leibler divergence: A quantile approach, Statistics & Probability Letters 111 (2016) 72–79.
  • [17] I. J. Goodfellow, M. Mirza, D. Xiao, A. Courville, Y. Bengio, An empirical investigation of catastrophic forgetting in gradient-based neural networks (2013). arXiv:1312.6211.
  • [18] M. Jaderberg, K. Simonyan, A. Zisserman, K. Kavukcuoglu, Spatial transformer networks (2015). arXiv:1506.02025.
  • [19] N. Y. Masse, G. D. Grant, D. J. Freedman, Alleviating catastrophic forgetting using context-dependent gating and synaptic stabilization, Proceedings of the National Academy of Sciences 115 (44) (2018) E10467–E10475.
  • [20] H. Shin, J. K. Lee, J. Kim, J. Kim, Continual learning with deep generative replay (2017). arXiv:1705.08690.
  • [21] A. Chaudhry, M. Ranzato, M. Rohrbach, M. Elhoseiny, Efficient lifelong learning with A-GEM, in: 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, OpenReview.net, 2019.
    URL https://openreview.net/forum?id=Hkf2_sC5FX
  • [22] S.-A. Rebuffi, A. Kolesnikov, G. Sperl, C. H. Lampert, icarl: Incremental classifier and representation learning, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017) 5533–5542.
  • [23] A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, R. Hadsell, Progressive neural networks, ArXiv abs/1606.04671 (2016).
  • [24] J. Lee, J. Yoon, E. Yang, S. J. Hwang, Lifelong learning with dynamically expandable networks, ArXiv abs/1708.01547 (2018).
  • [25] X. Li, Y. Zhou, T. Wu, R. Socher, C. Xiong, Learn to grow: A continual structure learning framework for overcoming catastrophic forgetting, Vol. 97 of Proceedings of Machine Learning Research, PMLR, Long Beach, California, USA, 2019, pp. 3925–3934.
    URL http://proceedings.mlr.press/v97/li19m.html

Appendix A Experimental Details

The detail of each experiments are shown in Table 6 and Table 7. For EWC, Online-EWC and SI, the loss function includes a regularization term, the loss value controlled by a strength parameter: Ltotal=Lcurrent+λ×Lreg{L}_{total}=L_{current}+\lambda\times{L}_{reg}. For sample replayed methods, LwF, DGR and DGR+distill, the loss for replayed data is added the the loss function. In Omniglot experiment, there are 50 alphabets in Omniglot dataset, the task is divided based on the alphabets. The dataset is augmented by 20 kinds of transformations, 10 rotations and 10 shifting. In our experiments, 10 classes was chosen in each task. The result of fisher matrix calculation for EWC and Onine EWC increases dramatically along with the training steps, which causes the poor results.

Table 6: Hyper-parameters for the experiments on split MNIST, rotate MNIST and permutate MNIST.
Experiment Split MNIST Rotate MNIST Permuted MNIST
Network(MLP) [784,256,256] [784,500,500] [784,500,500]
Optimizer SGD learning rate:0.1
Batch size 128
Training epochs 10
XdG 80 percentage neurons per layer to gate
EWC lambda:5000
o-EWC lambda:5000; forgetting coefficient:1.0
SI regularisation strength:1 Dampening ratio:0.1
DGR and DGR+distill VAE:MLP[784,500,500]
A-GEM replay samples from previous task:2000
ISYANA(ours) η=1exp(Z0.1)\eta=1*\exp{(-Z-0.1)}
Table 7: Hyper-parameters for Omniglot Experiment
Parameter
Network(MLP) [784,500,500]
Optimizer SGD learning rate:0.1 (except ISYANA)
Batch size 32
Training epochs 5
XdG 80 percentage neurons per layer to gate
EWC lambda:5000
o-EWC lambda:5000
SI regularisation strength:0.1 Dampening ratio:0.1
DGR and DGR+distill VAE:MLP[784,500,500]
A-GEM replay samples from previous task:2000
ISYANA(ours) η=1exp(Z0.1)\eta=1*\exp{(-Z-0.1)}