This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Unbiased and Efficient Self-Supervised Incremental Contrastive Learning

Cheng Ji Beihang UniversityBeijingChina jicheng@act.buaa.edu.cn Jianxin Li Beihang UniversityBeijingChina lijx@act.buaa.edu.cn Hao Peng Beihang UniversityBeijingChina penghao@act.buaa.edu.cn Jia Wu Macquarie UniversitySydneyAustralia jia.wu@mq.edu.au Xingcheng Fu Beihang UniversityBeijingChina fuxc@act.buaa.edu.cn Qingyun Sun Beihang UniversityBeijingChina sunqy@act.buaa.edu.cn  and  Philip S. Yu University of Illinois at ChicagoChicagoUSA psyu@uic.edu
(2023)
Abstract.

Contrastive Learning (CL) has been proved to be a powerful self-supervised approach for a wide range of domains, including computer vision and graph representation learning. However, the incremental learning issue of CL has rarely been studied, which brings the limitation in applying it to real-world applications. Contrastive learning identifies the samples with the negative ones from the noise distribution that changes in the incremental scenarios. Therefore, only fitting the change of data without noise distribution causes bias, and directly retraining results in low efficiency. To bridge this research gap, we propose a self-supervised Incremental Contrastive Learning (ICL) framework consisting of (i) a novel Incremental InfoNCE (NCE-II) loss function by estimating the change of noise distribution for old data to guarantee no bias with respect to the retraining, (ii) a meta-optimization with deep reinforced Learning Rate Learning (LRL) mechanism which can adaptively learn the learning rate according to the status of the training processes and achieve fast convergence which is critical for incremental learning. Theoretically, the proposed ICL is equivalent to retraining, which is based on solid mathematical derivation. In practice, extensive experiments in different domains demonstrate that, without retraining a new model, ICL achieves up to 16.7×16.7\times training speedup and 16.8×16.8\times faster convergence with competitive results.

Contrastive learning, incremental learning, self-supervised learning.
journalyear: 2023copyright: acmcopyrightconference: Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining; February 27–March 3, 2023; Singapore, Singapore.booktitle: Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining (WSDM ’23), February 27–March 3, 2023, Singapore, Singaporeprice: 15.00isbn: 978-1-4503-9407-9/23/02doi: 10.1145/3539597.3570458ccs: Computing methodologies Online learning settingsccs: Information systems Data streamsccs: Information systems Data stream mining

1. Introduction

Contrastive Learning (CL) is a widely used self-supervised learning approach across a wide range of domains, such as computer vision (CV) (He et al., 2020; Chen et al., 2020), neural language processing (NLP) (Logeswaran and Lee, 2018; Van den Oord et al., 2018; Li et al., 2022a), and graph representation learning (GRL) (Qiu et al., 2020; You et al., 2020; Sun et al., 2021). The main idea of contrastive learning is to make representations of similar samples close and distinct samples far away through a noise contrastive estimation (NCE) (Gutmann and Hyvärinen, 2010, 2012) loss function with the given noise distribution (Van den Oord et al., 2018). Moreover, in real-world application scenarios, online systems are expected to constantly face new data and learn incrementally, which limits the applicability of contrastive learning. The data distribution and noise distribution are incrementally observed and the estimation by NCE is thus biased. Nevertheless, there is little work to study the incremental learning issue of contrastive learning (ICL, Incremental Contrastive Learning).

Incremental learning aims to learn the new data while not forgetting the old data (see the formal definition in Definition 3). It requires the learning model to face the stability-plasticity dilemma (Mermillod et al., 2013) with the following three characteristics: (a) stability for the old data, which means that the model should contain the ability to remember and update the knowledge of the old data, (b) plasticity for the new data, which requires that the model should be able to adapt to the new data, (c) efficiency of the training process, which helps the model update quickly in real-world applications, especially in a streaming environment.

Table 1. Comparison of different strategies.
Stability Plasticity Efficiency
Inference ×\times ×\times \checkmark
Fine-tuning ×\times ×\times \checkmark
Retraining \checkmark \checkmark ×\times
ICL (Ours) \checkmark \checkmark \checkmark
  • 1

    \checkmark : “unbiased” or “high”, ×\times : “biased” or “low”.

The major challenge of contrastive learning in the incremental setting is how to estimate the change of the noise distribution. The noise distribution in contrastive learning is used to sample the negative ones and is related to and close to the data distribution (Gutmann and Hyvärinen, 2012; Van den Oord et al., 2018). Therefore, as the data change, the noise distribution also changes. Based on the trained model, the old data has been learned through NCE by contrasting the negative samples from the original noise distribution. However, there is a bias in the estimation after giving the new data since the noise distribution used for sampling the negative samples has changed. The existing NCE methods cannot be applied to fit the change (we provide the bias analysis in Section 3). Regardless of the trained model, retraining seems to be the best way due to no bias in the process. However, it has a serious time-consuming problem. Therefore, in this paper, we seek to answer the question: can contrastive learning approaches be able to estimate the old data through an unbiased NCE w.r.t. retraining, while learning the new data efficiently?

The naive strategies (e.g., inference, fine-tuning, and retraining) fail to answer this question due to the above three requirements of incremental learning as shown in Table 1:

  • Poor Stability. Without a carefully designed NCE strategy, both inference and fine-tuning lead to a biased estimate due to the change in the noise distribution. In addition, directly fine-tuning the model on new data causes catastrophic forgetting (McCloskey and Cohen, 1989).

  • Weak Plasticity. Inference is not capable of learning the new data and fine-tuning is over-transferring, ignoring the contribution of old data to the noise distribution. Moreover, bias will continue to accumulate in practical online applications.

  • Low Efficiency. While retraining a new model with all data does not seem to have the above issues, the time cost is unacceptable in real-world applications.

However, there is little work on incremental learning in CL, except for applying the replay and distillation technique (Cha et al., 2021; Lin et al., 2021). Moreover, existing incremental learning approaches are hard to use for self-supervised CL because they focus mainly on class-incremental learning (Mai et al., 2021) and task-incremental learning (Serra et al., 2018). Without any information of labels and tasks, self-supervised incremental contrastive learning is still facing a research gap.

Contributions. To this end, we propose an unbiased and efficient self-supervised Incremental Contrastive Learning (ICL) framework. First, we design an Incremental InfoNCE (NCE-II) loss function to fit the change of noise distribution. Furthermore, we accelerate the convergence by a meta-optimization algorithm with a reinforced Learning Rate Learning (LRL) mechanism. Finally, we conduct thorough experiments on four datasets of CV and GRL to show the efficiency and effectiveness of ICL. The main contributions of this paper are summarized as follows:

  • Leveraging a new metric for measuring the change of the noise distribution, we design a novel NCE-II loss to theoretically achieve an unbiased ICL with respect to retraining. To our best knowledge, it is the first attempt to study the issue of unbiased incremental learning in self-supervised CL.

  • We propose a meta-optimization algorithm with LRL to adaptively learn the learning rates according to the status of the training process for fast convergence.

  • Extensive experiments demonstrate that ICL maintains high efficiency in terms of training time and convergence epoch. In addition, as an unbiased and efficient approach, ICL still achieves competitive results.

2. Background and problem formulation

We first provide the background and problem formulation of contrastive learning and incremental contrastive learning.

Definition 0 (Contrastive Learning).

Given the set of input data X{xi𝒳}i=1NX\coloneqq\{x_{i}\in\mathcal{X}\}_{i=1}^{N} of size NN and an encoder ϕ():𝒳𝒵\phi(\cdot):\mathcal{X}\to\mathcal{Z} mapping the inputs from the data space 𝒳\mathcal{X} into the latent space 𝒵\mathcal{Z}, contrastive learning aims to train the encoder with a contrastive loss function \mathcal{L} designed to identify the positive sample xi+x_{i}^{+} for a given xix_{i}.

In this paper, we consider the most commonly used contrastive learning framework (Chen et al., 2020; He et al., 2020; Logeswaran and Lee, 2018; You et al., 2020) with the InfoNCE (denoted as NCE-I in this paper) as the loss function \mathcal{L}.

Definition 0 (InfoNCE (NCE-I)).

Given the set of input data X{xi𝒳}i=1NX\coloneqq\{x_{i}\in\mathcal{X}\}_{i=1}^{N} and an encoder ϕ()\phi(\cdot), the InfoNCE (NCE-I) loss is defined as:

(1) iI,X=logf(xi,xi+)f(xi,xi+)+K𝔼xipnXf(xi,xi),\mathcal{L}_{i}^{I,X}=-\log\frac{f(x_{i},x_{i}^{+})}{f(x_{i},x_{i}^{+})+K\mathbb{E}_{x_{i}^{-}\sim p_{n}^{X}}f(x_{i},x_{i}^{-})},

where f(,)exp(sim(ϕ(),ϕ())/τ)f(\cdot,\cdot)\coloneqq\exp(sim(\phi(\cdot),\phi(\cdot))/\tau) with sim(,)sim(\cdot,\cdot) as the similarity measurement and τ\tau as the temperature parameter, xi+x_{i}^{+} represents the positive sample generated by a data augmentation module, xix_{i}^{-} is one of the negative samples from a noise distribution, and KK is a hyperparameter representing the ratio of negative samples to positive samples.

Noise Distribution. The InfoNCE loss follows the noise contrastive estimation (NCE) principle (Gutmann and Hyvärinen, 2010, 2012), which learns to deduce the properties of data XX by comparing the difference to the reference (noise) data YY. The noise data YY is an i.i.d. sample {y1,y2,,yK}\{y_{1},y_{2},\dots,y_{K}\} from a random variable with noise distribution pnp_{n}. In this paper, the negative ones are from the data distribution by a uniform sampling which can be approximated by Monte Carlo sampling (Shapiro, 2003). We randomly sample a mini-batch of K+1K+1 inputs and treat the others as the negative samples for each one.

Definition 0 (Incremental   Contrastive   Learning).

Given an encoder ϕ()\phi(\cdot) trained on old data X{xi𝒳}i=1NX\coloneqq\{x_{i}\in\mathcal{X}\}_{i=1}^{N} with a contrastive learning approach and the new data ΔX{xN+i𝒳}i=1ΔN\Delta X\coloneqq\{x_{N+i}\in\mathcal{X}\}_{i=1}^{\Delta N} that has not been observed, the incremental contrastive learning aims to refine the encoder for adapting to the new data without forgetting the knowledge of old data, i.e., stability and plasticity.

Noise Distribution Change. As mentioned above, the negative ones in contrastive learning are randomly sampled from the dataset which changes in the incremental setting. That is, the noise distribution changes.

Finally, we denote X{xiXΔX}X^{\prime}\coloneqq\{x_{i}\in X\cup\Delta X\} as all data. For generality, we focus on one incremental step and note that the old data XX represent the data used for training the encoder before and preserved by the online system through a memory queue or replay technique. Therefore, the study in this paper can be conveniently implemented to a number of contrastive learning and incremental learning methods.

3. Methodology

In this section, we propose an unbiased and efficient self-supervised Incremental Contrastive Learning (ICL) framework.

3.1. Overall Framework

The proposed ICL framework consists of the following components:

  1. (1)

    Augmentation. Given each input xix_{i} (e.g., an image or a graph), two augmentations q1(|xi)q_{1}(\cdot|x_{i}) and q2(|xi)q_{2}(\cdot|x_{i}) are applied to xix_{i} to obtain a positive pair (xi,xi+)(x_{i},x_{i}^{+}). For different domains of datasets (i.e., CV and GRL), different augmentation strategies are applied (Section 4.1.3).

  2. (2)

    Encoder. An encoder ϕ():𝒳𝒵\phi(\cdot):\mathcal{X}\to\mathcal{Z} is used to learn the latent representations for each positive pair (xi,xi+)(x_{i},x_{i}^{+}). Specifically, a ResNet-18 (He et al., 2016) and a GCN (Kipf and Welling, 2017) are used for CV an GRL respectively.

  3. (3)

    Incremental InfoNCE (NCE-II). To resolve the problems mentioned in Section 1, we design a novel loss function, named NCE-II, to eliminate the bias caused by the change of noise distribution. Furthermore, the proposed NCE-II maintains equivalence with retraining, and the error bound of final empirical risk tends to zero (Section 3.2).

  4. (4)

    Meta-optimization with Learning Rate Learning (LRL). In order to further improve the efficiency of the training process, which is crucial for incremental learning, we proposed a new meta-learning optimization algorithm with a reinforced learning rate learning mechanism for fast convergence (Section 3.3).

In the following sections, we introduce the proposed objective function (NCE-II) and optimization algorithm (meta-optimization with LRL).

3.2. Objective Function: Unbiased Estimation

We split the final objective function into two parts: one for old data and the other for new data. The new data ΔX\Delta X can be learned by NCE-I with the noise distribution pnXp_{n}^{X^{\prime}}:

(2) iI,X=logf(xi,xi+)f(xi,xi+)+K𝔼xipnXf(xi,xi).\mathcal{L}_{i}^{I,X^{\prime}}=-\log\frac{f(x_{i},x_{i}^{+})}{f(x_{i},x_{i}^{+})+K\mathbb{E}_{x_{i}^{-}\sim p_{n}^{X^{\prime}}}f(x_{i},x_{i}^{-})}.

However, for stability as discussed in Section 1, InfoNCE loss cannot serve as the objective function for the old data. Given the encoder trained with the old data XX, it is necessary to find an optimization method for XX, so that the entire training process (including the incremental learning phase) with XX is unbiased.

3.2.1. Motivation: Change of Noise Distribution

In order to eliminate the deviation caused by the change of the noise distribution in the learning of the old data, we first propose a metric for measuring the change ratio of the noise distribution.

To explore how noise distribution works in contrastive learning, we first rewrite the InfoNCE loss into the form of softmax-based categorical cross-entropy:

(3) iI,X\displaystyle\mathcal{L}_{i}^{I,X} =xjX𝟙i=jlog(f(xi,xj+)xkXf(xi,xk+))\displaystyle=\sum_{x_{j}\in X}\mathds{1}_{i=j}\cdot-\log\left(\frac{f(x_{i},x_{j}^{+})}{\sum_{x_{k}\in X}f(x_{i},x_{k}^{+})}\right)
(4) =xjX𝟙i=jlog(softmaxX(f(xi,xj+))).\displaystyle=\sum_{x_{j}\in X}\mathds{1}_{i=j}\cdot-\log\left(\text{softmax}_{X}(f(x_{i},x_{j}^{+}))\right).

The softmax term in Eq. (4) represents the final prediction result of the model, in which the sum of predicted probabilities for all classes changes with the noise distribution, that is, xkXf(xi,xk+)xkXΔXf(xi,xk+)\sum_{x_{k}\in X}f(x_{i},x_{k}^{+})\to\sum_{x_{k}\in{X\cup\Delta X}}f(x_{i},x_{k}^{+}). Therefore, we propose to use the ratio of the sum of predicted probabilities for new classes to the one for old classes.

Definition 0 (Change Ratio of Noise Distribution).

Given the noise distribution pnXp_{n}^{X} of old data XX and pnΔXp_{n}^{\Delta X} of new data ΔX\Delta X, the change ratio is represented by

(5) riXΔX\displaystyle r_{i}^{X\to\Delta X} =xjΔX{xi}f(xi,xj+)xkXf(xi,xk+)\displaystyle=\frac{\sum_{x_{j}\in\Delta X\cup\{x_{i}\}}f(x_{i},x_{j}^{+})}{\sum_{x_{k}\in X}f(x_{i},x_{k}^{+})}
(6) =f(xi,xi+)+K𝔼xipnΔXf(xi,xi)f(xi,xi+)+K𝔼xipnXf(xi,xi).\displaystyle=\frac{f(x_{i},x_{i}^{+})+K\mathbb{E}_{x_{i}^{-}\sim p_{n}^{\Delta X}}f(x_{i},x_{i}^{-})}{f(x_{i},x_{i}^{+})+K\mathbb{E}_{x_{i}^{-}\sim p_{n}^{X}}f(x_{i},x_{i}^{-})}.

With the proposed change ratio in Eq. (6), we can measure how much the noise distribution changes. If the new data maintain the same noise distribution as the old data (i.e., pnΔX=pnXp_{n}^{\Delta X}=p_{n}^{X}), the change ratio riXΔXr_{i}^{X\to\Delta X} equals 11. The deviation of riXΔXr_{i}^{X\to\Delta X} from 1 reflects the degree of change of noise distribution.

3.2.2. Objective Function: NCE-II Loss

Next, we design a novel contrastive loss function, leveraging the change ratio of the noise distribution.

Definition 0 (Incremental InfoNCE (NCE-II)).

Given the old data XX and new data ΔX\Delta X, the loss function of the incremental contrastive learning for old data is defined as

(7) iII\displaystyle\mathcal{L}^{II}_{i} =log(αriXΔX+(1α)1)\displaystyle=\log\left(\alpha\cdot r_{i}^{X\to\Delta X}+(1-\alpha)\cdot 1\right)
(8) =log(αf(xi,xi+)+K𝔼xipnΔXf(xi,xi)f(xi,xi+)+K𝔼xipnXf(xi,xi)+(1α)1),\displaystyle=\log\left(\alpha\cdot\frac{f(x_{i},x_{i}^{+})+K\mathbb{E}_{x_{i}^{-}\sim p_{n}^{\Delta X}}f(x_{i},x_{i}^{-})}{f(x_{i},x_{i}^{+})+K\mathbb{E}_{x_{i}^{-}\sim p_{n}^{X}}f(x_{i},x_{i}^{-})}+(1-\alpha)\cdot 1\right),

where constant α=ΔNN+ΔN[0,1)\alpha=\frac{\Delta N}{N+\Delta N}\in[0,1) represents the growth ratio of the data.

In NCE-II loss, the change ratio for each sample riΔXXr_{i}^{\Delta X\to X} is weighted by a coefficient α\alpha, which reflects the increment of the data. Specifically, when there is no new data (i.e., α=0\alpha=0) or the noise distribution remains the same (i.e., ri,ΔXXr_{i,\Delta X\to X}=1), the NCE-II equals 0, which is in line with the intuition.

3.2.3. No Bias: Equivalence with Retraining

Given NCE-II, the encoder is trained on the old data XX with two contrastive loss functions (Eq.(2) and Eq.(8)), the sum of which is equivalent to the loss function of retraining on all data XX^{\prime}.

Theorem 3.

For old data xiXx_{i}\in X, the NCE-II with new data ΔX\Delta X plus InfoNCE only with XX is equivalent to the one with all data XX^{\prime}, i.e., iI,X=iI,X+iII\mathcal{L}_{i}^{I,X^{\prime}}=\mathcal{L}_{i}^{I,X}+\mathcal{L}_{i}^{II}.

Proof.

The proof is given in the online repositories111Proofs are at https://github.com/RingBDStack/ICL-Incremental-InfoNCE.. ∎

Finally, the objective function of ICL is defined as

(9) =xiXiII+xiΔXiI,X,\mathcal{L}=\sum_{x_{i}\in X}\mathcal{L}_{i}^{II}+\sum_{x_{i}\in\Delta X}\mathcal{L}_{i}^{I,X^{\prime}},

which is an unbiased estimation of both old data and new data w.r.t. retraining. In Table 2, we provide the comparison of bias between different strategies to show the superiority of the proposed method.

Table 2. Bias of different strategies.
Old Data New Data
Inference logriXX\log r_{i}^{X\to X^{\prime}} iI,X\mathcal{L}_{i}^{I,X^{\prime}}
Fine-tuning logriXX\log r_{i}^{X\to X^{\prime}} logriΔXX\log r_{i}^{\Delta X\to X^{\prime}}
Retraining 0 0
ICL (Ours) 0 0

3.2.4. Bound Analysis.

Although the proposed loss function NCE-II is equal to the difference of InfoNCE with old data and all data, the final empirical risk \mathcal{R} during the entire training process is nevertheless different due to the change of data distribution. Therefore, we provide the bound analysis to guarantee the correctness of the proposed NCE-II.

Theorem 4.

The difference between the empirical risk of the method with the proposed NCE-II and the retraining with InfoNCE throughout the training process is bounded by αold\alpha\mathcal{R}^{old}, where old=1Ni=1NiI,old\mathcal{R}^{old}=\frac{1}{N}\sum_{i=1}^{N}\mathcal{L}_{i}^{I,old} approaches zero as iI,old\mathcal{L}_{i}^{I,old} is minimized after the previous training process on old data XX and the growth ratio α[ 0,1)\alpha\in[\,0,1). Then we have αold0\alpha\mathcal{R}^{old}\to 0.

Proof.

The proof is given in the online repositories. ∎

3.2.5. Complexity Analysis.

Given the number of epochs ΔT\Delta T in the incremental learning process, the time complexity of the proposed method is 𝒪(ΔT((2K+1)N+(K+1)ΔN)C)\mathcal{O}(\Delta T((2K+1)N+(K+1)\Delta N)C), where CC represents for the time consuming of the encoder and function f(,)f(\cdot,\cdot). As the parameters of the encoder have been optimized, the number of epochs is much smaller than the one of the retraining process. We further guarantee it in the next section.

Refer to caption
Figure 1. An overview of ICL. (a) The proposed Incremental InfoNCE (NCE-II) to fit the change of noise distribution pnp_{n}. (b) The proposed meta-optimization where the old data XX is treated as the support set in the meta-training stage and the new data ΔX\Delta X is treated as the query set in the meta-testing stage. (c) The proposed Learning Rate Learning (LRL) mechanism which uses the current loss as state and loss decrement as reward to generate the learning rates (lrslr_{s} and lrqlr_{q}) as actions for meta-optimization.

3.3. Optimization: Efficient Adaption

In order to further improve the efficiency of the proposed method and ensure that the number of epochs in the incremental learning process is smaller, we next propose a meta-optimization algorithm and a Learning Rate Learning (LRL) mechanism by reinforcement learning (RL) to quickly adapt to the new data.

Formally, the optimization of the encoder is defined as

(10) θt+1θtlrθ(ϕ(X;θt)),\theta_{t+1}\leftarrow\theta_{t}-lr*\nabla_{\theta}\mathcal{L}(\phi(X;\theta_{t})),

where lrlr is the learning rate and θ(ϕ(X;θt))\nabla_{\theta}\mathcal{L}(\phi(X;\theta_{t})) is the gradient in the time step tt.

3.3.1. Meta-optimization

For plasticity, we treat the optimization of ICL for new data as transfer learning (the new data may contain new classes or from different domains) and propose a meta-optimization strategy for fast adaption, which is formulated as a form of model agnostic meta-learning (MAML) (Finn et al., 2017).

We treat the old data XX as the support set and the new data ΔX\Delta X as the query set. In the meta-training stage, we calculate the loss on the support set by Eq.(8) and gain the new parameters θ\theta^{\prime} in a few gradient descent steps:

(11) θ=θlrsxiXθII(ϕ(xi;θ))θ,\theta^{\prime}=\theta-lr_{s}*\frac{\partial\sum_{x_{i}\in X}\mathcal{L}_{\theta}^{II}(\phi(x_{i};\theta))}{\partial\theta},

where lrslr_{s} is the learning rate of the meta-training process on the support set XX. For fairness, the number of steps is set to max{(1α)/α,1}\max\{\lceil(1-\alpha)/\alpha\rceil,1\} to ensure that the total number of samples trained in one epoch and the size of all data are as equal as possible, under the premise that there is at least one support sample for each query.

In the meta-testing stage, after obtaining the adapted parameters θ\theta^{\prime}, we update the parameters θ\theta on the query set with

(12) θθlrqxiΔXθI,X(ϕ(xi;θ))θ,\theta\leftarrow\theta-lr_{q}*\frac{\partial\sum_{x_{i}\in\Delta X}\mathcal{L}_{\theta}^{I,X^{\prime}}(\phi(x_{i};\theta^{\prime}))}{\partial\theta},

where lrqlr_{q} is the learning rate of the meta-testing process on the query set ΔX\Delta X.

3.3.2. Learning Rate Learning (LRL)

The learning rate for gradient descent is commonly set as a hyperparameter and needs to be manually searched for the optimal value. However, it is a challenge to determine the values of two learning rates (lrqlr_{q} and lrslr_{s}) in our proposed meta-optimization. It is (a) unreasonable to simply set the two values to the same or use the previous one because the two learning rates are used for different training processes and data, (b) hard to search for the optimal value due to the large searching space, and (c) the change rules of learning rates should be automatically learned according to the status of the training environment instead of applying the human-designed ones. Therefore, we aim to find a solution that can adaptively control learning rates while speeding up the convergence process.

We proposed a Learning rate Learning (LRL) mechanism to learn the two learning rates (lrslr_{s} and lrqlr_{q}) adaptively according to the current state of the encoder in a reinforcement learning approach (Kaelbling et al., 1996; Xu et al., 2017; Li et al., 2021b). In our reinforcement learning setting, we propose to use a function \mathcal{F} to generate the state sts_{t} which encodes the observation of the training process (the inputs XX and the parameters θt\theta_{t}) at the time step tt:

(13) st=(X,θt).s_{t}=\mathcal{F}(X,\theta_{t}).

The action ata_{t} is defined as the learned learning rate based on the state sts_{t} and ata_{t}\in\mathbb{R} is a continuous value.

To resolve the continuous action space issue, (Xu et al., 2017) uses an actor-critic algorithm (Sutton et al., 1999; Silver et al., 2014) to generate the learning rate, employing two neural networks: the actor network for learning rate generation and the critic network for criticizing the actions. However, because the inputs of the actor network are ordered and related (i.e., not i.i.d.), the reinforcement learning process is therefore unstable and problematic (Mnih et al., 2013). Therefore, we propose to use the Deep Deterministic Policy Gradient (DDPG) to learn the two learning rates respectively, which contains the experience replay mechanism to resolve the i.i.d. issue and the separate target network mechanism for a stable training (Lillicrap et al., 2016).

In general, DDPG networks 𝒟={μ,Q,μ,Q}\mathcal{D}=\{\mu,Q,\mu^{\prime},Q^{\prime}\} consist of the (online) actor network μ(s|θμ)\mu(s|\theta^{\mu}) and (online) critic network Q(s,a|θQ)Q(s,a|\theta^{Q}), and the target actor network μ\mu^{\prime} and the target critic network QQ^{\prime} with the same initial weights as the online ones. We first select an action ata_{t} in step tt as

(14) at=μ(st|θμ)+𝒩t,a_{t}=\mu(s_{t}|\theta^{\mu})+\mathcal{N}_{t},

where 𝒩\mathcal{N} is the Ornstein-Uhlenbeck (OU) process (Uhlenbeck and Ornstein, 1930) to generate noise for action exploration. Next, the action ata_{t} is executed (updating encoder’s parameters with ata_{t} as the learning rate) and the new state st+1s_{t+1} and reward rtr_{t} are observed. As the aim of LRL is to accelerate the convergence, we define the reward rtr_{t} as the loss decrement:

(15) rt=tt+1.r_{t}=\mathcal{L}_{t}-\mathcal{L}_{t+1}.

Finally, we optimize the the critic network using Temporal-Difference (TD) learning with a replay buffer and actor network with the chain rule (Lillicrap et al., 2016). The target networks are updated by a momentum mechanism. Please refer to the online repositories for more details.

4. Experiment

Extensive experiments across two domains, computer vision and graph representation learning, are conducted to demonstrate the high efficiency of the proposed unbiased ICL222Code is available at https://github.com/RingBDStack/ICL-Incremental-InfoNCE. framework. In addition, ICL also achieves competitive results. We further provide the incremental setting analysis and ablation study.

Table 3. Classification results of running time and convergence epoch on CV datasets. (bold: best; underlined: runner-up)
Dataset ImageNet-2 MNIST-2 Avg. Rank
Growth Rate α=0.3\alpha=0.3 α=0.5\alpha=0.5 α=0.7\alpha=0.7 α=0.3\alpha=0.3 α=0.5\alpha=0.5 α=0.7\alpha=0.7
Metric Time Epoch Time Epoch Time Epoch Time Epoch Time Epoch Time Epoch Time Epoch
Retraining 1.0x 1.0x 1.0x 1.0x 1.0x 1.0x 1.0x 1.0x 1.0x 1.0x 1.0x 1.0x - -
Fine-tuning 10.7x 3.2x 3.9x 2.0x 2.2x 1.6x 12.3x 3.6x 5.5x 2.6x 3.8x 2.7x 2.3 3.6
Replay 3.4x 2.0x 2.0x 1.6x 3.4x 2.7x 2.8x 2.6x 3.5x 2.5x 3.2x 2.4x 4.3 4.5
Distillation 6.9x 2.9x 2.1x 1.4x 2.2x 1.5x 7.8x 3.0x 4.4x 2.8x 3.1x 2.2x 4.0 4.8
ICL (Ours) 11.2x 16.8x 6.2x 13.2x 5.9x 12.7x 14.2x 10.6x 16.7x 12.7x 6.4x 7.1x 1.0 1.0
ICL (w/o LRL) 9.6x 14.7x 3.4x 5.3x 1.1x 2.2x 1.0x 1.0x 2.4x 2.5x 1.5x 2.3x 5.2 4.0
ICL (w/o M) 9.2x 15.7x 4.8x 8.1x 1.7x 3.8x 1.1x 1.3x 1.2x 1.5x 1.0x 1.2x 5.3 4.5
ICL (w/o M+LRL) 9.8x 16.4x 1.6x 2.7x 1.0x 1.3x 1.7x 1.7x 1.3x 1.3x 1.2x 1.3x 5.7 5.3
Table 4. Classification results of running time and convergence epoch on GRL datasets. (bold: best; underlined: runner-up)
Dataset PROTEINS REDDIT Avg. Rank
Growth Rate α=0.3\alpha=0.3 α=0.5\alpha=0.5 α=0.7\alpha=0.7 α=0.3\alpha=0.3 α=0.5\alpha=0.5 α=0.7\alpha=0.7
Metric Time Epoch Time Epoch Time Epoch Time Epoch Time Epoch Time Epoch Time Epoch
Retraining 1.0x 1.0x 1.0x 1.0x 1.0x 1.0x 1.0x 1.0x 1.0x 1.0x 1.0x 1.0x - -
Fine-tuning 25.7x 8.0x 8.1x 3.9x 2.7x 1.9x 14.4x 1.1x 5.5x 2.9x 1.2x 0.9x 1.7 3.8
Replay 2.9x 2.1x 1.6x 1.0x 1.4x 1.4x 2.9x 2.1x 2.0x 1.4x 1.3x 1.1x 4.8 5.8
Distillation 18.2x 7.3x 5.2x 2.7x 0.9x 0.8x 2.4x 1.5x 2.7x 1.6x 1.1x 0.9x 4.8 5.3
ICL (Ours) 10.1x 10.5x 5.8x 6.1x 2.6x 2.7x 2.7x 2.5x 2.9x 3.1x 2.3x 3.2x 3.0 1.8
ICL (w/o LRL) 3.8x 14.6x 1.6x 1.7x 1.0x 1.1x 3.4x 3.3x 2.7x 2.9x 2.6x 3.6x 3.8 3.0
ICL (w/o M) 2.6x 3.6x 2.1x 2.9x 1.4x 2.1x 3.0x 3.2x 1.4x 1.6x 1.2x 1.3x 5.0 3.7
ICL (w/o M+LRL) 2.5x 2.9x 1.9x 2.0x 1.1x 1.2x 5.9x 5.8x 2.2x 2.4x 2.5x 2.5x 4.1 4.0

4.1. Experimental Settings

4.1.1. Datasets

We use four benchmark datasets across computer vision (CV) and graph representation learning (GRL). For CV, we use the following two datasets: (a) ImageNet is a benchmark dataset consisting of around 1.28 million images (Deng et al., 2009). In order to reduce the influence of additional factors for an accurate efficiency evaluation, we extract a subset with 2 classes, named ImageNet-2, to manage that the entire training process can be completed on 1 GPU within 6 hours. (b) MNIST is a handwritten digit dataset (LeCun et al., 1998). Similarly, we extract a subset with 2 classes, named MNIST-2. For GRL, we use the following two datasets: (a) PROTEINS is a bioinformatics dataset with 1,113 molecular graphs (Morris et al., 2020). (b) REDDIT is a social network dataset consisting of 2,000 graphs in 2 classes (Morris et al., 2020). We use Local Degree Profile (LDP) algorithm (Cai and Wang, 2018) to generate node features.

4.1.2. Baselines and ICL Variants

We compare four baselines including the simple strategies (retraining and fine-tuning) and commonly used techniques (replay and distillation).

  • Retraining uses both old data and new data to train a new encoder. Since the aim of this study is to find an approach for unbiased estimation, we first compare ICL with retraining which serves as the baseline of efficiency and the upper bound of effectiveness.

  • Fine-tuning updates the parameters of the pre-trained encoder using the new data.

Furthermore, since ICL is under the self-supervised setting which lacks related research, we survey the replay and distillation technique used in incremental learning (Rebuffi et al., 2017; Castro et al., 2018; Fang et al., 2020; Cha et al., 2021; Lin et al., 2021).

  • Replay uses the previously seen data (partially stored old data) and the new data to train the encoder for avoiding catastrophic forgetting. classes,

  • Distillation learns a more compact encoder from the old one to prevent over-drift of representations from previous data when learning new ones.

For fairness, we use the exact same network architecture for all methods.

We further introduce the following three variants of ICL to verify the effectiveness of the components (LRL and meta-optimization):

  • ICL without LRL mechanism (w/o LRL), i.e., with NCE-II and meta-optimization.

  • ICL without meta-optimization (w/o M), i.e., with NCE-II and LRL only generating one learning rate.

  • ICL without meta-optimization and LRL mechanism (w/o M+LRL), i.e., only with NCE-II.

4.1.3. Implementation Details

We adopt a two-stage scheme (You et al., 2020; Li et al., 2021a, 2022b). In the training process, for fairness, all of the methods use the same single encoder to generate the representations. In the testing process, an extra SVM classifier (You et al., 2020) is trained with the fixed embedding for evaluation. In order to evaluate the efficiency of the methods more accurately and reduce the influence of external factors, we conduct each experiment on one single Nvidia V100 32G GPU for 5 times independently. We record the mean value as the final accuracy for each experiment and omit the standard deviations (all deviations are around 0.01). (a) The encoder we have used for CV is ResNet-18 (He et al., 2016) and a two-layer Graph Convolution Network (GCN) (Kipf and Welling, 2017) with 32 hidden units is used for GRL. (b) We choose {random resizing, 224×224-pixel cropping, random color jittering, random grayscale conversion, random horizontal flip} (Wu et al., 2018; He et al., 2020) as the data augmentation for images, and randomly sample one from {random node dropping, random node attribute masking, random subgraph selection} (You et al., 2020) for GRL. (c) The similarity measurement we have used is the cosine similarity function sim(zi,zj)zizj/zizjsim(z_{i},z_{j})\coloneqq z_{i}\cdot z_{j}/||z_{i}||\cdot||z_{j}||. (d) The state generated by function \mathcal{F} in LRL is defined as the average loss of the current training process on the mini-batch data. (e) A two-layer LSTM with 20 hidden units is used as the actor network and a three-layer neural network (NN) with 10 hidden units is used as the critic network.

4.1.4. Parameters Setting

For common hyper-parameters, the number of negative samples K=31K=31 (i.e., a batch size bs=32bs=32), the initial learning rate is searched from lr{103,104,105}lr\in\{10^{-3},10^{-4},10^{-5}\}, the temperature parameters τ=0.1\tau=0.1, and the patience to wait for convergence is 5050 epochs (i.e., the process is terminated when the loss no longer drops for 50 epochs). For LRL, the momentum term m=103m=10^{-3}.

4.2. Efficiency Analysis: How fast ICL is?

We first perform the image and graph-level classification on CV and GRL datasets. To simulate different incremental scenarios, we randomly split each dataset into the old data and new data according to the given growth ratio α\alpha. As an unbiased approach, the proposed ICL framework achieves high efficiency and fast convergence. Specifically, we have the following observations.

High Efficiency. The proposed ICL framework has a great superiority in terms of training time consumption as shown in Table 3 and Table 4, where we use the speedup of training time compared with retraining se,i=timeretrain/timeis_{e,i}=time_{retrain}/time_{i} for method ii as the metric. (a) For CV datasets, ICL achieves a speedup of up to 16.7×16.7\times w.r.t. retraining. Moreover, without the meta-optimization and LRL (i.e., only applying NCE-II), ICL still brings us a speedup of up to 9.8×9.8\times. Overall, it is obvious that ICL is the most efficient approach in all of the cases on CV datasets, even compared with fine-tuning and distillation which exclude the need of training old data. (b) For GRL datasets, ICL also achieves a speedup of up to 10.1×10.1\times, and still 5.9×5.9\times without the meta-optimization and LRL. It is worth mentioning that ICL can beat replay and distillation with only NCE-II.

Fast Convergence. The proposed ICL framework gives the fastest convergence speed. We use the speedup of convergence epoch compared with the retraining sc,i=epochretrain/epochis_{c,i}=epoch_{retrain}/epoch_{i} for method ii as its metric. (a) For CV datasets, ICL achieves a speedup of up to 16.8×16.8\times compared with retraining and the improvements are significant in all cases. Moreover, ICL only with NCE-II (w/o M+LRL) still gains a speedup of up to 16.4×16.4\times. Similar to training time, the convergence speedup of ICL exceeds all methods. (b) For GRL datasets, the experimental results are identical. ICL with its variants achieves a speedup of up to 14.6×14.6\times.

Therefore, the proposed ICL maintains significant advantages in reducing training time and accelerating convergence. Furthermore, although the time complexity of ICL is larger, the results still show the superiority of ICL in terms of efficiency, which contributed to the LRL mechanism.

4.3. Effectiveness Analysis: Does ICL impede model?

As an unbiased and efficient approach, the proposed ICL also achieves competitive performance as shown in Figure 2 and Figure 3. It is clearly observed that applying ICL to a contrastive learning approach will not impede the representation ability of the model. Specifically, the difference between ICL and the others falls in [0.0137,+0.0126][-0.0137,+0.0126] with an average improvement of 0.002.

4.4. Incremental Setting Analysis: When to use ICL?

We study the change of speedup of training time and convergence epoch with the variation of the growth ratio α\alpha. For fairness, we compare the unbiased methods (ICL and retraining) and report the results in Figure 4. Specifically, we have the following findings.

Consistent Efficiency. We vary α\alpha from 0.10.1 to 0.90.9 at 0.10.1 intervals, i.e., simulating the amount of new data from 1/9 of the old data to 9 times the old data. Overall, ICL achieves a 2.2×2.2\times-18.3×18.3\times speedup of training time and a 2.7×2.7\times-22.9×22.9\times speedup of convergence epoch.

Adaption Ability with a Large Ratio of New Data. The superiority of ICL in training time is obvious even when α\alpha is large. Moreover, from Table 3 and Table 4, ICL consistently achieves faster convergence while the other baselines nearly equal the retraining. Thus, ICL is more practical in scenarios where there is a large ratio of new data.

Limited Superiority by the Ratio of New Data. As α\alpha increases, the speedup of ICL becomes smaller due to a large amount of new data. Thus, overmuch new data leads to the weakening of incremental learning strategies. However, ICL nevertheless gains a 2.2×2.2\times-5.3×5.3\times speedup when the amount of the new data is 9 times the old one.

4.5. Ablation Study: How ICL works?

We further investigate the ablation study of the three important components: the NCE-II loss, the meta-optimization algorithm, and the LRL mechanism. We provide the loss change compared with NCE-I and Adam in Figure 5.

NCE-II Loss is unbiased and useful. From Table 3 and Table 4, it is noticed that ICL only with NCE-II (i.e., without meta-optimization and LRL) still achieves a faster training and convergence speed, up to 9.8×9.8\times and 16.4×16.4\times respectively. From Figure 5, NCE-II’s decrease rate of loss is faster than only applying NCE-I, which reflects that NCE-II helps to quickly adapt to new data with an unbiased estimation.

Meta-optimization is essential. As shown in Table 3 and Table 4, without meta-optimization, the efficiency is significantly lowered. Moreover, comparing ICL only with NCE-II and ICL with meta-optimization, the efficiency is improved. Thus, meta-optimization contributes considerably to fast adaption of new data.

Learning Rate Learning has a vital contribution to the improvement of efficiency. From Table 3 and Table 4, without LRL mechanism, the degree of improvement on efficiency is reduced. In Figure 5, it is obvious that the loss declines faster with LRL compared with the Adam algorithm.

Refer to caption
(a) ImageNet-2.
Refer to caption
(b) MNIST-2.
Refer to caption
(c) PROTEINS.
Refer to caption
(d) REDDIT.
Figure 2. Classification results of old data on four datasets. (R/T: Retraining; F/T: Fine-tuning; R/P: Replay; Dist.: Distillation)
Refer to caption
(a) ImageNet-2.
Refer to caption
(b) MNIST-2.
Refer to caption
(c) PROTEINS.
Refer to caption
(d) REDDIT.
Figure 3. Classification results of new data on four datasets. (R/T: Retraining; F/T: Fine-tuning; R/P: Replay; Dist.: Distillation)
Refer to caption
(a) ImageNet-2 (Time).
Refer to caption
(b) ImageNet-2 (Epoch).
Refer to caption
(c) PROTEINS (Time).
Refer to caption
(d) PROTEINS (Epoch).
Figure 4. Speedup of unbiased methods (ICL and retraining) with the variation of α\alpha. For fairness, the biased ones (fine-tuning, replay, and distillation) are excluded.
Refer to caption
(a) ImageNet-2.
Refer to caption
(b) REDDIT.
Figure 5. Training process on ImageNet-2 and REDDIT.

5. Related Work

5.1. Contrastive Learning

Contrastive learning (CL) is a self-supervised approach that aims at learning to discriminate the samples by contrasting the negative ones (Liu et al., 2021; Jaiswal et al., 2020; Li et al., 2021a). Contrastive Predictive Coding (CPC) proposed an InfoNCE loss to maximize the mutual information between the sample and its positive one (Van den Oord et al., 2018; Sun et al., 2022). Recently, InfoNCE loss has been widely used in computer vision (CV). SimCLR and MoCo subsequently use contrastive learning to generate the representations of images through different data augmentation methods (Chen et al., 2020; He et al., 2020) and achieve comparable results with state-of-the-art supervised approaches. Furthermore, contrastive learning is also well researched in graph representation learning (GRL). Graph Contrastive Coding (GCC) and GrpahCL apply InfoNCE to learn the representation of node and graph (Qiu et al., 2020; You et al., 2020). The above studies follow a similar framework to Definition 1 which is the backbone of ICL.

5.2. Incremental Learning

Incremental learning is a learning process where the new data is continuously coming from the environment (Ade and Deshmukh, 2013; Peng et al., 2017, 2021). Most studies of incremental learning focus on supervised learning. For example, iCaRL proposed an incremental classifier and representation learning approach for supervised incremental learning (Rebuffi et al., 2017). In addition, (Castro et al., 2018) proposed an end-to-end incremental learning method for class-incremental issue. However, these methods are hard to implement into self-supervised contrastive learning due to the lack of labels. Recently, some works incorporate the replay and distillation technique into CL (Lin et al., 2021; Cha et al., 2021), while they still have the bias issue.

6. Conclusion

In this paper, we studied the unbiased incremental learning issue in self-supervised contrastive learning and proposed Incremental Contrastive Learning (ICL) framework. Specifically, we designed an Incremental InfoNCE (NCE-II) loss function to give an unbiased estimation of noise distribution change in incremental scenarios. Moreover, we proposed a meta-optimization algorithm with Learning Rate Learning (LRL) mechanism to achieve fast convergence. The experiments demonstrated the efficiency and effectiveness of the proposed ICL framework.

Acknowledgements.
The corresponding author is Jianxin Li. This work is supported in part by NSFC through grant U20B2053 and NSF under grants III-1763325, III-1909323, III-2106758, and SaTC-1930941.

References

  • (1)
  • Ade and Deshmukh (2013) RR Ade and PR Deshmukh. 2013. Methods for incremental learning: a survey. International Journal of Data Mining & Knowledge Management Process 3, 4 (2013), 119.
  • Cai and Wang (2018) Chen Cai and Yusu Wang. 2018. A simple yet effective baseline for non-attributed graph classification. arXiv preprint arXiv:1811.03508 (2018).
  • Castro et al. (2018) Francisco M Castro, Manuel J Marín-Jiménez, Nicolás Guil, Cordelia Schmid, and Karteek Alahari. 2018. End-to-end incremental learning. In Proceedings of the European conference on computer vision (ECCV). 233–248.
  • Cha et al. (2021) Hyuntak Cha, Jaeho Lee, and Jinwoo Shin. 2021. Co2l: Contrastive continual learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 9516–9525.
  • Chen et al. (2020) Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A simple framework for contrastive learning of visual representations. In International conference on machine learning (ICML). PMLR, 1597–1607.
  • Deng et al. (2009) Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition (CVPR). Ieee, 248–255.
  • Fang et al. (2020) Zhiyuan Fang, Jianfeng Wang, Lijuan Wang, Lei Zhang, Yezhou Yang, and Zicheng Liu. 2020. SEED: Self-supervised Distillation For Visual Representation. In International Conference on Learning Representations (ICLR).
  • Finn et al. (2017) Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Model-agnostic meta-learning for fast adaptation of deep networks. In International conference on machine learning (ICML). PMLR, 1126–1135.
  • Gutmann and Hyvärinen (2010) Michael Gutmann and Aapo Hyvärinen. 2010. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the thirteenth international conference on artificial intelligence and statistics. JMLR Workshop and Conference Proceedings, 297–304.
  • Gutmann and Hyvärinen (2012) Michael U Gutmann and Aapo Hyvärinen. 2012. Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics. Journal of machine learning research 13, 2 (2012).
  • He et al. (2020) Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 2020. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR). 9729–9738.
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). 770–778.
  • Jaiswal et al. (2020) Ashish Jaiswal, Ashwin Ramesh Babu, Mohammad Zaki Zadeh, Debapriya Banerjee, and Fillia Makedon. 2020. A survey on contrastive self-supervised learning. Technologies 9, 1 (2020), 2.
  • Kaelbling et al. (1996) Leslie Pack Kaelbling, Michael L Littman, and Andrew W Moore. 1996. Reinforcement learning: A survey. Journal of artificial intelligence research 4 (1996), 237–285.
  • Kipf and Welling (2017) Thomas N Kipf and Max Welling. 2017. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations (ICLR).
  • LeCun et al. (1998) Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86, 11 (1998), 2278–2324.
  • Li et al. (2021a) Jianxin Li, Cheng Ji, Hao Peng, Yu He, Yangqiu Song, Xinmiao Zhang, and Fanzhang Peng. 2021a. RWNE: A Scalable Random-Walk based Network Embedding Framework with Personalized Higher-order Proximity Preserved. Journal of Artificial Intelligence Research 71 (2021), 237–263.
  • Li et al. (2022b) Jianxin Li, Tianchen Zhu, Haoyi Zhou, Qingyun Sun, Chunyang Jiang, Shuai Zhang, and Chunming Hu. 2022b. AIQoSer: Building the efficient Inference-QoS for AI Services. In 2022 IEEE/ACM 30th International Symposium on Quality of Service (IWQoS). IEEE, 1–10.
  • Li et al. (2022a) Qian Li, Jianxin Li, Jiawei Sheng, Shiyao Cui, Jia Wu, Yiming Hei, Hao Peng, Shu Guo, Lihong Wang, Amin Beheshti, et al. 2022a. A Survey on Deep Learning Event Extraction: Approaches and Applications. IEEE Transactions on Neural Networks and Learning Systems (2022).
  • Li et al. (2021b) Qian Li, Hao Peng, Jianxin Li, Jia Wu, Yuanxing Ning, Lihong Wang, S Yu Philip, and Zheng Wang. 2021b. Reinforcement learning-based dialogue guided event extraction to exploit argument relations. IEEE/ACM Transactions on Audio, Speech, and Language Processing 30 (2021), 520–533.
  • Lillicrap et al. (2016) Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. 2016. Continuous control with deep reinforcement learning.. In International Conference on Learning Representations (ICLR).
  • Lin et al. (2021) Zhiwei Lin, Yongtao Wang, and Hongxiang Lin. 2021. Continual Contrastive Self-supervised Learning for Image Classification. arXiv preprint arXiv:2107.01776 (2021).
  • Liu et al. (2021) Xiao Liu, Fanjin Zhang, Zhenyu Hou, Li Mian, Zhaoyu Wang, Jing Zhang, and Jie Tang. 2021. Self-supervised learning: Generative or contrastive. IEEE Transactions on Knowledge and Data Engineering (TKDE) (2021).
  • Logeswaran and Lee (2018) Lajanugen Logeswaran and Honglak Lee. 2018. An efficient framework for learning sentence representations. In International Conference on Learning Representations (ICLR).
  • Mai et al. (2021) Zheda Mai, Ruiwen Li, Hyunwoo Kim, and Scott Sanner. 2021. Supervised contrastive replay: Revisiting the nearest class mean classifier in online class-incremental continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 3589–3599.
  • McCloskey and Cohen (1989) Michael McCloskey and Neal J Cohen. 1989. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of learning and motivation. Vol. 24. Elsevier, 109–165.
  • Mermillod et al. (2013) Martial Mermillod, Aurélia Bugaiska, and Patrick Bonin. 2013. The stability-plasticity dilemma: Investigating the continuum from catastrophic forgetting to age-limited learning effects. Frontiers in psychology 4 (2013), 504.
  • Mnih et al. (2013) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. 2013. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 (2013).
  • Morris et al. (2020) Christopher Morris, Nils M Kriege, Franka Bause, Kristian Kersting, Petra Mutzel, and Marion Neumann. 2020. Tudataset: A collection of benchmark datasets for learning with graphs. arXiv preprint arXiv:2007.08663 (2020).
  • Peng et al. (2017) Hao Peng, Jianxin Li, Yangqiu Song, and Yaopeng Liu. 2017. Incrementally learning the hierarchical softmax function for neural language models. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 31.
  • Peng et al. (2021) Hao Peng, Renyu Yang, Zheng Wang, Jianxin Li, Lifang He, S Yu Philip, Albert Y Zomaya, and Rajiv Ranjan. 2021. Lime: Low-cost and incremental learning for dynamic heterogeneous information networks. IEEE Trans. Comput. 71, 3 (2021), 628–642.
  • Qiu et al. (2020) Jiezhong Qiu, Qibin Chen, Yuxiao Dong, Jing Zhang, Hongxia Yang, Ming Ding, Kuansan Wang, and Jie Tang. 2020. Gcc: Graph contrastive coding for graph neural network pre-training. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD). 1150–1160.
  • Rebuffi et al. (2017) Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, Georg Sperl, and Christoph H Lampert. 2017. icarl: Incremental classifier and representation learning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR). 2001–2010.
  • Serra et al. (2018) Joan Serra, Didac Suris, Marius Miron, and Alexandros Karatzoglou. 2018. Overcoming catastrophic forgetting with hard attention to the task. In International Conference on Machine Learning (ICML). PMLR, 4548–4557.
  • Shapiro (2003) Alexander Shapiro. 2003. Monte Carlo sampling methods. Handbooks in operations research and management science 10 (2003), 353–425.
  • Silver et al. (2014) David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller. 2014. Deterministic policy gradient algorithms. In International conference on machine learning (ICML). PMLR, 387–395.
  • Sun et al. (2022) Qingyun Sun, Jianxin Li, Hao Peng, Jia Wu, Xingcheng Fu, Cheng Ji, and S Yu Philip. 2022. Graph structure learning with variational information bottleneck. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 4165–4174.
  • Sun et al. (2021) Qingyun Sun, Jianxin Li, Hao Peng, Jia Wu, Yuanxing Ning, Philip S Yu, and Lifang He. 2021. Sugar: Subgraph neural network with reinforcement pooling and self-supervised mutual information mechanism. In Proceedings of the Web Conference 2021. 2081–2091.
  • Sutton et al. (1999) Richard S Sutton, David McAllester, Satinder Singh, and Yishay Mansour. 1999. Policy gradient methods for reinforcement learning with function approximation. Advances in neural information processing systems (NeurIPS) 12 (1999).
  • Uhlenbeck and Ornstein (1930) George E Uhlenbeck and Leonard S Ornstein. 1930. On the theory of the Brownian motion. Physical review 36, 5 (1930), 823.
  • Van den Oord et al. (2018) Aaron Van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation learning with contrastive predictive coding. arXiv e-prints (2018), arXiv–1807.
  • Wu et al. (2018) Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin. 2018. Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). 3733–3742.
  • Xu et al. (2017) Chang Xu, Tao Qin, Gang Wang, and Tie-Yan Liu. 2017. Reinforcement learning for learning rate control. arXiv preprint arXiv:1705.11159 (2017).
  • You et al. (2020) Yuning You, Tianlong Chen, Yongduo Sui, Ting Chen, Zhangyang Wang, and Yang Shen. 2020. Graph contrastive learning with augmentations. Advances in Neural Information Processing Systems (NeurIPS) 33 (2020), 5812–5823.