This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Less is More: Rethinking State-of-the-art Continual Relation Extraction Models with a Frustratingly Easy but Effective Approach

Peiyi Wang1  Yifan Song1  Tianyu Liu3  Rundong Gao2
Binghuai Lin3  Yunbo Cao3  Zhifang Sui1
1 Key Laboratory of Computational Linguistics, Peking University, MOE, China
2 Center for Data Science, Peking University
3 Tencent Cloud Xiaowei
wangpeiyi9979@gmail.com; gaord20@stu.pku.edu.cn
{rogertyliu, binghuailin, yunbocao}@tencent.com; {yfsong, szf}@pku.edu.cn
Abstract

Continual relation extraction (CRE) requires the model to continually learn new relations from class-incremental data streams. In this paper, we propose a Frustratingly easy but Effective Approach (Fea) method with two learning stages for CRE: 1) Fast Adaption (FA) warms up the model with only new data. 2) Balanced Tuning (BT) finetunes the model on the balanced memory data. Despite its simplicity, Fea achieves comparable (on Tacred) or superior (on Fewrel) performance compared with the state-of-the-art baselines. With careful examinations, we find that the data imbalance between new and old relations leads to a skewed decision boundary in the head classifiers over the pretrained encoders, thus hurting the overall performance. In Fea, the FA stage unleashes the potential of memory data for the subsequent finetuning, while the BT stage helps establish a more balanced decision boundary. With a unified view, we find that two strong CRE baselines can be subsumed into the proposed training pipeline. The success of Fea also provides actionable insights and suggestions for future model designing in CRE.

1 Introduction

Relation extraction (RE) aims to identify the relation of two given entities in a sentence, which is one of the cornerstones in knowledge graph construction and completion Riedel et al. (2013). Traditional RE models Gormley et al. (2015); Xu et al. (2015a); Zhang et al. (2018) are trained on a static dataset with a predefined relation set, which is inadequate in the real-world applications where new relations are constantly emerging.

To adapt to the real-world applications, continual relation extraction (CRE) is proposed Wang et al. (2019); Han et al. (2020). CRE requires the model to continually learn new relations from class-incremental data streams. In CRE, it is often infeasible to combine the new data with previous data and then retrain the model due to the privacy policy, store space limit or computation overhead Biesialska et al. (2020); Wu et al. (2021). Therefore, CRE models usually suffer from catastrophic forgetting, i.e., the performance of previously learned relations drops rapidly while learning new relations.

To alleviate this problem, researchers usually retain a few instances as a memory cache for each learned relation, and propose different ways to replay the sotred incorporate the memory data while learning new relations. For example, Han et al. (2020) propose memory replay, activation and reconsolidation protocols to attain relation prototype representation. Cui et al. (2021) propose a memory network to refine the representation of input instances to better utilize memory data. Wu et al. (2021) propose a curriculum-meta learning method to reduce the replay frequency of memory data to alleviate the over-fitting problem. Zhao et al. (2022) introduce supervised contrastive learning and knowledge distillation to learn better memory representations.

In this paper, we propose a frustratingly easy but effective approach (Fea) with two learning stages for CRE: 1) in Fast Adaption (FA), we train the model with only new data to rapidly learn new relations. 2) in Balanced Tuning (BT), we finetune the model with the balanced updated memory by downsampling new data and then adding the downsampled instances to the existing memory. Despite its simplicity, Fea achieves comparable (on Tacred) or superior (on Fewrel) performance compared with the state-of-the-art baselines.

To better understand why our Fea works, we conduct a series of analysis and observe that: 1) The catastrophic forgetting happens because the CRE models often mistakenly predict existing old relation instances as newly emerging relations. 2) The intrinsic data imbalance problem between newly emerging data and existing memory cache leads to a skewed decision boundary of the head classifier, which is one of the main reasons for catastrophic forgetting. 3) The Fast Adaption phase retains the potential of memory data for the subsequent finetuning, while the Balance Tuning phase circumvents the data imbalance problem, and helps build a more balanced decision boundary. 4) Constructing a better classifier with the balanced decision boundary is of great benefit to CRE models.

With a deep dive into the best performing CRE models, we also find that the proposed Fea can be viewed as a simplified but more effective variant of two strong baselines, namely EMAR Han et al. (2020) and RPCRE Cui et al. (2021), by removing or replacing certain modules. We show that EMAR and RPCRE can be enhanced by incorporating similar designing principles of Fea, and the success of Fea also provides actionable insights and suggestions for future model designing in CRE. Our contributions are listed as follows:

  • We propose a frustratingly easy but effective approach (Fea) with Fast Adaption and Balanced Tuning for CRE.

  • Extensive experiments show the effectiveness of Fea on two benchmarks. We also conduct thorough analysis on why catastrophic forgetting happens and why Fea works.

  • By aligning the underlying designs of two strong CRE baselines with that of Fea, we provide actionable insights for developing effective CRE models in future research.

2 Related Work

2.1 Relation Extraction

Traditional Relation Extraction mainly includes supervised methods Zelenko et al. (2003); Zhou et al. (2005); Zeng et al. (2014); Gormley et al. (2015); Xu et al. (2015b); Miwa and Bansal (2016), and distant supervised methods Mintz et al. (2009); Zeng et al. (2015); Lin et al. (2016); Han et al. (2018a); Liu et al. (2019); Baldini Soares et al. (2019). The conventional RE works mainly utilize human-designed lexical and syntactic features, e.g., shortest dependency path, POS tagging to predict the relation of two entities Zelenko et al. (2003); Zhou et al. (2005); Mintz et al. (2009), which needs a lot of human engineering. Recently, deep learning-based methods are proposed to alleviate the human labor and outperform feature-based methods. For example, Zeng et al. (2015) encodes sentence through convolutional neural networks. Han et al. (2018a) introduce attention mechanism to aggregate information of sentence with given entities. Alt et al. (2019); Baldini Soares et al. (2019) introduce pretrained language models for relation extraction.

2.2 Continual Learning

Continual Learning (CL) aims to train the models to learn from a continuous data stream Biesialska et al. (2020). Researchers usually formulate the data stream as a sequence of tasks arriving at different times. In CL, the models usually suffer from catastrophic forgetting, and existing CL methods mainly focus on three kinds of methods to alleviate this problem: 1) Regularization-based methods Li and Hoiem (2017); Aljundi et al. (2018); Kirkpatrick et al. (2017) put regularization constraints on the parameter space while learning subsequent tasks to preserve acquired knowledge. 2) Parameter isolation-based methods Mallya and Lazebnik (2018); Mallya et al. (2018); Serra et al. (2018); Liu et al. (2018) allocate subsets of the model parameters or extend new parameters for specific tasks. For example, Mallya and Lazebnik (2018) prunes parameters by heuristics and Liu et al. (2018) learns gradient masks to instantiate new sub-networks for each task. 3) Rehearsal-based methods Lopez-Paz and Ranzato (2017); Shin et al. (2017); Seff et al. (2017) store a few instances in the memory from previous tasks, and replay the memory data when learning new tasks to remind the model of previously learned knowledge. Specifically, we focus on the rehearsal-based method for CRE in this paper.

3 Task Formulation

Following previous works Cui et al. (2021); Zhao et al. (2022), CRE is formulated to accomplish a sequence of tasks (𝒯1,𝒯2,,𝒯K)(\mathcal{T}_{1},\mathcal{T}_{2},\dots,\mathcal{T}_{K}), where the kk-th task 𝒯k\mathcal{T}_{k} has its own training set 𝒟k\mathcal{D}_{k}, testing set QkQ_{k} and relation set RkR_{k}. Every instance (xi,yi)𝒟kQk(x_{i},y_{i})\in\mathcal{D}_{k}\cup Q_{k} corresponds to a specific relation yiRky_{i}\in R_{k}. More specifically, in the kk-th task, a CRE model is trained on 𝒟k\mathcal{D}_{k} to learn new relations RkR_{k}, and it should also be capable of handling all the seen relations R¯k=i=1kRi\bar{R}_{k}=\cup_{i=1}^{k}R_{i}, i.e., the model will be evaluated on all seen testing sets j=1iQj\bigcup_{j=1}^{i}Q_{j}. To circumvent the catastrophic forgetting in CRE, following previous works Cui et al. (2021); Zhao et al. (2022), we use a memory to store a few instances mr={(x1,r),,(xB,r)}m_{r}=\{(x_{1},r),\dots,(x_{B},r)\} from each previously learned relation rr, and then replay them in the subsequent training, where BB is the constrained memory size. After the kk-th task, we have the memory k=rR¯k{(x1,r),,(xB,r)}\mathcal{M}_{k}=\cup_{r\in\bar{R}_{k}}\{(x_{1},r),\dots,(x_{B},r)\}, where R¯k\bar{R}_{k} is the set of all already observed relations.

4 Methodology

Algorithm 1 Fea for the kk-th task
0:  f(Θ,θk1)f(\Theta,\theta_{k-1}): Relation extraction model trained on tasks (𝒯1,,𝒯k1)(\mathcal{T}_{1},...,\mathcal{T}_{k-1}) with parameters (Θ,θk1)(\Theta,\theta_{k-1}). k1\mathcal{M}_{k-1}: Memory which stores typical instances of relations in (𝒯1,,𝒯k1)(\mathcal{T}_{1},...,\mathcal{T}_{k-1}). 𝒟k\mathcal{D}_{k}: Training dataset of the kk-th task 𝒯k\mathcal{T}_{k}.
0:  f(Θ,θk)f(\Theta,\theta_{k}): Relation extraction model trained on tasks (𝒯1,,𝒯k)(\mathcal{T}_{1},...,\mathcal{T}_{k}) with parameters (Θ,θk)(\Theta,\theta_{k}). k\mathcal{M}_{k}: Memory which stores typical instances of relations in (𝒯1,,𝒯k)(\mathcal{T}_{1},...,\mathcal{T}_{k}).
1:  θk\theta_{k} \leftarrow θk1θnew\theta_{k-1}\cup\theta_{new}
2:  for i1i\leftarrow 1 to epoch1epoch_{1} do
3:     Update (Θ,θk)(\Theta,\theta_{k}) with \nabla\mathcal{L} on 𝒟k\mathcal{D}_{k}
4:  end for
5:  Select memory instances mkm_{k} for relations in 𝒯k\mathcal{T}_{k} from 𝒟k\mathcal{D}_{k} through the K-Means algorithm.
6:  k\mathcal{M}_{k} \leftarrow k1mk\mathcal{M}_{k-1}\cup m_{k}
7:  for i1i\leftarrow 1 to epoch2epoch_{2} do
8:     Update (Θ,θk)(\Theta,\theta_{k}) with \nabla\mathcal{L} on k\mathcal{M}_{k}
9:  end for

Our proposed Fea is model-independent, and we use the relation extraction model f(Θ,θ)f(\Theta,\theta) from Cui et al. (2021) as our backbone model. The extraction model consists of two components, one is an encoder with parameter Θ\Theta, and the other is a classifier with parameter θ\theta.

4.1 Relation Extraction Model

Encoder

Given an input instances xx with two entities e1e_{1} and e2e_{2}, we first insert four special marks [E11][E_{11}]/[E12][E_{12}] and [E21][E_{21}]/ [E22][E_{22}] to denote the start/end positions of head and tail entities:

(,[E11],e1,[E12],,[E21],e2,[E22],).(\dots,[E_{11}],e_{1},[E_{12}],\dots,[E_{21}],e_{2},[E_{22}],\dots).

Then, we use BERT to encode the input xx, and get the hidden representations of [E11][E_{11}] and [E21][E_{21}], i.e., 𝐡11\mathbf{h}_{11} and 𝐡21\mathbf{h}_{21}. Finally, we achieve the representation 𝐡\mathbf{h} of xx through as follows:

𝐡=LayerNorm(𝐖cat[𝐡11;𝐡12]+𝐛cat)),\mathbf{h}={\rm LayerNorm}(\mathbf{W}_{cat}[\mathbf{h}_{11};\mathbf{h}_{12}]+\mathbf{b}_{cat})), (1)

where [;][;] is the concatenation operation, 𝐖catd×2d\mathbf{W}_{cat}\in\mathbb{R}^{d\times 2d} and 𝐛cat2d\mathbf{b}_{cat}\in\mathbb{R}^{2d} are trainable parameters.

Classifier

The classifier figures out the relation probability of xx according to the Encoder output 𝐡\mathbf{h} as follows:

P(y|x)=softmax(𝐖𝐡),P(y|x)={\rm softmax}(\mathbf{W}\mathbf{h}), (2)

where 𝐖2d×c\mathbf{W}\in\mathbb{R}^{2d\times c} is trainable parameter, and cc is the number of seen relations.

Fewrel
Models T1 T2 T3 T4 T5 T6 T7 T8 T9 T10
EA-EMR 89.0 69.0 59.1 54.2 47.8 46.1 43.1 40.7 38.6 35.2
CML 91.2 74.8 68.2 58.2 53.7 50.4 47.8 44.4 43.1 39.7
RPCRE 97.9 92.7 91.6 89.2 88.4 86.8 85.1 84.1 82.2 81.5
RPCRE 97.8 94.7 92.7 90.3 89.4 88.0 87.1 85.8 84.4 82.8
EMAR 98.1 94.3 92.3 90.5 89.7 88.5 87.2 86.1 84.8 83.6
CRL 98.1 94.6 92.5 90.5 89.4 87.9 86.9 85.6 84.5 83.1
Fea (Ours) 98.3 94.8 93.1 91.7 90.8 89.1 87.9 86.8 85.8 84.3
Tacred
Models T1 T2 T3 T4 T5 T6 T7 T8 T9 T10
EA-EMR 47.5 40.1 38.3 29.9 24.0 27.3 26.9 25.8 22.9 19.8
CML 57.2 51.4 41.3 39.3 35.9 28.9 27.3 26.9 24.8 23.4
RPCRE 97.6 90.6 86.1 82.4 79.8 77.2 75.1 73.7 72.4 72.4
RPCRE 97.5 92.2 89.1 84.2 81.7 81.0 78.1 76.1 75.0 75.3
EMAR 98.3 92.0 87.4 84.1 82.1 80.6 78.3 76.6 76.8 76.1
CRL 97.7 93.2 89.8 84.7 84.1 81.3 80.2 79.1 79.0 78.0
Fea (Ours) 97.6 92.6 89.5 86.4 84.8 82.8 81.0 78.5 78.5 77.7
Table 1: Accuracy on all observed relations at the stage of learning current tasks. denotes our reproduced results with the open codebases. Other results are directly taken from Zhao et al. (2022). Best results are in boldface. We report the average accuracy over 55 different runs.

4.2 Two Stage Learning

In CRE, after the (k(k-1)1)-th task of continual relation extraction, we have the model f(Θ,θk1)f(\Theta,\theta_{k-1}) which has been trained on (𝒯1,,𝒯k1)(\mathcal{T}_{1},\dots,\mathcal{T}_{k-1}). When learning new relations on the kk-th task, we take a frustratingly easy but effective approach (Fea) with fast adaption and balanced tuning to train the model.

Fast Adaption (FA)

aims to rapidly learn the knowledge of new tasks. In FA, as shown in Algorithm 1, we first extend the classifier for new relations (line 1), and then train the model with only the instances of new relations (lines 2-4) to quickly learn new relations.

Memory Selection

After the Fast Adaption, for relations in the current task 𝒯k\mathcal{T}_{k}, we select their most informative and diverse instances from the training data 𝒟k\mathcal{D}_{k} to update the memory (Algorithm 1: line 5-6). Following Han et al. (2020); Cui et al. (2021); Zhao et al. (2022), we apply the K-means algorithm to cluster instances of each relation rr in 𝒯k\mathcal{T}_{k} and select the instances closest to each cluster center as the memorized instances of the rr. The cluster number of K-means is set as the memory size for each relation.

Balanced Tuning (BT)

aims to train the model to distinguish old and new relations. However, in CRE, the number of instances of old relations in the memory is significantly less than that of current relations in 𝒟k\mathcal{D}_{k}, which causes severe the data imbalance problem. Therefore, at the Balanced Tuning stage (Algorithm 1: line 7-9), we train the model only with the updated memory k\mathcal{M}_{k} where all seen relations have an equal number of instances.

4.3 Training and Inference

The loss function of both FA and BT stages is defined as follows:

=i=1|𝒟|logP(yi|xi),\mathcal{L}=\sum^{|\mathcal{D}^{*}|}_{i=1}-\log{P(y_{i}|x_{i})}, (3)

where (xi,yi)(x_{i},y_{i}) is an instance from 𝒟\mathcal{D}^{*}, and 𝒟\mathcal{D}^{*} denotes 𝒟k\mathcal{D}_{k} or k\mathcal{M}_{k} at the FA stage or BT stage, respectively. During inference, we select the relation with the max probability as the predicted relation.

5 Experiments

5.1 Datasets and Evaluation Metric

Datasets

Following Han et al. (2020); Cui et al. (2021); Zhao et al. (2022), we evaluate Fea on two widely used datasets, Fewrel and Tacred. Fewrel Han et al. (2018b) is originally proposed for few-shot relation extraction, which consists of 100100 relations and 700700 instances per relation. Previous works Han et al. (2020); Cui et al. (2021); Zhao et al. (2022) construct CRE dataset using 8080 relations of Fewrel, and divide them into 1010 tasks to form a task sequence. The training, validation and test split ratio is 3:1:1. Tacred Zhang et al. (2017) is a relation extraction dataset with 4242 relations (including the ‘no relation’) and 106,264106,264 instances. Cui et al. (2021); Zhao et al. (2022) remove the instances of ‘no relation’ and divide the left relations into 1010 tasks to form a task sequence. The number of training and testing samples for each relation is limited to 320320 and 4040, respectively.

Evaluation Metric

Following Han et al. (2020); Cui et al. (2021), after training on each new task, the model will be evaluated on the test data of all seen relations by the classification accuracy.

5.2 Experimental Setup

Parameter Settings

We use a random sampling strategy to construct continual relation task sequences. It randomly divides all relations of the dataset into 1010 sets to simulate 1010 tasks. For fairness, we use the same random seeds with Cui et al. (2021); Zhao et al. (2022), thus the task sequences are exactly the same. We report the average accuracy of 55 different sampling task sequences. Following Cui et al. (2021), we use bert-base-uncased as our encoder, and Adam as our optimizer with the learning rate 1e1e-33 for non-BERT modules and 1e1e-5 for the BERT module. The memory size is 1010 in our main experiments (we also explore the influence of memory size in Appendix 5). Both our fast adaption learning and balanced tuning stages contain 1010 epochs.111We do not tune the hyperparameters of Fea, and just set them the same as Cui et al. (2021). We run our code on a single NVIDIA A40 GPU with 48GB of memory.

Baselines

We compare Fea with the following baselines in our experiments: EA-EMR Wang et al. (2019), CML Wu et al. (2021), EMAR Han et al. (2020), RP-CRE Cui et al. (2021), CRL Zhao et al. (2022). Please refer to appendix A for details of these baselines.

5.3 Main Results

The performances of Fea and baselines are shown in Table 1. As is shown, Our Fea achieves the best result at all task (T1-T10) stages on Fewrel, and comparable results with the best performing model CRL on Tacred. We think Fea can achieve the new state-of-the-art on Fewrel, since Fewrel is a class-balanced dataset, which is harmonious with our training data distribution of BT. Although Fea does not achieve the best performance on Tacred, it is just slightly worse than a more complex model CRL 0.30.3 accuracy (Note that Fea outperforms CRL 1.2 accuracy on Fewrel). These results show the effectiveness of our Fea on both balanced and imbalanced datasets.222We also explore the influence of memory size (the number of clusters in memory selection) for Fea. Fea is more stable than baselines. Please refer to Appendix 5 for details.333Fea also has a better model efficiency than that of baselines. Please refer to Appendix C for more details.

6 Analysis

Methods Fewrel Tacred
Fea 84.3 77.7
remove BT 75.8 71.2
remove FA 81.1 74.9
remove FA and BT 75.6 71.2
Table 2: Performances of models with different training data at two learning stages.

6.1 Ablation Study

To explore the effectiveness of Fea, we conduct the ablation study. We consider the three different ablation methods: 1) remove BT, which trains the model on previous memory and all new data (i.e., k1𝒟k\mathcal{M}_{k-1}\cup\mathcal{D}_{k}) at the second stage. 2) remove FA, which trains the model on k1𝒟k\mathcal{M}_{k-1}\cup\mathcal{D}_{k} at the first stage. 3) remove FA and BT, which directly trains the model on k1𝒟k\mathcal{M}_{k-1}\cup\mathcal{D}_{k} on each task stage. As shown in Table 2: a) Fea significantly outperforms “remove BT” 8.58.5 and 6.56.5 accuracy on Fewrel and Tacred, respectively, which shows BT is very essential for Fea. b)remove FA” from Fea leads to 3.23.2 and 2.82.8 accuracy drop on Fewrel and Tacred, respectively, while “remove FA and BT” has comparable performance with “remove BT”, showing the effectiveness of FA depends on the presence of BT. In fact, the FA works since it preserves the potential of memory data for the following BT, which we explore in Section 6.4. We can conclude that both BT and FA are essential for Fea.444We also explore more ablation methods in Appendix D.

Methods Error latter former inner
Fea 15.7% 63.5% 28.6% 7.9%
remove BT 24.2% 90.3% 6.1% 3.6%
remove FA 18.9% 80.9% 13.4% 6.3%
CRL 16.9% 65.2% 28.0% 6.8%
EMAR 16.6% 70.0% 23.1% 6.9%
RPCRE 17.2% 65.2% 28.2% 6.6%
Table 3: Error analysis on Fewrel. Fea outperforms other models, mainly since it has fewer latter errors.

6.2 Error Analysis

The main results and ablation study show Fea with FA and BT is extremely effective. To understand why Fea works, we first perform an error analysis to explore where Fea outperforms other methods. Specifically, given an input instance xx that belongs to the relation yy and the corresponding model prediction ypy^{p}, we say that the model makes an error if yypy\neq y^{p}. Assuming that the relation yy and ypy^{p} appear at the tasks tt-th and tpt^{p}-th, respectively, we summarize three different errors according to their appearance order: latter error (i.e., t<tpt<t^{p}), former error (i.e., t>tpt>t^{p}) and inner error (i.e., t=tpt=t^{p}).

As shown in Table 3, the latter error accounts for a large proportion for all methods. However, Fea has less latter error proportion compared with other methods and thus outperforms them. In addition, through the error analysis of Fea and two ablation methods, we find that both FA and BT are very essential to reduce the latter error. We also conduct a more detailed error analysis and notice that most of the latter error happens among similar relations. For example, on Fewrel, “remove FA” and “remove BT” predicts 37.3%37.3\% and 52.1%52.1\% instances of the former relation “location” to the latter similar relation “headquarters location”, while the ratios are just 21.1%21.1\% and 7.8%7.8\% for Fea and a supervised model (i.e., trains all data together without forgetting), respectively. From the error analysis, we can find that the catastrophic forgetting happens because the CRE models often mistakenly predict existing old relation instances as newly emerging relations..

6.3 Explorations of Balanced Tuning

Refer to caption
Figure 1: t-SNE of instances belonging to P974 (tributary) and P403 (mouth of the watercourse) after learning P403. “R1 \rightarrow R2” means the model predicts the instance of R1 to R2. “Golden Boundary” and “Prediction Boundary” are the decision boundary computed by the logistic regression with the golden label and model prediction, respectively.

Compared with “remove BT”, Fea reduces a lot of latter errors, and thus alleviates the catastrophic forgetting. To explain why BT can reduce the latter error, we chose two similar relations P974 (tributary, appears at the 77-th task) and P403 (mouth of the watercourse, appears at the 99-th task) with severe latter error to conduct a case study.555Please refer to Appendix F for more cases. We draw the t-SNE of instances belonging to P974 and P403 for “remove BT” and Fea after learning P403. As shown in Figure 1(a), for “remove BT”, the model tends to establish a skewed decision boundary between P974 and P403 due to the data imbalance.666The skewed boundary phenomenon between minor and majority classes is also observed in imbalanced learning works Khan et al. (2019); Kim and Kim (2020). In contrast, Fea with BT can establish a more balanced decision boundary (Figure1(b)), and thus reduces a lot of latter error. To further confirm whether BT helps build a better decision boundary, we drop the classifier of Fea and “remove BT” after training, and retrain an upper-bound classifier (UBC) with all training data to see their performance gap. As shown in the first four rows of Table 4: 1) Fea outperforms “remove BT”, while “Fea with UBC” and “remove BT with UBC” have comparable results on two benchmarks, showing that BT works because it establishes a more balanced decision boundary. 2) Fea significantly outperforms “remove BT” average 7.57.5 accuracy on two benchmarks, showing that the intrinsic data imbalance problem between old and new data that leads to a skewed decision boundary is one of the main reasons for catastrophic forgetting.

We also explore the performance of supervised method that trains all data together without catastrophic forgetting. As shown in Table 4: 1) two CRE models “with UBC” and “SUP.” significantly outperform “SUP. with frozen encoder”, which shows the original BERT representation is not good enough to represent relations, and the pretrained encoder learns a lot of knowledge during training. 2) “SUP.” significantly outperforms Fea and “remove BT” 6.06.0 and 13.513.5 average accuracy on two benchmarks, while it just outperforms “Fea with UBC” and “remove BT with UBC” 2.02.0 and 2.12.1 average accuracy, showing that the forgetting of BERT encoder may be not a serious problem for the performance of CRE models. 3)with UBC” significantly improves the performances of Fea and “remove BT” 4.04.0 and 11.211.2 average accuracy on two benchmarks, showing that constructing a good classifier can be of great benefit to CRE models.

Methods Few. Tac. Avg.
Fea 84.3 77.7 81.0
Fea with UBC 88.0 82.0 85.0
remove BT 75.8 71.2 73.5
remove BT with UBC 87.9 82.4 85.2
Supervised (SUP.) 89.5 84.5 87.0
SUP. with frozen encoder 77.3 66.2 71.8
Table 4: Performances of two kinds of supervised methods and Fea/“remove BT” with upper-bound classifier (UBC) on two benchmarks. A better classifier significantly improves the performance of CRE models.

6.4 Explorations of Fast Adaption

Fea outperforms “remove FA” 3.03.0 average accuracy on two benchmarks, showing the importance of training the model ONLY on new data at first. Table 3 shows FA works since it is helpful for reducing the latter error. Therefore, we begin with a case study between P974 and P403 with severe latter error to understand why FA is helpful.777Please refer to Appendix G for more cases. We draw the t-SNE of instances belonging to P974 and P403 for “remove FA” and Fea after two learning stages of the task containing P403. For “remove FA”, after the first stage, the model establishes a skewed decision boundary due to the data imbalance problem (Figure 2(a)). However, after BT, the decision boundary is still skewed (Figure 2(c)), showing BT does not work well. For Fea, in contrast, after FA, the memory instances of P974 and P403 scatter on the embedding space, and the model predicts all instances as P403 (Figure 2(b)), because the model does not need to distinguish these two relations. After BT, the model establishes a relative fair decision boundary (Figure 2(d)) compared with that of “remove FA”. These results show that “remove FA” may hinder the following BT.

Refer to caption
Figure 2: t-SNE of instances belonging to P974 and P403 after two stages of the task that contains P403.
Refer to caption
Figure 3: Illustration of EMAR, RPCRE and Fea. Fea can be viewed as a simplified variant of EMAR and RPCRE.

Because BT updates the model on the memory data, we further utilize the gradient norm of memory data to reflect their effect on BT. As shown in Table 5, in the BT stage, compared with Fea, the memory data has a much smaller gradient norm in “remove FA”, which shows the memory data has a very limited effect. We think the reasons are as follows: 1) For “remove FA”, the model has learned a good parameter to distinguish the memory data between old and new relations at the first stage, and thus tends to keep the learned parameter (i.e., skewed decision boundary) in the following BT. 2) In contrast, Fea utilizes the memory data only in BT, and thus can help the model establish a more balanced decision boundary on the balanced situation from the scratch. These results show that it is important to learn new relations with only new data at first, and FA works as it retains the potential of memory data for the subsequent BT.

6.5 Less is More: Rethinking CRE models

In this section, we find that Fea can be viewed as a simplified variant of two strong CRE models (EMAR and RPCRE), and we rethink them to explain why Fea can outperform these more complex models. We hope our analysis can provide guidance for the design of future CRE models.

Rethinking EMAR

Han et al. (2020) proposes an episodic memory activation and reconsolidation (EMAR) method for CRE. As shown in Figure 3(a), EMAR consists of three stages: 1) Learning from new data. 2) Memory Replay and Activation (MRA): Learning from the memory data and new data. 3) Memory Reconsolidation (MR): learning from memory data with relation prototypes. Please refer to Han et al. (2020) for more model details.

Among these three stages, the first stage is the same as FA, and the third stage MR can be regarded as BT (compared with BT, MR further incorporates relation prototypes to train the model). However, the extra second stage MAR mixes the memory and new data to train the model, and thus introduces the data imbalance problem. According to our analysis, the data imbalance harms the model performance. Therefore, we only use the balanced memory data to train the model in MRA. As shown in Table 6, after using balanced data in MRA (EMAR: balanced MRA), the accuracy of EMAR on Fewrel and Tacred can improve 0.40.4 and 1.11.1, respectively. However, the performance is still slightly lower than that of Fea, which shows incorporating the relation prototypes may not be helpful in CRE. As the relation prototypes are calculated by the memory data, we think the reason is that optimizing the model based on relation prototypes aggravates the over-fitting of memory data.

Methods Fewrel Tacred
Fea 27.8 45.0
remove FA 1.4 4.1
Table 5: The average gradient norm per step of memory data in BT on two benchmarks.
Methods Fewrel Tacred
Fea 84.3 77.7
EMAR 83.6 76.1
EMAR: balanced MRA 84.0 77.2
RPCRE 82.8 75.3
RPCRE: with normalized 83.0 76.5
Table 6: Performances of two typical baseline models after our modification with principles of FEA.

Rethinking RPCRE

Cui et al. (2021) proposes a refining network with relation prototypes for CRE (RPCRE). As shown in Figure 3(b), RPCRE consists of two learning stages: 1) Initial training for new task. 2) Refine input instance with Prototypes (RP): Train a memory network (MN) to refine the representation of input instances with the relation prototypes for better classification. Please refer to Cui et al. (2021) for more model details.

The first stage is FA, and the second stage RP can be regarded as BT. Compared with BT, RP further utilizes a memory network (MN) to refine the representation of the input instances with relation prototypes. Specifically, for each input instance, MN regards it as a query, the relation prototypes as key and value. MN uses an attention mechanism to aggregate the relation prototypes, and adds the aggregation result to the original representation as a refined representation for classification.

However, MN is harmful for performance. There exists two potential reasons: 1) MN incorporates implicit data imbalance. We find that there exists gaps among the norms of different prototypes and input instances tend to attention more to the relation prototype with the larger norm. Therefore, after refining, the representation of each relation is not balanced. To alleviate this problem, we normalize the norm of all relation prototypes to 11 (“with normalized”). As shown in Table 6, after normalizing, the accuracy of RPCRE on Fewrel and Tacred can improve 0.20.2 and 1.21.2, respectively. 2) MN makes it more difficult for the CRE model to distinguish similar relations. According to our error analysis in Section 6.2, CRE models are confused by the similar relations, and are prone to latter error. Given an instance, the attention-based MN mechanism tends to add the representation of its similar relation prototypes to it, and thus the representation of similar relations becomes more confusing. For example, RPCRE predicts 40.7%40.7\% instances of former relation “followed by” to latter relation “follow”, and 5.7%5.7\% vice versa, while the ratio of pure Fea is much less than that of RPCRE (30.7%30.7\% and 1.4%1.4\%, respectively).

6.6 Suggestions for Future Work

Through our a series of analysis on both Fea and two strong CRE baselines, we think future CRE models should pay attention to the following issues: 1) Learn distinguishable feature for similar relations. In CRE, the catastrophic forgetting mainly happens among similar relations from different tasks, since there does not exist enough data to teach the model to distinguish them. Therefore, it may be helpful to propose some mechanisms to learn distinguishable features that highlight differences among similar relations. 2) Establish a better classifier. In CRE, the forgetting that happens on the pretrained encoder is not serious, and the conventional Softmax classifier tends to build a skewed decision boundary, leading to severe latter errors. Therefore, it can be helpful to design a better classifier for CRE models.

7 Conclusion

In this paper, we propose a frustratingly easy but effective approach (Fea) for CRE. Fea consists of two stages: 1) Fast Adaption (FA) warms up the model with only new data. 2) Balanced Tuning (BT) circumvents the intrinsic imbalance between new and old relations by finetuning on the balanced memory data. Despite the simplicity of Fea, it is extremely effective. Therefore, we conduct a series of analysis to understand why Fea works and why catastrophic forgetting happens. We also dive into two strong CRE baselines, and find that Fea can be viewed as a simplified but most effective variant of them. We show these two CRE baselines can be further enhanced by the principle of Fea. Based on our analysis, we also provide two actionable suggestions for future model design in CRE.

References

Appendix A Baselines

We compared Fea with following baselines:

  • EA-EMR Wang et al. (2019), which uses an embedding alignment mechanism to maintain memory to alleviate the catastrophic forgetting problem.

  • CML Wu et al. (2021), which proposes a curriculum-meta learning method to alleviate catastrophic forgetting while maintaining order-robust.

  • EMAR Han et al. (2020), which introduces memory activation and reconsolidation mechanism for continual relation extraction.

  • RP-CRE Cui et al. (2021), which refines sample embeddings with relation prototypes to enhance continual relation extraction.

  • CRL Zhao et al. (2022), which proposes a consistent representation learning method that utilizes contrastive replay and knowledge distillation to alleviate catastrophic forgetting.

Ablation Methods Stage 1 Stage 2 Fewrel Tacred
Fea 𝒟k\mathcal{D}_{k} k\mathcal{M}_{k} 84.3 77.7
A1 (remove BT) 𝒟k\mathcal{D}_{k} k1\mathcal{M}_{k-1}\cup 𝒟k\mathcal{D}_{k} 75.8 71.2
A2 (remove FA) k1𝒟k\mathcal{M}_{k-1}\cup\mathcal{D}_{k} k\mathcal{M}_{k} 81.1 74.9
A3 (remove FA and BT) \emptyset k1𝒟k\mathcal{M}_{k-1}\cup\mathcal{D}_{k} 75.6 71.2
A4 \emptyset k1k1k1\mathcal{M}_{k-1}\cup\mathcal{M}_{k-1}\cup\dots\cup\mathcal{M}_{k-1}\cup 𝒟k\mathcal{D}_{k} 73.1 70.1
A5 𝒟k\mathcal{D}_{k} k1k1k1\mathcal{M}_{k-1}\cup\mathcal{M}_{k-1}\cup\dots\cup\mathcal{M}_{k-1}\cup 𝒟k\mathcal{D}_{k} 78.1 73.8
Table 7: Performances of models with different training data at two learning stages. 𝒟k\mathcal{D}_{k} is the data of new relations, k1\mathcal{M}_{k-1} is the memory data of previous relations and k\mathcal{M}_{k} is the memory data of previous and new relations.

Appendix B Influence of Memory Size

Refer to caption
Figure 4: The results of different models with different memory sizes on Fewrel. The X-axis is the task stage and Y-axis is the accuracy on the test set of all observed relations. We report the average of 55 different runs.
Refer to caption
Figure 5: The results of different models with different memory sizes on Tacred. We report the average of 55 different runs.

For the rehearsal-based CRE models, memory size (the number of stored instances for each relation) is a key factor for the model performance. Therefore, in this section, we also study the performance of Fea with different memory sizes. As shown in Figure 4 and Figure 5: 1) As the size of the memory decreases, the performances of all models drop, which shows the importance of the memory size for CRE models. 2) In Fewrel, Fea outperforms three baselines with different memory sizes. In addition, when decreasing the memory size, the performance gap between Fea and baselines tends to be larger. For example, on the final task, Fea outperforms RPCRE 2.02.0, 1.51.5 and 1.11.1 accuracy on memory size 55, 1010 and 2020, respectively. 3) In Tacred, Fea outperforms RPCRE and EMAR with three different memory sizes, especially when memory sizes are 55 and 1010. CRL outperforms Fea when memory size is 2020. However, when memory size is 1010 and 55, Fea achieves comparable results with CRL. These results show Fea has more obvious advantages when the memory size is small. (4) By comparing the performance gap of T10T10 between M=5M=5 and M=20M=20, we can also find Fea performs more stable than all baselines.

Appendix C Model Efficiency

Models Fewrel Tacred
Accuracy Time Accuracy Time
Fea 84.3 173.7s 77.7 53.8s
RPCRE 82.8 347.2s 75.3 89.8s
EMAR 83.6 297.3s 76.1 70.8s
CRL 83.1 256.2s 78.0 67.7s
Table 8: Performances and training time of different models on two benchmarks. Compared with baselines, Fea takes less training time to update the model.

In real life, CRE models need to learn new relations constantly, thus we should also consider the learning efficiency of models. In this section, we compared the learning time (average training time per task) of Fea and our baselines on two benchmarks. As shown in Table 8, with exactly the same hardware setting, compared with all strong baselines, Fea has a better learning efficiency.

Appendix D More Detail of Ablation Study

In Table 7, We explore 55 different kinds of ablation methods in total. k1k1k1\mathcal{M}_{k-1}\cup\mathcal{M}_{k-1}\cup\dots\cup\mathcal{M}_{k-1}\cup 𝒟k\mathcal{D}_{k} denotes the up-sampling memory data where the number of memory data for each previously learned relation is the same as that of its origin training data. As is shown: 1) A3 and A4 have poor performances on two benchmarks, showing that it is important to propose a two-stage learning CRE model. 2) A5 outperforms A4, showing that it is important to learn new relations at stage 1. Fea outperform A2 shows that stage 1 should have ONLY new data. 3) Fea and A2 outperform all other methods showing that it is important to ensure the data imbalance at stage 2.

Appendix E More Detail of Error Analysis

To better explore the effectiveness of BT and FA, we conduct error analysis for Fea, “BT: remove BT” and “remove FA” on Fewrel in in Table 9 and Table 10. From the results, we can find: 1) The model makes a lot of latter error. 2) The model wrongly predicts many instances as the relations appearing at the last task (i.e., [10]). 3) The model is unable to distinguish similarity relations, and easily predicts former-occurring relation instances as the later-occurring similar relation. For example, “followed by (P156, appear at task [1])” to “follow (P155, appears at task [8])”, “mother” (P25, appears at task [4]) to “father” (P22, appears at task [10]), “tributary (P974, appears at task [7])” to “mouth of the watercourse (P403, appears at task [9])”. (4) Removing either BT or FA will significantly increase the latter error, and thus aggravate the catastrophic forgetting.

Appendix F Effectiveness of BT

We draw several relation pairs with severe latter error to show the effectiveness of BT in Figure 6.

Appendix G Effectiveness of FA

We draw several relation pairs with severe latter error to show the effectiveness of FA in Figure 7.

Relations Fea remove BT remove FA
Acc. TOP1 TOP2 Acc. TOP1 TOP2 Acc. TOP1 TOP2
P156/[1] 0.52 P155/[8]/0.31 P176/[10]/0.02 0.36 P155/[8]/0.31 P31/[10]/0.15 0.43 P155/[8]/0.27 P31/[10]/0.09
P84/[1] 0.94 P127/[5]/0.03 P6/[3]/0.01 0.91 P176/[10]/0.04 P127/[5]/0.02 0.93 P31/[10]/0.01 P1877/[10]/0.01
P39/[1] 0.94 P106/[4]/0.02 P410/[1]/0.01 0.88 P31/[10]/0.04 P106/[4]/0.04 0.92 P106/[4]/0.04 P31/[10]/0.01
P276/[1] 0.52 P159/[10]/0.21 P551/[7]/0.04 0.34 P159/[10]/0.52 P31/[10]/0.04 0.46 P159/[10]/0.37 P706/[5]/0.04
P410/[1] 0.99 P39/[1]/0.01 P106/[4]/0.01 0.98 P241/[1]/0.01 P31/[10]/0.01 0.99 P106/[4]/0.01 P26/[8]/0.00
P241/[1] 0.91 P137/[7]/0.05 P27/[8]/0.01 0.88 P137/[7]/0.04 P407/[10]/0.04 0.81 P137/[7]/0.11 P27/[8]/0.03
P177/[1] 1.00 P137/[7]/0.00 P937/[7]/0.00 0.98 P4552/[6]/0.01 P206/[6]/0.01 0.99 P206/[6]/0.01 P3450/[7]/0.00
P264/[1] 0.95 P175/[9]/0.03 P463/[5]/0.01 0.92 P175/[9]/0.02 P176/[10]/0.01 0.90 P175/[9]/0.04 P750/[4]/0.03
P412/[2] 0.99 P1303/[8]/0.01 P178/[7]/0.00 0.97 P1303/[8]/0.03 P178/[7]/0.00 0.99 P1303/[8]/0.01 P178/[7]/0.00
P361/[2] 0.27 P31/[10]/0.11 P131/[2]/0.07 0.21 P31/[10]/0.31 P1344/[10]/0.08 0.22 P31/[10]/0.21 P527/[8]/0.06
P1923/[2] 0.95 P1346/[4]/0.03 P710/[2]/0.01 0.87 P1346/[4]/0.07 P159/[10]/0.01 0.88 P1346/[4]/0.07 P710/[2]/0.04
P123/[2] 0.59 P178/[7]/0.34 P750/[4]/0.02 0.45 P178/[7]/0.37 P176/[10]/0.06 0.52 P178/[7]/0.34 P176/[10]/0.07
P118/[2] 0.96 P641/[10]/0.02 P1344/[10]/0.01 0.88 P1344/[10]/0.06 P641/[10]/0.04 0.93 P1344/[10]/0.04 P641/[10]/0.03
P131/[2] 0.66 P159/[10]/0.09 P706/[5]/0.09 0.44 P159/[10]/0.31 P31/[10]/0.08 0.46 P159/[10]/0.21 P17/[6]/0.06
P710/[2] 0.72 P1346/[4]/0.06 P1923/[2]/0.06 0.66 P1346/[4]/0.08 P1877/[10]/0.06 0.79 P1346/[4]/0.05 P991/[4]/0.04
P355/[2] 0.69 P527/[8]/0.11 P127/[5]/0.03 0.56 P31/[10]/0.18 P159/[10]/0.15 0.60 P159/[10]/0.10 P127/[5]/0.07
P6/[3] 0.99 P991/[4]/0.01 P355/[2]/0.01 0.98 P991/[4]/0.01 P22/[10]/0.01 0.99 P1877/[10]/0.01 P991/[4]/0.01
P400/[3] 0.91 P306/[9]/0.09 P31/[10]/0.01 0.84 P306/[9]/0.11 P31/[10]/0.04 0.82 P306/[9]/0.16 P361/[2]/0.01
P101/[3] 0.79 P135/[6]/0.04 P31/[10]/0.03 0.61 P31/[10]/0.15 P106/[4]/0.07 0.68 P106/[4]/0.10 P31/[10]/0.06
P140/[3] 0.95 P407/[10]/0.01 P101/[3]/0.01 0.88 P31/[10]/0.06 P407/[10]/0.03 0.93 P407/[10]/0.02 P31/[10]/0.01
P2094/[3] 1.00 P137/[7]/0.00 P937/[7]/0.00 0.99 P1344/[10]/0.01 P937/[7]/0.00 0.99 P1344/[10]/0.01 P937/[7]/0.00
P364/[3] 0.86 P407/[10]/0.13 P27/[8]/0.01 0.67 P407/[10]/0.33 P3450/[7]/0.00 0.68 P407/[10]/0.31 P495/[4]/0.01
P150/[3] 0.94 P1001/[6]/0.02 P551/[7]/0.01 0.89 P159/[10]/0.08 P31/[10]/0.01 0.86 P159/[10]/0.09 P527/[8]/0.01
P466/[3] 0.92 P127/[5]/0.03 P140/[3]/0.01 0.76 P159/[10]/0.09 P137/[7]/0.04 0.88 P159/[10]/0.04 P127/[5]/0.03
P449/[4] 0.98 P750/[4]/0.01 P1408/[5]/0.01 0.97 P750/[4]/0.02 P31/[10]/0.01 0.94 P750/[4]/0.04 P137/[7]/0.02
P674/[4] 0.81 P1877/[10]/0.06 P527/[8]/0.04 0.69 P1877/[10]/0.25 P527/[8]/0.03 0.78 P1877/[10]/0.14 P527/[8]/0.02
P991/[4] 0.97 P6/[3]/0.02 P102/[9]/0.01 0.99 P102/[9]/0.01 P137/[7]/0.00 0.98 P6/[3]/0.01 P102/[9]/0.01
P495/[4] 0.72 P407/[10]/0.11 P27/[8]/0.06 0.31 P407/[10]/0.46 P27/[8]/0.08 0.56 P407/[10]/0.17 P27/[8]/0.09
P1346/[4] 0.79 P1923/[2]/0.14 P710/[2]/0.05 0.83 P1923/[2]/0.09 P1877/[10]/0.04 0.82 P1923/[2]/0.11 P710/[2]/0.04
P750/[4] 0.89 P449/[4]/0.03 P178/[7]/0.02 0.89 P178/[7]/0.04 P176/[10]/0.04 0.91 P176/[10]/0.02 P178/[7]/0.01
P106/[4] 0.69 P641/[10]/0.13 P101/[3]/0.11 0.69 P641/[10]/0.13 P1303/[8]/0.07 0.79 P641/[10]/0.11 P1303/[8]/0.06
P25/[4] 0.91 P22/[10]/0.05 P26/[8]/0.02 0.39 P22/[10]/0.59 P26/[8]/0.03 0.69 P22/[10]/0.30 P26/[8]/0.01
P706/[5] 0.67 P206/[6]/0.12 P4552/[6]/0.09 0.59 P206/[6]/0.11 P31/[10]/0.11 0.54 P206/[6]/0.13 P4552/[6]/0.10
P127/[5] 0.51 P137/[7]/0.17 P176/[10]/0.09 0.29 P176/[10]/0.30 P137/[7]/0.19 0.36 P137/[7]/0.21 P176/[10]/0.20
P413/[5] 1.00 P137/[7]/0.00 P937/[7]/0.00 1.00 P137/[7]/0.00 P937/[7]/0.00 1.00 P137/[7]/0.00 P937/[7]/0.00
P463/[5] 0.81 P102/[9]/0.01 P175/[9]/0.01 0.64 P31/[10]/0.16 P175/[9]/0.05 0.70 P31/[10]/0.09 P175/[9]/0.04
P1408/[5] 0.99 P740/[5]/0.01 P3450/[7]/0.00 0.91 P159/[10]/0.09 P407/[10]/0.01 0.98 P159/[10]/0.01 P17/[6]/0.01
P59/[5] 0.99 P22/[10]/0.01 P137/[7]/0.00 0.95 P31/[10]/0.04 P22/[10]/0.01 0.99 P460/[9]/0.01 P31/[10]/0.01
P86/[5] 0.74 P58/[8]/0.10 P175/[9]/0.06 0.29 P1877/[10]/0.51 P175/[9]/0.15 0.70 P175/[9]/0.14 P1877/[10]/0.12
P740/[5] 0.77 P159/[10]/0.18 P937/[7]/0.02 0.67 P159/[10]/0.32 P27/[8]/0.01 0.66 P159/[10]/0.29 P937/[7]/0.01
Table 9: Error analysis of relations from task 11 to task 55 on Fewrel (relations from task 66 to task 1010 are shown in Table 10 due to the space limited.). “Acc.” denotes the accuracy. “TOP1” and “TOP2” denote the first and second confusing relations for each relation, respectively (model wrongly predicts instances as these relations). “[*]” denotes the task id that the relation appears. We highlight the results with Acc.0.7Acc.\leq 0.7. From the results, we can find that: 1) The model makes a lot of latter error. 2) The model wrongly predicts many instances as the relations appearing at the last task (i.e., [10]). 3) The model is unable to distinguish similarity relations, and easily predicts former-occurring relation instances as the later-occurring similar relation. For example, “followed by (P156, appear at task [1])” to “follow (P155, appears at task [8])”, “mother” (P25, appears at task [4]) to “father” (P22, appears at task [10]), “tributary (P974, appears at task [7])” to “mouth of the watercourse (P403, appears at task [9])” and so on. (4) Removing either BT or FA will significantly increase the latter error, and thus aggravates the catastrophic forgetting.
Relations Fea BT: remove BT remove FA
Acc. TOP1 TOP2 Acc. TOP1 TOP2 Acc. TOP1 TOP2
P206/[6] 0.90 P706/[5]/0.03 P403/[9]/0.03 0.89 P150/[3]/0.04 P403/[9]/0.03 0.93 P403/[9]/0.04 P706/[5]/0.01
P17/[6] 0.71 P495/[4]/0.09 P407/[10]/0.07 0.64 P407/[10]/0.16 P159/[10]/0.08 0.72 P407/[10]/0.09 P1001/[6]/0.07
P136/[6] 0.87 P31/[10]/0.05 P641/[10]/0.02 0.76 P31/[10]/0.15 P641/[10]/0.03 0.82 P31/[10]/0.09 P641/[10]/0.03
P800/[6] 0.94 P176/[10]/0.01 P551/[7]/0.01 0.89 P31/[10]/0.06 P176/[10]/0.01 0.93 P86/[5]/0.01 P31/[10]/0.01
P4552/[6] 0.91 P706/[5]/0.06 P31/[10]/0.01 0.91 P31/[10]/0.06 P706/[5]/0.02 0.96 P706/[5]/0.02 P31/[10]/0.01
P1001/[6] 0.81 P495/[4]/0.06 P17/[6]/0.04 0.64 P159/[10]/0.16 P31/[10]/0.09 0.84 P159/[10]/0.07 P31/[10]/0.03
P931/[6] 0.96 P159/[10]/0.03 P706/[5]/0.01 0.94 P159/[10]/0.06 P137/[7]/0.01 0.94 P159/[10]/0.06 P137/[7]/0.00
P135/[6] 0.88 P463/[5]/0.03 P937/[7]/0.01 0.85 P31/[10]/0.06 P1344/[10]/0.01 0.88 P31/[10]/0.02 P101/[3]/0.02
P3373/[7] 0.68 P22/[10]/0.16 P40/[9]/0.11 0.54 P22/[10]/0.38 P40/[9]/0.04 0.77 P22/[10]/0.09 P40/[9]/0.09
P551/[7] 0.69 P937/[7]/0.20 P27/[8]/0.04 0.56 P159/[10]/0.18 P937/[7]/0.14 0.63 P937/[7]/0.21 P159/[10]/0.07
P974/[7] 0.90 P403/[9]/0.10 P3450/[7]/0.00 0.60 P403/[9]/0.39 P131/[2]/0.01 0.54 P403/[9]/0.45 P155/[8]/0.01
P921/[7] 0.75 P31/[10]/0.05 P641/[10]/0.04 0.51 P31/[10]/0.29 P1877/[10]/0.06 0.67 P31/[10]/0.13 P1877/[10]/0.04
P3450/[7] 0.99 P641/[10]/0.01 P937/[7]/0.00 0.93 P31/[10]/0.04 P1344/[10]/0.01 0.93 P31/[10]/0.03 P641/[10]/0.02
P137/[7] 0.74 P176/[10]/0.09 P127/[5]/0.07 0.71 P176/[10]/0.12 P31/[10]/0.04 0.74 P176/[10]/0.13 P159/[10]/0.04
P937/[7] 0.83 P551/[7]/0.06 P159/[10]/0.03 0.74 P159/[10]/0.12 P551/[7]/0.09 0.77 P551/[7]/0.07 P159/[10]/0.06
P178/[7] 0.81 P123/[2]/0.06 P176/[10]/0.04 0.69 P176/[10]/0.20 P31/[10]/0.04 0.74 P176/[10]/0.13 P306/[9]/0.03
P26/[8] 0.69 P40/[9]/0.11 P22/[10]/0.10 0.54 P22/[10]/0.36 P40/[9]/0.05 0.67 P22/[10]/0.16 P40/[9]/0.09
P1303/[8] 1.00 P137/[7]/0.00 P937/[7]/0.00 0.99 P1344/[10]/0.01 P137/[7]/0.00 1.00 P137/[7]/0.00 P937/[7]/0.00
P1435/[8] 1.00 P137/[7]/0.00 P3450/[7]/0.00 1.00 P137/[7]/0.00 P3450/[7]/0.00 1.00 P137/[7]/0.00 P3450/[7]/0.00
P527/[8] 0.68 P31/[10]/0.04 P641/[10]/0.03 0.64 P31/[10]/0.14 P159/[10]/0.06 0.71 P31/[10]/0.07 P159/[10]/0.06
P155/[8] 0.86 P176/[10]/0.01 P31/[10]/0.01 0.79 P31/[10]/0.09 P176/[10]/0.04 0.86 P176/[10]/0.02 P31/[10]/0.01
P27/[8] 0.91 P407/[10]/0.02 P937/[7]/0.01 0.81 P407/[10]/0.15 P17/[6]/0.01 0.94 P407/[10]/0.02 P137/[7]/0.01
P58/[8] 0.63 P1877/[10]/0.18 P57/[8]/0.14 0.24 P1877/[10]/0.69 P57/[8]/0.06 0.35 P1877/[10]/0.51 P57/[8]/0.10
P57/[8] 0.87 P1877/[10]/0.04 P58/[8]/0.04 0.59 P1877/[10]/0.36 P58/[8]/0.03 0.82 P1877/[10]/0.14 P58/[8]/0.01
P403/[9] 0.81 P974/[7]/0.16 P4552/[6]/0.01 0.95 P974/[7]/0.04 P17/[6]/0.01 0.98 P974/[7]/0.01 P159/[10]/0.01
P306/[9] 0.96 P400/[3]/0.02 P176/[10]/0.01 0.96 P31/[10]/0.02 P176/[10]/0.01 0.99 P176/[10]/0.01 P400/[3]/0.01
P175/[9] 0.93 P86/[5]/0.05 P57/[8]/0.01 0.96 P1877/[10]/0.02 P176/[10]/0.01 0.96 P1877/[10]/0.01 P57/[8]/0.01
P102/[9] 0.97 P463/[5]/0.01 P551/[7]/0.01 0.97 P937/[7]/0.01 P1344/[10]/0.01 0.96 P463/[5]/0.01 P937/[7]/0.01
P1411/[9] 1.00 P137/[7]/0.00 P3450/[7]/0.00 0.99 P31/[10]/0.01 P921/[7]/0.00 1.00 P137/[7]/0.00 P3450/[7]/0.00
P40/[9] 0.86 P22/[10]/0.05 P26/[8]/0.04 0.84 P22/[10]/0.15 P26/[8]/0.01 0.93 P22/[10]/0.06 P3373/[7]/0.01
P105/[9] 0.99 P31/[10]/0.01 P137/[7]/0.00 0.98 P31/[10]/0.02 P137/[7]/0.00 0.99 P31/[10]/0.01 P137/[7]/0.00
P460/[9] 0.71 P31/[10]/0.06 P155/[8]/0.03 0.62 P31/[10]/0.24 P22/[10]/0.02 0.76 P31/[10]/0.09 P135/[6]/0.01
P176/[10] 0.95 P178/[7]/0.03 P127/[5]/0.01 0.96 P159/[10]/0.01 P355/[2]/0.01 0.96 P178/[7]/0.01 P159/[10]/0.01
P641/[10] 0.98 P106/[4]/0.01 P101/[3]/0.01 0.99 P106/[4]/0.01 P937/[7]/0.00 0.99 P106/[4]/0.01 P937/[7]/0.00
P22/[10] 0.89 P3373/[7]/0.04 P40/[9]/0.02 0.98 P175/[9]/0.01 P3373/[7]/0.01 0.95 P40/[9]/0.02 P175/[9]/0.01
P31/[10] 0.81 P1411/[9]/0.04 P306/[9]/0.03 0.99 P361/[2]/0.01 P178/[7]/0.00 0.94 P136/[6]/0.01 P463/[5]/0.01
P1344/[10] 0.97 P2094/[3]/0.01 P118/[2]/0.01 0.99 P2094/[3]/0.01 P3450/[7]/0.00 0.98 P2094/[3]/0.01 P40/[9]/0.01
P407/[10] 0.92 P364/[3]/0.04 P27/[8]/0.02 0.99 P31/[10]/0.01 P137/[7]/0.00 0.97 P136/[6]/0.01 P27/[8]/0.01
P1877/[10] 0.86 P58/[8]/0.13 P57/[8]/0.01 0.99 P58/[8]/0.01 P937/[7]/0.00 0.96 P58/[8]/0.02 P57/[8]/0.01
P159/[10] 0.80 P740/[5]/0.07 P551/[7]/0.03 0.98 P150/[3]/0.01 P1001/[6]/0.01 0.94 P1408/[5]/0.01 P740/[5]/0.01
Table 10: Error analysis on Fewrel (relations from task 66 to task 1010).
Refer to caption
Figure 6: T-SNE of instances belonging to different similar relations pairs with server latter error. BT helps build a more balanced decision boundary between old and new relations, and thus reduces the latter error.
Refer to caption
Figure 7: T-SNE of instances belonging to different similar relations pairs with server latter errors. FA preserves the potential of memory instances for BT, and thus reduces the latter error.