Fine-grained Contrastive Learning for Relation Extraction
1 Contrastive Pre-training
1.1 Overview
This section introduces our pre-training method to learn high-quality entity and relation representations. We first construct informative representation for entities and relationships which we use to implement a three-part pre-training objective that features entity discrimination, relation discrimination, and masked language modeling.
1.2 Entity & Relation Representation
We construct entity and relationship representations following ERICA [Qin2021ERICAIE]. For the document , we use a pre-trained language model to encode and obtain the hidden states . Then, mean pooling is applied to the consecutive tokens in entity to obtain entity representations. Assuming and are the start index and end index of entity in document , the entity representation of is represented as:
(1) |
To form a relation representation, we concatenate the representations of two entities and : .
1.3 Entity Discrimination
For entity discrimination, we use the same method described in ERICA. The goal of entity discrimination is inferring the tail entity in a document given a head entity and a relation [Qin2021ERICAIE]. The model distinguishes the ground-truth tail entity from other entities in the text. Given a sampled instance tuple , our model is trained to distinguish the tail entity from other entities in the document . Specifically, we concatenate the relation name of , the head entity and a special token [SEP] in front of to get . Then, we encode to get the entity representations using the method from Section 1.2. The contrastive learning objective for entity discrimination is formulated as:
(2) |
where denotes the cosine similarity between two entity representations and is a temperature hyper-parameter.
1.4 Relation Discrimination
To effectively learn representation for downstream task relation extraction, we conduct a Relation Discrimination (RD) task during pre-training. RD aims to distinguish whether two relations are semantically similar [Qin2021ERICAIE]. Existing methods [Peng2020LearningFC, Qin2021ERICAIE] require large amounts of automatically labeled data from distant supervision which is noisy because not all sentences will adequately express a relationship.
In this case, the learning order can be introduced to make the model aware of the noise level of relation instances. To efficiently incorporate learning order into the training process, we propose fine-grained, noise-aware relation discrimination.
In this new method, the noise level of all distantly supervised training instances controls the optimization process by re-weighting the contrastive objective. Intuitively, the model should learn more from high-quality, accurately labeled training instances than noisy, inaccurately labeled instances. Hence, we assign higher weights to earlier learned instances from the learning order denoising stage.
In practice, we sample a tuple pair of relation instance and from and , where is a document; is a entity in ; is the relationship between two entities and is the first learned order introduced in Section LABEL:sec:learning_order. Using the method mentioned in Section 1.2, we obtain the positive relation representations and . To discriminate positive examples from negative ones, the fine-grained RD is defined as follows:
(3) | ||||
Where denotes the cosine similarity; is the temperature; is a hyper-parameter and is a negative instance () sampled from . Relation instances and are re-weighted by function which is defined as:
(4) |
where () is a hyper-parameter that weighs the negative samples; and are maximum and minimum first-learned order, respectively. We increase the weight of negative if it is a high-quality training instance (i.e., is small). Because all positives and negatives are discriminated from instance , we control the overall weight by the learning order .
1.5 Overall Objective
We include the MLM task [Devlin2019BERTPO] to avoid catastrophic forgetting of language understanding [McCloskey1989CatastrophicII] and construct the following overall objective for FineCL:
(5) |
1.6 Pre-training Details
To ensure a fair comparison and highlight the effectiveness of Fine-grained Contrastive Learning, we align our pre-training data and settings to those used by ERICA. The pre-training dataset provided by [Qin2021ERICAIE] is constructed using distant supervision for RE by pairing documents from Wikipedia (English) with the Wikidata knowledge graph. This distantly labeled dataset creation method mirrors the method used to create the distantly labeled training set in DocRED [docred] but differs in that it is much larger and more diverse. It contains 1M documents, 7.2M relation instances, and 1040 relation types compared to DocRED’s 100k documents, 1.5M relation instances, and 96 relation types (not including no relation). Additional checks are performed to ensure no fact triples overlap between the training data and the test sets of the various downstream RE tasks.
Size | 1% | 10% | 100% | |||
Metrics | F1 | IgF1 | F1 | IgF1 | F1 | IgF1 |
CNN* | - | - | - | - | 42.3 | 40.3 |
BiLSTM* | - | - | - | - | 51.1 | 50.3 |
HINBERT* | - | - | - | - | 55.6 | 53.7 |
CorefBERT* | 32.8 | 31.2 | 46.0 | 43.7 | 57.0 | 54.5 |
SpanBERT* | 32.2 | 30.4 | 46.4 | 44.5 | 57.3 | 55.0 |
ERNIE* | 26.7 | 25.5 | 46.7 | 44.2 | 56.6 | 54.2 |
MTB* | 29.0 | 27.6 | 46.1 | 44.1 | 56.9 | 54.3 |
CP* | 30.3 | 28.7 | 44.8 | 42.6 | 55.2 | 52.7 |
BERT | 19.9 | 18.8 | 45.2 | 43.1 | 56.6 | 54.4 |
RoBERTa | 29.6 | 27.9 | 47.6 | 45.7 | 58.2 | 55.9 |
ERICABERT | 22.9 | 21.7 | 48.5 | 46.4 | 57.4 | 55.2 |
ERICARoBERTa | 30.0 | 28.2 | 50.1 | 48.1 | 59.1 | 56.9 |
WCLRoBERTa | 22.3 | 20.8 | 49.4 | 47.5 | 58.5 | 56.2 |
FineCL | 33.2 | 31.2 | 50.3 | 48.3 | 59.5 | 57.1 |
1.7 Supervised Adaptation
The primary focus of our work is to improve relationship representations learned during pre-training and, in doing so, improve performance on downstream RE tasks. To illustrate the effectiveness of our pre-training method, we use cross-entropy loss, as described in equation LABEL:eq:cross_entropy, to fine-tune our pre-trained FineCL model on document-level and sentence-level RE tasks. The baselines are pre-trained and fine-tuned with identical data and settings for a fair comparison.
Document-level RE: To assess our framework’s ability to extract document-level relations, we report performance on DocRED [docred]. We compare our model to the following baselines: (1) CNN [zeng-etal-2014-relation], (2) BiLSTM [lstm], (3) BERT [Devlin2019BERTPO], (4) RoBERTa [Liu2019RoBERTaAR], (5) MTB [BaldiniSoares2019MatchingTB], (6) CP [Peng2020LearningFC], (7 & 8) ERICABERT & ERICARoBERTa [Qin2021ERICAIE], (9) WCLRoBERTa [Peng2020LearningFC]. We fine-tune the pre-trained models on DocRED’s human-annotated train/dev/test splits (see Appendix LABEL:app:docred for detailed experimental settings). We implement WCL with identical settings from our other pre-training experiments and, for a fair comparison, we use RoBERTa instead of BERT as the base model for WCL, given the superior performance we observe from RoBERTa in all other experiments. Table 1 reports performance across multiple data reduction settings (1%, 10%, and 100%), using an overall F1-micro score and an F1-micro score computed by ignoring fact triples in the test set that overlap with fact triples in the training and development splits. We observe that FineCL outperforms all baselines in all experimental settings, offering evidence that FineCL produces better relationship representations from noisy data.
Given that learning-order denoising weighs earlier learned instances over later learned instances, FineCL may be biased towards easier, or common relation classes. The increase in F1-micro performance could result from improved predictions on common relation classes at the expense of predictions on rare classes. To better understand the performance gains, we also report F1-macro and F1-macro weighted in Table 2. The results show that FineCL outperforms the top baselines in both F1-macro metrics indicating that, on average, our method improves performance across all relation classes. However, the low F1-macro scores from all the models highlights an area for improvement—future pre-trained RE models should focus on improving performance on long-tail relation classes.
Metric | F1-macro | F1-macro-weighted |
BERT | 37.3 | 54.9 |
RoBERTa | 39.6 | 56.9 |
ERICABERT | 37.9 | 55.8 |
ERICARoBERTa | 40.1 | 57.8 |
WCLRoBERTA | 39.9 | 57.2 |
FineCL | 40.7 | 58.2 |
Dataset | TACRED | SemEval | ||||
Size | 1% | 10% | 100% | 1% | 10% | 100% |
MTB* | 35.7 | 58.8 | 68.2 | 44.2 | 79.2 | 88.2 |
CP* | 37.1 | 60.6 | 68.1 | 40.3 | 80.0 | 88.5 |
BERT | 22.2 | 53.5 | 63.7 | 41.0 | 76.5 | 87.8 |
RoBERTa | 27.3 | 61.1 | 69.3 | 43.6 | 77.7 | 87.5 |
ERICABERT | 34.9 | 56.0 | 64.9 | 46.4 | 79.8 | 88.1 |
ERICARoBERTa | 41.1 | 61.7 | 69.5 | 50.3 | 80.9 | 88.4 |
WCLRoBERTA | 37.6 | 61.3 | 69.7 | 47.0 | 80.0 | 88.3 |
FineCL | 43.7 | 62.7 | 70.3 | 51.2 | 81.0 | 88.7 |
Sentence-level RE: To assess our framework’s ability to extract sentence-level relations, we report performance on TACRED [zhang2017tacred] and SemEval-2010 Task 8 [hendrickx-etal-2010-semeval]. We compare our model to MTB, CP, BERT, RoBERTa, ERICABERT, ERICARoBERTa, and WCLRoBERTa (see Appendix LABEL:app:semeval_tacred for detailed experimental settings). Table 3 reports F1 scores across multiple data reduction settings (1%, 10%, 100%). Again, we observe an increase in performance in all settings compared to the baselines.