Fine-grained Contrastive Learning for Relation Extraction

William Hogan Jiacheng Li Jingbo Shang
Department of Computer Science & Engineering
University of California, San Diego
{whogan,j9li,jshang}@ucsd.edu

1 Contrastive Pre-training

1.1 Overview

This section introduces our pre-training method to learn high-quality entity and relation representations. We first construct informative representation for entities and relationships which we use to implement a three-part pre-training objective that features entity discrimination, relation discrimination, and masked language modeling.

1.2 Entity & Relation Representation

We construct entity and relationship representations following ERICA [Qin2021ERICAIE]. For the document $d_{i}$ , we use a pre-trained language model to encode $d_{i}$ and obtain the hidden states $\{\mathbf{h}_{1},\mathbf{h}_{2},\dots,\mathbf{h}_{|d_{i}|}\}$ . Then, mean pooling is applied to the consecutive tokens in entity $e_{j}$ to obtain entity representations. Assuming $n_{\mathrm{start}}$ and $n_{\mathrm{end}}$ are the start index and end index of entity $e_{j}$ in document $d_{i}$ , the entity representation of $e_{j}$ is represented as:

\mathbf{m}_{e_{j}}=\mathrm{MeanPool}(\mathbf{h}_{n_{\mathrm{start}}},\dots,\mathbf{h}_{n_{\mathrm{end}}}),

(1)

To form a relation representation, we concatenate the representations of two entities $e_{j1}$ and $e_{j2}$ : $\mathbf{r}_{j1j2}=[\mathbf{e}_{j1};\mathbf{e}_{j2}]$ .

1.3 Entity Discrimination

For entity discrimination, we use the same method described in ERICA. The goal of entity discrimination is inferring the tail entity in a document given a head entity and a relation [Qin2021ERICAIE]. The model distinguishes the ground-truth tail entity from other entities in the text. Given a sampled instance tuple $t^{i}_{j}k=(d_{i},e_{ij},r^{i}_{jk},e_{ik})$ , our model is trained to distinguish the tail entity $e_{ik}$ from other entities in the document $d_{i}$ . Specifically, we concatenate the relation name of $r^{i}_{jk}$ , the head entity $e_{ij}$ and a special token [SEP] in front of $d_{i}$ to get $d^{*}_{i}$ . Then, we encode $d^{*}_{i}$ to get the entity representations using the method from Section 1.2. The contrastive learning objective for entity discrimination is formulated as:

\mathcal{L}_{\mathrm{ED}}=-\sum_{t_{jk}^{i}\in\mathcal{T}^{\prime}}\log\frac{\exp\left(\cos\left(\mathbf{e}_{ij},\mathbf{e}_{ik}\right)/\tau\right)}{\sum_{l=1,l\neq j}^{\left|\mathcal{E}_{i}\right|}\exp\left(\cos\left(\mathbf{e}_{ij},\mathbf{e}_{il}\right)/\tau\right)},

(2)

where $\mathrm{cos}(\cdot,\cdot)$ denotes the cosine similarity between two entity representations and $\tau$ is a temperature hyper-parameter.

1.4 Relation Discrimination

To effectively learn representation for downstream task relation extraction, we conduct a Relation Discrimination (RD) task during pre-training. RD aims to distinguish whether two relations are semantically similar [Qin2021ERICAIE]. Existing methods [Peng2020LearningFC, Qin2021ERICAIE] require large amounts of automatically labeled data from distant supervision which is noisy because not all sentences will adequately express a relationship.

In this case, the learning order can be introduced to make the model aware of the noise level of relation instances. To efficiently incorporate learning order into the training process, we propose fine-grained, noise-aware relation discrimination.

In this new method, the noise level of all distantly supervised training instances controls the optimization process by re-weighting the contrastive objective. Intuitively, the model should learn more from high-quality, accurately labeled training instances than noisy, inaccurately labeled instances. Hence, we assign higher weights to earlier learned instances from the learning order denoising stage.

In practice, we sample a tuple pair of relation instance $t_{A}=(d_{A},e_{A_{1}},r_{A},e_{A_{2}},k_{A})$ and $t_{B}=(d_{B},e_{B_{1}},r_{B},e_{B_{2}},k_{B})$ from $\mathcal{T}^{\prime}$ and $r_{A}=r_{B}$ , where $d$ is a document; $e$ is a entity in $d$ ; $r$ is the relationship between two entities and $k$ is the first learned order introduced in Section LABEL:sec:learning_order. Using the method mentioned in Section 1.2, we obtain the positive relation representations $\mathbf{r}_{t_{A}}$ and $\mathbf{r}_{t_{B}}$ . To discriminate positive examples from negative ones, the fine-grained RD is defined as follows:

	$\displaystyle\mathcal{L}_{\mathrm{RD}}$	$\displaystyle=-\sum_{t_{A},t_{B}\in\mathcal{T}^{\prime}}f(k_{A})\log\frac{\exp\left(\cos\left(\mathbf{r}_{t_{A}},\mathbf{r}_{t_{B}}\right)/\tau\right)}{\mathcal{Z}},$		(3)
	$\displaystyle\mathcal{Z}$	$\displaystyle=\sum_{t_{C}\in\mathcal{T}^{\prime}/\left\{t_{A}\right\}}^{N}f(k_{C})\exp\left(\cos\left(\mathbf{r}_{t_{A}},\mathbf{r}_{t_{C}}\right)/\tau\right)$		(3)

Where $\mathrm{cos}(\cdot,\cdot)$ denotes the cosine similarity; $\tau$ is the temperature; $N$ is a hyper-parameter and $t_{C}$ is a negative instance ( $r_{A}\neq r_{C}$ ) sampled from $\mathcal{T}^{\prime}$ . Relation instances $t_{A}$ and $t_{C}$ are re-weighted by function $f$ which is defined as:

f(k)=\alpha^{\frac{k_{\mathrm{max}}-k}{k_{\mathrm{max}}-k_{\mathrm{min}}}},

(4)

where $\alpha$ ( $\alpha>1$ ) is a hyper-parameter that weighs the negative samples; $\mathrm{max}$ and $\mathrm{min}$ are maximum and minimum first-learned order, respectively. We increase the weight of negative $t_{C}$ if it is a high-quality training instance (i.e., $k$ is small). Because all positives and negatives are discriminated from instance $t_{A}$ , we control the overall weight by the learning order $k_{A}$ .

1.5 Overall Objective

We include the MLM task [Devlin2019BERTPO] to avoid catastrophic forgetting of language understanding [McCloskey1989CatastrophicII] and construct the following overall objective for FineCL:

\mathcal{L}_{FineCL}=\mathcal{L}_{ED}+\mathcal{L}_{RD}+\mathcal{L}_{MLM}

(5)

1.6 Pre-training Details

To ensure a fair comparison and highlight the effectiveness of Fine-grained Contrastive Learning, we align our pre-training data and settings to those used by ERICA. The pre-training dataset provided by [Qin2021ERICAIE] is constructed using distant supervision for RE by pairing documents from Wikipedia (English) with the Wikidata knowledge graph. This distantly labeled dataset creation method mirrors the method used to create the distantly labeled training set in DocRED [docred] but differs in that it is much larger and more diverse. It contains 1M documents, 7.2M relation instances, and 1040 relation types compared to DocRED’s 100k documents, 1.5M relation instances, and 96 relation types (not including no relation). Additional checks are performed to ensure no fact triples overlap between the training data and the test sets of the various downstream RE tasks.

Size	1%		10%		100%
Metrics	F1	IgF1	F1	IgF1	F1	IgF1
CNN*	-	-	-	-	42.3	40.3
BiLSTM*	-	-	-	-	51.1	50.3
HINBERT*	-	-	-	-	55.6	53.7
CorefBERT*	32.8	31.2	46.0	43.7	57.0	54.5
SpanBERT*	32.2	30.4	46.4	44.5	57.3	55.0
ERNIE*	26.7	25.5	46.7	44.2	56.6	54.2
MTB*	29.0	27.6	46.1	44.1	56.9	54.3
CP*	30.3	28.7	44.8	42.6	55.2	52.7
BERT	19.9	18.8	45.2	43.1	56.6	54.4
RoBERTa	29.6	27.9	47.6	45.7	58.2	55.9
ERICA_BERT	22.9	21.7	48.5	46.4	57.4	55.2
ERICA_RoBERTa	30.0	28.2	50.1	48.1	59.1	56.9
WCL_RoBERTa	22.3	20.8	49.4	47.5	58.5	56.2
FineCL	33.2	31.2	50.3	48.3	59.5	57.1

Table 1: F1-micro scores reported on the DocRED test set. IgF1 ignores performance on fact triples in the test set overlapping with triples in the train/dev sets. (* denotes performance as reported in [Qin2021ERICAIE]; all other numbers are from our implementations).

1.7 Supervised Adaptation

The primary focus of our work is to improve relationship representations learned during pre-training and, in doing so, improve performance on downstream RE tasks. To illustrate the effectiveness of our pre-training method, we use cross-entropy loss, as described in equation LABEL:eq:cross_entropy, to fine-tune our pre-trained FineCL model on document-level and sentence-level RE tasks. The baselines are pre-trained and fine-tuned with identical data and settings for a fair comparison.

Document-level RE: To assess our framework’s ability to extract document-level relations, we report performance on DocRED [docred]. We compare our model to the following baselines: (1) CNN [zeng-etal-2014-relation], (2) BiLSTM [lstm], (3) BERT [Devlin2019BERTPO], (4) RoBERTa [Liu2019RoBERTaAR], (5) MTB [BaldiniSoares2019MatchingTB], (6) CP [Peng2020LearningFC], (7 & 8) ERICA_BERT & ERICA_RoBERTa [Qin2021ERICAIE], (9) WCL_RoBERTa [Peng2020LearningFC]. We fine-tune the pre-trained models on DocRED’s human-annotated train/dev/test splits (see Appendix LABEL:app:docred for detailed experimental settings). We implement WCL with identical settings from our other pre-training experiments and, for a fair comparison, we use RoBERTa instead of BERT as the base model for WCL, given the superior performance we observe from RoBERTa in all other experiments. Table 1 reports performance across multiple data reduction settings (1%, 10%, and 100%), using an overall F1-micro score and an F1-micro score computed by ignoring fact triples in the test set that overlap with fact triples in the training and development splits. We observe that FineCL outperforms all baselines in all experimental settings, offering evidence that FineCL produces better relationship representations from noisy data.

Given that learning-order denoising weighs earlier learned instances over later learned instances, FineCL may be biased towards easier, or common relation classes. The increase in F1-micro performance could result from improved predictions on common relation classes at the expense of predictions on rare classes. To better understand the performance gains, we also report F1-macro and F1-macro weighted in Table 2. The results show that FineCL outperforms the top baselines in both F1-macro metrics indicating that, on average, our method improves performance across all relation classes. However, the low F1-macro scores from all the models highlights an area for improvement—future pre-trained RE models should focus on improving performance on long-tail relation classes.

Metric	F1-macro	F1-macro-weighted
BERT	37.3	54.9
RoBERTa	39.6	56.9
ERICA_BERT	37.9	55.8
ERICA_RoBERTa	40.1	57.8
WCL_RoBERTA	39.9	57.2
FineCL	40.7	58.2

Table 2: F1-macro and F1-macro-weighted scores reported from the DocRED test set.

Dataset	TACRED			SemEval
Size	1%	10%	100%	1%	10%	100%
MTB*	35.7	58.8	68.2	44.2	79.2	88.2
CP*	37.1	60.6	68.1	40.3	80.0	88.5
BERT	22.2	53.5	63.7	41.0	76.5	87.8
RoBERTa	27.3	61.1	69.3	43.6	77.7	87.5
ERICA_BERT	34.9	56.0	64.9	46.4	79.8	88.1
ERICA_RoBERTa	41.1	61.7	69.5	50.3	80.9	88.4
WCL_RoBERTA	37.6	61.3	69.7	47.0	80.0	88.3
FineCL	43.7	62.7	70.3	51.2	81.0	88.7

Table 3: F1-micro scores reported from the TACRED and SemEval test sets (* denotes performance as reported in [Qin2021ERICAIE]; all other numbers are from our implementations).

Sentence-level RE: To assess our framework’s ability to extract sentence-level relations, we report performance on TACRED [zhang2017tacred] and SemEval-2010 Task 8 [hendrickx-etal-2010-semeval]. We compare our model to MTB, CP, BERT, RoBERTa, ERICA_BERT, ERICA_RoBERTa, and WCL_RoBERTa (see Appendix LABEL:app:semeval_tacred for detailed experimental settings). Table 3 reports F1 scores across multiple data reduction settings (1%, 10%, 100%). Again, we observe an increase in performance in all settings compared to the baselines.