This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Fine-grained Contrastive Learning for Relation Extraction

William Hogan        Jiacheng Li        Jingbo Shang
Department of Computer Science & Engineering
University of California, San Diego
{whogan,j9li,jshang}@ucsd.edu

1 Contrastive Pre-training

1.1 Overview

This section introduces our pre-training method to learn high-quality entity and relation representations. We first construct informative representation for entities and relationships which we use to implement a three-part pre-training objective that features entity discrimination, relation discrimination, and masked language modeling.

1.2 Entity & Relation Representation

We construct entity and relationship representations following ERICA [Qin2021ERICAIE]. For the document did_{i}, we use a pre-trained language model to encode did_{i} and obtain the hidden states {𝐡1,𝐡2,,𝐡|di|}\{\mathbf{h}_{1},\mathbf{h}_{2},\dots,\mathbf{h}_{|d_{i}|}\}. Then, mean pooling is applied to the consecutive tokens in entity eje_{j} to obtain entity representations. Assuming nstartn_{\mathrm{start}} and nendn_{\mathrm{end}} are the start index and end index of entity eje_{j} in document did_{i}, the entity representation of eje_{j} is represented as:

𝐦ej=MeanPool(𝐡nstart,,𝐡nend),\mathbf{m}_{e_{j}}=\mathrm{MeanPool}(\mathbf{h}_{n_{\mathrm{start}}},\dots,\mathbf{h}_{n_{\mathrm{end}}}), (1)

To form a relation representation, we concatenate the representations of two entities ej1e_{j1} and ej2e_{j2}: 𝐫j1j2=[𝐞j1;𝐞j2]\mathbf{r}_{j1j2}=[\mathbf{e}_{j1};\mathbf{e}_{j2}].

1.3 Entity Discrimination

For entity discrimination, we use the same method described in ERICA. The goal of entity discrimination is inferring the tail entity in a document given a head entity and a relation [Qin2021ERICAIE]. The model distinguishes the ground-truth tail entity from other entities in the text. Given a sampled instance tuple tjik=(di,eij,rjki,eik)t^{i}_{j}k=(d_{i},e_{ij},r^{i}_{jk},e_{ik}), our model is trained to distinguish the tail entity eike_{ik} from other entities in the document did_{i}. Specifically, we concatenate the relation name of rjkir^{i}_{jk}, the head entity eije_{ij} and a special token [SEP] in front of did_{i} to get did^{*}_{i}. Then, we encode did^{*}_{i} to get the entity representations using the method from Section 1.2. The contrastive learning objective for entity discrimination is formulated as:

ED=tjki𝒯logexp(cos(𝐞ij,𝐞ik)/τ)l=1,lj|i|exp(cos(𝐞ij,𝐞il)/τ),\mathcal{L}_{\mathrm{ED}}=-\sum_{t_{jk}^{i}\in\mathcal{T}^{\prime}}\log\frac{\exp\left(\cos\left(\mathbf{e}_{ij},\mathbf{e}_{ik}\right)/\tau\right)}{\sum_{l=1,l\neq j}^{\left|\mathcal{E}_{i}\right|}\exp\left(\cos\left(\mathbf{e}_{ij},\mathbf{e}_{il}\right)/\tau\right)}, (2)

where cos(,)\mathrm{cos}(\cdot,\cdot) denotes the cosine similarity between two entity representations and τ\tau is a temperature hyper-parameter.

1.4 Relation Discrimination

To effectively learn representation for downstream task relation extraction, we conduct a Relation Discrimination (RD) task during pre-training. RD aims to distinguish whether two relations are semantically similar [Qin2021ERICAIE]. Existing methods [Peng2020LearningFC, Qin2021ERICAIE] require large amounts of automatically labeled data from distant supervision which is noisy because not all sentences will adequately express a relationship.

In this case, the learning order can be introduced to make the model aware of the noise level of relation instances. To efficiently incorporate learning order into the training process, we propose fine-grained, noise-aware relation discrimination.

In this new method, the noise level of all distantly supervised training instances controls the optimization process by re-weighting the contrastive objective. Intuitively, the model should learn more from high-quality, accurately labeled training instances than noisy, inaccurately labeled instances. Hence, we assign higher weights to earlier learned instances from the learning order denoising stage.

In practice, we sample a tuple pair of relation instance tA=(dA,eA1,rA,eA2,kA)t_{A}=(d_{A},e_{A_{1}},r_{A},e_{A_{2}},k_{A}) and tB=(dB,eB1,rB,eB2,kB)t_{B}=(d_{B},e_{B_{1}},r_{B},e_{B_{2}},k_{B}) from 𝒯\mathcal{T}^{\prime} and rA=rBr_{A}=r_{B}, where dd is a document; ee is a entity in dd; rr is the relationship between two entities and kk is the first learned order introduced in Section LABEL:sec:learning_order. Using the method mentioned in Section 1.2, we obtain the positive relation representations 𝐫tA\mathbf{r}_{t_{A}} and 𝐫tB\mathbf{r}_{t_{B}}. To discriminate positive examples from negative ones, the fine-grained RD is defined as follows:

RD\displaystyle\mathcal{L}_{\mathrm{RD}} =tA,tB𝒯f(kA)logexp(cos(𝐫tA,𝐫tB)/τ)𝒵,\displaystyle=-\sum_{t_{A},t_{B}\in\mathcal{T}^{\prime}}f(k_{A})\log\frac{\exp\left(\cos\left(\mathbf{r}_{t_{A}},\mathbf{r}_{t_{B}}\right)/\tau\right)}{\mathcal{Z}}, (3)
𝒵\displaystyle\mathcal{Z} =tC𝒯/{tA}Nf(kC)exp(cos(𝐫tA,𝐫tC)/τ)\displaystyle=\sum_{t_{C}\in\mathcal{T}^{\prime}/\left\{t_{A}\right\}}^{N}f(k_{C})\exp\left(\cos\left(\mathbf{r}_{t_{A}},\mathbf{r}_{t_{C}}\right)/\tau\right)

Where cos(,)\mathrm{cos}(\cdot,\cdot) denotes the cosine similarity; τ\tau is the temperature; NN is a hyper-parameter and tCt_{C} is a negative instance (rArCr_{A}\neq r_{C}) sampled from 𝒯\mathcal{T}^{\prime}. Relation instances tAt_{A} and tCt_{C} are re-weighted by function ff which is defined as:

f(k)=αkmaxkkmaxkmin,f(k)=\alpha^{\frac{k_{\mathrm{max}}-k}{k_{\mathrm{max}}-k_{\mathrm{min}}}}, (4)

where α\alpha (α>1\alpha>1) is a hyper-parameter that weighs the negative samples; max\mathrm{max} and min\mathrm{min} are maximum and minimum first-learned order, respectively. We increase the weight of negative tCt_{C} if it is a high-quality training instance (i.e., kk is small). Because all positives and negatives are discriminated from instance tAt_{A}, we control the overall weight by the learning order kAk_{A}.

1.5 Overall Objective

We include the MLM task [Devlin2019BERTPO] to avoid catastrophic forgetting of language understanding [McCloskey1989CatastrophicII] and construct the following overall objective for FineCL:

FineCL=ED+RD+MLM\mathcal{L}_{FineCL}=\mathcal{L}_{ED}+\mathcal{L}_{RD}+\mathcal{L}_{MLM} (5)

1.6 Pre-training Details

To ensure a fair comparison and highlight the effectiveness of Fine-grained Contrastive Learning, we align our pre-training data and settings to those used by ERICA. The pre-training dataset provided by [Qin2021ERICAIE] is constructed using distant supervision for RE by pairing documents from Wikipedia (English) with the Wikidata knowledge graph. This distantly labeled dataset creation method mirrors the method used to create the distantly labeled training set in DocRED [docred] but differs in that it is much larger and more diverse. It contains 1M documents, 7.2M relation instances, and 1040 relation types compared to DocRED’s 100k documents, 1.5M relation instances, and 96 relation types (not including no relation). Additional checks are performed to ensure no fact triples overlap between the training data and the test sets of the various downstream RE tasks.

Size 1% 10% 100%
Metrics F1 IgF1 F1 IgF1 F1 IgF1
  CNN* - - - - 42.3 40.3
BiLSTM* - - - - 51.1 50.3
  HINBERT* - - - - 55.6 53.7
CorefBERT* 32.8 31.2 46.0 43.7 57.0 54.5
SpanBERT* 32.2 30.4 46.4 44.5 57.3 55.0
ERNIE* 26.7 25.5 46.7 44.2 56.6 54.2
MTB* 29.0 27.6 46.1 44.1 56.9 54.3
CP* 30.3 28.7 44.8 42.6 55.2 52.7
BERT 19.9 18.8 45.2 43.1 56.6 54.4
RoBERTa 29.6 27.9 47.6 45.7 58.2 55.9
ERICABERT 22.9 21.7 48.5 46.4 57.4 55.2
ERICARoBERTa 30.0 28.2 50.1 48.1 59.1 56.9
WCLRoBERTa 22.3 20.8 49.4 47.5 58.5 56.2
  FineCL 33.2 31.2 50.3 48.3 59.5 57.1
Table 1: F1-micro scores reported on the DocRED test set. IgF1 ignores performance on fact triples in the test set overlapping with triples in the train/dev sets. (* denotes performance as reported in [Qin2021ERICAIE]; all other numbers are from our implementations).

1.7 Supervised Adaptation

The primary focus of our work is to improve relationship representations learned during pre-training and, in doing so, improve performance on downstream RE tasks. To illustrate the effectiveness of our pre-training method, we use cross-entropy loss, as described in equation LABEL:eq:cross_entropy, to fine-tune our pre-trained FineCL model on document-level and sentence-level RE tasks. The baselines are pre-trained and fine-tuned with identical data and settings for a fair comparison.

Document-level RE: To assess our framework’s ability to extract document-level relations, we report performance on DocRED [docred]. We compare our model to the following baselines: (1) CNN [zeng-etal-2014-relation], (2) BiLSTM [lstm], (3) BERT [Devlin2019BERTPO], (4) RoBERTa [Liu2019RoBERTaAR], (5) MTB [BaldiniSoares2019MatchingTB], (6) CP [Peng2020LearningFC], (7 & 8) ERICABERT & ERICARoBERTa [Qin2021ERICAIE], (9) WCLRoBERTa [Peng2020LearningFC]. We fine-tune the pre-trained models on DocRED’s human-annotated train/dev/test splits (see Appendix LABEL:app:docred for detailed experimental settings). We implement WCL with identical settings from our other pre-training experiments and, for a fair comparison, we use RoBERTa instead of BERT as the base model for WCL, given the superior performance we observe from RoBERTa in all other experiments. Table 1 reports performance across multiple data reduction settings (1%, 10%, and 100%), using an overall F1-micro score and an F1-micro score computed by ignoring fact triples in the test set that overlap with fact triples in the training and development splits. We observe that FineCL outperforms all baselines in all experimental settings, offering evidence that FineCL produces better relationship representations from noisy data.

Given that learning-order denoising weighs earlier learned instances over later learned instances, FineCL may be biased towards easier, or common relation classes. The increase in F1-micro performance could result from improved predictions on common relation classes at the expense of predictions on rare classes. To better understand the performance gains, we also report F1-macro and F1-macro weighted in Table 2. The results show that FineCL outperforms the top baselines in both F1-macro metrics indicating that, on average, our method improves performance across all relation classes. However, the low F1-macro scores from all the models highlights an area for improvement—future pre-trained RE models should focus on improving performance on long-tail relation classes.

Metric F1-macro F1-macro-weighted
  BERT 37.3 54.9
RoBERTa 39.6 56.9
ERICABERT 37.9 55.8
ERICARoBERTa 40.1 57.8
WCLRoBERTA 39.9 57.2
  FineCL 40.7 58.2
Table 2: F1-macro and F1-macro-weighted scores reported from the DocRED test set.
Dataset TACRED SemEval
  Size 1% 10% 100% 1% 10% 100%
  MTB* 35.7 58.8 68.2 44.2 79.2 88.2
CP* 37.1 60.6 68.1 40.3 80.0 88.5
BERT 22.2 53.5 63.7 41.0 76.5 87.8
RoBERTa 27.3 61.1 69.3 43.6 77.7 87.5
ERICABERT 34.9 56.0 64.9 46.4 79.8 88.1
ERICARoBERTa 41.1 61.7 69.5 50.3 80.9 88.4
WCLRoBERTA 37.6 61.3 69.7 47.0 80.0 88.3
  FineCL 43.7 62.7 70.3 51.2 81.0 88.7
Table 3: F1-micro scores reported from the TACRED and SemEval test sets (* denotes performance as reported in [Qin2021ERICAIE]; all other numbers are from our implementations).

Sentence-level RE: To assess our framework’s ability to extract sentence-level relations, we report performance on TACRED [zhang2017tacred] and SemEval-2010 Task 8 [hendrickx-etal-2010-semeval]. We compare our model to MTB, CP, BERT, RoBERTa, ERICABERT, ERICARoBERTa, and WCLRoBERTa (see Appendix LABEL:app:semeval_tacred for detailed experimental settings). Table 3 reports F1 scores across multiple data reduction settings (1%, 10%, 100%). Again, we observe an increase in performance in all settings compared to the baselines.