ConstGCN: Constrained Transmission-based Graph Convolutional Networks for Document-level Relation Extraction

Ji Qi¹, Bin Xu¹, Kaisheng Zeng¹, Jinxin Liu¹,
Jifan Yu¹, Qi Gao², Juanzi Li¹, Lei Hou¹
¹Department of Computer Science and Technology, Tsinghua University
²Beijing Kedong Electric Control System Co. Ltd.
qj20@mails.tsinghua.edu.cn, xubin@tsinghua.edu.cn

Abstract

Document-level relation extraction with graph neural networks faces a fundamental graph construction gap between training and inference - the golden graph structure only available during training, which causes that most methods adopt heuristic or syntactic rules to construct a prior graph as a pseudo proxy. In this paper, we propose ConstGCN, a novel graph convolutional network which performs knowledge-based information propagation between entities along with all specific relation spaces without any prior graph construction. Specifically, it updates the entity representation by aggregating information from all other entities along with each relation space, thus modeling the relation-aware spatial information. To control the information flow passing through the indeterminate relation spaces, we propose to constrain the propagation using transmitting scores learned from the Noise Contrastive Estimation between fact triples. Experimental results show that our method outperforms the previous state-of-the-art (SOTA) approaches on the DocRE dataset. The source code is publicly available at https://github.com/THU-KEG/ConstGCN.

1 Introduction

Document-level relation extraction (DocRE) aims to extract heterogeneous relational graphs of form $\{\mathcal{G}=(\mathcal{E},\mathcal{R})\}$ in document, where the typed entities as nodes and multiple directional semantic relations as edges. In contrast to sentence-level RE Qin et al. (2018); Gao et al. (2020), DocRE has been a growing interest by extracting relations beyond the sentence boundariesYao et al. (2019) to ensure the information integrity.

Refer to caption — Figure 1: An example of DocRED document (bottom) with one of its golden multi-relational graphs (upper left). On the upper right, compared to the vanilla GNNs updating entity by accumulating representations of syntactically adjacent entities on the pseudo graph, the proposed ConstGNN models the relation-aware structural representation of entity by performing knowledge-based information propagation.

Previous DocRE methods tend to apply the graph neural networks (GNNs) Kipf and Welling (2016); Veličković et al. (2017) as the core component, and numerous variant of GNNs have proposed Guo et al. (2019); Christopoulou et al. (2019); Zeng et al. (2020); Xu et al. (2021). Similar to traditional GNNs modeled on the observable graph structures (e.g. social networks Huang et al. (2019) and academic citation network Feng et al. (2020)), these models all require a pre-specified graph construction. They mainly either rely on the heuristic rules of intra(inter)-sentential information of entities and mentions Zeng et al. (2020); Christopoulou et al. (2019), or leverage the syntactic rules of dependency paths built by an external parser Sahu et al. (2019); Guo et al. (2019) to serve as the prior graph structure of GNNs. We consider such graph structure as a pseudo graph structure, for it establishes each edge between a pair of entities as a binary association based on the task-independent auxiliary information (heuristic/syntactic rules).

However, the golden edges that describe the relationships between two entities contain multi-type abundant semantics. Thus, previous approaches suffer from two major intrinsic issues. First, Hindered Propagation Issue: the construction of the pseudo graph structure ignores many actual relational edges between entities, which hinders the effective information acquisition and dissemination. Second, Noisy Representation Issue: the simple information accumulation based on the binary associative edges on the pseudo graph makes them struggle to model relation-aware structural knowledge, which further results in noisy representations and harms the performance. For example, in Figure 1, the entity Saint Paul’s representaion is updated by simply accumulating the representations of its syntactically adjacent entities, making it similar to entities Nobles and Wisconsin Territory, while they are completely different entities with relation connections place_of_birth.

Instead of introducing the prior pseudo graph structures, we present ConstGCN, a novel Constrained Transmission-based Graph Convolutional Network that performs knowledge-based information propagation between entities along with all relation spaces without any prior graph construction which explicitly models the semantics of various relationships. Specifically, we innovatively propose the knowledge-based information propagation for DocRE by leveraging the flexible Knowledge Graph Embedding (KGE) approaches Wang et al. (2017) into a general transmitting operation. At each graph convolution step, it updates the entity representation by aggregating knowledge-based information broadcasted from neighbor entities along with all relation spaces. Thus, entity representations containing the relation-aware structural semantics are learned effectively and directly. Due to the agnostic nature of golden graph structure in documents, it is difficult to rigorously follow the relational paths to transmit information. We propose the transmitting scores to constrain the information flow through the indeterminate relational edges, where the scores are learned jointly from the Noise Contrastive Estimation (NCE) Mikolov et al. (2013). It allows the model to learn the semantic representations while maintaining the original relation-aware structural information. As shown in figure 1, compared to the vanilla GNNs, our model learns the representations with an isomorphic structure to the golden heterogeneous graph in the document.

We conduct extensive experiments on DocRED, a large-scale human-annotated dataset including heterogeneous graphs among entities in each document. The results show that our model achieves the SOTA performance compared to previous methods. In addition to the proposed graph convolutional network, we further demonstrated the compatibility of representation learning from documents and knowledge graphs. The contributions of our work are summarized as follows:

•

We present a novel graph convolutional network, ConstGCN, that can naturally model the heterogeneous graph structure including indeterminate edges based on knowledge-based information propagation.
•

We propose the approach of constrained transmission, which allows the model to learn the entity representations while maintaining the original relation-aware structural information.
•

We conduct experiments on the DocRE task and achieve the SOTA performance. This work also demonstrates the compatibility of representation learning from documents and knowledge graphs.

2 Preliminary

2.1 Document-level Relation Extraction

Given a textual document $\mathcal{D}=\{w_{i}\}_{i=1}^{|\mathcal{D}|}$ consisting of a sequence of words and a set of typed entities $\mathcal{E}=\{e_{i}\}_{i=1}^{|\mathcal{E}|}$ , where each entity refers to a set of mentions $e=\{m_{i}\}_{i=1}^{|e|}$ which is a sequence of words in the document. The task of DocRE aims to extract the heterogeneous graphs $\{(e_{i},r_{k},e_{j})|e_{i},e_{j}\in\mathcal{E},r_{k}\in\mathcal{R}\}$ , in which a pair of entities nodes $e_{i},e_{j}$ may have multiple edges and each edge $r_{k}$ refers to a specific relation type, and $\mathcal{R}$ is the set of predefined relations.

2.2 Knowledge Graph Embedding

Given a knowledge graph (KG) consisting of a collection of triples $\mathcal{G}=\{(e_{i},r_{k},e_{j})|e_{i},e_{j}\in\mathcal{E},r_{k}\in\mathcal{R}\}$ with the sets of pre-defined types. The task of KGE aims to learn the vectorial representations $\mathbf{e}_{i},\mathbf{r}_{k},\mathbf{e}_{j}$ modeling the heterogeneous structural information based on a scoring function $d_{r}$ . Depending on the scoring function used, typical methods can be divided into two categories: the translational distance-based approaches e.g. (TransEBordes et al. and RotatESun et al. (2019)) and the semantic matching-based approaches (e.g. DistMultYang et al. (2014) and ComplExTrouillon et al. (2016)). As shown in table 1, the key idea behind both categories of methods is to transmit the representation of head entity in relation-specific projection spaces to approximate tail entity. We thus define a unified transmitting operation $\oplus$ for the existing typical KGE approaches,

Methods	Scoring Function
TransE	$d_{r}(e_{i},r_{k},e_{j})=\gamma-\|\|\mathbf{e}_{i}+\mathbf{e}_{j}-\mathbf{r}_{k}\|\|$
RotatE	$d_{r}(e_{i},r_{k},e_{j})=\gamma-\|\|\langle\mathbf{e}_{i}\circ\mathbf{e}_{j}-\mathbf{r}_{k}\rangle\|\|$
DistMult	$d_{r}(e_{i},r_{k},e_{j})=\langle\mathbf{e}_{i},\mathbf{e}_{j},\mathbf{r}_{k}\rangle$
ComplEx	$d_{r}(e_{i},r_{k},e_{j})=Re(\langle\mathbf{e}_{i},\mathbf{e}_{j},\mathbf{r}_{k}\rangle)$

Table 1: The scoring functions for typical KGE models. The L1-norm is used for all distance based models and the subscript

||\cdot||_{1}

is dropped for brevity.

\displaystyle\mathbf{e}\oplus\mathbf{r}=\begin{cases}\mathbf{e}+\mathbf{r},\quad(\text{TransE})\\ \langle\mathbf{e},\mathbf{r}\rangle,\quad(\text{DistMult})\\ Re(\langle\mathbf{e},\mathbf{r}\rangle),\quad(\text{ComplEx})\end{cases}

(1)

where $\langle\cdot\rangle$ denotes the generalized dot product, and $Re$ is the operation that returns the real part of a complex value. The transmitting operation above will be used to perform knowledge-based message passing in document.

3 Methodology

In this section, we introduce the details of the proposed model. The overall framework is illustrated in figure 2. Based on the representations of entities obtained from a PLM encoder, ConstGCN is composed of $T$ graph convolutional layers and each layer has two computational steps: the computation of transmitting scores and the computation of message passing under the constraints of transmitting scores.

3.1 PLM Encoding

Given an input document consisting of a sequence of words with a set of typed entities, we first insert a special token "*" at the start and end of entity mentions based on the entity marker tecnique Zhang et al. (2017); Shi and Lin (2019); Baldini Soares et al. (2019). For the processed document $\mathcal{D}=\{x_{i}\}_{i=1}^{|\mathcal{D}|}$ , we then employ a pre-trained language model (e.g. BERT Devlin et al. (2019)) to get contextual sequence representations:

\mathbf{H}=(\mathbf{x}_{1},...,\mathbf{x}_{|\mathcal{D}|})=\text{PLM}(x_{1},...,x_{|\mathcal{D}|}),

(2)

where $\mathbf{x}_{i}\in\mathbb{R}^{d^{w}}$ refer to the contextualized embedding of $i$ -th token. For those documents that the sequence length are longer than the maximum input length of encoder, we compute the representations of overlapping tokens by averaging their embeddings from different windows. Then, we utilize the embedding of special token "*" at the start of $u$ -th mention to represent the mention $\mathbf{m}_{u}$ . For $i$ -th entity with its mentions $e_{i}=\{m_{u}\}_{u=1}^{|e_{i}|}$ , we compute an initial coarse-grained entity representation by averaging its coreference mentions with the logsumexp pooling function Zhang et al. (2019):

\mathbf{e}_{i}=\log\sum_{m_{u}\in e_{i}}\exp(\mathbf{m}_{u}),

(3)

we use these representations of entities obtained from the PLM encoder as the initialization of entity nodes $\mathbf{E}^{(0)}=[\mathbf{e}_{1},\mathbf{e}_{2},...,\mathbf{e}_{|\mathcal{E}|}]^{\top}\in\mathbb{R}^{|\mathcal{E}|\times d}$ .

3.2 Constrained Transmission-based GCN

To model the multi-relational graph structure, ConstGCN updates the entity representation by receiving information broadcasted from other entities along with all relation spaces. We thus introduce the representations of relations. Instead of independently defining an embedding for each type of relation, we use a variant of the basis formulationsSchlichtkrull et al. (2018) to linearly combine a set of basis vectors to promote generalization. Formally, let $\{\mathbf{z}_{1},...,\mathbf{z}_{\mathcal{B}}|\mathbf{z}_{b}\in\mathbb{R}^{d}\}$ be a set of basis vectors, a relation representation is given as:

\displaystyle\mathbf{r}_{k}=\sum_{b=1}^{\mathcal{B}}\beta_{k_{b}}\mathbf{z}_{b},

(4)

where $\beta_{k_{b}}\in\mathbb{R}$ is specific learnable scalar weight corresponding to $k$ -th relation space. We refer to these representations of relations as the initial representations of all relations $\mathbf{R}^{(0)}=[\mathbf{r}_{1},...,\mathbf{r}_{|\mathcal{R}|}]^{\top}\in\mathbb{R}^{|\mathcal{R}|\times d}$ for the subsequent constrained transmission-based graph convolution.

Modeling transmitting scores with NCE

Due to the agnostic nature of relational edges, we need to broadcast the information of entities along with all relation spaces. Therefore, we need to measure the probability that there is a specific relationship between any two nodes to control the information flow.

Specifically, a simplified NCE objective Gutmann and Hyvärinen (2012); Mikolov et al. (2013) can be define as:

\displaystyle\begin{split}\mathcal{L}_{nce}=&\log\sigma(\phi(\mathbf{v}_{i},\mathbf{v}_{j}))\\ &+\sum_{k}\mathbb{E}_{v_{k}^{\prime}\sim p_{n}(v)}[\log\sigma(-\phi(\mathbf{v}_{k}^{\prime},\mathbf{v}_{i}))],\end{split}

(5)

where it differentiate positive sample pairs $(v_{i},v_{j})$ from noisy sample pairs $(v_{k}^{\prime},v_{i})$ drawing from the noise distribution $p_{n}$ by means of logistic regression, and $\phi$ is a specific measure function. Inspired by the KGE ideas, we naturally use the logistic term of NCE by replacing the measure function $\phi$ as the KGE scoring functions to calculate the transmitting score from entity $e_{i}$ to entity $e_{j}$ through relation space $r_{k}$ :

\displaystyle f_{r}(\mathbf{e}^{(t)}_{i},\mathbf{r}^{(t)}_{k},\mathbf{e}^{(t)}_{j})=\sigma(d_{r}(\mathbf{e}^{(t)}_{i},\mathbf{r}^{(t)}_{k},\mathbf{e}^{(t)}_{j})),

(6)

where the score indicates the probability that there is a relational edge $r_{k}$ from $e_{i}$ to $e_{j}$ .

At $t+1$ -th layer, by performing the computation of equation (6) for all entity pairs at each relation space, we obtain the matrices of transmitting scores within all relation spaces

\displaystyle[\mathcal{A}^{(t)}]_{kij}=f_{r}(\mathbf{e}^{(t)}_{i},\mathbf{r}^{(t)}_{k},\mathbf{e}^{(t)}_{j}),

(7)

where $\mathcal{A}^{(t)}\in\mathbb{R}^{|\mathcal{R}|\times|\mathcal{E}|\times|\mathcal{E}|}$ refer to the tensor of transmitting scores corresponding to all relation spaces at $t$ -th convolution layer.

Note that the equation (6) can be used as an objective for KGE learning, and thus it can be used to optimize the representations of entities and relations with a specific KGE approach simultaneously.

Passing semantic information under Constraints

For a heterogeneous graph consisting of entity nodes and relational edges in a document, an entity has multifaceted structural semantic information based on its neighbors of particular relations, and such multifaceted information is difficult to model by the hindered message passing on binary associative edges. We propose to update each entity representation by transmitting the representations of its relation-specific neighbors along with the relation spaces to enhance the multifaceted structural information.

Due to the agnostic nature of relation edges in the real world document, we further constrain the propagation with the transmitting scores learned from the previous step to maintain the original relation-aware structure. For each entity, the update of entity representation is defined as $\mathbf{e}^{(t+1)}_{i}=\sum_{k=1}^{|\mathcal{R}|}\sum_{j=1}^{|\mathcal{E}|}f_{r}(\mathbf{e}^{(t)}_{j},\mathbf{r}^{(t)}_{k},\mathbf{e}^{(t)}_{i})(\mathbf{e}^{(t)}_{j}\oplus\mathbf{r}^{(t)}_{k})$ , where the operation $\oplus$ denote the transmitting operation contingent upon specific KGE methods introduced above. For the scale of all entities, we can rewrite this comprehensive transmission in an elegant way by using tensor multiplication, and promote computational efficiency:

	$\displaystyle\mathbf{E}^{(t+1)}_{k}=\tilde{\mathcal{A}}^{(t)}\otimes(\mathbf{E}^{(t)}\oplus\mathbf{M}_{k}^{(t)}),$		(8)
	$\displaystyle\mathbf{E}^{(t+1)}=f_{pool}(\mathbf{E}^{(t+1)}_{1},\mathbf{E}^{(t+1)}_{2},...,\mathbf{E}^{(t+1)}_{\|\mathcal{R}\|}),$		(9)

where $\mathbf{E}^{(t+1)}_{k}\in\mathbb{R}^{|\mathcal{E}|\times d}$ is the entity representations corresponding to the $k$ -th relation space, $\tilde{\mathcal{A}}^{(t)}$ refers to the transposed transmitting scores that transpose the matrices consisting of the values of the second and third dimensions of $\mathcal{A}^{(t)}$ , and $\mathbf{M}_{k}^{(t)}=[\mathbf{r}^{(t)}_{k},...,\mathbf{r}^{(t)}_{k}]^{\top}\in\mathbb{R}^{|\mathcal{E}|\times d}$ is the matrix composed of $|\mathcal{E}|$ identical vectors $\mathbf{r}^{(t)}_{k}$ . The operation $\otimes:(\mathbb{R}^{|\mathcal{R}|\times|\mathcal{E}|\times|\mathcal{E}|},\mathbb{R}^{|\mathcal{E}|\times d})\rightarrow\mathbb{R}^{|\mathcal{R}|\times|\mathcal{E}|\times d}$ refers to the tensor multiplication. Similar to the pooling functions that aggregate feature maps along channel dimension in the computer vision, the function $f_{pool}(\mathcal{X}):\mathbb{R}^{d_{1}\times d_{2}\times d_{3}}\rightarrow\mathbb{R}^{d_{2}\times d_{3}}$ aggregates the matrices comprised of the values with $d_{2}$ and $d_{3}$ dimensions of $\mathcal{X}$ along the $d_{1}$ dimension. There are multiple pooling operations can be used, such as $f_{pool}^{mean}$ , $f_{pool}^{max}$ and $f_{pool}^{sum}$ for mean, max, sum.

We introduce an attentive pooling function $f^{att}_{pool}$ that aggregates the representations of entities according to their importance in each relation space:

\displaystyle\begin{split}&\mathbf{Z}^{(t)}=softmax(\frac{\mathbf{R}^{(t)}(\mathbf{E}^{(t)})^{\top}}{\sqrt{d}}),\\ &f_{pool}^{att}:\quad\sum_{k=1}^{|\mathcal{R}|}(\mathbf{z}_{k}^{(t)})^{\top}\cdot(\mathbf{E}^{(t+1)}_{k}),\end{split}

(10)

where $\mathbf{z}_{k}^{(t)}\in\mathbb{R}^{|\mathcal{E}|}$ refers to the $k$ -th row of $\mathbf{Z}^{(t)}$ indicating the importance of entities with respect to the relation space $r_{k}$ .

By performing $T$ layers of the graph convolution, the representations of entities $\mathbf{E}^{(T)}$ and relations $\mathbf{R}^{(T)}$ that modeled the multi-relational spatial information are obtained.

3.3 Learning

Based on the fine-grained entity representations obtained from ConstGCN, we further get the representation of each entity pair by supplementing the related local tokens information with the localized context pooling Zhou et al. (2021). For each entity pair $(e_{i},e_{j})$ in current document, we get the localized representation by an attention-based weighted summation:

\displaystyle\mathbf{c}^{(i,j)}=(\boldsymbol{\alpha}^{(i)}\cdot\boldsymbol{\alpha}^{(j)})\mathbf{H},

(11)

where $\boldsymbol{\alpha}^{(i)}\in\mathbb{R}^{|\mathcal{D}|}$ refer to the the averaged attention scores from $i$ -th enttiy to all tokens for all layers in the pre-trained language model. Then the augmented representation is obtained by:

\displaystyle\begin{cases}\bar{\mathbf{e}}_{i}=\tanh(\mathbf{W}_{s}\mathbf{e}_{i}^{(T)}+\mathbf{W}_{c_{1}}\mathbf{c}^{(i,j)}),\\ \bar{\mathbf{e}}_{j}=\tanh(\mathbf{W}_{o}\mathbf{e}_{j}^{(T)}+\mathbf{W}_{c_{2}}\mathbf{c}^{(i,j)}),\end{cases}

(12)

where $\mathbf{W}_{s},\mathbf{W}_{o},\mathbf{W}_{c_{1}},\mathbf{W}_{c_{1}}\in\mathbb{R}^{d\times d}$ are the learnable weight matrices. Based on the derivation above, we employ two learning objective to optimize the model. The first is the cross-entropy-based classification objective to learn the final labels of relational classes between entities, and the second is a contrastive learning objective for the representation learning of entities and relations.

3.3.1 Classification objective

Category	Model	Dev		Test
	Model	Ign F1	F1	Ign F1	F1
Sequence-based Models	CNN^∗ Yao et al. (2019)	41.58	43.45	40.33	42.26
	LSTM^∗ Yao et al. (2019)	48.44	50.68	47.71	50.07
	BiLSTM^∗ Yao et al. (2019)	48.87	50.94	48.78	51.06
	ContexAware^∗ Yao et al. (2019)	48.94	51.09	48.40	50.07
	BERT ${}_{BASE}^{*}$ Wang et al. (2019)	-	54.16	-	53.20
	BERT-2Phase ${}_{BASE}^{*}$ Wang et al. (2019)	-	54.42	-	53.92
	HIN-BERT ${}_{BASE}^{*}$ Tang et al. (2020)	54.29	56.31	53.70	55.60
	CorefBERT ${}_{BASE}^{*}$ Ye et al. (2020)	55.32	57.51	54.54	56.96
	CorefRoBERTa ${}_{LARGE}^{*}$ Ye et al. (2020)	57.35	59.43	57.90	60.25
	MIUK(three-view)-BERT ${}_{BASE}^{*}$ Li et al. (2021)	58.27	60.11	58.05	59.99
	ATLOP-BERT ${}_{BASE}^{*}$ Zhou et al. (2021)	59.22	61.09	59.31	61.30
	ATLOP-RoBERTa ${}_{LARGE}^{*}$ Zhou et al. (2021)	61.32	63.18	61.39	63.40
Graph-based Models	GAT^† Veličković et al. (2017)	45.17	51.44	47.36	49.51
	AGGCN^† Guo et al. (2019)	46.29	52.47	48.89	51.45
	EoG^† Christopoulou et al. (2019)	45.94	52.15	49.48	51.82
	GCNN^† Sahu et al. (2019)	46.22	51.52	49.59	51.62
	LSR-BERT ${}_{BASE}^{*}$ Nan et al. (2020)	52.43	59.00	56.97	59.05
	GAIN-BERT ${}_{BASE}^{\natural}$ Zeng et al. (2020)	57.75	60.03	57.98	60.42
	HeterGSAN-BERT ${}_{BASE}^{*}$ Xu et al. (2021)	58.13	60.18	57.12	59.45
	ConstGCN-BERT_BASE	59.32	61.27	59.58	61.55
	ConstGCN-RoBERTa_LARGE	62.01	63.91	62.04	64.00

Table 2: Performance on the dev set and the test set of DocRED. The bottom two rows show results of our models with TransE implementation. Results with

*

are reported in their original papers. Results with ^† are implemented in [Nan et al., 2020]. Results with ^♮ are our reproduced performance. Bold results indicate the optimal performances.

The probabilities of final relational classes for each pair of entities in the document are calculated as following:

\displaystyle P(r|e_{i},e_{j})=\sigma(\bar{\mathbf{e}}_{i}\mathbf{W}_{r}\bar{\mathbf{e}}_{j}+b_{r}),

(13)

where $\mathbf{W}_{r}\in\mathbb{R}^{d\times d},b_{r}\in\mathbb{R}$ are learnable model parameters. We utilize the adaptive thresholding method Zhou et al. (2021) that involves a learnable threshold class TH to differentiate the positive and negative classes. Then the cross-entropy based classification objective is defined as:

\displaystyle\begin{split}&\mathcal{L}_{cls_{1}}=-\sum_{r_{k}\in\mathcal{T}_{P}}\log(\frac{\exp(P(r_{k}|e_{i},e_{j}))}{\sum_{r_{k}^{\prime}\in\mathcal{T}_{P}\cup\{TH\}}\exp(P(r_{k}^{\prime}|e_{i},e_{j}))}),\\ &\mathcal{L}_{cls_{2}}=-\log(\frac{\exp(P(TH|e_{i},e_{j}))}{\sum_{r_{k}^{\prime}\in\mathcal{T}_{N}\cup\{TH\}}\exp(P(r_{k}^{\prime}|e_{i},e_{j}))}),\\ &\mathcal{L}_{cls}=\mathcal{L}_{cls_{1}}+\mathcal{L}_{cls_{2}},\end{split}

where $\mathcal{T}_{P}$ and $\mathcal{T}_{N}$ are the all positive relational triples set and all negative relational triples set respectively. At the test time, the positive relation classes are returned if the probabilities are higher than the TH classes.

3.3.2 Contrastive learning objective

We optimize the transmission learning with the oise Contrastive Estimation (NCE)Gutmann and Hyvärinen (2012); Mikolov et al. (2013). Instead of using a uniform negative sampling, we leverage the self-adversarial negative sampling method introduced bySun et al. (2019) to alleviate the non-meaningful information problem,

\displaystyle\begin{split}\mathcal{L}_{nce}=&-\sum_{i\neq j}\sum_{(e_{i},e_{j},r_{k})\in\mathcal{T}_{P}}(\log\sigma(d_{r}(\bar{\mathbf{e}}_{i},\bar{\mathbf{r}}_{k},\bar{\mathbf{e}}_{j}))\\ &+\sum_{(e_{q}^{\prime},r_{k},e_{q}^{\prime\prime})\in\mathcal{T}_{N}^{q}}\varphi_{q}^{k}\log\sigma(-d_{r}(\bar{\mathbf{e}}_{q}^{\prime},\bar{\mathbf{r}}_{k},\bar{\mathbf{e}}_{q}^{\prime\prime}))),\\ \varphi_{q}^{k}=&\frac{\exp\tau\sigma(-d_{r}(\bar{\mathbf{e}}_{q}^{\prime},\bar{\mathbf{r}}_{k},\bar{\mathbf{e}}_{q}^{\prime\prime}))}{\sum_{(e_{l}^{\prime},r_{k},e_{l}^{\prime\prime})\in\mathcal{T}_{N}^{l}}\exp\tau\sigma(-d_{r}(\bar{\mathbf{e}}_{l}^{\prime},\bar{\mathbf{r}}_{k},\bar{\mathbf{e}}_{l}^{\prime\prime}))},\end{split}

where $\mathcal{T}_{N}^{q}$ refer to the negative triples sampled from current document $q$ , and $\tau$ is the sampling temperature. Finally, the overall objective is a linear combination of these two objectives,

\displaystyle\mathcal{L}=\mathcal{L}_{cls}+\mu\mathcal{L}_{nce},

(14)

where $\mu$ is the hyperparameter to balance these two different objectives.

4 Experiments

4.1 Experimental Settings

Dataset.

The proposed methods on a widely-used human-annotated dataset, DocRED Yao et al. (2019), which is built based on Wikipedia and Wikidata, with 5,053 documents, 132,375 entities, 56,354 relational facts, and 96 relation types. More than 40.7% relations can only be extracted across sentences, and 61.1% relations require inference skills such as logical inference. We follow the standard split of the dataset, i.e., 3,053 documents for training, 1,000 for development and 1,000 for test.

Baselines.

We compare ConstGCN with sequential methods that adopt sequential encoders as the main architectures, including CNN/LSTMYao et al. (2019), BERT-2Phase Wang et al. (2019), HIN-GloVe/HIN-BERT Tang et al. (2020), CorefBERT Ye et al. (2020), MIUK Li et al. (2021) and ATLOP Zhou et al. (2021), and graphical methods leveraging GNNs as the backbone, including GCNN Sahu et al. (2019), GAT Veličković et al. (2017), AGGCN Guo et al. (2019), EoG Christopoulou et al. (2019), GAIN Zeng et al. (2020), LSR Nan et al. (2020) and HeterGSAN Xu et al. (2021).

Implementation.

For the foundational components, we use cased BERT_base or RoBERTa_large Liu et al. (2019) as the encoder to validate the efficiency for all BERT-based methods. For ConstGCN, we empirically set the number of relation basis vectors $\mathcal{B}=56$ , and the pooling function as att tuned on the dev set. We investigate three implementations with different transmitting operations, ConstGCN(TransE), ConstGCN(DistMult) and ConstGCN(ComplEx), where the fixed margin of ConstGCN(TransE) is set to $\gamma=20.0$ . The number of graph convolutional layers are set to 2, 2, and 1 for the three implementations, respectively, and the negative sampling proportion in transmitting score’s learning is set to $1/|\mathcal{T}_{N}^{q}|=1/40$ consistently for all implementations. All hyperparameters are tuned based on grid search with the performance on dev set, and more training details are available in Appendix A.

4.2 Main Results

The main results on the DocRED dataset with the standard metrics F1 and Ign F1 are presented in Table 2, where Ign F1 calculate F1 scores excluding the common relation facts shared in the training and dev/test sets. We group the existing models into 2 categories: (1) the Sequence-based Models that adopt sequential encoder to encode and obtain the representations; (2) the Graph-based Models that focus on modeling the heterogeneous graph structure with GNNs independent of encoder.

As shown in the Table 2, our methods achieve the state-of-the-art performance on all metrics and all splits. Compared to the group of Graph-based Models, our method outperforms the state-of-the-art GNNs-based model by 1.09% F1 score and 2.1% F1 score on the development set and test set respectively. These results suggest that ConstGCN naturally learns more fine-grained informative representations of entities during the constrained transmission-based graph convolution. By using RoBERTa_LARGE as the encoder, our ConstGCN further achieves the F1 scores of 63.91% and 64.00% for the both evaluation set, which are a new state-of-the-art results on DocRED.

4.3 Model Analysis

In this section, we further investigate the effectiveness of the transmission learning and the impact of ConstGCN on documents of different complexity.

How does the transmission learning affect the performance ?

Figure 3 shows the learning curves of classification performance and transmission costs for all three implementations on the development set during the training. As seen, the $F1$ scores strictly improve with the decreasing of transmitting costs for all three models, and the model achieves the optimal classification performances when the transmission costs converge. This shows that the informative entity representations learned by optimizing the constrained transmission-based graph convolution can lead to significant improvements in classification performance. We also notice that the model using ComplEx transmitting operation performs the best in terms of convergence speed and optimal values compared to the other models, and it suggests that the model learns effective representations especially in the complex space. The visualization of the transmitting scores learned between entities in a specific document are shown in Appendix B.

What kind of environment is ConstGCN more effective for ?

We further investigate the effectiveness of ConstGCN on documents of different relation-aware complexities to demonstrate its applicability. As the third sub-figure 3 shows, we split the development set of DocRED into 6 disjoint subsets by the number of relations types, and evaluate models trained with or without ConstGCN on each subset. Overall for both models, we found that their F1 performances tend to decrease when the number of relation types keeps growing. However, when the number of relation types is larger than 10, the model w/ ConstGCN consistently exhibit significant better performance compared to the model w/o ConstGCN. This result indicates that the broadcasting of knowledge-based transmission in all relation spaces can learn relation-aware structural information and is more applicable to complex environment. Further, we can assume that when the complexity of the document exceeds that of DocRED, we can still get good performance by adjusting the transmitting operation in ConstGCN.

Model	F1	AUC
ConstGCN (TransE)	61.27	61.32
#layers $T=1$	47.52	44.46
#layers $T=3$	60.87	61.69
$f_{pool}$ = mean	61.23	61.32
$f_{pool}$ = max	61.20	61.40
ConstGCN (DistMult)	61.06	62.21
#layers $T=1$	61.06	60.99
#layers $T=3$	61.48	61.50
ConstGCN (ComplEx)	61.41	61.73
#layers $T=2$	60.89	62.39
#layers $T=3$	59.85	61.80

Table 3: Ablation studies of ConstGCN-BERT_BASE. We change different implementations of components one at a time. These ablation results show that, the 2-layer ConstGCN can learn effective representations and benefit classification performance.

4.4 Ablation Study

In this subsection, we examine the contributions of main components under different implementations.

Specifically, we explore the effectiveness of the number of graph convolution layers with different transmitting operations, and the performances of using different pooling functions. Table 3 shows the results on the development set.

Overall, we observe that a 2-layer transmission-based graph convolution learns effective entity representations and obtains the best classification F1 performance for all three models. In particular, the performance of ConstGCN (TransE) model deteriorates significantly when the number of layers of graph convolution was reduced to 1, while the other models maintain competitive performance with 1 layer. This suggest that the translation-based approaches require more transmitting steps to accumulate effective information than the semantic-based methods. Moreover, the model demonstrates consistent performance with three different pooling functions, att, mean and max.

5 Related Work

Relation Extraction

Early research efforts on relation extraction concentrate on extracting relations between entity pairs within a sentence. Various approaches including feature-based methods Mintz et al. (2009), kernel-based methods Bunescu and Mooney (2005), and deep neural networks-based methods Qin et al. (2018); Gao et al. (2020) have been shown effective in handling this problem. However, due to the specificity of data structures in biomedical domain, some recent researches construct the document graph with heuristics and syntactical rules Quirk and Poon (2016); Guo et al. (2019), and then perform inference to extract the binary interactions between biomedical entities (e.g. 3-ary relation between drug, gene, and mutation) in the entire document.

However, due to the specificity of data structures in biomedical domain, some recent researches construct the document graph with heuristics and syntactical dependencies Quirk and Poon (2016); Guo et al. (2019), and then perform inference to extract the binary interactions between biomedical entities (e.g. 3-ary relation between drug, gene, and mutation) in the entire document. More recently, with the large-scale general-purpose DocRE dataset proposed by Yao et al. (2019), there has been a growing interest in extracting relations on such multi-mention multi-label environment Wang et al. (2019); Ye et al. (2020); Tang et al. (2020); Nan et al. (2020); Zeng et al. (2020); Xu et al. (2021); Li et al. (2021).

Knowledge Graph Embedding

The Knowledge Graph Embedding (KGE) Wang et al. (2017) approaches aims to model semantic representations of entities and relations in the multi-relational graph structure properly, and it is extensively studied recently. Most of KGE approaches define a score function that models the entity and relation embeddings to constrain the valid triples with a higher score than the invalid ones. The KGE methods can be briefly classified into two major categories based on the type of scoring functions. The translation-based methods including TransEBordes et al. and RotatESun et al. (2019) measure the socre as translating distance from head entity to tail enity along the relation space, and the semantic-based methods such as DistMultYang et al. (2014) and ComplExTrouillon et al. (2016) exploit the score as semantic similarity between head and tail entities with the relation-specific projection space.

6 Conclusion

This paper proposes a novel graph convolutional network that naturally learns the relation-aware entity representations on heterogeneous graphs with indeterminate edges. To avoid the prior pseudo graph structure, we transmit entity representations along with all relation spaces to update each entity, in which the transmission is constrained with the transmitting scores learned from Noise Contrastive Estimation to maintain the original spatial information. Experimental results show that our method greatly advantages the DocRE task.

Limitation

In this paper, we explore 3 typical KGE ideas into the knowledge-based graph convolution and prove their efficiency. However, there are many other approaches, such as KGE in hyperbolic space, which we do not validate in this paper. In the training of ConstGCN, it is a very complicated problem to perform better negative sampling of the entity-relation triples in the documents. In this paper, we follow the strong-weak negative sample strategy used in many KGE approaches, with a carefully chosen sampling ratio in the document, but this is still not sufficient. We have theoretically proven the compatibility of representation learning between natural texts and KGs based the ConstGCN, rather than verifying it on a case-by-case basis.

References

Baldini Soares et al. (2019) Livio Baldini Soares, Nicholas FitzGerald, Jeffrey Ling, and Tom Kwiatkowski. 2019. Matching the blanks: Distributional similarity for relation learning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2895–2905, Florence, Italy. Association for Computational Linguistics.
(2) Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko. Translating embeddings for modeling multi-relational data. volume 26.
Bunescu and Mooney (2005) Razvan Bunescu and Raymond Mooney. 2005. A shortest path dependency kernel for relation extraction. In Proceedings of human language technology conference and conference on empirical methods in natural language processing, pages 724–731.
Christopoulou et al. (2019) Fenia Christopoulou, Makoto Miwa, and Sophia Ananiadou. 2019. Connecting the dots: Document-level neural relation extraction with edge-oriented graphs. arXiv preprint arXiv:1909.00228.
Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL.
Feng et al. (2020) Wenzheng Feng, Jie Zhang, Yuxiao Dong, Yu Han, Huanbo Luan, Qian Xu, Qiang Yang, Evgeny Kharlamov, and Jie Tang. 2020. Graph random neural networks for semi-supervised learning on graphs. Advances in neural information processing systems, 33:22092–22103.
Gao et al. (2020) Tianyu Gao, Xu Han, Ruobing Xie, Zhiyuan Liu, Fen Lin, Leyu Lin, and Maosong Sun. 2020. Neural snowball for few-shot relation learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 7772–7779.
Goyal et al. (2017) Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. 2017. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677.
Guo et al. (2019) Zhijiang Guo, Yan Zhang, and Wei Lu. 2019. Attention guided graph convolutional networks for relation extraction. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 241–251.
Gutmann and Hyvärinen (2012) Michael U Gutmann and Aapo Hyvärinen. 2012. Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics. Journal of Machine Learning Research.
Huang et al. (2019) Xiao Huang, Qingquan Song, Yuening Li, and Xia Hu. 2019. Graph recurrent networks with attributed random walks. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 732–740.
Kipf and Welling (2016) Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907.
Li et al. (2021) Bo Li, Wei Ye, Canming Huang, and Shikun Zhang. 2021. Multi-view inference for relation extraction with uncertain knowledge. 35(15):13234–13242.
Liu et al. (2019) Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
Loshchilov and Hutter (2017) Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, 26.
Mintz et al. (2009) Mike Mintz, Steven Bills, Rion Snow, and Dan Jurafsky. 2009. Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pages 1003–1011.
Nan et al. (2020) Guoshun Nan, Zhijiang Guo, Ivan Sekulic, and Wei Lu. 2020. Reasoning with latent structure refinement for document-level relation extraction. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1546–1557.
Qin et al. (2018) Pengda Qin, Weiran Xu, and William Yang Wang. 2018. Robust distant supervision relation extraction via deep reinforcement learning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2137–2147.
Quirk and Poon (2016) Chris Quirk and Hoifung Poon. 2016. Distant supervision for relation extraction beyond the sentence boundary. arXiv preprint arXiv:1609.04873.
Sahu et al. (2019) Sunil Kumar Sahu, Fenia Christopoulou, Makoto Miwa, and Sophia Ananiadou. 2019. Inter-sentence relation extraction with document-level graph convolutional neural network. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4309–4316.
Schlichtkrull et al. (2018) Michael Schlichtkrull, Thomas N Kipf, Peter Bloem, Rianne Van Den Berg, Ivan Titov, and Max Welling. 2018. Modeling relational data with graph convolutional networks. In European semantic web conference.
Shi and Lin (2019) Peng Shi and Jimmy Lin. 2019. Simple bert models for relation extraction and semantic role labeling. arXiv preprint arXiv:1904.05255.
Srivastava et al. (2014) Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958.
Sun et al. (2019) Zhiqing Sun, Zhi-Hong Deng, Jian-Yun Nie, and Jian Tang. 2019. Rotate: Knowledge graph embedding by relational rotation in complex space. arXiv preprint arXiv:1902.10197.
Tang et al. (2020) Hengzhu Tang, Yanan Cao, Zhenyu Zhang, Jiangxia Cao, Fang Fang, Shi Wang, and Pengfei Yin. 2020. Hin: Hierarchical inference network for document-level relation extraction. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pages 197–209.
Trouillon et al. (2016) Théo Trouillon, Johannes Welbl, Sebastian Riedel, Éric Gaussier, and Guillaume Bouchard. 2016. Complex embeddings for simple link prediction. In International conference on machine learning, pages 2071–2080.
Veličković et al. (2017) Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. 2017. Graph attention networks. arXiv preprint arXiv:1710.10903.
Wang et al. (2019) Hong Wang, Christfried Focke, Rob Sylvester, Nilesh Mishra, and William Wang. 2019. Fine-tune bert for docred with two-step process. arXiv preprint arXiv:1909.11898.
Wang et al. (2017) Quan Wang, Zhendong Mao, Bin Wang, and Li Guo. 2017. Knowledge graph embedding: A survey of approaches and applications. IEEE Transactions on Knowledge and Data Engineering, 29(12):2724–2743.
Xu et al. (2021) Wang Xu, Kehai Chen, and Tiejun Zhao. 2021. Document-level relation extraction with reconstruction. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 14167–14175.
Yang et al. (2014) Bishan Yang, Wen-tau Yih, Xiaodong He, Jianfeng Gao, and Li Deng. 2014. Embedding entities and relations for learning and inference in knowledge bases. arXiv preprint arXiv:1412.6575.
Yao et al. (2019) Yuan Yao, Deming Ye, Peng Li, Xu Han, Yankai Lin, Zhenghao Liu, Zhiyuan Liu, Lixin Huang, Jie Zhou, and Maosong Sun. 2019. DocRED: A large-scale document-level relation extraction dataset. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 764–777.
Ye et al. (2020) Deming Ye, Yankai Lin, Jiaju Du, Zhenghao Liu, Peng Li, Maosong Sun, and Zhiyuan Liu. 2020. Coreferential reasoning learning for language representation. arXiv preprint arXiv:2004.06870.
Zeng et al. (2020) Shuang Zeng, Runxin Xu, Baobao Chang, and Lei Li. 2020. Double graph based reasoning for document-level relation extraction. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1630–1640.
Zhang et al. (2017) Yuhao Zhang, Victor Zhong, Danqi Chen, Gabor Angeli, and Christopher D. Manning. 2017. Position-aware attention and supervised data improve slot filling. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 35–45, Copenhagen, Denmark. Association for Computational Linguistics.
Zhang et al. (2019) Zhengyan Zhang, Xu Han, Zhiyuan Liu, Xin Jiang, Maosong Sun, and Qun Liu. 2019. Ernie: Enhanced language representation with informative entities. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1441–1451.
Zhou et al. (2021) Wenxuan Zhou, Kevin Huang, Tengyu Ma, and Jing Huang. 2021. Document-level relation extraction with adaptive thresholding and localized context pooling. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 14612–14620.

Appendix A Hyperparameter settings and implementation details

We train ConstGCN on one NVIDIA RTX 3090 for a maximum of 30 epochs, it takes about 5 hours to finish training. We tune the optimal hyperparameters using grid search based on the F1 score on dev set. We set the dropout Srivastava et al. (2014) with rate 0.1, and clip the gradients to a max norm of 1.0. The optimization method is set to AdamW Loshchilov and Hutter (2017) with an initial learning rate $1e^{-5}$ and a linear warmupGoyal et al. (2017) for the first 6% steps with exponential decay. As shown in Table 4 and 5, the value of htperparameters we finally adopted are in bold.

Hyperparameter	Value
Batch Size	8, 6, 2
Learning Rate	0.001
Activation Function	ReLU, Tanh
Positive v.s. Negative Ratio	1.0, 0.5, 0.25
Entity Type Embedding Size	20
Coreference Embedding Size	20
Dropout	0.2, 0.6, 0.8
Weight Decay	0.0001

Table 4: Settings for base network

Hyperparameter	Value
$T$	1, 2, 3
$f_{pool}$	Max, Mean, Att
$f_{r}$	TransE, ComplEx, DistMult
$\gamma$	4, 8, 12, 16, 20, 24, 28
$\mathcal{B}$	48, 56, 64, 72, 80, 88, 96
$\|\mathcal{N}_{e}^{q}\|$	10, 20, 40
$\tau$	1.0, 2.0
$\mu$	0.001, 0.01

Table 5: Settings for ConstGCN

Appendix B Visualization of Transmitting Scores

We visualize the transmitting scores between all entities in a specific relation space, country, in a document from the development set, which are predicted by the optimal models of three implementations. The values are shown in Figure 5 and Figure 6. We find that the transmitting scores calculated during the graph convolution consistently conform with the adjacency matrix of golden graph, suggesting that our models learn relation-aware structural information for entities effectively. Besides, based on the visualized figure, we can clearly find that the problem of N-1 that has been identified in traditional TransE still exists in ConstGCN, and it inspires us for future improvements.

Appendix C Error Analysis

An annotated example that our model is underperforming on it is shown in bellow. We can see a frustrating scene: there are a large number of mistakes and omissions in the labeled data, especially since many entities that actually co-referenced are annotated as separate entities. In the Figure 4, entities of the same color, except for gray, indicate that they should be labeld as co-referencing entities. However, these omitted annotations make it difficult for model to accurately distinguish semantic boundaries.

Edward P. "Ned" McEvoy (born 1886) was an Irish hurler who played for the Dublin and Laois senior teams. Born in Abbeyleix, County Laois, McEvoy first played competitive hurling and Gaelic football in his youth. He arrived on the inter-county scene when he first linked up with the Laois senior team before later joining the Dublin senior team before returning to Laois. McEvoy was a regular member of the starting fifteen, and won one All-Ireland medal and two Leinster medals. He was an All-Ireland runner-up on one occasion. At club level McEvoy won several championship medals as a dual player with Abbeyleix. He also won a championship medal with the Thomas Davis club.