This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

ConstGCN: Constrained Transmission-based Graph Convolutional Networks for Document-level Relation Extraction

Ji Qi1, Bin Xu1, Kaisheng Zeng1, Jinxin Liu1,
Jifan Yu1, Qi Gao2, Juanzi Li1, Lei Hou1
1Department of Computer Science and Technology, Tsinghua University
2Beijing Kedong Electric Control System Co. Ltd.
qj20@mails.tsinghua.edu.cn, xubin@tsinghua.edu.cn
Abstract

Document-level relation extraction with graph neural networks faces a fundamental graph construction gap between training and inference - the golden graph structure only available during training, which causes that most methods adopt heuristic or syntactic rules to construct a prior graph as a pseudo proxy. In this paper, we propose ConstGCN, a novel graph convolutional network which performs knowledge-based information propagation between entities along with all specific relation spaces without any prior graph construction. Specifically, it updates the entity representation by aggregating information from all other entities along with each relation space, thus modeling the relation-aware spatial information. To control the information flow passing through the indeterminate relation spaces, we propose to constrain the propagation using transmitting scores learned from the Noise Contrastive Estimation between fact triples. Experimental results show that our method outperforms the previous state-of-the-art (SOTA) approaches on the DocRE dataset. The source code is publicly available at https://github.com/THU-KEG/ConstGCN.

1 Introduction

Document-level relation extraction (DocRE) aims to extract heterogeneous relational graphs of form {𝒢=(,)}\{\mathcal{G}=(\mathcal{E},\mathcal{R})\} in document, where the typed entities as nodes and multiple directional semantic relations as edges. In contrast to sentence-level RE Qin et al. (2018); Gao et al. (2020), DocRE has been a growing interest by extracting relations beyond the sentence boundariesYao et al. (2019) to ensure the information integrity.

Refer to caption
Figure 1: An example of DocRED document (bottom) with one of its golden multi-relational graphs (upper left). On the upper right, compared to the vanilla GNNs updating entity by accumulating representations of syntactically adjacent entities on the pseudo graph, the proposed ConstGNN models the relation-aware structural representation of entity by performing knowledge-based information propagation.

Previous DocRE methods tend to apply the graph neural networks (GNNs) Kipf and Welling (2016); Veličković et al. (2017) as the core component, and numerous variant of GNNs have proposed Guo et al. (2019); Christopoulou et al. (2019); Zeng et al. (2020); Xu et al. (2021). Similar to traditional GNNs modeled on the observable graph structures (e.g. social networks Huang et al. (2019) and academic citation network Feng et al. (2020)), these models all require a pre-specified graph construction. They mainly either rely on the heuristic rules of intra(inter)-sentential information of entities and mentions Zeng et al. (2020); Christopoulou et al. (2019), or leverage the syntactic rules of dependency paths built by an external parser Sahu et al. (2019); Guo et al. (2019) to serve as the prior graph structure of GNNs. We consider such graph structure as a pseudo graph structure, for it establishes each edge between a pair of entities as a binary association based on the task-independent auxiliary information (heuristic/syntactic rules).

However, the golden edges that describe the relationships between two entities contain multi-type abundant semantics. Thus, previous approaches suffer from two major intrinsic issues. First, Hindered Propagation Issue: the construction of the pseudo graph structure ignores many actual relational edges between entities, which hinders the effective information acquisition and dissemination. Second, Noisy Representation Issue: the simple information accumulation based on the binary associative edges on the pseudo graph makes them struggle to model relation-aware structural knowledge, which further results in noisy representations and harms the performance. For example, in Figure 1, the entity Saint Paul’s representaion is updated by simply accumulating the representations of its syntactically adjacent entities, making it similar to entities Nobles and Wisconsin Territory, while they are completely different entities with relation connections place_of_birth.

Instead of introducing the prior pseudo graph structures, we present ConstGCN, a novel Constrained Transmission-based Graph Convolutional Network that performs knowledge-based information propagation between entities along with all relation spaces without any prior graph construction which explicitly models the semantics of various relationships. Specifically, we innovatively propose the knowledge-based information propagation for DocRE by leveraging the flexible Knowledge Graph Embedding (KGE) approaches Wang et al. (2017) into a general transmitting operation. At each graph convolution step, it updates the entity representation by aggregating knowledge-based information broadcasted from neighbor entities along with all relation spaces. Thus, entity representations containing the relation-aware structural semantics are learned effectively and directly. Due to the agnostic nature of golden graph structure in documents, it is difficult to rigorously follow the relational paths to transmit information. We propose the transmitting scores to constrain the information flow through the indeterminate relational edges, where the scores are learned jointly from the Noise Contrastive Estimation (NCE) Mikolov et al. (2013). It allows the model to learn the semantic representations while maintaining the original relation-aware structural information. As shown in figure 1, compared to the vanilla GNNs, our model learns the representations with an isomorphic structure to the golden heterogeneous graph in the document.

We conduct extensive experiments on DocRED, a large-scale human-annotated dataset including heterogeneous graphs among entities in each document. The results show that our model achieves the SOTA performance compared to previous methods. In addition to the proposed graph convolutional network, we further demonstrated the compatibility of representation learning from documents and knowledge graphs. The contributions of our work are summarized as follows:

  • We present a novel graph convolutional network, ConstGCN, that can naturally model the heterogeneous graph structure including indeterminate edges based on knowledge-based information propagation.

  • We propose the approach of constrained transmission, which allows the model to learn the entity representations while maintaining the original relation-aware structural information.

  • We conduct experiments on the DocRE task and achieve the SOTA performance. This work also demonstrates the compatibility of representation learning from documents and knowledge graphs.

2 Preliminary

2.1 Document-level Relation Extraction

Given a textual document 𝒟={wi}i=1|𝒟|\mathcal{D}=\{w_{i}\}_{i=1}^{|\mathcal{D}|} consisting of a sequence of words and a set of typed entities ={ei}i=1||\mathcal{E}=\{e_{i}\}_{i=1}^{|\mathcal{E}|}, where each entity refers to a set of mentions e={mi}i=1|e|e=\{m_{i}\}_{i=1}^{|e|} which is a sequence of words in the document. The task of DocRE aims to extract the heterogeneous graphs {(ei,rk,ej)|ei,ej,rk}\{(e_{i},r_{k},e_{j})|e_{i},e_{j}\in\mathcal{E},r_{k}\in\mathcal{R}\}, in which a pair of entities nodes ei,eje_{i},e_{j} may have multiple edges and each edge rkr_{k} refers to a specific relation type, and \mathcal{R} is the set of predefined relations.

Refer to caption
Figure 2: Overview of ConstGCN. Given the representations of entities, ConstGCN first computes the transmitting scores between entities in all relation spaces and then performs the graph convolution that updates each entity by transmitting its neighboring information along with projection spaces of all relations under the constraints of transmitting scores. Finally, the entity representations that model the structural semantics of heterogeneous graphs are learned and further used to predict the relational classes.

2.2 Knowledge Graph Embedding

Given a knowledge graph (KG) consisting of a collection of triples 𝒢={(ei,rk,ej)|ei,ej,rk}\mathcal{G}=\{(e_{i},r_{k},e_{j})|e_{i},e_{j}\in\mathcal{E},r_{k}\in\mathcal{R}\} with the sets of pre-defined types. The task of KGE aims to learn the vectorial representations 𝐞i,𝐫k,𝐞j\mathbf{e}_{i},\mathbf{r}_{k},\mathbf{e}_{j} modeling the heterogeneous structural information based on a scoring function drd_{r}. Depending on the scoring function used, typical methods can be divided into two categories: the translational distance-based approaches e.g. (TransEBordes et al. and RotatESun et al. (2019)) and the semantic matching-based approaches (e.g. DistMultYang et al. (2014) and ComplExTrouillon et al. (2016)). As shown in table 1, the key idea behind both categories of methods is to transmit the representation of head entity in relation-specific projection spaces to approximate tail entity. We thus define a unified transmitting operation \oplus for the existing typical KGE approaches,

Methods Scoring Function
TransE dr(ei,rk,ej)=γ𝐞i+𝐞j𝐫kd_{r}(e_{i},r_{k},e_{j})=\gamma-||\mathbf{e}_{i}+\mathbf{e}_{j}-\mathbf{r}_{k}||
RotatE dr(ei,rk,ej)=γ𝐞i𝐞j𝐫kd_{r}(e_{i},r_{k},e_{j})=\gamma-||\langle\mathbf{e}_{i}\circ\mathbf{e}_{j}-\mathbf{r}_{k}\rangle||
DistMult dr(ei,rk,ej)=𝐞i,𝐞j,𝐫kd_{r}(e_{i},r_{k},e_{j})=\langle\mathbf{e}_{i},\mathbf{e}_{j},\mathbf{r}_{k}\rangle
ComplEx dr(ei,rk,ej)=Re(𝐞i,𝐞j,𝐫k)d_{r}(e_{i},r_{k},e_{j})=Re(\langle\mathbf{e}_{i},\mathbf{e}_{j},\mathbf{r}_{k}\rangle)
Table 1: The scoring functions for typical KGE models. The L1-norm is used for all distance based models and the subscript ||||1||\cdot||_{1} is dropped for brevity.
𝐞𝐫={𝐞+𝐫,(TransE)𝐞,𝐫,(DistMult)Re(𝐞,𝐫),(ComplEx)\displaystyle\mathbf{e}\oplus\mathbf{r}=\begin{cases}\mathbf{e}+\mathbf{r},\quad(\text{TransE})\\ \langle\mathbf{e},\mathbf{r}\rangle,\quad(\text{DistMult})\\ Re(\langle\mathbf{e},\mathbf{r}\rangle),\quad(\text{ComplEx})\end{cases} (1)

where \langle\cdot\rangle denotes the generalized dot product, and ReRe is the operation that returns the real part of a complex value. The transmitting operation above will be used to perform knowledge-based message passing in document.

3 Methodology

In this section, we introduce the details of the proposed model. The overall framework is illustrated in figure 2. Based on the representations of entities obtained from a PLM encoder, ConstGCN is composed of TT graph convolutional layers and each layer has two computational steps: the computation of transmitting scores and the computation of message passing under the constraints of transmitting scores.

3.1 PLM Encoding

Given an input document consisting of a sequence of words with a set of typed entities, we first insert a special token "*" at the start and end of entity mentions based on the entity marker tecnique Zhang et al. (2017); Shi and Lin (2019); Baldini Soares et al. (2019). For the processed document 𝒟={xi}i=1|𝒟|\mathcal{D}=\{x_{i}\}_{i=1}^{|\mathcal{D}|}, we then employ a pre-trained language model (e.g. BERT Devlin et al. (2019)) to get contextual sequence representations:

𝐇=(𝐱1,,𝐱|𝒟|)=PLM(x1,,x|𝒟|),\mathbf{H}=(\mathbf{x}_{1},...,\mathbf{x}_{|\mathcal{D}|})=\text{PLM}(x_{1},...,x_{|\mathcal{D}|}), (2)

where 𝐱idw\mathbf{x}_{i}\in\mathbb{R}^{d^{w}} refer to the contextualized embedding of ii-th token. For those documents that the sequence length are longer than the maximum input length of encoder, we compute the representations of overlapping tokens by averaging their embeddings from different windows. Then, we utilize the embedding of special token "*" at the start of uu-th mention to represent the mention 𝐦u\mathbf{m}_{u}. For ii-th entity with its mentions ei={mu}u=1|ei|e_{i}=\{m_{u}\}_{u=1}^{|e_{i}|}, we compute an initial coarse-grained entity representation by averaging its coreference mentions with the logsumexp pooling function Zhang et al. (2019):

𝐞i=logmueiexp(𝐦u),\mathbf{e}_{i}=\log\sum_{m_{u}\in e_{i}}\exp(\mathbf{m}_{u}), (3)

we use these representations of entities obtained from the PLM encoder as the initialization of entity nodes 𝐄(0)=[𝐞1,𝐞2,,𝐞||]||×d\mathbf{E}^{(0)}=[\mathbf{e}_{1},\mathbf{e}_{2},...,\mathbf{e}_{|\mathcal{E}|}]^{\top}\in\mathbb{R}^{|\mathcal{E}|\times d}.

3.2 Constrained Transmission-based GCN

To model the multi-relational graph structure, ConstGCN updates the entity representation by receiving information broadcasted from other entities along with all relation spaces. We thus introduce the representations of relations. Instead of independently defining an embedding for each type of relation, we use a variant of the basis formulationsSchlichtkrull et al. (2018) to linearly combine a set of basis vectors to promote generalization. Formally, let {𝐳1,,𝐳|𝐳bd}\{\mathbf{z}_{1},...,\mathbf{z}_{\mathcal{B}}|\mathbf{z}_{b}\in\mathbb{R}^{d}\} be a set of basis vectors, a relation representation is given as:

𝐫k=b=1βkb𝐳b,\displaystyle\mathbf{r}_{k}=\sum_{b=1}^{\mathcal{B}}\beta_{k_{b}}\mathbf{z}_{b}, (4)

where βkb\beta_{k_{b}}\in\mathbb{R} is specific learnable scalar weight corresponding to kk-th relation space. We refer to these representations of relations as the initial representations of all relations 𝐑(0)=[𝐫1,,𝐫||]||×d\mathbf{R}^{(0)}=[\mathbf{r}_{1},...,\mathbf{r}_{|\mathcal{R}|}]^{\top}\in\mathbb{R}^{|\mathcal{R}|\times d} for the subsequent constrained transmission-based graph convolution.

Modeling transmitting scores with NCE

Due to the agnostic nature of relational edges, we need to broadcast the information of entities along with all relation spaces. Therefore, we need to measure the probability that there is a specific relationship between any two nodes to control the information flow.

Specifically, a simplified NCE objective Gutmann and Hyvärinen (2012); Mikolov et al. (2013) can be define as:

nce=logσ(ϕ(𝐯i,𝐯j))+k𝔼vkpn(v)[logσ(ϕ(𝐯k,𝐯i))],\displaystyle\begin{split}\mathcal{L}_{nce}=&\log\sigma(\phi(\mathbf{v}_{i},\mathbf{v}_{j}))\\ &+\sum_{k}\mathbb{E}_{v_{k}^{\prime}\sim p_{n}(v)}[\log\sigma(-\phi(\mathbf{v}_{k}^{\prime},\mathbf{v}_{i}))],\end{split} (5)

where it differentiate positive sample pairs (vi,vj)(v_{i},v_{j}) from noisy sample pairs (vk,vi)(v_{k}^{\prime},v_{i}) drawing from the noise distribution pnp_{n} by means of logistic regression, and ϕ\phi is a specific measure function. Inspired by the KGE ideas, we naturally use the logistic term of NCE by replacing the measure function ϕ\phi as the KGE scoring functions to calculate the transmitting score from entity eie_{i} to entity eje_{j} through relation space rkr_{k}:

fr(𝐞i(t),𝐫k(t),𝐞j(t))=σ(dr(𝐞i(t),𝐫k(t),𝐞j(t))),\displaystyle f_{r}(\mathbf{e}^{(t)}_{i},\mathbf{r}^{(t)}_{k},\mathbf{e}^{(t)}_{j})=\sigma(d_{r}(\mathbf{e}^{(t)}_{i},\mathbf{r}^{(t)}_{k},\mathbf{e}^{(t)}_{j})), (6)

where the score indicates the probability that there is a relational edge rkr_{k} from eie_{i} to eje_{j}.

At t+1t+1-th layer, by performing the computation of equation (6) for all entity pairs at each relation space, we obtain the matrices of transmitting scores within all relation spaces

[𝒜(t)]kij=fr(𝐞i(t),𝐫k(t),𝐞j(t)),\displaystyle[\mathcal{A}^{(t)}]_{kij}=f_{r}(\mathbf{e}^{(t)}_{i},\mathbf{r}^{(t)}_{k},\mathbf{e}^{(t)}_{j}), (7)

where 𝒜(t)||×||×||\mathcal{A}^{(t)}\in\mathbb{R}^{|\mathcal{R}|\times|\mathcal{E}|\times|\mathcal{E}|} refer to the tensor of transmitting scores corresponding to all relation spaces at tt-th convolution layer.

Note that the equation (6) can be used as an objective for KGE learning, and thus it can be used to optimize the representations of entities and relations with a specific KGE approach simultaneously.

Passing semantic information under Constraints

For a heterogeneous graph consisting of entity nodes and relational edges in a document, an entity has multifaceted structural semantic information based on its neighbors of particular relations, and such multifaceted information is difficult to model by the hindered message passing on binary associative edges. We propose to update each entity representation by transmitting the representations of its relation-specific neighbors along with the relation spaces to enhance the multifaceted structural information.

Due to the agnostic nature of relation edges in the real world document, we further constrain the propagation with the transmitting scores learned from the previous step to maintain the original relation-aware structure. For each entity, the update of entity representation is defined as 𝐞i(t+1)=k=1||j=1||fr(𝐞j(t),𝐫k(t),𝐞i(t))(𝐞j(t)𝐫k(t))\mathbf{e}^{(t+1)}_{i}=\sum_{k=1}^{|\mathcal{R}|}\sum_{j=1}^{|\mathcal{E}|}f_{r}(\mathbf{e}^{(t)}_{j},\mathbf{r}^{(t)}_{k},\mathbf{e}^{(t)}_{i})(\mathbf{e}^{(t)}_{j}\oplus\mathbf{r}^{(t)}_{k}), where the operation \oplus denote the transmitting operation contingent upon specific KGE methods introduced above. For the scale of all entities, we can rewrite this comprehensive transmission in an elegant way by using tensor multiplication, and promote computational efficiency:

𝐄k(t+1)=𝒜~(t)(𝐄(t)𝐌k(t)),\displaystyle\mathbf{E}^{(t+1)}_{k}=\tilde{\mathcal{A}}^{(t)}\otimes(\mathbf{E}^{(t)}\oplus\mathbf{M}_{k}^{(t)}), (8)
𝐄(t+1)=fpool(𝐄1(t+1),𝐄2(t+1),,𝐄||(t+1)),\displaystyle\mathbf{E}^{(t+1)}=f_{pool}(\mathbf{E}^{(t+1)}_{1},\mathbf{E}^{(t+1)}_{2},...,\mathbf{E}^{(t+1)}_{|\mathcal{R}|}), (9)

where 𝐄k(t+1)||×d\mathbf{E}^{(t+1)}_{k}\in\mathbb{R}^{|\mathcal{E}|\times d} is the entity representations corresponding to the kk-th relation space, 𝒜~(t)\tilde{\mathcal{A}}^{(t)} refers to the transposed transmitting scores that transpose the matrices consisting of the values of the second and third dimensions of 𝒜(t)\mathcal{A}^{(t)}, and 𝐌k(t)=[𝐫k(t),,𝐫k(t)]||×d\mathbf{M}_{k}^{(t)}=[\mathbf{r}^{(t)}_{k},...,\mathbf{r}^{(t)}_{k}]^{\top}\in\mathbb{R}^{|\mathcal{E}|\times d} is the matrix composed of |||\mathcal{E}| identical vectors 𝐫k(t)\mathbf{r}^{(t)}_{k}. The operation :(||×||×||,||×d)||×||×d\otimes:(\mathbb{R}^{|\mathcal{R}|\times|\mathcal{E}|\times|\mathcal{E}|},\mathbb{R}^{|\mathcal{E}|\times d})\rightarrow\mathbb{R}^{|\mathcal{R}|\times|\mathcal{E}|\times d} refers to the tensor multiplication. Similar to the pooling functions that aggregate feature maps along channel dimension in the computer vision, the function fpool(𝒳):d1×d2×d3d2×d3f_{pool}(\mathcal{X}):\mathbb{R}^{d_{1}\times d_{2}\times d_{3}}\rightarrow\mathbb{R}^{d_{2}\times d_{3}} aggregates the matrices comprised of the values with d2d_{2} and d3d_{3} dimensions of 𝒳\mathcal{X} along the d1d_{1} dimension. There are multiple pooling operations can be used, such as fpoolmeanf_{pool}^{mean}, fpoolmaxf_{pool}^{max} and fpoolsumf_{pool}^{sum} for mean, max, sum.

We introduce an attentive pooling function fpoolattf^{att}_{pool} that aggregates the representations of entities according to their importance in each relation space:

𝐙(t)=softmax(𝐑(t)(𝐄(t))d),fpoolatt:k=1||(𝐳k(t))(𝐄k(t+1)),\displaystyle\begin{split}&\mathbf{Z}^{(t)}=softmax(\frac{\mathbf{R}^{(t)}(\mathbf{E}^{(t)})^{\top}}{\sqrt{d}}),\\ &f_{pool}^{att}:\quad\sum_{k=1}^{|\mathcal{R}|}(\mathbf{z}_{k}^{(t)})^{\top}\cdot(\mathbf{E}^{(t+1)}_{k}),\end{split} (10)

where 𝐳k(t)||\mathbf{z}_{k}^{(t)}\in\mathbb{R}^{|\mathcal{E}|} refers to the kk-th row of 𝐙(t)\mathbf{Z}^{(t)} indicating the importance of entities with respect to the relation space rkr_{k}.

By performing TT layers of the graph convolution, the representations of entities 𝐄(T)\mathbf{E}^{(T)} and relations 𝐑(T)\mathbf{R}^{(T)} that modeled the multi-relational spatial information are obtained.

3.3 Learning

Based on the fine-grained entity representations obtained from ConstGCN, we further get the representation of each entity pair by supplementing the related local tokens information with the localized context pooling Zhou et al. (2021). For each entity pair (ei,ej)(e_{i},e_{j}) in current document, we get the localized representation by an attention-based weighted summation:

𝐜(i,j)=(𝜶(i)𝜶(j))𝐇,\displaystyle\mathbf{c}^{(i,j)}=(\boldsymbol{\alpha}^{(i)}\cdot\boldsymbol{\alpha}^{(j)})\mathbf{H}, (11)

where 𝜶(i)|𝒟|\boldsymbol{\alpha}^{(i)}\in\mathbb{R}^{|\mathcal{D}|} refer to the the averaged attention scores from ii-th enttiy to all tokens for all layers in the pre-trained language model. Then the augmented representation is obtained by:

{𝐞¯i=tanh(𝐖s𝐞i(T)+𝐖c1𝐜(i,j)),𝐞¯j=tanh(𝐖o𝐞j(T)+𝐖c2𝐜(i,j)),\displaystyle\begin{cases}\bar{\mathbf{e}}_{i}=\tanh(\mathbf{W}_{s}\mathbf{e}_{i}^{(T)}+\mathbf{W}_{c_{1}}\mathbf{c}^{(i,j)}),\\ \bar{\mathbf{e}}_{j}=\tanh(\mathbf{W}_{o}\mathbf{e}_{j}^{(T)}+\mathbf{W}_{c_{2}}\mathbf{c}^{(i,j)}),\end{cases} (12)

where 𝐖s,𝐖o,𝐖c1,𝐖c1d×d\mathbf{W}_{s},\mathbf{W}_{o},\mathbf{W}_{c_{1}},\mathbf{W}_{c_{1}}\in\mathbb{R}^{d\times d} are the learnable weight matrices. Based on the derivation above, we employ two learning objective to optimize the model. The first is the cross-entropy-based classification objective to learn the final labels of relational classes between entities, and the second is a contrastive learning objective for the representation learning of entities and relations.

3.3.1 Classification objective

Category Model Dev Test
Ign F1 F1 Ign F1 F1
Sequence-based Models CNN Yao et al. (2019) 41.58 43.45 40.33 42.26
LSTM Yao et al. (2019) 48.44 50.68 47.71 50.07
BiLSTM Yao et al. (2019) 48.87 50.94 48.78 51.06
ContexAware Yao et al. (2019) 48.94 51.09 48.40 50.07
BERTBASE{}_{BASE}^{*} Wang et al. (2019) - 54.16 - 53.20
BERT-2PhaseBASE{}_{BASE}^{*} Wang et al. (2019) - 54.42 - 53.92
HIN-BERTBASE{}_{BASE}^{*} Tang et al. (2020) 54.29 56.31 53.70 55.60
CorefBERTBASE{}_{BASE}^{*} Ye et al. (2020) 55.32 57.51 54.54 56.96
CorefRoBERTaLARGE{}_{LARGE}^{*} Ye et al. (2020) 57.35 59.43 57.90 60.25
MIUK(three-view)-BERTBASE{}_{BASE}^{*} Li et al. (2021) 58.27 60.11 58.05 59.99
ATLOP-BERTBASE{}_{BASE}^{*} Zhou et al. (2021) 59.22 61.09 59.31 61.30
ATLOP-RoBERTaLARGE{}_{LARGE}^{*} Zhou et al. (2021) 61.32 63.18 61.39 63.40
Graph-based Models GAT Veličković et al. (2017) 45.17 51.44 47.36 49.51
AGGCN Guo et al. (2019) 46.29 52.47 48.89 51.45
EoG Christopoulou et al. (2019) 45.94 52.15 49.48 51.82
GCNN Sahu et al. (2019) 46.22 51.52 49.59 51.62
LSR-BERTBASE{}_{BASE}^{*} Nan et al. (2020) 52.43 59.00 56.97 59.05
GAIN-BERTBASE{}_{BASE}^{\natural} Zeng et al. (2020) 57.75 60.03 57.98 60.42
HeterGSAN-BERTBASE{}_{BASE}^{*} Xu et al. (2021) 58.13 60.18 57.12 59.45
ConstGCN-BERTBASE 59.32 61.27 59.58 61.55
ConstGCN-RoBERTaLARGE 62.01 63.91 62.04 64.00
Table 2: Performance on the dev set and the test set of DocRED. The bottom two rows show results of our models with TransE implementation. Results with * are reported in their original papers. Results with are implemented in [Nan et al., 2020]. Results with are our reproduced performance. Bold results indicate the optimal performances.

The probabilities of final relational classes for each pair of entities in the document are calculated as following:

P(r|ei,ej)=σ(𝐞¯i𝐖r𝐞¯j+br),\displaystyle P(r|e_{i},e_{j})=\sigma(\bar{\mathbf{e}}_{i}\mathbf{W}_{r}\bar{\mathbf{e}}_{j}+b_{r}), (13)

where 𝐖rd×d,br\mathbf{W}_{r}\in\mathbb{R}^{d\times d},b_{r}\in\mathbb{R} are learnable model parameters. We utilize the adaptive thresholding method Zhou et al. (2021) that involves a learnable threshold class TH to differentiate the positive and negative classes. Then the cross-entropy based classification objective is defined as:

cls1=rk𝒯Plog(exp(P(rk|ei,ej))rk𝒯P{TH}exp(P(rk|ei,ej))),cls2=log(exp(P(TH|ei,ej))rk𝒯N{TH}exp(P(rk|ei,ej))),cls=cls1+cls2,\displaystyle\begin{split}&\mathcal{L}_{cls_{1}}=-\sum_{r_{k}\in\mathcal{T}_{P}}\log(\frac{\exp(P(r_{k}|e_{i},e_{j}))}{\sum_{r_{k}^{\prime}\in\mathcal{T}_{P}\cup\{TH\}}\exp(P(r_{k}^{\prime}|e_{i},e_{j}))}),\\ &\mathcal{L}_{cls_{2}}=-\log(\frac{\exp(P(TH|e_{i},e_{j}))}{\sum_{r_{k}^{\prime}\in\mathcal{T}_{N}\cup\{TH\}}\exp(P(r_{k}^{\prime}|e_{i},e_{j}))}),\\ &\mathcal{L}_{cls}=\mathcal{L}_{cls_{1}}+\mathcal{L}_{cls_{2}},\end{split}

where 𝒯P\mathcal{T}_{P} and 𝒯N\mathcal{T}_{N} are the all positive relational triples set and all negative relational triples set respectively. At the test time, the positive relation classes are returned if the probabilities are higher than the TH classes.

3.3.2 Contrastive learning objective

We optimize the transmission learning with the oise Contrastive Estimation (NCE)Gutmann and Hyvärinen (2012); Mikolov et al. (2013). Instead of using a uniform negative sampling, we leverage the self-adversarial negative sampling method introduced bySun et al. (2019) to alleviate the non-meaningful information problem,

nce=ij(ei,ej,rk)𝒯P(logσ(dr(𝐞¯i,𝐫¯k,𝐞¯j))+(eq,rk,eq′′)𝒯Nqφqklogσ(dr(𝐞¯q,𝐫¯k,𝐞¯q′′))),φqk=expτσ(dr(𝐞¯q,𝐫¯k,𝐞¯q′′))(el,rk,el′′)𝒯Nlexpτσ(dr(𝐞¯l,𝐫¯k,𝐞¯l′′)),\displaystyle\begin{split}\mathcal{L}_{nce}=&-\sum_{i\neq j}\sum_{(e_{i},e_{j},r_{k})\in\mathcal{T}_{P}}(\log\sigma(d_{r}(\bar{\mathbf{e}}_{i},\bar{\mathbf{r}}_{k},\bar{\mathbf{e}}_{j}))\\ &+\sum_{(e_{q}^{\prime},r_{k},e_{q}^{\prime\prime})\in\mathcal{T}_{N}^{q}}\varphi_{q}^{k}\log\sigma(-d_{r}(\bar{\mathbf{e}}_{q}^{\prime},\bar{\mathbf{r}}_{k},\bar{\mathbf{e}}_{q}^{\prime\prime}))),\\ \varphi_{q}^{k}=&\frac{\exp\tau\sigma(-d_{r}(\bar{\mathbf{e}}_{q}^{\prime},\bar{\mathbf{r}}_{k},\bar{\mathbf{e}}_{q}^{\prime\prime}))}{\sum_{(e_{l}^{\prime},r_{k},e_{l}^{\prime\prime})\in\mathcal{T}_{N}^{l}}\exp\tau\sigma(-d_{r}(\bar{\mathbf{e}}_{l}^{\prime},\bar{\mathbf{r}}_{k},\bar{\mathbf{e}}_{l}^{\prime\prime}))},\end{split}

where 𝒯Nq\mathcal{T}_{N}^{q} refer to the negative triples sampled from current document qq, and τ\tau is the sampling temperature. Finally, the overall objective is a linear combination of these two objectives,

=cls+μnce,\displaystyle\mathcal{L}=\mathcal{L}_{cls}+\mu\mathcal{L}_{nce}, (14)

where μ\mu is the hyperparameter to balance these two different objectives.

4 Experiments

4.1 Experimental Settings

Dataset.

The proposed methods on a widely-used human-annotated dataset, DocRED Yao et al. (2019), which is built based on Wikipedia and Wikidata, with 5,053 documents, 132,375 entities, 56,354 relational facts, and 96 relation types. More than 40.7% relations can only be extracted across sentences, and 61.1% relations require inference skills such as logical inference. We follow the standard split of the dataset, i.e., 3,053 documents for training, 1,000 for development and 1,000 for test.

Baselines.

We compare ConstGCN with sequential methods that adopt sequential encoders as the main architectures, including CNN/LSTMYao et al. (2019), BERT-2Phase Wang et al. (2019), HIN-GloVe/HIN-BERT Tang et al. (2020), CorefBERT Ye et al. (2020), MIUK Li et al. (2021) and ATLOP Zhou et al. (2021), and graphical methods leveraging GNNs as the backbone, including GCNN Sahu et al. (2019), GAT Veličković et al. (2017), AGGCN Guo et al. (2019), EoG Christopoulou et al. (2019), GAIN Zeng et al. (2020), LSR Nan et al. (2020) and HeterGSAN Xu et al. (2021).

Implementation.

For the foundational components, we use cased BERTbase or RoBERTalarge Liu et al. (2019) as the encoder to validate the efficiency for all BERT-based methods. For ConstGCN, we empirically set the number of relation basis vectors =56\mathcal{B}=56, and the pooling function as att tuned on the dev set. We investigate three implementations with different transmitting operations, ConstGCN(TransE), ConstGCN(DistMult) and ConstGCN(ComplEx), where the fixed margin of ConstGCN(TransE) is set to γ=20.0\gamma=20.0. The number of graph convolutional layers are set to 2, 2, and 1 for the three implementations, respectively, and the negative sampling proportion in transmitting score’s learning is set to 1/|𝒯Nq|=1/401/|\mathcal{T}_{N}^{q}|=1/40 consistently for all implementations. All hyperparameters are tuned based on grid search with the performance on dev set, and more training details are available in Appendix A.

Refer to caption
Refer to caption
Refer to caption
Figure 3: The first two figures show the dev F1 score with the transmitting costs, where the F1 score strictly increases as the transmission cost decreases until convergence. The third shows the dev F1 score with the different number of relation types on DocRED. ConstGCN shows more superiority when the number of relation types grows.

4.2 Main Results

The main results on the DocRED dataset with the standard metrics F1 and Ign F1 are presented in Table 2, where Ign F1 calculate F1 scores excluding the common relation facts shared in the training and dev/test sets. We group the existing models into 2 categories: (1) the Sequence-based Models that adopt sequential encoder to encode and obtain the representations; (2) the Graph-based Models that focus on modeling the heterogeneous graph structure with GNNs independent of encoder.

As shown in the Table 2, our methods achieve the state-of-the-art performance on all metrics and all splits. Compared to the group of Graph-based Models, our method outperforms the state-of-the-art GNNs-based model by 1.09% F1 score and 2.1% F1 score on the development set and test set respectively. These results suggest that ConstGCN naturally learns more fine-grained informative representations of entities during the constrained transmission-based graph convolution. By using RoBERTaLARGE as the encoder, our ConstGCN further achieves the F1 scores of 63.91% and 64.00% for the both evaluation set, which are a new state-of-the-art results on DocRED.

4.3 Model Analysis

In this section, we further investigate the effectiveness of the transmission learning and the impact of ConstGCN on documents of different complexity.

How does the transmission learning affect the performance ?

Figure 3 shows the learning curves of classification performance and transmission costs for all three implementations on the development set during the training. As seen, the F1F1 scores strictly improve with the decreasing of transmitting costs for all three models, and the model achieves the optimal classification performances when the transmission costs converge. This shows that the informative entity representations learned by optimizing the constrained transmission-based graph convolution can lead to significant improvements in classification performance. We also notice that the model using ComplEx transmitting operation performs the best in terms of convergence speed and optimal values compared to the other models, and it suggests that the model learns effective representations especially in the complex space. The visualization of the transmitting scores learned between entities in a specific document are shown in Appendix B.

What kind of environment is ConstGCN more effective for ?

We further investigate the effectiveness of ConstGCN on documents of different relation-aware complexities to demonstrate its applicability. As the third sub-figure 3 shows, we split the development set of DocRED into 6 disjoint subsets by the number of relations types, and evaluate models trained with or without ConstGCN on each subset. Overall for both models, we found that their F1 performances tend to decrease when the number of relation types keeps growing. However, when the number of relation types is larger than 10, the model w/ ConstGCN consistently exhibit significant better performance compared to the model w/o ConstGCN. This result indicates that the broadcasting of knowledge-based transmission in all relation spaces can learn relation-aware structural information and is more applicable to complex environment. Further, we can assume that when the complexity of the document exceeds that of DocRED, we can still get good performance by adjusting the transmitting operation in ConstGCN.

Model F1 AUC
ConstGCN (TransE) 61.27 61.32
   #layers T=1T=1 47.52 44.46
   #layers T=3T=3 60.87 61.69
   fpoolf_{pool} = mean 61.23 61.32
   fpoolf_{pool} = max 61.20 61.40
ConstGCN (DistMult) 61.06 62.21
   #layers T=1T=1 61.06 60.99
   #layers T=3T=3 61.48 61.50
ConstGCN (ComplEx) 61.41 61.73
   #layers T=2T=2 60.89 62.39
   #layers T=3T=3 59.85 61.80
Table 3: Ablation studies of ConstGCN-BERTBASE. We change different implementations of components one at a time. These ablation results show that, the 2-layer ConstGCN can learn effective representations and benefit classification performance.

4.4 Ablation Study

In this subsection, we examine the contributions of main components under different implementations.

Specifically, we explore the effectiveness of the number of graph convolution layers with different transmitting operations, and the performances of using different pooling functions. Table 3 shows the results on the development set.

Overall, we observe that a 2-layer transmission-based graph convolution learns effective entity representations and obtains the best classification F1 performance for all three models. In particular, the performance of ConstGCN (TransE) model deteriorates significantly when the number of layers of graph convolution was reduced to 1, while the other models maintain competitive performance with 1 layer. This suggest that the translation-based approaches require more transmitting steps to accumulate effective information than the semantic-based methods. Moreover, the model demonstrates consistent performance with three different pooling functions, att, mean and max.

5 Related Work

Relation Extraction

Early research efforts on relation extraction concentrate on extracting relations between entity pairs within a sentence. Various approaches including feature-based methods Mintz et al. (2009), kernel-based methods Bunescu and Mooney (2005), and deep neural networks-based methods Qin et al. (2018); Gao et al. (2020) have been shown effective in handling this problem. However, due to the specificity of data structures in biomedical domain, some recent researches construct the document graph with heuristics and syntactical rules Quirk and Poon (2016); Guo et al. (2019), and then perform inference to extract the binary interactions between biomedical entities (e.g. 3-ary relation between drug, gene, and mutation) in the entire document.

However, due to the specificity of data structures in biomedical domain, some recent researches construct the document graph with heuristics and syntactical dependencies Quirk and Poon (2016); Guo et al. (2019), and then perform inference to extract the binary interactions between biomedical entities (e.g. 3-ary relation between drug, gene, and mutation) in the entire document. More recently, with the large-scale general-purpose DocRE dataset proposed by Yao et al. (2019), there has been a growing interest in extracting relations on such multi-mention multi-label environment Wang et al. (2019); Ye et al. (2020); Tang et al. (2020); Nan et al. (2020); Zeng et al. (2020); Xu et al. (2021); Li et al. (2021).

Knowledge Graph Embedding

The Knowledge Graph Embedding (KGE) Wang et al. (2017) approaches aims to model semantic representations of entities and relations in the multi-relational graph structure properly, and it is extensively studied recently. Most of KGE approaches define a score function that models the entity and relation embeddings to constrain the valid triples with a higher score than the invalid ones. The KGE methods can be briefly classified into two major categories based on the type of scoring functions. The translation-based methods including TransEBordes et al. and RotatESun et al. (2019) measure the socre as translating distance from head entity to tail enity along the relation space, and the semantic-based methods such as DistMultYang et al. (2014) and ComplExTrouillon et al. (2016) exploit the score as semantic similarity between head and tail entities with the relation-specific projection space.

6 Conclusion

This paper proposes a novel graph convolutional network that naturally learns the relation-aware entity representations on heterogeneous graphs with indeterminate edges. To avoid the prior pseudo graph structure, we transmit entity representations along with all relation spaces to update each entity, in which the transmission is constrained with the transmitting scores learned from Noise Contrastive Estimation to maintain the original spatial information. Experimental results show that our method greatly advantages the DocRE task.

Limitation

In this paper, we explore 3 typical KGE ideas into the knowledge-based graph convolution and prove their efficiency. However, there are many other approaches, such as KGE in hyperbolic space, which we do not validate in this paper. In the training of ConstGCN, it is a very complicated problem to perform better negative sampling of the entity-relation triples in the documents. In this paper, we follow the strong-weak negative sample strategy used in many KGE approaches, with a carefully chosen sampling ratio in the document, but this is still not sufficient. We have theoretically proven the compatibility of representation learning between natural texts and KGs based the ConstGCN, rather than verifying it on a case-by-case basis.

References

Appendix A Hyperparameter settings and implementation details

We train ConstGCN on one NVIDIA RTX 3090 for a maximum of 30 epochs, it takes about 5 hours to finish training. We tune the optimal hyperparameters using grid search based on the F1 score on dev set. We set the dropout Srivastava et al. (2014) with rate 0.1, and clip the gradients to a max norm of 1.0. The optimization method is set to AdamW Loshchilov and Hutter (2017) with an initial learning rate 1e51e^{-5} and a linear warmupGoyal et al. (2017) for the first 6% steps with exponential decay. As shown in Table 4 and 5, the value of htperparameters we finally adopted are in bold.

Hyperparameter Value
Batch Size 8, 6, 2
Learning Rate 0.001
Activation Function ReLU, Tanh
Positive v.s. Negative Ratio 1.0, 0.5, 0.25
Entity Type Embedding Size 20
Coreference Embedding Size 20
Dropout 0.2, 0.6, 0.8
Weight Decay 0.0001
Table 4: Settings for base network
Hyperparameter Value
TT 1, 2, 3
fpoolf_{pool} Max, Mean, Att
frf_{r} TransE, ComplEx, DistMult
γ\gamma 4, 8, 12, 16, 20, 24, 28
\mathcal{B} 48, 56, 64, 72, 80, 88, 96
|𝒩eq||\mathcal{N}_{e}^{q}| 10, 20, 40
τ\tau 1.0, 2.0
μ\mu 0.001, 0.01
Table 5: Settings for ConstGCN

Appendix B Visualization of Transmitting Scores

We visualize the transmitting scores between all entities in a specific relation space, country, in a document from the development set, which are predicted by the optimal models of three implementations. The values are shown in Figure 5 and Figure 6. We find that the transmitting scores calculated during the graph convolution consistently conform with the adjacency matrix of golden graph, suggesting that our models learn relation-aware structural information for entities effectively. Besides, based on the visualized figure, we can clearly find that the problem of N-1 that has been identified in traditional TransE still exists in ConstGCN, and it inspires us for future improvements.

Appendix C Error Analysis

An annotated example that our model is underperforming on it is shown in bellow. We can see a frustrating scene: there are a large number of mistakes and omissions in the labeled data, especially since many entities that actually co-referenced are annotated as separate entities. In the Figure 4, entities of the same color, except for gray, indicate that they should be labeld as co-referencing entities. However, these omitted annotations make it difficult for model to accurately distinguish semantic boundaries.

Edward P. "Ned" McEvoy (born 1886) was an Irish hurler who played for the Dublin and Laois senior teams. Born in Abbeyleix, County Laois, McEvoy first played competitive hurling and Gaelic football in his youth. He arrived on the inter-county scene when he first linked up with the Laois senior team before later joining the Dublin senior team before returning to Laois. McEvoy was a regular member of the starting fifteen, and won one All-Ireland medal and two Leinster medals. He was an All-Ireland runner-up on one occasion. At club level McEvoy won several championship medals as a dual player with Abbeyleix. He also won a championship medal with the Thomas Davis club.

Refer to caption
Figure 4: An example of input document and annotated golden relational graphs from DocRED. The contains ad. and located in ad. refer to the relations contains administrative territorial entity and located in the administrative territorial entity, respectively.The number inside the brackets after the entity name indicates the number of coreferences the entity have.
Refer to caption
Refer to caption
Figure 5: Visualization of the transmitting scores learned between all entities in the specific relation space of country. Left: the golden adjacency matrix that each element represent a golden relation if the value is equal to 1; Right: the transmitting scores learned with the transmitting operation TransE.
Refer to caption
Refer to caption
Figure 6: Visualization of the transmitting scores learned between all entities in the specific relation space of country. Left: the transmitting scores learned with the transmitting operation DistMult; Right: the transmitting scores learned with the transmitting operation ComplEx.