This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Orthogonal Attention: A Cloze-Style Approach to Negation Scope Resolution

Aditya Khandelwal
College of Engineering Pune
Pune, India
khandelwalar16.comp@coep.ac.in
\AndVahida Attar
College of Engineering Pune
Pune, India
vahida.comp@coep.ac.in
Abstract

Negation Scope Resolution is an extensively researched problem, which is used to locate the words affected by a negation cue in a sentence. Recent works have shown that simply finetuning transformer-based architectures yield state-of-the-art results on this task. In this work, we look at Negation Scope Resolution as a Cloze-Style task, with the sentence as the Context and the cue words as the Query. We also introduce a novel Cloze-Style Attention mechanism called Orthogonal Attention, which is inspired by Self Attention. First, we propose a framework for developing Orthogonal Attention variants, and then propose 4 Orthogonal Attention variants: OA-C, OA-CA, OA-EM, and OA-EMB. Using these Orthogonal Attention layers on top of an XLNet backbone, we outperform the finetuned XLNet state-of-the-art for Negation Scope Resolution, achieving the best results to date on all 4 datasets we experiment with: BioScope Abstracts, BioScope Full Papers, SFU Review Corpus and the *sem 2012 Dataset (Sherlock).

1 Introduction

Negation Scope Resolution involves finding the words in a sentence whose meaning was affected by the use of a negation cue (a word that expresses negation). Consider the following examples:

  1. 1.

    This place is[n’t] familiar.

  2. 2.

    I do [not] know the answer.

  3. 3.

    I am [neither] a saint [nor] a sinner.

The words enclosed in square brackets are the negation cues, and the words in italics are their corresponding scopes. As we can see, negation cues can be of multiple types: an affix (1), a single word cue (2) and a multi-word cue (3). A sentence can also have multiple cue words, each of which can have different scopes. Hence, the input to any system performing Negation Scope Resolution is the sentence-negation cue pair.

Approaches to solve the task of negation scope resolution have varied significantly over the years, ranging from simple rule based systems to BiLSTM and CRF classifiers. To represent the cue word input, traditionally, these methods utilized either cue dependent hand-crafted features (for CRF classifiers), or an additional binary input vector representing the cue words in the sentence. More recently though, Khandelwal and Sawant (2020) and Britto and Khandelwal (2020) used transformer-based architectures to address the task, and represented the cue words in the sentence via a preprocessing strategy: by augmenting the input sentence with a special token which is added before each cue word to tell the system that the following word is a cue word.

In this paper, we propose a novel approach to solving the problem of Negation Scope resolution: by viewing it as a Cloze-Style task, where the sentence is used as the Context input and the cue words are used as the Query input. We also develop a novel Cloze-Style Attention mechanism called Orthogonal Attention, which uses the key-query-value structure used in Self-Attention Vaswani et al. (2017).

Cloze-Style Question Answering (and Machine Reading Comprehension) are 2 classes of problems that involve using 2 distinct inputs to produce the output desired. In most cases, the input is a query-context pair Q,C\langle Q,C\rangle from which an answer A\langle A\rangle is generated. This is akin to posing a Question (Query) over a paragraph containing information (Context). The format of the answer can vary significantly, from pointing to a part of the context that contains the answer, as in SQuADRajpurkar et al. (2016), to filling in the blanks of the Query using relevant information from the Context. Thus, these tasks require modelling the interaction between the query and the context, to produce the corresponding answer. Such tasks became popular after the release of the CNN and Daily News Datasets in Hermann et al. (2015).

Given the success of Attention Mechanisms for traditional Natural Language Processing tasks (like translation), researchers have also developed Attention mechanisms to address Cloze-Style tasks (by using them to model the interaction between the Query and Context). A few examples of such Attention mechanisms include: Attention Sum Readers Kadlec et al. (2016), Attention over Attention Cui et al. (2016), Gated-Attention Dhingra et al. (2016), and Dynamic Coattention Networks Xiong et al. (2016). Most attention mechanisms used a similar approach: compute a score for each query word-context word pair, and use that score to perform some form of a weighted summation of certain vectors to generate context-aware representations of the query and query-aware representations of the context. We review a few such attention mechanisms in Section 2.

The novel Cloze-Style Attention mechanism we propose is called Orthogonal Attention. We propose a framework to develop Orthogonal Attention variants, and propose 4 such variants (OA-C OA-CA, OA-EM, OA-EMB). The exact specifications are covered in Section 3, but at it’s core, it uses the key-query-value structure and multiheaded structure found in Self Attention Vaswani et al. (2017).

These layers are then appended to the current state-of-the-art architecture for Negation Scope Resolution (XLNet-base), and we observe that adding Orthogonal Attention layers to the architectures yield improvements over the current state-of-the-art architectures.

2 Literature Review

2.1 Attention for Cloze-Style Tasks

Kadlec et al. (2016) proposed Attention Sum Reader, wherein they performed a dot product between the question embedding and each context word embedding to get probability scores per word of the context.

Dhingra et al. (2016) proposed a Gated Attention Reader Network to compute a query-aware representation of the context words. They used a multi-hop architecture with k hops (layers), where each layer incrementally infused the context embeddings with information from the query embeddings. They used a “Gated-Attention Module” for each word of the context, which performs attention over the query words using the context word followed by an element-wise multiplication between the summarized query and the context word. Each layer computes a different query representation using a separate BiGRU, and the output of the Gated-Attention Modules is also passed through a BiGRU before being fed to the next query.

Seo et al. (2016) proposed Bidirectional Attention Flow (BiDAF), wherein they first computed a similarity matrix between the context words and the query words, which was then used to compute context-aware query representations and query-aware context representations. The context-to-query attention was modelled as an attention over the context representation using query tokens, where the attention weights came from the similarity matrix. The query-to-context attention produces a weighted sum of the query words, using the attention weights as the maximum across the columns of the similarity matrix. Finally, all the generated matrices and the original context embeddings were concatenated to generate the query-aware representation of each word.

Wang et al. (2017a) introduced a Gated Self-Matching Network to perform Question Answering. Their model included a gated matching layer to match the question and the context, and a self-matching layer to aggregate information from the whole passage. This was done by using an RNN to summarize a concatenation of: the context embeddings at time t (ut)(u_{t}), and a special context vector ctc_{t}, which was derived from an attention over the query embeddings using the previous output of the summarizer RNN and utu_{t}. Thus, they generated a summary of the context and the query. The self-matching layer allowed each context embedding to gain information from its surrounding context.

Xiong et al. (2016) proposed Dynamic Coattention Network, which uses a Coattention encoder to model the interaction between the query and context. The coattention encoder involved computation of an attention matrix as the dot products between the query and context words: L=CTQL=C^{T}Q. This matrix was then normalized row-wise to produce AQA^{Q} (attention weights over the context for each query word), and column-wise to produce ACA^{C} (attention weights over the query for each context word). AQA^{Q} was further used to compute summaries of the context: CQ=CAQC^{Q}=CA^{Q}. Then, ACA^{C} was used to summarise the concatenation of CQC^{Q} and QQ, the summary of context words for each word of the query and the query word itself: CC=[Q;CQ]ACC^{C}=[Q;C^{Q}]A^{C}. As a final step, they passed the concatenation of the context and the query aware representation of the context through a BiLSTM.

2.2 Negation Scope Resolution

Khandelwal and Sawant (2020) provide an extensive summary of papers using non-transformer based models addressing Negation Scope Resolution. We summarise their paper, and the follow-up paper by Britto and Khandelwal (2020), which use transformer-based architectures to address this task. They report state-of-the-art results, and we use their models as the baseline.

Khandelwal and Sawant (2020) proposed using BERT Devlin et al. (2018) to address negation scope resolution. The transfer learning capability of such models allowed these models to perform much better than all previous models. They encoded the cue words in the sentence using a preprocessing strategy, by adding special tokens representing the type of the cue word before the cue words. For example:
I do not know the answer. \rightarrow I do [cue_tok] not know the answer.
This yielded the best results to date on all datasets (BioScope Full Papers and Abstract Subcorpora, SFU Review Corpus and Sherlock Dataset).

Britto and Khandelwal (2020) went a step further and used XLNet and RoBERTa in place of BERT, improving results even further.

The strategy adopted by these papers could be summarized as follows:

Transformer{BERT/XLNet/RoBERTa}Transformer\in\{BERT/XLNet/RoBERTa\}

X=Transformer(InputSentence)X=Transformer(InputSentence)

Yi=WXi+bXiXY_{i}=W*X_{i}+b...\forall X_{i}\in X

2.3 Self-Attention

Self-attention was introduced in Attention is All you NeedVaswani et al. (2017) as a key part of the transformer architecture. It was designed to get contextual representation of each token. The process was as follows: extracting Key (KSA)(K^{SA}), Query (QSA)(Q^{SA}) and Value (VSA)(V^{SA}) vectors from each input vector, and then performing a Scaled Dot-Product Attention operation to compute the attention weights for each token ii, using the token’s query vector QiSAQ^{SA}_{i} and all Key vectors KSAK^{SA}. These weights were used to summarise the Value vectors VSAV^{SA}, to compute the contextual representation. This operation was repeated multiple times in parallel (using different weight vectors), and their outputs combined using a weight matrix. This was called Multiheaded Self Attention. We can summarise the above process using the following equations:

Zik=jsoftmax(QiSAKjSAdk)VjSAZ_{i}^{k}=\sum_{j}softmax(\frac{Q^{SA}_{i}\bullet K^{SA}_{j}}{\sqrt{d_{k}}})V^{SA}_{j}
Z=Concatk(Zk)(WO)TZ=Concat_{k}(Z^{k})(W^{O})^{T}

3 Orthogonal Attention

Inspired by self-attention Vaswani et al. (2017), we look to use the key-query-value structure and the multiheaded process to create an Attention mechanism to address Cloze-Style tasks. We first detail the overall modelling framework, and then explain individual components.

We use the following notational convention: QQ is Query Inputs, CC is Context Input, QSQ^{S} is Query Vectors for Orthogonal Attention, KSK^{S} is Key Vectors for Orthogonal Attention, VSV^{S} is Value Vectors for Orthogonal Attention, QDQ^{D} is Query Vectors for Dot-Product Attention and CDC^{D} is Context Vectors for Dot-Product Attention.

3.1 Orthogonal Attention Encoder Block (OA)

Refer to caption
Figure 1: Orthogonal Attention Block

Multiheaded Orthogonal Attention: Z=f(C,Q)Z=f(C,Q)

Residual Connection: X1=Z+CX_{1}=Z+C

Layer Normalization: X2=LayerNorm(X1)X_{2}=LayerNorm(X_{1})

Multiheaded Self-Attention: X3=SelfAtt(X2)X_{3}=SelfAtt(X_{2})

2 Fully Connected Layers: X4=ReLU(X3WT+b)TW2X_{4}=ReLU(X_{3}W^{T}+b)^{T}W_{2}

Residual Connection: X5=X4+X2X_{5}=X_{4}+X_{2}

Layer Normalization: C=LayerNorm(X5)C^{\prime}=LayerNorm(X_{5})

This is quite similar to the transformer encoder block proposed in Vaswani et al. (2017).

We use a Multiheaded Self-Attention module after the Multiheaded Orthogonal Attention module so as to allow the query-aware contextual representations to exchange information among themselves. This is similar in function to the self-matching layer by Wang et al. (2017b).

3.2 Multiheaded Orthogonal Attention

First, we detail the design of an individual head of Orthogonal Attention:

Refer to caption
Figure 2: Orthogonal Attention Single Head

We use a function α(C,Q)\alpha(C,Q) to generate qc\|q\|*\|c\| key-value pairs (KijSK^{S}_{ij} and VijSV^{S}_{ij}), and β(C,Q)\beta(C,Q) to generate q\|q\| query vectors (QjS)(Q^{S}_{j}). Then, we summarise the value vectors using the key and query vectors, similar to self-attention. Specifically, the summation is performed for each context word (isi^{\prime}s) over all jsj^{\prime}s in VijSV^{S}_{ij}. This generates a context word representation that is query-aware. This operation is performed in a multiheaded fashion. Specifically, the following equations can be used to summarise the entire process:

CiQ=jsoftmax(KijSQjSdk)VijSC^{Q}_{i}=\sum_{j}softmax(\frac{K^{S}_{ij}\bullet Q^{S}_{j}}{\sqrt{d_{k}}})V^{S}_{ij}

Specifically,

Query: QQ [n,d]...[n,d]

Context: CC [m,d]...[m,d]

QS=β(Q,C)Q^{S}=\beta(Q,C) [n,dk]...[n,d_{k}]

KS=α(Q,C)K^{S}=\alpha(Q,C) [m,n,dk]...[m,n,d_{k}]

VS=α(Q,C)V^{S}=\alpha(Q,C) [m,n,dv]...[m,n,d_{v}]

Attention Weights:
WS=Dropout(Softmax(Diag(KS(QS)T)))W^{S}=Dropout(Softmax(Diag(K^{S}(Q^{S})^{T})))
[m,n,1]...[m,n,1]

CiQ=Sum(WSVS,axis=1).squeeze(1)C^{Q}_{i}=Sum(W^{S}*V^{S},axis=1).squeeze(1) [m,dv]...[m,d_{v}]

CiQC^{Q}_{i} represent the output of the ithi^{th} Orthogonal Attention head. COC^{O} represents the output of the Multiheaded Orthogonal Attention Layer

CO=f(C,Q)=C^{O}=f(C,Q)= Concatk([CkQ],axis=1)(WD)TConcat_{k}([C^{Q}_{k}],axis=1)(W^{D})^{T} [m,d]...[m,d]

Like Self-Attention, we choose dk=dv=dn_headsdk=dv=\frac{d}{n\_heads}. WDW^{D} is used to project the output back to a dd dimensional space.

3.3 Orthogonal Attention Variants

Orthogonal Attention models the interaction between the Query and Context using 2 functions, α\alpha and β\beta, which are used to produce the key, query and value vectors KS,QSK^{S},Q^{S}, and VSV^{S}. In this section, we detail 4 choices for α\alpha and β\beta.

We experiment with 2 primary modes of interaction between the query vectors QjQ_{j} and context vectors CiC_{i} to generate the KSK^{S} and VSV^{S}:

  • An elementwise multiplication operation (this is used in OA-EM and OA-EMB). This approach is similar to Dhingra et al. (2016), who also relied on element-wise multiplicative interaction between query and context words. This could be thought of as the query acting as a filter for each context dimension.

  • A 1D Convolutional operation using Query vectors QjQ_{j} as filters to convolve over the Context vectors CiC_{i} (this is used in OA-C and OA-CA). The intuition is that instead of an elementwise multiplicative filter, we generate filters from each query word that could be convolved over the context word embeddings.

To generate QSQ^{S}, we experiment with 2 choices:

  • Only using the Query input QQ (this is used in OA-C and OA-EM).

  • Using both the Query QQ and the Context CC (this is used in OA-CA and OA-EMB). This choice is inspired by Seo et al. (2016), who proposed that using bidirectionality helps the model.

3.3.1 OA-EM

Here, αEM\alpha^{EM} uses Elementwise Multiplication to model the interaction between each context word-query word pair. βEM\beta^{EM} only uses QQ to generate QSQ^{S}. Formally,

Refer to caption
Figure 3: OA-EM: Internal Design
  • αEM(C,Q)\alpha^{EM}(C,Q):

    Linear: C1=ReLU(CW0T+b0C^{1}=ReLU(CW_{0}^{T}+b_{0}) [m,dk]...[m,d_{k}]

    Linear: Q1=ReLU(QW1T+b1Q^{1}=ReLU(QW_{1}^{T}+b_{1} [n,dk]...[n,d_{k}]

    Elementwise Multiplication: X1=C1.reshape(n,1,dk)Q1.reshape(1,n,dk)X^{1}=C^{1}.reshape(n,1,d_{k})\odot Q^{1}.reshape(1,n,d_{k}) [m,n,dk]...[m,n,d_{k}]

    Linear: αEM(C,Q)=ReLU(X1W2T+b2)\alpha^{EM}(C,Q)=ReLU(X^{1}W_{2}^{T}+b_{2}) [m,n,dk]...[m,n,d_{k}]

  • βEM(C,Q):\beta^{EM}(C,Q):

    Linear: βEM(C,Q)=ReLU(QW3T+b3)\beta^{EM}(C,Q)=ReLU(QW_{3}^{T}+b_{3}) [n,dk]...[n,d_{k}]

3.3.2 OA-EMB

Here, αEMB\alpha^{EMB} uses Elementwise Multiplication to model the interaction between each context word-query word pair, just like in αEM\alpha^{EM}. Unlike OA-EM though, βEMB\beta^{EMB} generates QSQ^{S} as context-aware query representations using elementwise multiplication. Specifically, it uses the dot product attention as implemented in the PyTorchNLP Library 111https://pytorchnlp.readthedocs.io to calculate a summary of the context CC per query word QjQ_{j}. This generates a vector summary of CC for each query word, which is then multiplied elementwise with the query vectors QQ themselves to get the context-aware query representation QSQ^{S}. Formally,

Refer to caption
Figure 4: OA-EMB: Internal Design
  • αEMB(C,Q)\alpha^{EMB}(C,Q):

    Elementwise Multiplication: αEMB(C,Q)=αEM(C,Q)\alpha^{EMB}(C,Q)=\alpha^{EM}(C,Q) [m,n,dk]...[m,n,d_{k}]

  • βEMB(C,Q):\beta^{EMB}(C,Q):

    Linear: C1=ReLU(CW0T+b0)C^{1}=ReLU(CW_{0}^{T}+b_{0}) [m,dk]...[m,d_{k}]

    Linear: Q1=ReLU(QW1T+b1)Q^{1}=ReLU(QW_{1}^{T}+b_{1}) [n,dk]...[n,d_{k}]

    Dot Product Attention: CQ=Attention(QD=Q1,CD=C1)C^{Q}=Attention(Q^{D}=Q^{1},C^{D}=C^{1}) [n,dk]...[n,d_{k}]

    Elementwise Multiplication: Q2=Q1CQQ^{2}=Q^{1}\odot C^{Q} [n,dk]...[n,d_{k}]

    Linear: βEMB(C,Q)=ReLU(Q2W2T+b2)\beta^{EMB}(C,Q)=ReLU(Q^{2}W_{2}^{T}+b_{2}) [n,dk]...[n,d_{k}]

3.3.3 OA-C

Here, αC\alpha^{C} uses 1D Convolutional Operation to model the interaction between each context word-query word pair. Specifically, we use QQ to generate 1D Convolutional filters, which are then convolved over each context word CiC_{i} separately. We use dk\sqrt{d_{k}} filters of size dk\sqrt{d_{k}}, with a stride of dk\sqrt{d_{k}}. The resulting feature maps are flattened to produce embeddings for each context word-query word pair, from which KSK^{S} and VSV^{S} are generated. βC\beta^{C} only uses QQ to generate QSQ^{S}, just like βEM\beta^{EM}. Formally,

Refer to caption
Figure 5: OA-C: Internal Design
  • αC(C,Q):\alpha^{C}(C,Q):

    Linear: C1=ReLU(CW0T+b0)C^{1}=ReLU(CW_{0}^{T}+b_{0}) [m,dk]...[m,d_{k}]

    Linear (Generate convolutional filters): Wconv=QW1T+b1W_{conv}=QW_{1}^{T}+b_{1} [n,dk]...[n,\sqrt{d_{k}}]

    Linear (Generate convolutional biases): bconv=QW2T+b2b_{conv}=QW_{2}^{T}+b_{2} [n,1]...[n,1]

    1D Convolution: X=Conv1D(input=C1,filters=(Wconv,bconv),stride=dk)X=Conv1D(input=C^{1},filters=(W_{conv},b_{conv}),stride=\sqrt{d_{k}}) [m,n,dk]...[m,n,d_{k}]

    Linear: αC(C,Q)=ReLU(XW3T+b3)\alpha^{C}(C,Q)=ReLU(XW_{3}^{T}+b_{3}) [m,n,dk]...[m,n,d_{k}]

  • βC(C,Q):\beta^{C}(C,Q):

    Linear: βC(C,Q)=βEM(C,Q)\beta^{C}(C,Q)=\beta^{EM}(C,Q) [n,dk]...[n,d_{k}]

3.3.4 OA-CA

Here, αCA\alpha^{CA} uses 1D Convolutions to model the interaction between each context word-query word pair, just like in αC\alpha^{C}. Unlike OA-C though, βCA\beta^{CA} uses CC and QQ to generate QSQ^{S}. Specifically, we perform a dot product attention mechanism over CC using each individual query word QjQ_{j} to generate a vector summary of CC for each query word (like in OA-EMB), which are then used to generate filters for a 1D Convolution over the corresponding query word (from which the filter was generated). We use dk\sqrt{d_{k}} filters of size dk\sqrt{d_{k}}, with a stride of dk\sqrt{d_{k}}. The resulting feature maps are flattened to produce embeddings for each query word, from which QSQ^{S} are generated. Formally,

Refer to caption
Figure 6: OA-CA: Internal Design
  • αCA(C,Q)\alpha^{CA}(C,Q):

    Conv1D: αCA(C,Q)=αC(C,Q)\alpha^{CA}(C,Q)=\alpha^{C}(C,Q) [m,n,dk]...[m,n,d_{k}]

  • βCA(C,Q):\beta^{CA}(C,Q):

    Linear: C1=ReLU(CW0T+b0)C^{1}=ReLU(CW_{0}^{T}+b_{0}) [m,dk]...[m,d_{k}]

    Linear: Q1=ReLU(QW4T+b4)Q^{1}=ReLU(QW_{4}^{T}+b_{4}) [n,dk]...[n,d_{k}]

    Dot Product Attention: CQ=Attention(QD=Q1,CD=C1)C^{Q}=Attention(Q^{D}=Q^{1},C^{D}=C^{1}) [n,dk]...[n,d_{k}]

    Linear (Generate convolutional filters): Wconv=CQW5T+b5W_{conv}=C^{Q}W_{5}^{T}+b_{5} [n,dk]...[n,\sqrt{d_{k}}]

    Linear (Generate convolutional biases): bconv=CQW6T+b6b_{conv}=C^{Q}W_{6}^{T}+b_{6} [n,1]...[n,1]

    1D Convolution: βCA(C,Q)=Conv1D(input=Q1,filters=(Wconv,bconv),stride=dk)\beta^{CA}(C,Q)=Conv1D(input=Q^{1},filters=(W_{conv},b_{conv}),stride=\sqrt{d_{k}}) [n,dk]...[n,d_{k}]

4 Experimentation Details

The model architecture is a straightforward modification of Khandelwal and Sawant (2020), by passing the output of the XLNet-base model to 2 Orthogonal Attention Encoder blocks. Finally, we add a linear layer to get the output probabilities. We also use Dropout and Residual Connections to regularize the model and stabilize training.

The model architecture can be summarised as follows:

Sentence Embeddings: X1=XLNetbase(X)X_{1}=XLNet-base(X)

Dropout: X2=Dropout(X1)X_{2}=Dropout(X_{1})

Orthogonal Attention Encoder: X3=OA(X2,X2[cue_ids])X_{3}=OA(X_{2},X_{2}[cue\_ids])

Orthogonal Attention Encoder: X4=OA(X3,X3[cue_ids])X_{4}=OA(X_{3},X_{3}[cue\_ids])

Dropout: X5=Dropout(X4)X_{5}=Dropout(X_{4})

Residual Connection: X6=X5+X1X_{6}=X_{5}+X_{1}

Dropout: X7=Dropout(X6)X_{7}=Dropout(X_{6})

Linear: Y=X7WT+bY=X_{7}W^{T}+b

We experiment with 4 datasets:

  • BioScope Abstracts Subcorpora (BA): 1075 samples

  • BioScope Full Papers Subcorpora (BF): 235 samples

  • SFU Review Corpus (SFU): 2205 samples

  • *sem 2012 Shared Task dataset (Sherlock): 615 samples

We perform a 10-fold cross validation. When the Sherlock dataset is involved, we use the train-dev-test split provided by *sem 2012 organizers, so the 10-fold CV becomes a 10-run average over the test set provided. We report the scores for the Token level F1, which is an F1 score over the number of word level label matches. Using the proposed model architecture, we experiment with the 4 proposed variants of Orthogonal Attention. For each of the variants, we also experiment with 2 preprocessing techniques for the input sentence:

  • Normal: The input sentence is passed as is. We hypothesise that this approach will require the model to learn the dependencies between the cue words and their scope via leveraging the multiheaded orthogonal attention blocks, as the XLNet layer does not know which words are the cue words in the sentence.

  • Augment: Similar to the Augment method used in prior work (Khandelwal and Sawant (2020), Britto and Khandelwal (2020)), the cue words are preceded with a special token (tok[0]). This way, we explicitly tell the XLNet layer about the cue words as well, and thus the Orthogonal Attention Encoder layers would be able to enhance the representation of each token, making it easier for the linear layer to find the scope of the cue.

While the Normal Preprocessing strategy would be the default method for a Cloze-style task, we observed that our model overfit to the training set, due to the limited amount of training data (ranging from 200 samples to 2200 samples). Each Orthogonal Attention layer had around 4-7 million parameters which had to be learnt from scratch (as shown in Table 1). To reduce overfitting, we experimented with also allowing access to cue information to the XLNet layer,so that it generated meaningful cue-aware embeddings, via the Augment preprocessing technique. In our analysis, we show that all the Orthogonal Attention variants give us a gain in performance over using XLNet-base only.

We use a differential learning rate strategy, where XLNet-base is finetuned using an initial learning rate of 3e-5 and the Orthogonal Attention Encoder Layers are trained using an initial learning rate of 1e-4, both using the Adam optimizer.

We use a batch size of 32, a dropout probability of 0.3, and the maximum input sequence length as 128. We train for 60 epochs, with an early stopping on the validation set F1 with a patience of 6 epochs. From k folds that the model is divided into, one fold is kept as the test set while a validation set (for early stopping) is sampled from the k-1 folds, whose size is the same as the test set.

We also perform a 10-fold Cross-Validation for the model proposed by Britto and Khandelwal (2020), using XLNet as the base model (which is the state-of-the-art model for BF, BA and SFU), to accurately compare the results. This model is referred to as Baseline in all further results.

5 Results and Analysis

[Uncaptioned image]
Table 1: Number of Model Parameters

Table 1 contains a summary of the model sizes and the execution times for an inference run with various batch sizes. We see that OA-C and OA-EM have similar execution profiles, and that OA-CA and OA-EMB (the bidirectional variants) have similar execution profiles.

[Uncaptioned image]
Table 2: Results for Augment Preprocessing Method
[Uncaptioned image]
Table 3: Results for Normal Preprocessing Method

Tables 2 and 3 contain the results for Augment and Normal Preprocessing method respectively (Token-level F1 score (Macro Average over a 10-fold CV)). Across a row, the best model for that train dataset-test dataset combination is highlighted in bold. The best score for a given test dataset is highlighted in yellow. We observe that:

  • [Uncaptioned image]
    Table 4: Performance Difference (F1 points)

    Table 4 shows that for the Augment preprocessing method, all models yield improvements over the baseline(non-OA) for Augment preprocessing method.To compute the second column, we take the difference of the scores of that model to the best score that any model had on that train dataset-test dataset combination. It also shows that the 2 most promising variants are OA-EMB and OA-CA (the bidirectional variants).

  • Refer to caption
    Figure 7: Performance Difference between Augment and Normal (F1 Points)

    Figure 7 contains the difference between performance for the a model when the only difference was the preprocessing method used, averaged. We observe that the while the Normal preprocessing method has a significant difference in performance compared to Augment, but that this difference decreases with increasing dataset sizes (supporting our earlier hypothesis of overfitting).

[Uncaptioned image]
Table 5: Summary of Results

A summary of the best results by each model variant is shown in Table 5. A final summary of the results in comparison to the previous state-of-the-art results is shown in Table 6.

[Uncaptioned image]
Table 6: Final Summary Table

6 Conclusion and Future Scope

In this paper, we proposed a novel approach to solving Negation Scope Resolution by viewing it as a Cloze-Style task, and also proposed a novel Cloze-Style Attention mechanism called Orthogonal Attention. We proposed 4 such variants. Our results showed that Orthogonal Attention is very effective as a Cloze-Style Attention mechanism, and using it with the current state-of-the-art models (XLNet-base) gives us an increase in performance over them. Thus, we report the best results till date on Negation Scope Resolution.

Future work could utilize Orthogonal Attention to address Question Answering and other Machine Reading Comprehension tasks. The Orthogonal Attention framework could be used to create better variants of Orthogonal Attention, which could better model the interaction between the Query and Context.

References