Document-Level Definition Detection in Scholarly Documents:
Existing Models, Error Analyses, and Future Directions

Dongyeop Kang^♡ Andrew Head^♡ Risham Sidhu^♡
Kyle Lo^♢ Daniel S. Weld^♢∘ Marti A. Hearst^♡
^♡University of California, Berkeley, ^♢Allen Institute for AI, ^∘University of Washington
{dongyeopk,andrewhead,rishamsidhu,hearst}@berkeley.edu
{kylel,danw}@allenai.org

Abstract

The task of definition detection is important for scholarly papers, because papers often make use of technical terminology that may be unfamiliar to readers. Despite prior work on definition detection, current approaches are far from being accurate enough to use in real-world applications.

In this paper, we first perform in-depth error analysis of the current best performing definition detection system and discover major causes of errors. Based on this analysis, we develop a new definition detection system, HEDDEx, that utilizes syntactic features, transformer encoders, and heuristic filters, and evaluate it on a standard sentence-level benchmark. Because current benchmarks evaluate randomly sampled sentences, we propose an alternative evaluation that assesses every sentence within a document. This allows for evaluating recall in addition to precision.

HEDDEx outperforms the leading system on both the sentence-level and the document-level tasks, by 12.7 F1 points and 14.4 F1 points, respectively. We note that performance on the high-recall document-level task is much lower than in the standard evaluation approach, due to the necessity of incorporation of document structure as features. We discuss remaining challenges in document-level definition detection, ideas for improvements, and potential issues for the development of reading aid applications.

1 Introduction

Automatic definition detection is an important task in natural language processing (NLP). Definitions can be used for a variety of downstream tasks, such as ontology matching and construction Bovi et al. (2015), paraphrasing Hashimoto et al. (2011), and word sense disambiguation Banerjee and Pedersen (2002); Huang et al. (2019). Prior work in automated definition detection has addressed the domain of scholarly articles Reiplinger et al. (2012); Jin et al. (2013); Espinosa-Anke and Schockaert (2018); Vanetik et al. (2020); Veyseh et al. (2020). Definition detection is especially important for scholarly papers because they often use unfamiliar technical terms that readers must understand to properly comprehend the article.

Example

s^{task}

are softmax-normalized weights and the scalar […]

\hdashline[0.4pt/2pt]

Textual entailment is the task of determining whether

a “hypothesis” is true, given a “premise”.

\hdashline[0.4pt/2pt]

A biLM combines both a forward and backward LM

\hdashline[0.4pt/2pt]

[…] a fine grained word sense disambiguation (WSD)

task and a POS tagging task.

Table 1: Examples of terms and definitions from Peters et al. (2018). Each row shows a term (e.g.,

s^{task}

) along with its definition (e.g., “softmax-normalized weights”).

In formal terms, definition detection is comprised of two tasks: classifying sentences as containing definitions or not, and identifying which spans within these sentences contain terms and definitions. As the performance of definition extractors continues to improve, these algorithms could pave the way for new types of intelligent assistance for readers of dense technical documents. For example, one could envision future interfaces that reveal definitions of jargon like “biLM” or the symbol “ $s^{task}$ ” when a reader hovers over the terms in a reading application Head et al. (2020). Examples of sentences containing terms and definitions are shown in Table 1.

Despite recent advances in definition detection, much work remains to be done before models are capable of extracting definitions with an accuracy appropriate for real-world applications. The first challenge is one of recall: existing systems are typically not trained to identify all definitions in a document, but rather to classify individual sentences arbitrarily sampled from a large corpus. The second challenge is one of precision: the state of the art misclassifies upwards of 30% of sentences Veyseh et al. (2020). This begs the questions of why definition extractors fall short, and how these shortcomings can be overcome.

In this paper, we contribute the following:

•

An in-depth error analysis of the current best-performing model. This analysis characterizes the state of the field and illustrates future directions for improvement;
•

A new model, Heuristically-Enhanced Deep Definition Extraction (HEDDEx), that extends a state-of-the-art model with improvements designed to address the problems found in the error analysis. An evaluation shows that this improved model outperforms the state of the art by a large margin (+12.7 F1);
•

An introduction of the challenging task of full-document definition detection. In this task, models are evaluated based on their ability to identify definitions across an entire document’s sentences. We believe this framing of definition detection is critical to preparing future algorithms for real-world use;
•

A preliminary analysis of previous models and our model on the document-level definition detection task using a small test set of scholarly papers where every term and definition has been labeled. This analysis shows that HEDDEx outperforms the state of the art, while revealing opportunities for future improvements.

In summary, this paper draws attention to the work yet to be done in addressing the task of document-level definition detection for scholarly documents. We draw attention to the fact that a seemingly straightforward task like definition detection still poses significant challenges to NLP, and that this is an area that needs more focus in the scholarly document processing community.

2 Related Work

Definition detection has been tackled in several ways in prior research. The traditional rule-based systems Muresan and Klavans (2002); Westerhout and Monachesi (2008); Westerhout (2009a) used hand-written definition patterns (e.g., “is defined as“) and linguistic features (e.g., pronoun, verb, punctuation), providing high precision but low recall detection. To address the low recall problem, model-driven approaches Fahmi and Bouma (2006); Westerhout (2009b); Navigli and Velardi (2010); Reiplinger et al. (2012) were developed using statistical and syntactic features such as bag-of-words, sentence position, part-of-speech (POS) tags, and their combination with hand-written rules. Notably, Jin et al. (2013) used conditional random field (CRF) Lafferty et al. (2001) to predict tags of each token in a sentence such as TERM for term tokens, DEF for definition tokens, and O for neither. Recently, sophisticated neural models such as convolutional networks Espinosa-Anke and Schockaert (2018) and graph convolutional networks Veyseh et al. (2020) have been applied to obtain better sentence representations in combination with syntactic features. However, our analysis found that the state-of-the-art is still far from solving the problem, achieving an F1 score of only 60 points on a standard test set.

Refer to caption — Table 2: Sample annotations from our analysis of errors produced by \NAT@swafalse\NAT@partrue\NAT@fullfalse\NAT@citetpVeyseh2020AJM joint model when extracting definitions from the W00 Jin et al. (2013) dataset. Each row includes a sentence annotated with gold labels for terms and definitions, and the system’s predictions for [terms] and {definitions} (“Sentences”). Also shown is a class of error (“Cause”), surface patterns that we anticipate could be used to correct the detection of the definition (“Patterns”), and classes of improvements to make to the model (“Solutions”). The first row is an example of a false positive; the second row is a partially-correct prediction; and the third row is a false negative. A transcription error (‘axe’ instead of ‘are’) is retained from the dataset.

Sentences	Cause	Patterns	Solutions
[Equal] is open in something of type collection where that collection is a {partition of something} .	$\bullet$ Overgeneralization: technical term bias
$\bullet$ description (is)	(none applicable)	?
\hdashline[0.4pt/2pt]
A [graph - based operator] defines a transformation on a multi-document graph (MDG) G which preserves {some of its properties while reducing the number}…	$\bullet$ Complicated sentence structure	<term> defines <def>	parsing features
\hdashline[0.4pt/2pt]
The Inductive Logic Programming learning method that we have developed enables us to automatically extract from a corpus N - V pairs whose elements axe linked by one of the semantic relations defined in the qualia structure …	$\bullet$ Unfamiliar or unseen vocabulary $\bullet$ Unseen patterns	<term> that we have developed enables us to automatically <def>	generalize patterns

Top hypothesized causes of error	(%)
Overgeneralization: technical term bias	48.6%
Unfamiliar or unseen vocabulary	25.7%
Complicated sentence structure	12.9%
Entity detection	4.3%
Overgeneralization: technical term bias	28.9%
Overgeneralization: surface pattern bias	23.3%
Unseen patterns	14.4%
Complicated sentence structure	12.2%
Overgeneralization: description	3.3%

Proposed error correction solution types	(%)
Syntactic (POS, parse tree, entity, acronym)	29.2%
Heuristics	23.6%
Better encoder/tokenizer, UNK	18.0%
Rules (surface patterns)	11.3%
Annotations*	9.4%
Pattern generalization	5.7%
Mathematical symbol detection	1.9%
More context	0.9%

	Macro P/R/F	TERM P/R/F	DEF P/R/F	Partial F	Clsf.
DefMiner Jin et al. (2013)	52.5 / 49.5 / 50.5	-	-	-	-
LSTM-CRF Li et al. (2016)	57.1 / 55.9 / 56.2	-	-	-	-
GCDT Liu et al. (2019a)	57.9 / 56.6 / 57.4	-	-	-	-
Joint Veyseh et al. (2020)	60.9 / 60.3 / 60.6	-	-	-	-
Joint* Veyseh et al. (2020)	61.0 / 60.2 / 60.7	-	-	-	70.5
HEDDEx
Joint* + BERT-base	59.5 / 61.3 / 60.3	66.6 / 70.0 / 68.2	72.1 / 74.0 / 72.8	74.3	83.4
Joint* + BERT-large	60.4 / 61.4 / 60.7	67.5 / 71.0 / 69.0	72.3 / 73.9 / 72.9	74.5	83.2
Joint* + RoBERTa-large	60.3 / 61.6 / 60.7	67.3 / 70.3 / 68.6	72.8 / 74.6 / 73.5	73.2	84.2
Joint* + SciBERT	61.9 / 61.2 / 61.5	71.1 / 69.1 / 69.9	74.0 / 74.6 / 74.2	75.7	85.1
\hdashline[0.4pt/2pt] Joint* + SciBERT + Syntactic	61.6 / 61.8 / 61.6	70.7 / 71.3 / 70.9	73.3 / 72.4 / 72.8	74.3	84.3
\hdashline[0.4pt/2pt] Joint* + SciBERT + Syntactic + Heuristic	72.9 / 74.3 / 73.4	69.8 / 72.1 / 70.8	75.4 / 71.8 / 73.3	74.3	84.5

Term (%)		Definition (%)
Textual term	45.2%	Textual Def.	58.7%
Incorrect term	27.3%	Other: implausible	24.8%
Math symbol term	22.7%	Other: plausible	11.8%
Acronym	3.3%	Short name / Synonym	3.5%
Acronym and text	1.3%	Textual & Formula Def.	0.6%
		Formula Def.	0.4%

Document-Level Definition Detection in Scholarly Documents:
Existing Models, Error Analyses, and Future Directions

Abstract

1 Introduction

2 Related Work

3 Error Analysis of the Leading System

4 Definition Sentence Detection Model

4.1 Proposed Model: Heuristically-Enhanced Deep Definition Extraction (HEDDEx)

4.2 Baseline Models

4.3 Metrics

4.4 Setup

4.5 Results

5 Document-Level Definition Detection

5.1 Error Analysis on Predicted Definitions

5.2 Full Document Definition Annotation

5.3 Evaluation on Document-level Definitions

6 Discussion

7 Conclusion

Acknowledgements

References

Term Span (%)		Definition Span (%)
Correct	83.4%	Correct	89.9%
Too Long (to the right)	10.1%	Cut Off (to the right)	3.4%
Cut Off (to the right)	3.2%	Too Long (to the left)	2.6%
Cut Off (to the left)	1.6%	Cut Off (to the left)	2.4%
Too Long (to the left)	1.4%	Too Long (to the right)	1.3%

	Macro	TERM	DEF	Partial	Clf.
Joint model	36.0	30.1	38.1	34.1	86.8
\hdashline[0.4pt/2pt] HEDDEx w/ BERT	45.3	32.4	39.0	34.6	89.1
HEDDEx w/ RoBERTa	47.7	36.4	47.2	37.2	88.1
\hdashline[0.4pt/2pt] HEDDEx ensemble	50.4	38.7	49.5	39.0	89.8

	Precision	Recall	F1
Macro	55.3	46.7	50.4
TERM	44.8	34.0	38.7
DEF	55.6	44.7	49.5

	Predicted definition sentences	Type
1	Our word vectors are learned functions of the internal states of a {deep bidirectional language model} ([biLM]), which is pre-trained on a large text corpus.	term
2	We use vectors derived from a bidirectional LSTM that is trained with a coupled {language model} ([LM]) objective on a large text corpus.	term
3	Using intrinsic evaluations, we show that the higher-level [LSTM states] capture context-dependent aspects of word meaning (e.g., they can be used without modification to perform well on supervised word sense disambiguation tasks) while lower-level states {model aspects of syntax}.	term
4	We first show that they can be easily added to existing models for six diverse and challenging language understanding problems , including textual entailment, question answering and sentiment analysis.	term
5	For tasks where direct comparisons are possible, outperforms [CoVe] CITATION, which {computes contextualized representations using a neural machine translation encoder}.	term
6	context2vec CITATION uses a bidirectional Long Short Term Memory LSTM ; CITATION to encode the context around a pivot word .	term
7	Unlike most widely used word embeddings CITATION, word representations are functions of the entire input sentence, as described in this section.	term
8	This setup allows us to do [semi-supervised learning], where the biLM is pretr{ained at a large scale} (Sec. SECTION) and easily incorporated into a wide range of existing neural NLP architectures (Sec. SECTION).	term
9	Given a sequence of $N$ tokens, $(t_{1},t_{2},...,t_{N})$ , a forward language model computes the probability of the sequence by modeling the probability of token $t_{k}$ given the history $(t_{1},...,t_{k-1})$ :	term
10	A [backward LM] is {similar to a forward LM, except it runs over the sequence in reverse, predicting the previous token given the future context}:	term
11	A [biLM] {combines both a forward and backward LM}.	term
12	where [ $\boldsymbol{h^{LM}_{k,0}}$ ] is {the token layer} and ${h}^{LM}_{k,j}=[\overrightarrow{{h}}^{LM}_{k,j};\overleftarrow{{h}}^{LM}_{k,j}]$ , for each biLSTM layer.	symbol
13	In (EQUATION), [ $s^{task}$ ] are {softmax-normalized weights} and the scalar parameter $\gamma^{task}$ allows the task model to scale the entire vector.	symbol
14	For each token $t_{k}$ , a $L$ -layer biLM computes a set of $2L+1$ representations EQUATION where $h^{LM}_{k,0}$ is the token layer and ${h}^{LM}_{k,j}=[\overrightarrow{{h}}^{LM}_{k,j};\overleftarrow{{h}}^{LM}_{k,j}]$ , for each biLSTM layer.	symbol, term
15	In (EQUATION), $s^{task}$ are softmax-normalized weights and the [scalar parameter $\gamma^{task}$ ] {allows the task model to scale the entire vector}.	symbol
16	The [Stanford Question Answering Dataset (SQuAD) CITATION] {contains 100K+ crowd sourced question-answer pairs where the answer is a span in a given Wikipedia paragraph}.	term
17	[Textual entailment] is {the task of determining whether a “hypothesis” is true, given a “premise”}.	term
18	The [Stanford Natural Language Inference (SNLI) corpus CITATION] {provides approximately 550K hypothesis/premise pairs}.	term
19	A [semantic role labeling (SRL) system] {models the predicate-argument structure of a sentence}, and is often described as answering.	term
20	CITATION modeled [SRL] {as a BIO tagging problem and used an 8-layer deep biLSTM with forward and backward directions interleaved}, following CITATION.	term
21	[Coreference resolution] is {the task of clustering mentions in text that refer to the same underlying real world entities}.	term
22	The [CoNLL] 2003 NER task CITATION {consists of newswire from the Reuters RCV1 corpus tagged with four different entity types (PER, LOC, ORG, MISC)}.	term
23	The [fine-grained sentiment classification] task in the Stanford Sentiment Treebank SST-5 {involves selecting one of five labels (from very negative to very positive) to describe a sentence from a movie review}.	term
24	The sentences contain diverse linguistic phenomena such as idioms and complex syntactic constructions such as negations that are difficult for models to learn.	multi-term
25	Intuitively, the [biLM] must be {disambiguating the meaning of words using their context}.	term
26	a fine grained {word sense disambiguation} ([WSD]) task and a POS tagging task.	term

Document-Level Definition Detection in Scholarly Documents: Existing Models, Error Analyses, and Future Directions

Abstract

1 Introduction

2 Related Work

3 Error Analysis of the Leading System

4 Definition Sentence Detection Model

4.1 Proposed Model: Heuristically-Enhanced Deep Definition Extraction (HEDDEx)

4.2 Baseline Models

4.3 Metrics

4.4 Setup

4.5 Results

5 Document-Level Definition Detection

5.1 Error Analysis on Predicted Definitions

5.2 Full Document Definition Annotation

5.3 Evaluation on Document-level Definitions

6 Discussion

7 Conclusion

Acknowledgements

References

Document-Level Definition Detection in Scholarly Documents:
Existing Models, Error Analyses, and Future Directions