This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

AdaK-NER: An Adaptive Top-𝐊\mathbf{K} Approach
for Named Entity Recognition with Incomplete Annotations

Hongtao Ruan1    Liying Zheng2&Peixian Hu3
Abstract

State-of-the-art Named Entity Recognition (NER) models rely heavily on large amounts of fully annotated training data. However, accessible data are often incompletely annotated in the industry like Financial Technology. Normally the unannotated tokens are regarded as non-entities by default, while we underline that these tokens could either be non-entities or part of any entity. Here, we study NER modeling with incomplete annotated data where only a fraction of the named entities are labeled, and the unlabeled tokens are equivalently multi-labeled by every possible label. Taking multi-labeled tokens into account, the numerous possible paths can distract the training model from the gold path (ground truth label sequence), and thus hinders the learning ability. In this paper, we propose AdaK-NER, named the adaptive top-KK approach, to help the model focus on a smaller feasible region where the gold path is more likely to be located. We demonstrate the superiority of our approach through extensive experiments on both English and Chinese datasets, averagely improving 2%2\% in F-score on the CoNLL-2003 and over 10%10\% on two Chinese datasets from Financial Technology compared with the prior state-of-the-art works.

1 Introduction

Named Entity Recognition (NER) Li et al. (2020); Sang and DeΒ Meulder (2003); Peng et al. (2019) is a fundamental task in Natural Language Processing (NLP). NER task aims at recognizing the meaningful entities occurring in the text, which can benefit various downstream tasks, such as question answering Cao et al. (2019), event extraction Wei et al. (2020), and opinion mining Poria et al. (2016). Especially, by detecting relevant entities in Financial Technology, NER helps financial professionals efficiently leverage the information of news, which is paramount in making high-quality decisions.

Strides in statistical models, such as Conditional Random Field (CRF) Lafferty et al. (2001) and pre-trained language models like BERT Devlin et al. (2018), have equipped NER with new learning principles Li et al. (2020). Pre-trained model with rich representation ability can discover hidden features automatically while CRF can capture the dependencies between labels with the BIO or BIOES tagging scheme.

However, most existing methods rely on large amounts of fully annotated information for training NER models Li et al. (2020); Jia et al. (2020). Fulfilling such requirements is expensive and laborious in the industry like Financial Technology. Annotators, are not likely to be fully equipped with comprehensive domain knowledge, only annotate the named entities they recognize and let the others off, resulting in incomplete annotations. They typically do not specify the non-entity Surdeanu et al. (2010), so that the recognized entities are the only available annotations. Figure 1(a) shows examples of both gold path111A path is a label sequence for a given sentence. and incomplete path.

Refer to caption
Figure 1: (a) The sentence is annotated with BIO tagging scheme. The entity types are person (PER) and location (LOC). Only the entity β€˜Bahrain’ of LOC is recognized, while β€˜Barbados’ of LOC and β€˜Franz Fischler’ of PER are missing. (b) Weighted CRF model considers all the 31253125 possible paths with 55 unannotated tokens for qq estimation. (c) In our model, we build a candidate mask to filter out the less likely labels (labels in faded color). Therefore, the possible paths of our model is significantly less than weighted CRF.

For corpus with incomplete annotations, each unannotated token can either be part of an entity or non-entity, making the token equivalently multi-labeled by every possible label. Since conventional CRF algorithms require fully annotated sentences, a strand of literature suggests assigning weights to possible labels Shang et al. (2018); Jie et al. (2019). Fuzzy CRF Shang et al. (2018) focused on filling the unannotated tokens by assigning equal probability to every possible path. Further, JieΒ Jie et al. (2019) introduced a weighted CRF method to seek a more reasonable distribution qq for all possible paths, attempting to pay more attention to those paths with high potential to be gold path.

Ideally, through comprehensive learning on qq distribution, the gold path can be correctly discovered. However, this perfect situation is difficult to achieve in practical applications. Intuitively, taking all possible paths into consideration will distract the model from the gold path, as the feasible region (the set of possible paths where we search for the gold path) grows exponentially with the length of the unannotated tokens increasing, which might cause failure to identify the gold path.

To address this issue, one promising direction is to prune the size of feasible region during training. We assume the unknown gold path is among or very close to the top-KK paths with the highest possibilities. Specifically, we utilize a novel adaptive KK-best loss to help the training model focus on a smaller feasible region where the gold path is likely to be located. Furthermore, once a path is identified as a disqualified sequence, it will be removed from the feasible region. This operation can thus drastically eliminate redundancy without undermining the effectiveness. For this purpose, a candidate mask is built to filter out the less likely paths, so as to restrict the size of the feasible region.

Trained in this way, our AdaK-NER overcomes the shortcomings of Fuzzy CRF and weighted CRF, resulting in a significant improvement on both precision and recall, and a higher F1F_{1} score as well.

In summary, the contributions of this work are:

  • β€’

    We present a KK-best mechanism for improving incomplete annotated NER, aiming to focus on the gold path effectively from the most possible candidates.

  • β€’

    We demonstrate both qualitatively and quantitatively that our approach achieves state-of-the-art performance compared to various baselines on both English and Chinese datasets.

2 Proposed Approach

Let 𝒙=(𝒙1,𝒙2,β‹―,𝒙n){\boldsymbol{x}}=({\boldsymbol{x}}_{1},{\boldsymbol{x}}_{2},\cdots,{\boldsymbol{x}}_{n}) be a training sentence of length nn, token 𝒙iβˆˆπ‘Ώ{\boldsymbol{x}}_{i}\in{\boldsymbol{X}}. Correspondingly, π’š=(π’š1,π’š2,β‹―,π’šn){\boldsymbol{y}}=({\boldsymbol{y}}_{1},{\boldsymbol{y}}_{2},\cdots,{\boldsymbol{y}}_{n}) denotes the complete label sequence, π’šiβˆˆπ’€{\boldsymbol{y}}_{i}\in{\boldsymbol{Y}}. The NER problem can be defined as inferring π’š{\boldsymbol{y}} based on 𝒙{\boldsymbol{x}}.

Under the incomplete annotation framework, we introduce the following terminologies. A possible path refers to a possible complete label sequence consistent with the annotated tokens. For example, a possible incomplete annotated label sequence for 𝒙{\boldsymbol{x}} can be π’šu=(βˆ’,π’š2,βˆ’,β‹―,βˆ’){\boldsymbol{y}}_{u}=(-,{\boldsymbol{y}}_{2},-,\cdots,-), where token 𝒙2{\boldsymbol{x}}_{2} is annotated as π’š2{\boldsymbol{y}}_{2} and other missing labels are labeled as βˆ’-. π’šc=(π’šc1,π’š2,β‹―,π’šcn){\boldsymbol{y}}_{c}=({\boldsymbol{y}}_{c_{1}},{\boldsymbol{y}}_{2},\cdots,{\boldsymbol{y}}_{c_{n}}) with π’šciβˆˆπ’€{\boldsymbol{y}}_{c_{i}}\in{\boldsymbol{Y}}, is a possible path for 𝒙{\boldsymbol{x}}, where all the missing labels βˆ’- are replaced by some elements in 𝒀{\boldsymbol{Y}}. Set C​(π’šu)C({\boldsymbol{y}}_{u}) denotes all possible complete paths for 𝒙{\boldsymbol{x}} with incomplete annotation π’šu{\boldsymbol{y}}_{u}. D={(𝒙i,π’šui)}D=\{({\boldsymbol{x}}^{i},{\boldsymbol{y}}^{i}_{u})\} is the available incompletely annotated dataset.

For NER task, CRF Lafferty et al. (2001) is a traditional approach to capture the dependencies between the labels by modeling the conditional probability pπ’˜β€‹(π’š|𝒙)p_{\boldsymbol{w}}({\boldsymbol{y}}|{\boldsymbol{x}}) of a label sequence π’š{\boldsymbol{y}} given an input sequence 𝒙{\boldsymbol{x}} of length nn as:

pπ’˜β€‹(π’š|𝒙)=exp⁑(π’˜β‹…Ξ¦β€‹(𝒙,π’š))βˆ‘π’šβˆˆπ’€nexp⁑(π’˜β‹…Ξ¦β€‹(𝒙,π’š)).p_{\boldsymbol{w}}({\boldsymbol{y}}|{\boldsymbol{x}})=\frac{\exp({\boldsymbol{w}}\cdot\Phi({\boldsymbol{x}},{\boldsymbol{y}}))}{\sum_{{\boldsymbol{y}}\in{\boldsymbol{Y}}^{n}}\exp({\boldsymbol{w}}\cdot\Phi({\boldsymbol{x}},{\boldsymbol{y}}))}. (1)

Φ​(𝒙,π’š)\Phi({\boldsymbol{x}},{\boldsymbol{y}}) denotes the map from a pair of 𝒙{\boldsymbol{x}} and π’š{\boldsymbol{y}} to an arbitrary feature vector, π’˜{\boldsymbol{w}} is the model parameter, pπ’˜β€‹(π’š|𝒙)p_{\boldsymbol{w}}({\boldsymbol{y}}|{\boldsymbol{x}}) is the probability of π’š{\boldsymbol{y}} predicted by the model. Once π’˜{\boldsymbol{w}} has been estimated via minimizing negative log-likelihood:

L​(π’˜,𝒙)=βˆ’log⁑pπ’˜β€‹(π’š|𝒙),L({\boldsymbol{w}},{\boldsymbol{x}})=-\log p_{\boldsymbol{w}}({\boldsymbol{y}}|{\boldsymbol{x}}), (2)

the label sequence can be inferred by:

π’š^=arg⁑maxπ’šβˆˆπ’€n⁑pπ’˜β€‹(π’š|𝒙).\hat{{\boldsymbol{y}}}=\arg\max_{{\boldsymbol{y}}\in{\boldsymbol{Y}}^{n}}p_{\boldsymbol{w}}({\boldsymbol{y}}|{\boldsymbol{x}}).\vspace{-1pt} (3)

The original CRF learning algorithm requires a fully annotated sequence π’š{\boldsymbol{y}}, thus the incompletely annotated data is not directly applicable to it. JieΒ Jie et al. (2019) modified the loss function as follows:

L​(π’˜,𝒙)=βˆ’logβ€‹βˆ‘π’šβˆˆC​(π’šu)q​(π’š|𝒙)​pπ’˜β€‹(π’š|𝒙),L({\boldsymbol{w}},{\boldsymbol{x}})=-\log\sum_{{\boldsymbol{y}}\in C({\boldsymbol{y}}_{u})}q({\boldsymbol{y}}|{\boldsymbol{x}})p_{\boldsymbol{w}}({\boldsymbol{y}}|{\boldsymbol{x}}),\vspace{-1pt} (4)

where q​(π’š|𝒙)q({\boldsymbol{y}}|{\boldsymbol{x}}) is an estimated distribution of all possible complete paths π’šβˆˆC​(π’šu){\boldsymbol{y}}\in C({\boldsymbol{y}}_{u}) for 𝒙{\boldsymbol{x}}.

We illustrate their model in (Figure 1b). Note that when qq is a uniform distribution, the above CRF model is Fuzzy CRF Shang et al. (2018) which puts equal probability to all possible paths in C​(π’šu)C({\boldsymbol{y}}_{u}). JieΒ Jie et al. (2019) claimed that qq should be highly skewed rather than uniformly distributed, therefore they presented a kk-fold cross-validation stacking method to approximate distribution qq.

Nonetheless, as Figure 1(b) shows, a sentence with only 66 words (11 annotated, 55 unannotated) have 31253125 possible paths. We argued that identifying the gold path from all possible paths is like looking for a needle in a haystack. This motivates us to reduce redundant paths during training. We propose two major strategies (adaptive KK-best loss and candidate mask) to induce the model to focus on the gold path (Figure 1(c)), and two minor strategies (annealing technique and iterative sample selection) to further improve the model effectiveness in NER task. The workflow is summarized in Algorithm 1.

2.1 Adaptive KK-best Loss

Viterbi decoding algorithm is a dynamic programming technique to find the most possible path with only linear complexity, thus it could be used to predict a path for an input based on the parameters provided by the NER model. KK-best Viterbi decoding Huang and Chiang (2005) extends the original Viterbi decoding algorithm to extract the top-KK paths with the highest probabilities. In the incomplete data, the gold path is unknown. We hypothesize it is very likely to be the same with or close to one of the top-KK paths. This inspires us to introduce an auxiliary KK-best loss component to help the model focus on a smaller yet promising region. Weight is added to balance the weighted CRF loss and the auxiliary loss, and thus we modify (4) into:

Lk​(π’˜,𝒙)=\displaystyle L_{k}({\boldsymbol{w}},{\boldsymbol{x}})= βˆ’(1βˆ’Ξ»)​logβ€‹βˆ‘π’šβˆˆC​(π’šu)q​(π’š|𝒙)​pπ’˜β€‹(π’š|𝒙)\displaystyle-(1-\lambda)\log\sum_{{\boldsymbol{y}}\in C({\boldsymbol{y}}_{u})}q({\boldsymbol{y}}|{\boldsymbol{x}})p_{\boldsymbol{w}}({\boldsymbol{y}}|{\boldsymbol{x}}) (5)
βˆ’Ξ»β€‹logβ€‹βˆ‘π’šβˆˆKπ’˜β€‹(x)pw​(π’š|𝒙),\displaystyle-\lambda\log\sum_{{\boldsymbol{y}}\in K_{\boldsymbol{w}}(x)}p_{w}({\boldsymbol{y}}|{\boldsymbol{x}}),

where Kπ’˜β€‹(𝒙)K_{\boldsymbol{w}}({\boldsymbol{x}}) represents the top-KK paths decoded by constrained KK-best Viterbi algorithm222The constrained decoding ensures the resulting complete paths are compatible with the incomplete annotations. with parameters π’˜{\boldsymbol{w}}, and Ξ»\lambda is an adaptive weight coefficient.

2.2 Estimating qq with Candidate Mask

We divide the training data into kk folds and employ kk-fold cross-validation stacking to estimate qq for each hold-out fold Jie et al. (2019). We propose an interpolation mode to adjust qq by increasing the probabilities for paths with high confidence and decreasing for the others. The probability of each possible path is a temperature softmax of log⁑pπ’˜i\log p_{{\boldsymbol{w}}_{i}}:

qπ’˜i​(π’š|𝒙)=exp⁑(log⁑pπ’˜i​(π’š|𝒙)/T)βˆ‘π’šexp⁑(log⁑pπ’˜i​(π’š|𝒙)/T),\displaystyle q_{{\boldsymbol{w}}_{i}}({\boldsymbol{y}}|{\boldsymbol{x}})=\frac{\exp\left(\log p_{{\boldsymbol{w}}_{i}}({\boldsymbol{y}}|{\boldsymbol{x}})/T\right)}{\sum_{{\boldsymbol{y}}}\exp\left(\log p_{{\boldsymbol{w}}_{i}}({\boldsymbol{y}}|{\boldsymbol{x}})/T\right)}, (6)

where T>0T>0 denotes the temperature and π’˜i{\boldsymbol{w}}_{i} is the model trained by holding out the ii-th fold. A higher temperature produces a softer probability distribution over the paths, resulting in more diversity and also more mistakes Hinton et al. (2015). We iterate the cross-validation until qq converges.

JieΒ Jie et al. (2019) estimated q​(π’š|𝒙)q({\boldsymbol{y}}|{\boldsymbol{x}}) for each π’šβˆˆC​(π’šu){\boldsymbol{y}}\in C({\boldsymbol{y}}_{u}) while the size of C​(π’šu)C({\boldsymbol{y}}_{u}) grows exponentially on the number of unannotated tokens in 𝒙{\boldsymbol{x}}. To reduce the number of possible paths for qq estimation, we build candidate mask based on the KK-best candidates and the self-built candidates.

Data: D={(𝒙i,π’šui)}D=\{({\boldsymbol{x}}^{i},{\boldsymbol{y}}_{u}^{i})\}
Randomly divide DD into kk folds: D1,D2,β‹―,DkD_{1},D_{2},\cdots,D_{k}
Entity Dictionary β„‹β†βˆ…\mathcal{H}\leftarrow\emptyset
Initialize model MM with parameters π’˜^\hat{{\boldsymbol{w}}}
Initialize qq distributions {q(β‹…|𝒙i)q(\cdot|{\boldsymbol{x}}^{i})}
Sample importance score si←1s_{i}\leftarrow 1
hyper-parameters ss and cc
forΒ iteration = 1,β‹―,N1,\cdots,NΒ do
Β Β Β Β Β Β  % Sample Selection
Β Β Β Β Β Β  D′←DD^{{}^{\prime}}\leftarrow D
Β Β Β Β Β Β  forΒ j = 1,β‹―,k1,\cdots,kΒ do
Β Β Β Β Β Β Β Β Β Β Β Β  Dj′←DjD_{j}^{{}^{\prime}}\leftarrow D_{j}
Β Β Β Β Β Β Β Β Β Β Β Β  forΒ (𝐱i,𝐲ui)∈Djβ€²({\boldsymbol{x}}^{i},{\boldsymbol{y}}_{u}^{i})\in D_{j}^{{}^{\prime}}Β do
Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β  ifΒ si<ss_{i}<sΒ then
Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β remove (𝒙i,π’šui)({\boldsymbol{x}}^{i},{\boldsymbol{y}}_{u}^{i}) from Djβ€²D_{j}^{{}^{\prime}} and Dβ€²D^{{}^{\prime}}
Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β  end if
Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β 
Β Β Β Β Β Β Β Β Β Β Β Β  end for
Β Β Β Β Β Β Β Β Β Β Β Β 
Β Β Β Β Β Β  end for
Β Β Β Β Β Β % qq Distribution Estimating
Β Β Β Β Β Β  forΒ j = 1,β‹―,k1,\cdots,kΒ do
Β Β Β Β Β Β Β Β Β Β Β Β  Train M​(π’˜j)M({\boldsymbol{w}}_{j}) on Dβ€²\Djβ€²D^{{}^{\prime}}\backslash D_{j}^{{}^{\prime}}: Eq.(7)
Β Β Β Β Β Β Β Β Β Β Β Β  forΒ (𝐱i,𝐲ui)∈Djβ€²({\boldsymbol{x}}^{i},{\boldsymbol{y}}_{u}^{i})\in D_{j}^{{}^{\prime}}Β do
Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β  Predict Kb​(𝒙i)K_{b}({\boldsymbol{x}}^{i}) by M​(π’˜^)M(\hat{{\boldsymbol{w}}})
Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β  Extract H​(𝒙i)H({\boldsymbol{x}}^{i}) by β„‹\mathcal{H}
Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β  Possible paths S=S​(π’šui,Kb​(𝒙i),H​(𝒙i))S=S({\boldsymbol{y}}_{u}^{i},K_{b}({\boldsymbol{x}}^{i}),H({\boldsymbol{x}}^{i}))
Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β  Estimate q​(π’š|𝒙i)q({\boldsymbol{y}}|{\boldsymbol{x}}^{i}) for π’šβˆˆS{\boldsymbol{y}}\in S: Eq.(6)
Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β  si=maxπ’šβ‘pπ’˜j​(π’š|𝒙i)s_{i}=\max_{\boldsymbol{y}}p_{{\boldsymbol{w}}_{j}}({\boldsymbol{y}}|{\boldsymbol{x}}^{i})
Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β  ei←e_{i}\leftarrow {e​n​t​i​t​i​e​sentities} predicted by M​(π’˜j)M({\boldsymbol{w}}_{j})
Β Β Β Β Β Β Β Β Β Β Β Β  end for
Β Β Β Β Β Β Β Β Β Β Β Β 
Β Β Β Β Β Β  end for
Β Β Β Β Β Β % Dictionary β„‹\mathcal{H} Updating
Β Β Β Β Β Β  β„‹β†βˆ…\mathcal{H}\leftarrow\emptyset
Β Β Β Β Β Β  forΒ e​n​t​i​t​y∈βˆͺeientity\in\cup e_{i}Β do
Β Β Β Β Β Β Β Β Β Β Β Β  ifΒ e​n​t​i​t​yβˆ‰β„‹entity\notin\mathcal{H} and f​r​e​q​(e​n​t​i​t​y)freq(entity) >ccΒ then
Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β  ℋ←a​d​d​_​e​n​t​i​t​y​(β„‹,e​n​t​i​t​y)\mathcal{H}\leftarrow add\_entity(\mathcal{H},entity)
Β Β Β Β Β Β Β Β Β Β Β Β  end if
Β Β Β Β Β Β Β Β Β Β Β Β 
Β Β Β Β Β Β  end for
Β Β Β Β Β Β Train M​(π’˜β€²)M({\boldsymbol{w}}^{\prime}) on DD with qq: Eq.(7)
Β Β Β Β Β Β  ifΒ F1F_{1} of M​(𝐰′)>F1M({\boldsymbol{w}}^{\prime})>F_{1} of M​(𝐰^)M(\hat{{\boldsymbol{w}}}) on DevΒ then
Β Β Β Β Β Β Β Β Β Β Β Β  π’˜^β†π’˜β€²\hat{{\boldsymbol{w}}}\leftarrow{\boldsymbol{w}}^{\prime}
Β Β Β Β Β Β  end if
Β Β Β Β Β Β 
end for
Return the final NER model M​(π’˜^)M(\hat{{\boldsymbol{w}}})
AlgorithmΒ 1 AdaK-NER

KK-best Candidates.

During the end of each iteration, we train a model M​(π’˜^)M(\hat{{\boldsymbol{w}}}) on the whole training data DD. In the next iteration, we use the trained model M​(π’˜^)M(\hat{{\boldsymbol{w}}}) to identify KK-best candidates set Kb​(𝒙)K_{b}({\boldsymbol{x}}) for each sample 𝒙{\boldsymbol{x}} by constrained KK-best Viterbi decoding. Kb​(𝒙)={K^i​(𝒙)}i=1,β‹―,KK_{b}({\boldsymbol{x}})=\{\hat{K}_{i}({\boldsymbol{x}})\}_{i=1,\cdots,K} contains top-KK possible paths with the highest probabilities, where K^i​(𝒙)=[K^i​(𝒙1),K^i​(𝒙2),β‹―,K^i​(𝒙n)]\hat{K}_{i}({\boldsymbol{x}})=[\hat{K}_{i}({\boldsymbol{x}}_{1}),\hat{K}_{i}({\boldsymbol{x}}_{2}),\cdots,\hat{K}_{i}({\boldsymbol{x}}_{n})].

Self-built Candidates.

In the current iteration, after training a model M​(π’˜i)M({\boldsymbol{w}}_{i}) on (kβˆ’1)(k-1) folds, we use M​(π’˜i)M({\boldsymbol{w}}_{i}) to predict a path for each sample in the hold-out fold, and extract entities from the predicted path. Then we merge all entities identified by kk models {M​(π’˜i)}i=1,β‹―,k\{M({\boldsymbol{w}}_{i})\}_{i=1,\cdots,k}, resulting an entity dictionary β„‹\mathcal{H}. For each sample 𝒙{\boldsymbol{x}} we conjecture that its named entities should lie in the dictionary β„‹\mathcal{H}. Consequently, in the next iteration we form a self-built candidate H​(𝒙)=[H​(𝒙1),H​(𝒙2),β‹―,H​(𝒙n)]H({\boldsymbol{x}})=[H({\boldsymbol{x}}_{1}),H({\boldsymbol{x}}_{2}),\cdots,H({\boldsymbol{x}}_{n})] for each 𝒙{\boldsymbol{x}} of length nn. H​(𝒙j)H({\boldsymbol{x}}_{j}) is the corresponding entity label if the token 𝒙j{\boldsymbol{x}}_{j} is part of an entity in β„‹\mathcal{H}, otherwise H​(𝒙j)H({\boldsymbol{x}}_{j}) is O label.

We utilize the above candidates (i.e., the KK-best candidates set Kb​(𝒙)K_{b}({\boldsymbol{x}}) and the self-built candidate H​(𝒙)H({\boldsymbol{x}})) to construct a candidate mask for 𝒙{\boldsymbol{x}}. For each unannotated 𝒙j{\boldsymbol{x}}_{j} in 𝒙{\boldsymbol{x}}, the possible label set consists of (1) O label (2) H​(𝒙j)H({\boldsymbol{x}}_{j}), (3) βˆͺi=1KK^i​(𝒙j)\cup_{i=1}^{K}\hat{K}_{i}({\boldsymbol{x}}_{j}).

For example, as Figure 1(c) shows, the unannotated token β€˜Barbados’ is predicted as B-PER and B-LOC in the above candidate paths, we treat B-PER, B-LOC and O label as the possible labels of β€˜Barbados’ and mask the other labels.

With this masking scheme, we can significantly narrow down the feasible region of 𝒙{\boldsymbol{x}} (Figure 1(c)) when estimating q(β‹…|𝒙)q(\cdot|{\boldsymbol{x}}). After estimating q(β‹…|𝒙)q(\cdot|{\boldsymbol{x}}), we can train a model through the modified loss:

Lm​(π’˜,𝒙)=\displaystyle L_{m}({\boldsymbol{w}},{\boldsymbol{x}})= βˆ’(1βˆ’Ξ»)​logβ€‹βˆ‘π’šβˆˆSq​(π’š|𝒙)​pπ’˜β€‹(π’š|𝒙)\displaystyle-(1-\lambda)\log\sum_{{\boldsymbol{y}}\in S}q({\boldsymbol{y}}|{\boldsymbol{x}})p_{\boldsymbol{w}}({\boldsymbol{y}}|{\boldsymbol{x}}) (7)
βˆ’Ξ»β€‹logβ€‹βˆ‘π’šβˆˆKπ’˜β€‹(𝒙)pπ’˜β€‹(π’š|𝒙),\displaystyle-\lambda\log\sum_{{\boldsymbol{y}}\in K_{\boldsymbol{w}}({\boldsymbol{x}})}p_{\boldsymbol{w}}({\boldsymbol{y}}|{\boldsymbol{x}}),

where S=S​(π’šu,Kb​(𝒙),H​(𝒙))S=S({\boldsymbol{y}}_{u},K_{b}({\boldsymbol{x}}),H({\boldsymbol{x}})) contains the possible paths restricted by the candidate mask.

Dataset Train Dev Test
#entity #sent #entity #sent #entity #sent
CoNLL-2003 2349923499 1404114041 59425942 32503250 56485648 34533453
Taobao 2939729397 60006000 49414941 998998 48664866 10001000
Youku 1275412754 80018001 15801580 10001000 15701570 10011001
Table 1: Data statistics for CoNLL-2003, Taobao and Youku. β€˜#entity’ represents the number of entities, and β€˜#sent’ is the number of sentences.

2.3 Annealing Technique for Ξ»\lambda

Intuitively, the top-KK paths decoded by the algorithm could be of poor quality at the beginning of training, because the model’s parameters used in decoding haven’t been trained adequately. Therefore, we employ an annealing technique to adapt Ξ»\lambda during training as:

λ​(b)=exp⁑[γ​(bBβˆ’1)],\vspace{-3pt}\lambda(b)=\exp\left[\gamma\left(\frac{b}{B}-1\right)\right],

where bb is the current training step, BB is the total number of training steps, and Ξ³\gamma is the hyper-parameter used to control the accelerated speed of Ξ»\lambda. The coefficient Ξ»\lambda increases rapidly at the latter half of the training, enforcing the model to extracting more information from the top-KK paths.

2.4 Iterative Sample Selection

Due to the incomplete annotation, there exist some samples whose qq distributions are poorly estimated. We use an idea of sample selection to deal with these samples. In each iteration, after training a model M​(π’˜i)M({\boldsymbol{w}}_{i}) on (kβˆ’1)(k-1) folds, we utilize M​(π’˜i)M({\boldsymbol{w}}_{i}) to decode a most possible path π’š^\hat{{\boldsymbol{y}}} for π’™βˆˆDi{\boldsymbol{x}}\in D_{i}, and assign a probability score s=pπ’˜i​(π’š^|𝒙)s=p_{{\boldsymbol{w}}_{i}}(\hat{{\boldsymbol{y}}}|{\boldsymbol{x}}) to 𝒙{\boldsymbol{x}} at the meantime. Iterative sample selection is to select the samples with probability scores beyond a threshold to construct new training data, which are used in the training phase of kk-fold cross-validation in the next iteration (more Algorithm details can be found in Algorithm 1). We use this strategy to help model identify the gold path effectively with high-quality samples.

3 Experiments

3.1 Dataset

To benchmark AdaK-NER against its SOTA alternatives in realistic settings, we consider one standard English dataset and two Chinese datasets from Financial Technology Industry: (ii) CoNLL-2003 English Sang and DeΒ Meulder (2003): annotated by person (PER), location (LOC) and organization (ORG) and miscellaneous (MISC). (i​iii) Taobao333http://www.taobao.com: a Chinese e-commerce site. The model type (PATTERN), product type (PRODUCT), brand type (BRAND) and the other entities (MISC) are recognized in the dataset. (i​i​iiii) Youku444http://www.youku.com: a Chinese video-streaming website with videos from various domains. Figure type (FIGURE), program type (PROGRAM) and the others (MISC) are annotated. Statistics for datasets are presented in Table 1.

We randomly remove a proportion of entities and all O labels to construct the incomplete annotation, with ρ\rho representing the ratio of entities that keep annotated. We employ two schemes for removing entities:

  • β€’

    Random-based Scheme is simply removing entities by random Jie et al. (2019); Li et al. (2021), which simulates the situation that a given entity is not recognized by an annotator.

  • β€’

    Entity-based Scheme is removing all occurrences of a randomly selected entity until the desired amount remains Mayhew et al. (2019); Effland and Collins (2021); Wang et al. (2019). For example, if the entity β€˜Bahrain’ is selected, then every occurrence of β€˜Bahrain’ will be removed. This slightly complicated scheme matches the situation that some entities in a special domain could not be recognized by non-expert annotators.

According to the low recall of entities tagged by non-speaker annotators in Mayhew Mayhew et al. (2019), we set ρ=0.2\rho=0.2 and ρ=0.4\rho=0.4 in our experiments.

3.2 Experiment Setup

Evaluation Metrics.

We consider the following performance metrics: Precision (PP), Recall (RR), and balanced F-score (F1F_{1}). These metrics are calculated based on the true entities and the recognized entities. F1F_{1} score is the main metric to evaluate the final NER models of baselines and our approach.

Ratio Model CoNLL-2003 Taobao Youku
𝐏↑\mathbf{P\uparrow} 𝐑↑\mathbf{R\uparrow} π…πŸβ†‘\mathbf{F_{1}\uparrow} 𝐏↑\mathbf{P\uparrow} 𝐑↑\mathbf{R\uparrow} π…πŸβ†‘\mathbf{F_{1}\uparrow} 𝐏↑\mathbf{P\uparrow} 𝐑↑\mathbf{R\uparrow} π…πŸβ†‘\mathbf{F_{1}\uparrow}
ρ=0.2\rho=0.2 BERT CRF 81.4281.42 15.0515.05 25.4025.40 83.11\mathbf{83.11} 24.0624.06 37.3237.32 64.8564.85 20.4520.45 31.0931.09
BERT Fuzzy CRF 17.9417.94 88.14\mathbf{88.14} 29.8129.81 41.4841.48 80.39\mathbf{80.39} 54.7254.72 22.7422.74 84.65\mathbf{84.65} 35.8535.85
BERT weighted CRF 85.0385.03 82.6582.65 83.8283.82 70.0670.06 57.8557.85 63.3763.37 70.1870.18 38.9838.98 50.1250.12
AdaK-NER 87.05\mathbf{87.05} 86.7486.74 86.89\mathbf{86.89} 74.2474.24 78.8978.89 76.50\mathbf{76.50} 78.21\mathbf{78.21} 79.9679.96 78.09\mathbf{78.09}
ρ=0.4\rho=0.4 BERT CRF 80.0780.07 51.2551.25 62.4962.49 84.7684.76 47.6847.68 61.0361.03 78.8978.89 50.7050.70 61.7361.73
BERT Fuzzy CRF 14.8914.89 86.6186.61 25.4125.41 43.5143.51 85.0285.02 57.5657.56 30.8830.88 84.0184.01 45.1645.16
BERT weighted CRF 85.4085.40 88.6988.69 87.0187.01 73.1773.17 81.0981.09 76.9376.93 74.9974.99 82.2982.29 78.4778.47
AdaK-NER 87.4787.47 88.7088.70 88.08\mathbf{88.08} 74.0874.08 80.1380.13 76.99\mathbf{76.99} 78.3878.38 81.5381.53 79.93\mathbf{79.93}
ρ=1.0\rho=1.0 BERT CRF 91.3491.34 92.3692.36 91.8591.85 86.0186.01 88.2088.20 87.0987.09 83.2083.20 84.5284.52 83.8583.85
Table 2: Performance comparison between different models on three datasets with Random-based Scheme.
Ratio Model CoNLL-2003 Taobao Youku
𝐏↑\mathbf{P\uparrow} 𝐑↑\mathbf{R\uparrow} π…πŸβ†‘\mathbf{F_{1}\uparrow} 𝐏↑\mathbf{P\uparrow} 𝐑↑\mathbf{R\uparrow} π…πŸβ†‘\mathbf{F_{1}\uparrow} 𝐏↑\mathbf{P\uparrow} 𝐑↑\mathbf{R\uparrow} π…πŸβ†‘\mathbf{F_{1}\uparrow}
ρ=0.2\rho=0.2 BERT CRF 86.79\mathbf{86.79} 18.3618.36 30.3130.31 39.6239.62 10.5810.58 16.7016.70 69.1069.10 15.6715.67 25.5525.55
BERT Fuzzy CRF 15.9915.99 86.30\mathbf{86.30} 26.9826.98 42.4942.49 82.33\mathbf{82.33} 56.0556.05 27.7927.79 86.37\mathbf{86.37} 42.0542.05
BERT weighted CRF 83.4083.40 70.9670.96 76.6876.68 73.49\mathbf{73.49} 52.6352.63 61.3361.33 74.7174.71 32.5532.55 45.3445.34
AdaK-NER 86.3286.32 71.7271.72 78.35\mathbf{78.35} 73.2473.24 76.5976.59 74.88\mathbf{74.88} 78.86\mathbf{78.86} 75.5475.54 76.20\mathbf{76.20}
ρ=0.4\rho=0.4 BERT CRF 86.68\mathbf{86.68} 34.2634.26 49.1149.11 78.43\mathbf{78.43} 39.6839.68 52.7052.70 62.1662.16 35.1635.16 44.9144.91
BERT Fuzzy CRF 13.8413.84 84.60\mathbf{84.60} 23.7923.79 42.2442.24 81.0781.07 55.5455.54 32.1032.10 82.8782.87 46.2746.27
BERT weighted CRF 84.6884.68 76.9176.91 80.6180.61 74.6574.65 79.5779.57 77.0377.03 75.6775.67 80.6480.64 78.0878.08
AdaK-NER 85.4885.48 77.8577.85 81.49\mathbf{81.49} 74.5874.58 80.54\mathbf{80.54} 77.44\mathbf{77.44} 79.01\mathbf{79.01} 81.0281.02 80.00\mathbf{80.00}
ρ=1.0\rho=1.0 BERT CRF 91.3491.34 92.3692.36 91.8591.85 86.0186.01 88.2088.20 87.0987.09 83.2083.20 84.5284.52 83.8583.85
Table 3: Performance comparison between different models on three datasets with Entity-based Scheme.

Baselines.

We consider several strong baselines to compare with the proposed methods, including BERT with conventional CRF (or CRF for abbreviation) Lafferty et al. (2001), BERT with Fuzzy CRF Shang et al. (2018), and BERT with weighted CRF presented by JieΒ Jie et al. (2019). CRF regards all unannotated tokens as O label to form complete paths, while Fuzzy CRF treats all possible paths compatible with the incomplete path with equal probability. Weighted CRF assigns an estimated distribution to all possible paths derived from the incomplete path to train the model.

Training details.

We employ BERT model Devlin et al. (2018) as the neural architecture for baselines and our AdaK-NER. Specifically, we use pretrained Chinese BERT with whole word masking Cui et al. (2019) for the Chinese datasets and pretrained BERT with case-preserving WordPiece Devlin et al. (2018) for CoNLL-2003 English dataset. Unless otherwise specified, we set the hyperparameter over [top KK] as 5 by default for illustrative purposes. Based on the fact that a larger kk-fold value has a negligible effect Jie et al. (2019), we choose to split the training data into 22 folds (i.e., k=2k=2). We initialize qq distribution by assign each unannotated token as O label to form complete paths, and iteratively updated qq by k-fold cross-validation stacking. Empirically, we set the iteration number to 10, which is enough for our model to converge.

3.3 Experimental Results

To validate the utility of our model, we consider a wide range of real-world tasks experimentally with entity keeping ratio ρ=0.2\rho=0.2 and ρ=0.4\rho=0.4. We present the results with Random-based Scheme in Table 2 and Entity-based Scheme in Table 3. We compare the performance of our method to other competing solutions, with each baseline carefully tuned to ensure fairness. In all cases, CRF has high precision and low recall because it labels all the unannotated tokens as O label. In contrast, taking all possible paths into account yields the mismatch of the gold path, hence Fuzzy CRF recalls more entities. Weighted CRF outperforms CRF and Fuzzy CRF, indicating that distribution qq should be highly skewed rather than uniformly distributed.

With adaptive KK-best loss, candidate mask, annealing technique and iterative sample selection approach, our approach AdaK-NER performs strongly, exhibits high precision and high recall on all datasets and gives best results in F1F_{1} score over the other three models. The improvement is especially remarkable on Chinese Taobao and Youku datesets for ρ=0.2\rho=0.2, as it delivers over 13%13\% and 27%27\% increase in F1F_{1} score with Random-based Scheme, while over 13%13\% and 30%30\% increase with Entity-based Scheme.

Note that in CoNLL-2003 and Youku, the F1F_{1} score of AdaK-NER with Random-based Scheme is only roughly 5%5\% lower than that of CRF trained on complete data (ρ=1\rho=1), while we build AdaK-NER on the training data with only 20%20\% entities available (ρ=0.2\rho=0.2). In the other Chinese dataset, our model also achieves encouraging improvement compared to the other methods and presents a step toward more accurate incomplete named entity recognition.

Entity-based Scheme is more restrictive, however, our model still achieves best F1F_{1} score compared with other methods. The overall results show AdaK-NER achieves state-of-the-art performance compared to various baselines on both English and Chinese datasets with incomplete annotations.

Refer to caption
Refer to caption
Figure 2: (left) Sensitivity analysis of top truncation K. A smaller K is more sensitive to the truncation. (right) F1F_{1} score comparison between Fuzzy CRF, weighted CRF and our model on Taobao dataset across different ρ\rho selection with Random-based Scheme
Model CoNLL-2003 Taobao Youku
𝐏↑\mathbf{P\uparrow} 𝐑↑\mathbf{R\uparrow} π…πŸβ†‘\mathbf{F_{1}\uparrow} 𝐏↑\mathbf{P\uparrow} 𝐑↑\mathbf{R\uparrow} π…πŸβ†‘\mathbf{F_{1}\uparrow} 𝐏↑\mathbf{P\uparrow} 𝐑↑\mathbf{R\uparrow} π…πŸβ†‘\mathbf{F_{1}\uparrow}
w/o KK-best loss 88.5588.55 80.4980.49 84.3384.33 78.6478.64 53.1253.12 63.4163.41 62.2662.26 39.339.3 48.1848.18
w/o weighted loss 83.7983.79 82.8482.84 83.3283.32 47.8547.85 66.1566.15 55.5355.53 73.4673.46 71.5971.59 72.5272.52
w/o annealing 88.3688.36 84.0184.01 86.1386.13 76.0976.09 60.0560.05 67.1367.13 80.9880.98 62.2962.29 70.7970.79
w/o KK-best candidates 84.4284.42 73.7673.76 78.7378.73 72.7572.75 56.6256.62 63.6863.68 73.9473.94 58.0358.03 65.0265.02
w/o self-built candidates 87.5287.52 86.3386.33 86.92\mathbf{86.92} 72.3872.38 77.4477.44 74.8274.82 77.8877.88 74.4674.46 76.1376.13
w/o candidate mask 84.9784.97 86.5186.51 85.7385.73 68.2968.29 79.1679.16 73.3273.32 73.4073.40 79.8179.81 76.4776.47
w/o sample selection 86.6486.64 86.0386.03 86.3386.33 72.8872.88 79.5979.59 76.0976.09 78.4878.48 79.4379.43 78.95\mathbf{78.95}
AdaK-NER 87.0587.05 86.7486.74 86.89\mathbf{86.89} 74.2474.24 78.8978.89 76.50\mathbf{76.50} 78.2178.21 79.9679.96 78.0978.09
Table 4: Ablation study for AdaK-NER on three datasets with Random-based Scheme for ρ=0.2\rho=0.2.

The Effect of KK.

As discussed in Section 2.1 and 2.2, the parameter KK can affect the learning procedure from two aspects. We compare the performance of different KK on Taobao dataset with Random-based Scheme and ρ=0.2\rho=0.2. The hyperparameter over [top KK] is selected from {1,3,5,7,9}\{1,3,5,7,9\} on the validation set. As illustrated in Figure 2, a relatively large KK delivers better empirical results, and the metrics (precision, recall and F1F_{1}) are pretty close when K=5,7,9K=5,7,9. Meanwhile, a smaller KK can narrow down the possible paths more effectively in theory. Hence we favor K=5K=5 which might be a balanced choice.

The Effect of ρ\rho.

We further examine annotation rate (ρ\rho) interacts with learning. We plot F1F_{1} score on Taobao dataset with Random-based Scheme across varying annotation rate in Figure 2. The annotation removed with large ρ\rho inherits the annotation removed with the small ρ\rho. All the performance deliver better results with the increase of ρ\rho. Our model consistently outperforms weighted CRF and Fuzzy CRF, and the improvement is significant when ρ\rho is relatively small, which indicates our model is especially powerful when the annotated tokens are fairly sparse.

3.4 Ablation Study

To investigate the effectiveness of the proposed strategies used in AdaK-NER, we conduct the following ablation with Random-based Scheme and ρ=0.2\rho=0.2. As shown in Table 4, the adaptive KK-best loss contributes most to our model on the three datasets. It helps our model achieve higher recall while preserving acceptable precision. Especially on Youku dataset, removing it would cause a significant drop on recall by 40%40\%. The weighted CRF loss is indispensable, and annealing method could help our model achieve better results. Candidate mask is attributed to promote precision while keeping high recall. Both KK-best candidates and self-built candidates facilitate the model performance. Iterative sample selection makes a positive contribution to our model on CoNLL-2003 and Taobao, whereas it slightly hurts the performance on Youku. In general, incorporating these techniques enhances model performance on incomplete annotated data.

4 Related Works

Pre-trained Language Models

has been an emerging direction in NLP since Google launched BERT Devlin et al. (2018) in 2018. With the powerful Transformer architecture, several pre-trained models, such as BERT and generative pre-training model (GPT), and their variants have achieved state-of-the-art performance in various NLP tasks including NER Devlin et al. (2018); Liu et al. (2019). YangΒ Yang et al. (2019) proposed a pre-trained permutation language model (XLNet) to overcome the limitations of denoising autoencoding based pre-training. LiuΒ Liu et al. (2019) demonstrated that more data and more parameter tuning could benefit pre-trained language models, and released a new pre-trained model (RoBERTa). To follow the trend, we use BERT as our neural model in this work.

Statistical Modeling

has been widely employed in sequence labeling. Classical models learn label sequences through graph-based representation, with prominent examples such as Hidden Markov Model (HMM), Maximum Entropy Markov Models (MEMM) and Conditional Random Fields (CRF) Lafferty et al. (2001). Among them, CRF is an optimal model, since it resolves the labeling bias issue in MEMM and doesn’t require the unreasonable independence assumptions in HMM. However, conventional CRF is not directly applicable to the incomplete annotation situation. NiΒ Ni et al. (2017) select the sentences with the highest confidence, and regarding missing labels as O. Another line of work is to replace CRF with Partial CRF Nooralahzadeh et al. (2019); Huang et al. (2019) or Fuzzy CRF Shang et al. (2018), which assign unlabeled words with all possible labels and maximize the total probability Yang et al. (2018). Although these works have led to many promising results, they still need external knowledge for high-quality performance. JieΒ Jie et al. (2019) presented a weighted CRF model which is most closely related to our work. They estimated a proper distribution for all possible paths derived from the incomplete annotations. Our work enhances Fuzzy CRF by reducing the possible paths by a large margin, aiming to better focus on the gold path.

5 Conclusion

In this paper, we explore how to build an effective NER model by only using incomplete annotations. We propose two major strategies including introducing a novel adaptive KK-best loss and a mask based on KK-best candidates and self-built candidates to help our model better focus on the gold path. The results show that our approaches can significantly improve the performance of NER model with incomplete annotations.

References

  • Cao et al. [2019] YuΒ Cao, Meng Fang, and Dacheng Tao. Bag: Bi-directional attention entity graph convolutional network for multi-hop reasoning question answering. arXiv preprint arXiv:1904.04969, 2019.
  • Cui et al. [2019] Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, Shijin Wang, and Guoping Hu. Cross-lingual machine reading comprehension. arXiv preprint arXiv:1909.00361, 2019.
  • Devlin et al. [2018] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  • Effland and Collins [2021] Thomas Effland and Michael Collins. Partially supervised named entity recognition via the expected entity ratio loss. arXiv preprint arXiv:2108.07216, 2021.
  • Hinton et al. [2015] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
  • Huang and Chiang [2005] Liang Huang and David Chiang. Better k-best parsing. In Proceedings of the Ninth International Workshop on Parsing Technology, 2005.
  • Huang et al. [2019] Xiao Huang, LiΒ Dong, Elizabeth Boschee, and Nanyun Peng. Learning a unified named entity tagger from multiple partially annotated corpora for efficient adaptation. arXiv preprint arXiv:1909.11535, 2019.
  • Jia et al. [2020] Chen Jia, Yuefeng Shi, Qinrong Yang, and Yue Zhang. Entity enhanced bert pre-training for chinese ner. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6384–6396, 2020.
  • Jie et al. [2019] Zhanming Jie, Pengjun Xie, Wei Lu, Ruixue Ding, and Linlin Li. Better modeling of incomplete annotations for named entity recognition. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 729–734, 2019.
  • Lafferty et al. [2001] John Lafferty, Andrew McCallum, and FernandoΒ CN Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. 2001.
  • Li et al. [2020] Xiaonan Li, Hang Yan, Xipeng Qiu, and Xuanjing Huang. Flat: Chinese ner using flat-lattice transformer. arXiv preprint arXiv:2004.11795, 2020.
  • Li et al. [2021] Yangming Li, lemao liu, and Shuming Shi. Empirical analysis of unlabeled entity problem in named entity recognition. In International Conference on Learning Representations, 2021.
  • Liu et al. [2019] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  • Mayhew et al. [2019] Stephen Mayhew, Snigdha Chaturvedi, Chen-Tse Tsai, and Dan Roth. Named entity recognition with partially annotated training data. arXiv preprint arXiv:1909.09270, 2019.
  • Ni et al. [2017] Jian Ni, Georgiana Dinu, and Radu Florian. Weakly supervised cross-lingual named entity recognition via effective annotation and representation projection. arXiv preprint arXiv:1707.02483, 2017.
  • Nooralahzadeh et al. [2019] Farhad Nooralahzadeh, JanΒ Tore LΓΈnning, and Lilja Øvrelid. Reinforcement-based denoising of distantly supervised ner with partial annotation. Association for Computational Linguistics, 2019.
  • Peng et al. [2019] Minlong Peng, Xiaoyu Xing, QiΒ Zhang, Jinlan Fu, and Xuanjing Huang. Distantly supervised named entity recognition using positive-unlabeled learning. arXiv preprint arXiv:1906.01378, 2019.
  • Poria et al. [2016] Soujanya Poria, Erik Cambria, and Alexander Gelbukh. Aspect extraction for opinion mining with a deep convolutional neural network. Knowledge-Based Systems, 108:42–49, 2016.
  • Sang and DeΒ Meulder [2003] ErikΒ F Sang and Fien DeΒ Meulder. Introduction to the conll-2003 shared task: Language-independent named entity recognition. arXiv preprint cs/0306050, 2003.
  • Shang et al. [2018] Jingbo Shang, Liyuan Liu, Xiang Ren, Xiaotao Gu, Teng Ren, and Jiawei Han. Learning named entity tagger using domain-specific dictionary. arXiv preprint arXiv:1809.03599, 2018.
  • Surdeanu et al. [2010] Mihai Surdeanu, Ramesh Nallapati, and Christopher Manning. Legal claim identification: Information extraction with hierarchically labeled data. In Workshop Programme, pageΒ 22, 2010.
  • Wang et al. [2019] Zihan Wang, Jingbo Shang, Liyuan Liu, Lihao Lu, Jiacheng Liu, and Jiawei Han. Crossweigh: Training named entity tagger from imperfect annotations. arXiv preprint arXiv:1909.01441, 2019.
  • Wei et al. [2020] Qiang Wei, Zongcheng Ji, Zhiheng Li, Jingcheng Du, Jingqi Wang, Jun Xu, Yang Xiang, Firat Tiryaki, Stephen Wu, Yaoyun Zhang, etΒ al. A study of deep learning approaches for medication and adverse drug event extraction from clinical text. Journal of the American Medical Informatics Association, 27(1):13–21, 2020.
  • Yang et al. [2018] Yaosheng Yang, Wenliang Chen, Zhenghua Li, Zhengqiu He, and Min Zhang. Distantly supervised ner with partial annotation learning and reinforcement learning. In Proceedings of the 27th International Conference on Computational Linguistics, pages 2159–2169, 2018.
  • Yang et al. [2019] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, RussΒ R Salakhutdinov, and QuocΒ V Le. Xlnet: Generalized autoregressive pretraining for language understanding. In Advances in neural information processing systems, pages 5753–5763, 2019.