JaMIE: A Pipeline Japanese Medical Information Extraction System
Abstract
We present an open-access natural language processing toolkit for Japanese medical information extraction. We first propose a novel relation annotation schema for investigating the medical and temporal relations between medical entities in Japanese medical reports. We experiment with the practical annotation scenarios by separately annotating two different types of reports. We design a pipeline system with three components for recognizing medical entities, classifying entity modalities, and extracting relations. The empirical results show accurate analyzing performance and suggest the satisfactory annotation quality, the effective annotation strategy for targeting report types, and the superiority of the latest contextual embedding models.
1 Introduction
Electronic medical record systems have been widely adopted in the hospitals. In the past decade, research efforts have been devoted to automated Information Extraction (IE) from raw medical reports. This approach should be able to liberate users from the burden of reading and understanding large volumes of records manually. While substantial progress has been made already in medical IE, it still suffers from the following limitations.
First, languages are the natural boundaries to hinder the existing research from being reused across languages. The development of the English corpora and approaches can less reflect the progress in other languages. Morita et al. (2013); Aramaki et al. (2014, 2016) present a series of Japanese clinical IE shared tasks. However, more semantic-aware tasks such as medical relation extraction Uzuner et al. (2011) and temporal relation extraction Bethard et al. (2017) are still undeveloped. Second, most existing medical IE datasets focus on general report content such as discharge summary, instead of more specific report types and diseases. Such settings potentially sacrifice the accuracy for analyzing specific report types, such as radiography interpretation reports.
In this work, we first propose a novel relation annotation scheme for investigating the medical and temporal relations in Japanese medical reports. Then, we intend to explore the correlation between the annotation efforts on specific report types and their analyzing accuracy, which is especially in demand for practical medical applications. Therefore, we target the comparison of analyzing two report types involved with the diseases of high death rates: (1) specific radiography interpretation reports of lung cancer (LC), (2) medical history reports (containing multiple types of reports relevant to a patient) of idiopathic pulmonary fibrosis (IPF). The relation annotation is based on the existing entities presented by Yada et al. (2020), which annotated the medical entities (e.g. disease, anatomical) and their modality information (e.g. positive, suspicious) in Japanese medical reports.
While rich English NLP tools for medical IE have been developed such as cTAKES Savova et al. (2010) and MetaMap Aronson and Lang (2010), there are few Japanese tools available until MedEx/J Aramaki et al. (2018). MedEx/J extracts only diseases and their negation information. In this paper, we present JaMIE: a pipeline Japanese Medical IE system, which can extract a wider range of medical information including medical entities, entity modalities, and relations from raw medical reports.
Category Relation Type Example Medical The <A>intrahepatic bile ducts</A> are <C>dilated</C>. <C>Not much has changed</C> since <TIMEX3>September 2003</TIMEX3>. No <F>pathologically significant</F> <D>lymph node enlargement</D>. There are no <D>abnormalities</D> in the <A>liver</A>. <T-key> Smoking</T-key>: <T-val>20 cigarettes</T-val> Temporal On <TIMEX3>Sep 20XX</TIMEX3>, diagnosed as <D>podagra</D>. She <CC>attended a cardiology clinic</CC>, during <TIMEX3>11–22 April</TIMEX3>. PSL 10mg/day had been kept since <TIMEX3>11 Aug</TIMEX3>, but it was <C>normalized</C>. <M-key>Equa</M-key> started at <TIMEX3>23 April</TIMEX3> On <TIMEX3>17 Nov</TIMEX3>, quitting <R>HOT</R>.
In summary, we achieves three-fold contributions as following:
-
•
We present a novel annotation schema for both medical and temporal relations in Japanese medical reports.
-
•
We manually annotate the relations for two types of reports and empirically analyze their performance and desired annotation amount.
-
•
We release an open-access toolkit JaMIE for automatically and accurately annotating medical entities (:95.65/85.49), entity modalities (:94.10/78.06), relations (:86.53/71.04) for two report types.
Although the annotated corpus is not possible to be opened due to the increase of anonymization level, the system code and trained models are released. 111https://github.com/racerandom/JaMIE/tree/demo
2 Japanese Medical IE Annotation
2.1 Entity and Modality Annotation
We adopt the following medical entity types defined in Yada et al. (2020): Diseases and symptoms <D>, Anatomical entities <A>, Features and measurements <F>, Change <C>, Time <TIMEX3>, Test <T-test/key/val>, Medicine <M-key/val>, Remedy <R>, Clinical Context <CC>. The complete entity and modality definition refers to the original paper.
2.2 Relation Annotation
On the top of the entity and modality annotation above, we designed relation types between two entities. They can be categorized into medical relations and temporal relations. The example of each relation type is presented in Table 1.
2.2.1 Medical Relations
A denotes an entity of <X> type has a type toward another entity of the type <Y>, in which <X> and <Y> can be any entity type defined above (including the case that <X> is the same type as <Y>).
- change:
-
A <C> entity changes the status of another entity, the type of which can be <D>, <A>, and <T/M-key>. A <C> is often presented as ‘dilate’, ‘shrink’, ‘appear’, etc.
- compare:
-
A <C> entity’s change is compared to a certain point <Y>, typically <TIMEX3>.
- feature:
-
A <F> entity describes a certain entity <Y>. A <F> is often presented as ‘significant’, ‘mild’, the size (of a tumor), etc.
- region:
-
An entity of an object includes or contains another object entity (often <D> or <A>).
- value:
-
The correspondence relation between <T/M-key> and <T/M-val>. In a rare case, however, other entities of the type <TIMEX3> and <D> may correspond to a value of a <X-key> entity.

2.2.2 Temporal Relations

Based on an existing medical temporal-relation annotation schema, THYME Bethard et al. (2017), we propose a simplified temporal-relation set below. Note that any temporal relation is defined as a form , where the type of <X> can also be another <TIMEX3> entity. Figure 2 portrays a visualized comparison among the proposed temporal relations.
- on:
-
A <X> entity happens at the meantime of a time span described by a <TIMEX3> entity.
- before:
-
A <X> entity happens before a time span described by a <TIMEX3> entity.
- after:
-
A <X> entity happens after a time span described by a <TIMEX3> entity.
- start:
-
A <X> entity starts at a time span described by a <TIMEX3> entity.
- finish:
-
A <X> entity finishes at a time span described by a <TIMEX3> entity.
We show the XML-style radiography interpretation report example with the entity-level information and our relation annotation in Figure 1. The test ‘<T-test>CT scan<T-text>’ is executed ‘on’ the day ‘<TIMEX3>July 26, 2016</TIMEX3>’. A disease ‘<D>right pleural effusion</D>’ is observed in the ‘region’ of the anatomical entity ‘<A>the upper lobe of the lung</A>’. A ‘<F>new</F>’ disease ‘<D>nodules</D>’ is in the ‘region’ of ‘<A>the lung field</A>. The ‘<brel>’ and ‘<trel>’ tags distinguish the medical relations and temporal relations. JaMIE supports this XML-style format for training models or outputting system prediction. The complete Japanese annotation guideline is available. 222https://sociocom.naist.jp/real-mednlp/wp-content/uploads/sites/3/2021/07/PRISM_Annotation_Guidelines.pdf
2.3 Annotation
In practice, we annotated two datasets: 1,000 radiography interpretation reports of LC and 156 medical history reports of IPF. We annotate all reports with two passes. One annotator conducted the first pass relation annotation for a report. In the second pass, the expert examined the annotation and led the final adjudication by discussing the inconsistency with the first pass annotator. This procedure is to balance the quality and cost, since it does not rely on the full expert annotation.
Table 2 shows the statistics of the relations annotation. Though the number of the medical history reports is relatively smaller, they usually contain more content per report and a wider coverage of entity types. Considering that the popular English 2010 i2b2/VA medical dataset contains 170 documents (3,106 relations) for training, our annotation scale are comparable with or even larger than it. The results show very different relation type distribution in the two types of reports. As the medical history reports of IPF can be viewed as the mixture of several types of reports such as radiography reports, examination reports, test results, etc., they show a more balanced coverage of relation types, while the radiography interpretation reports of LC are more narrowly distributed among the disease-relevant relation types such as ‘region’ and ‘feature’.
3 System Architecture of JaMIE

Figure 3 shows the overview of our Japanese medical IE system with a pipeline process of three components: medical entity recognition, modality classification, and relation extraction.
3.1 Sentence Encoder
Recent medical IE research Si et al. (2019); Alsentzer et al. (2019); Peng et al. (2019) suggests the contextual pre-trained models such as ELMO Peters et al. (2018) and BERT Devlin et al. (2019) markedly outperform traditional word embedding methods (e.g., word2vec, glove, and fastText). In our pipeline system, we adopt the Japanese pre-trained BERT333https://alaginrc.nict.go.jp/nict-bert/index.html as the sentence encoder for retrieving token embeddings.
Formally, a sentence is encoded by a contextual BERT or word embedding with bidirectional Long Short-Term Memory (LSTM) Hochreiter and Schmidhuber (1997) as:
3.2 Medical Entity Recognition
Medical entity recognition (MER) aims to predict the token spans of entities and their types from the text.
We formulate Medical Entity Recognition as sequential tagging with the BIO (begin, inside, outside) tags. The outputs are constrained with a conditional random field (CRF) Lafferty et al. (2001) layer. For a tag sequence , the probability of a sequence given is the softmax over all possible tag sequences:
where the score function represents the sum of the transition scores and tag probabilities.
1000 Radiography Interpretation Reports (LC) 156 Medical History Reports (IPF) Med REL #Num Temp REL #Num Med REL #Num Temp REL #Num region 6,794 on 696 region 631 on 1,583 change 689 start 5 change 465 start 219 feature 5,077 finish 2 feature 294 finish 43 value 2 after 3 value 1,932 after 22 compare 615 before 1 compare 229 before 14 Total 13,884 Total 5,432
3.3 Modality Classification
The modality classification (MC) component is to classify the modality types of the given entities. For a multi-token entity predicted by the MER model, we represents the entity embedding as the element sum of embeddings in the entity span. To enrich the context for predicting assertion, we concatenate the entity embedding with the auxiliary entity type. The -step modality prediction is:
where denotes the -th entity embedding, denote the entity type embedding predicted by the MER model.
3.4 Relation Extraction
The relation extraction (RE) component is to predict the relations and their types between two named entities. We formulate the relation extraction problem as the multiple head selection Zhang et al. (2017) of each entity in the sentence. Given each entity in the sentence, the model predicts whether another entity is the head of this token with a relation . The probability of a relation is defined as: , where denotes a single full-connected layer. An additional ‘N’ relation presents no relation between two tokens. The final representation of an entity is the concatenated embeddings of the entity, entity type, and modality type.
4 Experiments
4.1 Settings
For each dataset, we conduct the patient level 5-fold cross-validation to evaluate the performance of our system. 10% training data is split as the validation set. In each stage in the pipeline, the current component is trained with the gold inputs. The Japanese text is segmented into tokens by the morphological analyzer444MeCab Kudo et al. (2004). We adopt PyTorch Transformers555https://github.com/huggingface/transformers to implement the system. The following hyper-parameters are empirically chosen: fine-tuning epoch as 10, batch size as 16, AdamW with learning rate as 5e-5. The best checkpoints on the validation set are saved to produce test results.
Report Type | Encoder | MER F1 | MC F1 | RE F1 |
---|---|---|---|---|
Radiography Interpretation Reports (LC) | LSTM + word2vec | 93.63 | 93.01 | 77.88 |
BERT | 95.65 | 94.10 | 86.53 | |
(Yada et al., 2020) | 95.30 | - | - | |
Medical History Reports (IPF) | LSTM + word2vec | 82.73 | 75.26 | 60.42 |
BERT | 85.49 | 78.06 | 71.04 |
4.2 Evaluation
Instead of applying the usual pipeline evaluation with the gold inputs at each stage, we are more interested in the practical performance of the system and adopt the joint evaluation Zheng et al. (2017) as described in the following:
-
•
Medical entity recognition identifies medical entity from raw reports. We evaluate each {entity, entity type} to the reference.
-
•
Modality classification classifies the modality types of the entities identified by the former stage. The evaluation is on each {entity, entity type, modality type}.
-
•
Relation extraction extracts the relations between the entities identified by the former stages. The evaluation is on each triplet {entity, relation, entity2}.
We measure micro-F1 of the system prediction to the gold reference in each pipeline stage.
5 Experiment Results
5.1 Main Performance of JaMIE
Table 3 shows our system performance on two types of reports: radiography interpretation reports of LC and medical history reports of IPF. The radiography interpretation reports’ performance suggests that by concentrating annotation efforts on a specific report type the system achieves high F1 with sufficient training data. Compared to 95.30 MER F1 reported by Yada et al. (2020), our MER score outperforms their score by 0.35 with the additional CRF layer. The RE model obtains 86.53 F1 of the radiography interpretation reports and 71.04 F1 of the medical history reports.
We offer the baseline encoder with LSTM666LSTM and Word2vec hidden size equal to 256. upon word2vec embeddings Mikolov et al. (2013) trained on Japanese Wikipedia. We observe significant drops in all three tasks, especially in the final relation extraction. In both radiography interpretation reports of LC and medical history reports of IPF, the BERT-based RE models leading ‘LSTM + word2vec’ by approximately 10 points F1. We suggest that solving relation extraction requires long-range information between entities. BERT naturally models such long-range dependency in the self-attention mechanism, while word2vec is trained with a fixed local window and LSTM could also fall in the long sequential actions.
Med REL | RE F1 | Temp REL | RE F1 |
---|---|---|---|
region | 84.59 | on | 81.92 |
change | 76.23 | start | 20.00 |
feature | 90.16 | finish | - |
value | - | after | - |
compare | 80.86 | before | - |
Med REL | RE F1 | Temp REL | RE F1 |
---|---|---|---|
region | 71.73 | on | 70.48 |
change | 58.66 | start | 49.33 |
feature | 60.54 | finish | 12.02 |
value | 83.12 | after | - |
compare | 75.47 | before | 11.38 |
While the medical history reports contain broader relation types and the data size is relatively smaller, the system still obtains satisfactory performance. In addition, we present each relation F1 in Table 5. Except for three rarely appearing relations, i.e. ‘finish’, ‘after’ and ‘before’, the F1 scores on the other types are balanced and match the statistics in Table 2. As for the radiography interpretation report results in Table 5, the major relations of ‘region’ and ‘feature’ relations achieve high performance with 84.59 and 90.16 F1. The moderate ‘change’, ‘compare’ and ‘on’ obtain satisfying 76.23 to 81.92 F1.
5.2 Correlation between Report Types and Demanding Annotation Efforts
Report Type | RE F1 |
---|---|
Radiography Interpretation Reports | 86.53 |
- with 39% training data | 82.33 |
Medical history Reports | 71.04 |
One question is whether concentrating annotation efforts on a specific report type can quickly obtain high accuracy to meet the requirements of the practical applications. A valid approach is to compare the RE performance of two report types with the comparable annotation efforts i.e. training data size. The medical history report of IPF contains total 5,432 relations, which is approximately 39% of the radiography interpretation reports. We designed the experiment by reducing the train set of the radiography interpretation reports to the comparable 39% of the origin. The results in Table 6 show that even with comparable training size, the specific radiography interpretation reports lead the performance by 11.29 points F1.
To be clarified, the two results are still not exactly comparable due to the different relation distributions. However, the radiography interpretation reports more densely spread in the relation types such as ‘region’ and ‘feature’ (Table 2), which usually means less number of report needed for achieving the similar overall accuracy compared to the medical history reports. In the scenario of demanding high accuracy for practical medical applications, the results suggest that the annotation strategy of starting from a specific type of report and gradually increasing the coverage of report types is feasible.
6 System Application
6.1 User Interface
JaMIE provides an easy-to-use Command-Line Interface (CLI). We design our training/testing scripts similar to the official Transformers examples, in order to be friendly to the Transformers users. We demonstrate how to train/test a relation model with the following script:
6.2 Use Case
In the case of annotating raw medical reports with our trained model, users need to download our trained models from the JaMIE GitHub beforehand. Users then execute the pipeline ‘test’ scripts to annotate entities, modalities, and relations step by step. At each stage, the model will generate the prediction as the input of the next stage model. The prediction is presented in the same XML-style as shown in Figure 1.
Our medical IE annotation schema serves to encode a wide range of general medical information not limited to any specific disease, report types or languages. Users can manually annotate other types of medical reports by following our guideline. Users can apply the ‘train’ scripts to train the pipeline models on their newly annotated corpus for providing automatic annotation.
7 Conclusion
We propose a novel annotation schema for investigating medical and temporal relations between medical entities in Japanese medial reports. We empirically compare the annotation on two types of reports: specific radiography interpretation reports of LC and medical history reports of IPF. The system obtains overall satisfactory performance in three tasks, supporting the valuable findings of the good annotation quality, the feasible annotation strategies for targeting report types, and the superior performance of the contextual BERT encoder. The system code and trained models on our annotation are open-access. In the future, we plan to stick to LC and IPF, cover more specific report types involved with LC, and increase the annotation amount of medical history reports of IPF.
References
- Alsentzer et al. (2019) Emily Alsentzer, John R. Murphy, Willie Boag, Wei-Hung Weng, Di Jin, Tristan Naumann, and Matthew B. A. McDermott. 2019. Publicly available clinical BERT embeddings. CoRR, abs/1904.03323.
- Aramaki et al. (2014) Eiji Aramaki, Mizuki Morita, Yoshinobu Kano, and Tomoko Ohkuma. 2014. Overview of the ntcir-11 mednlp-2 task. In In Proceedings of the 11th NTCIR Workshop Meeting on Evaluation of Information Access Technologies.
- Aramaki et al. (2016) Eiji Aramaki, Mizuki Morita, Yoshinobu Kano, and Tomoko Ohkuma. 2016. Overview of the ntcir-12 mednlpdoc task. In In Proceedings of the 12th NTCIR Workshop Meeting on Evaluation of Information Access Technologies.
- Aramaki et al. (2018) Eiji Aramaki, Ken Yano, and Shoko Wakamiya. 2018. Medex/j: A one-scan simple and fast nlp tool for japanese clinical texts. In MEDINFO 2017: Precision Healthcare Through Informatics: Proceedings of the 16th World Congress on Medical and Health Informatics, volume 245, page 285. IOS Press.
- Aronson and Lang (2010) Alan R Aronson and François-Michel Lang. 2010. An overview of metamap: historical perspective and recent advances. Journal of the American Medical Informatics Association, 17(3):229–236.
- Bethard et al. (2017) Steven Bethard, Guergana Savova, Martha Palmer, and James Pustejovsky. 2017. SemEval-2017 Task 12: Clinical TempEval. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pages 565–572, Stroudsburg, PA, USA. Association for Computational Linguistics.
- Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
- Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation, 9(8):1735–1780.
- Kudo et al. (2004) Taku Kudo, Kaoru Yamamoto, and Yuji Matsumoto. 2004. Applying conditional random fields to Japanese morphological analysis. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, pages 230–237, Barcelona, Spain. Association for Computational Linguistics.
- Lafferty et al. (2001) John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning, ICML ’01, page 282–289, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc.
- Mikolov et al. (2013) Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space.
- Morita et al. (2013) Mizuki Morita, Yoshinobu Kano, Tomoko Ohkuma, Mai Miyabe, and Eiji Aramaki. 2013. Overview of the ntcir-10 mednlp task. In In Proceedings of NTCIR-10.
- Peng et al. (2019) Yifan Peng, Shankai Yan, and Zhiyong Lu. 2019. Transfer learning in biomedical natural language processing: An evaluation of BERT and elmo on ten benchmarking datasets. CoRR, abs/1906.05474.
- Peters et al. (2018) Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2227–2237, New Orleans, Louisiana. Association for Computational Linguistics.
- Savova et al. (2010) Guergana K Savova, James J Masanz, Philip V Ogren, Jiaping Zheng, Sunghwan Sohn, Karin C Kipper-Schuler, and Christopher G Chute. 2010. Mayo clinical text analysis and knowledge extraction system (ctakes): architecture, component evaluation and applications. Journal of the American Medical Informatics Association, 17(5):507–513.
- Si et al. (2019) Yuqi Si, Jingqi Wang, Hua Xu, and Kirk Roberts. 2019. Enhancing clinical concept extraction with contextual embeddings. Journal of the American Medical Informatics Association, 26(11):1297–1304.
- Uzuner et al. (2011) Özlem Uzuner, Brett R South, Shuying Shen, and Scott L DuVall. 2011. 2010 i2b2/va challenge on concepts, assertions, and relations in clinical text. Journal of the American Medical Informatics Association, 18(5):552–556.
- Yada et al. (2020) Shuntaro Yada, Ayami Joh, Ribeka Tanaka, Fei Cheng, Eiji Aramaki, and Sadao Kurohashi. 2020. Towards a versatile medical-annotation guideline feasible without heavy medical knowledge: Starting from critical lung diseases. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 4565–4572, Marseille, France. European Language Resources Association.
- Zhang et al. (2017) Yue Zhang, Zhenghua Li, Jun Lang, Qingrong Xia, and Min Zhang. 2017. Dependency parsing with partial annotations: An empirical comparison. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 49–58, Taipei, Taiwan. Asian Federation of Natural Language Processing.
- Zheng et al. (2017) Suncong Zheng, Feng Wang, Hongyun Bao, Yuexing Hao, Peng Zhou, and Bo Xu. 2017. Joint extraction of entities and relations based on a novel tagging scheme. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1227–1236, Vancouver, Canada. Association for Computational Linguistics.