Automated Knee X-ray Report Generation
Abstract
Gathering manually annotated images for the purpose of training a predictive model is far more challenging in the medical domain than for natural images as it requires the expertise of qualified radiologists. We therefore propose to take advantage of past radiological exams (specifically, knee X-ray examinations) and formulate a framework capable of learning the correspondence between the images and reports, and hence be capable of generating diagnostic reports for a given X-ray examination consisting of an arbitrary number of image views. We demonstrate how aggregating the image features of individual exams and using them as conditional inputs when training a language generation model results in auto-generated exam reports that correlate well with radiologist-generated reports.
1 Introduction
The burden on diagnostic radiologists has been increasing quite rapidly with the advancement of a variety of modern imaging modalities, which, while allowing for higher resolution images in 3D and even 4D, dramatically increase the complexity of the diagnostic process. It has become common for radiologists to rely on various image analysis and automated decision support systems to facilitate the interpretation process. These computer aided diagnostic systems, or CAD, have been able to make great advances with the help of machine learning algorithms and a large amount of clinical imaging data, and have been successfully tested in clinical settings [1, 2, 3, 4].
Many of these machine learning algorithms are supervised learning approaches that require large amounts of annotated image data for training. Gathering suitable data is especially challenging in the medical domain as it is incredibly time consuming for radiologists to generate ground-truths of the standard and volume required for training a predictive model. An alternative approach is to use past clinical images and corresponding radiological reports available through a hospital’s picture archiving and communication system (PACS); the advantage being that, although this data is largely unstructured (free text), it is available to us in high volumes and removes the need for manual annotation.
Hence, it is necessary to develop an appropriate learning framework: one that can take advantage of past radiological examinations and their corresponding textual reports, and formulate a method for prediction that can best expedite the diagnostic process. An additional motivation to learning from clinical reports is that reports provide a much richer context to the pathology than a single class label, such as severity and location.
Here we present a model for captioning knee X-ray exams that is capable of generating radiological reports summarising present pathologies and the contexts within which they are presented. A key contribution is that the proposed model is able to handle an arbitrary number of input image views (which is typical in the context of knee X-ray exams) and is able to learn from free-text reports. The model is based on the neural image caption (NIC) model of [5], where a word-level language generation model is conditioned on image features. In our implementation, the X-ray image features for each input image are derived from a pre-trained classification network, and aggregated into a fixed-sized input to the language model.
2 Related Work
In the field of computer vision, the use of human generated visually descriptive text to infer the contents of an image has primarily been applied to image caption generation. Earlier models relied on linking template-based language models to objects and spatial contexts in the image [6, 7, 8]. More recently, interest has moved to the combined potential of convolutional neural networks (CNNs) and recurrent neural networks (RNNs) for describing images using natural language [9, 5, 10, 11, 12]. The advantage of using neural networks for caption generation is that the model is not constricted by hard-coded language templates and is able to learn more freely from the training data.
By contrast, using clinical reports to perform automated pathology detection in medical images has only been explored in a few studies [13, 14, 15, 16]. In addition, these models are limited to learning from carefully curated reports, either medical subject heading annotations (MeSH® [17]) [13, 14], or templated reports made specifically for the purpose of training [15, 16], which are both very time consuming for radiologists/pathologists to create.
To our knowledge, this is the first study exploring the use of past medical examinations and their corresponding raw reports in order to develop a predictive model for an arbitrary number of input radiological images.
3 Dataset
The knee X-ray dataset has been extracted from the PAC system of St Thomas Hospital (part of Guys and St Thomas NHS Foundation Trust) and has been fully anonymised to remove sensitive patient information. It comprises a total of 330 knee X-ray exams collected over the years 2015 and 2016. Each exam consists of a textual report and one or more X-ray images (left/right knees or both, taken from different views: anteroposterior (AP), lateral (L) and skyline (S), and different positions: weight-bearing (WB) and non-weight-bearing(nonWB)). The most common exam consists of both AP and L views of left and right knees separately, making up 42% of total exams. The reports vary in length between 2 and 145 words, with an average of 30 and standard deviation of 18.7; and between 1 and 16 sentences with an average of 2.7 per report. The X-ray images vary in sizes between and .
Prior to using the raw reports for training an image captioning model, they were cleaned to remove phrases not directly describing the images, for instance, instructions to compare to past exams (which are not available to us due to the anonymisation process). These phrases do not allow any inferences about current images, so are not useful for finding image-text correspondence. In addition, common ‘stopwords’ (such as ‘and’, ‘the’) were removed, and all punctuation was replaced with either commas or full stops where appropriate. All report cleaning was done using regular expressions. Exams were randomly split into 80%-10%-10% for training, testing and validation respectively.
4 Medical Image Report Generation Model

4.1 Image Modelling
We adopt the GoogLeNet [18] CNN architecture, pre-trained on the ImageNet dataset [19], and extract image features of each X-ray image view ( - ) from the last spatial average pooling layer (). The maximum value of each feature is aggregated across the exam images to create a fixed-size input to the RNN of dimension , which is then passed through a fully connected layer in order to reduce the dimension to , equal to the RNN input size.
4.2 Language Modelling
We use a recurrent neural network, specifically the Long Short-Term Memory (LSTM) implementation proposed in [20], to model the report sequence. LSTM models are widely used in machine translation [21, 22, 23] and natural image and video captioning [9, 24, 12] due to their ability to capture long-term dependencies, and to reduce the problem of vanishing gradients in vanilla RNNs. Each LSTM unit has three sigmoid gates to control the internal state: ‘input’, ‘output’ and ‘forget’. At each time step, the gates control how much of the previous time steps is propagated through to determine the output. For an input word sequence , the internal hidden state and memory state are updated as follows:
where is the word embedding, and are the trainable weight parameters, and , and are the input, output and forget gates respectively.
In order for the sequence generation model to be conditioned on the set of input images, the combined images features for each exam are input at time step , and the words in the report input at consequent time steps. The complete network, including a bounding box (BBox) regressor to extract the knee joint, is illustrated in Figure 1.
5 Training
For training the report generation model, the LSTM is unrolled up to 33 time steps (1 for image features, 1 per start and end token, and 30 for the average number of words in the reports). The images were rescaled and padded (preserving the aspect ratio) to . Words were one-hot encoded, and ones that occurred at a frequency <5 were removed.
To improve the X-ray image features, a simple CNN BBox regressor was built and trained on a subset of the training images and ground-truth BBoxes manually created for this purpose (231 image-BBox pairs). In addition, the training set was augmented eight-fold by random cropping the original images from to , (in the case of the BBox images, small, random translations (between -5 and 5 in x and y) were applied to the BBox of each image), flipping the images along the vertical axis, and shuffling the sentences in the reports.
We then trained the LSTM model for report generation conditioned on the combined image features by minimising the negative log-likelihood between the output and true sequence:
where is the probability that the predicted word equals the true word at time step given image features CNN() and previous words , and is the LSTM sequence length. At training time, loss is minimised over the training set using stochastic gradient descent (batch size 20, learning rate ), and parameters are updated using Adam [25] optimisation.
6 Evaluation
Reports are generated on the test set by max-aggregating the image features of an exam, using it as the first word input into the LSTM, and sampling consequent words. The quality of the generated reports was evaluated by measuring BLUE [26] and METEOR [27] scores averaged over all the reports, which are a modified form of n-gram precision commonly used for evaluating image captioning and machine translation as they maintain high correlation with human judgement. As a comparison, we trained the model on single image inputs, duplicating the reports for each image in an exam. We also evaluated whether the performance improved if we narrow the input to the CNN to a BBox surrounding the knee joint.
The report generation model performed better when trained on a combination of image features than on single images as the conditional input (see Table 1), suggesting that all images are required, in part, in order to make the pathology assessment. This is in line with expectations since pathologies may only be present in one leg, or may only be seen from a particular view. Contrary to expectation, bounding box extraction did not improve the performance of report generation as it may have instead removed pertinent image features. A sample exam with the original report and two auto-generated reports of contrasting BLEU-1 scores is presented in Figure 2.
BLEU-1 | BLEU-2 | BLEU-3 | BLEU-4 | METEOR | ||||||
---|---|---|---|---|---|---|---|---|---|---|
tr | te | tr | te | tr | te | tr | te | tr | te | |
Baseline, single image input | 42.2 | 33.2 | 13.3 | 5.7 | 3.8 | 1.9 | 1.3 | 1.1 | 26.7 | 22.2 |
Max-aggr. of image features | 60.7 | 40.4 | 32.6 | 10.1 | 19.4 | 2.6 | 12.3 | 1.2 | 41.1 | 35.7 |
Max-aggr.+BBoxes | 38.9 | 37.4 | 11.3 | 7.1 | 3.4 | 1.1 | 1.2 | 0.2 | 28.3 | 28.9 |
![]() |
Original: Joint spaces articular surfaces appear preserved. Significant degenerative erosive change seen. Good prediction (B1=87.5): Joint spaces articular surfaces appear preserved bilaterally. Poor prediction (B1=28.6): Joint space narrowing medial compartments bilaterally. |
7 Conclusion
We demonstrate how past knee X-ray exams consisting of sets of images and corresponding radiological reports can be used as part of a learning framework to determine image-text correspondence and hence automate the generation of such reports for new X-ray exams. Preliminary results look promising as the auto-generated reports correlate well with true reports, and we hope to train the model on additional knee X-ray exams as these become available to us. Further developments to the model can be made by incorporating the knowledge of the view-type of each image, keeping them as separate inputs, and finding correspondence between each image and parts of text (for instance, through the use of an attention mechanism [28]). We will explore this in a future study.
References
- Freer and Ulissey [2001] Timothy W Freer and Michael J Ulissey. Screening mammography with computer-aided detection: Prospective study of 12,860 patients in a community breast center 1. Radiology, 220(3):781–786, 2001.
- Ramírez et al. [2013] Javier Ramírez, JM Górriz, Diego Salas-Gonzalez, A Romero, Míriam López, Ignacio Álvarez, and Manuel Gómez-Río. Computer-aided diagnosis of alzheimer’s type dementia combining support vector machines and discriminant set of features. Information Sciences, 237:59–72, 2013.
- Litjens et al. [2015] Geert J S Litjens, Jelle O. Barentsz, Nico Karssemeijer, and Henkjan J. Huisman. Clinical evaluation of a computer-aided diagnosis system for determining cancer aggressiveness in prostate MRI. European Radiology, 25(11):3187–3199, 2015.
- Kobayashi et al. [2016] Hajime Kobayashi, Masaki Ohkubo, Akihiro Narita, Janaka C Marasinghe, Kohei Murao, Toru Matsumoto, Shusuke Sone, and Shinichi Wada. A method for evaluating the performance of computer-aided detection of pulmonary nodules in lung cancer ct screening: detection limit for nodule size and density. The British Journal of Radiology, 90(0):20160313, 2016.
- Vinyals et al. [2015] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image caption generator. In CVPR, pages 3156–3164, 2015.
- Farhadi et al. [2010] Ali Farhadi, Mohsen Hejrati, Mohammad Amin Sadeghi, Peter Young, Cyrus Rashtchian, Julia Hockenmaier, and David Forsyth. Every picture tells a story: Generating sentences from images. In ECCV, pages 15–29. Springer, 2010.
- Kulkarni et al. [2013] Girish Kulkarni, Visruth Premraj, Vicente Ordonez, Sagnik Dhar, Siming Li, Yejin Choi, Alexander C Berg, and Tamara L Berg. Babytalk: Understanding and generating simple image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(12):2891–2903, 2013.
- Fidler et al. [2013] Sanja Fidler, Abhishek Sharma, and Raquel Urtasun. A sentence is worth a thousand pixels. In CVPR, pages 1995–2002, 2013.
- Kiros et al. [2014] Ryan Kiros, Ruslan Salakhutdinov, and Richard S Zemel. Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539, 2014.
- Karpathy and Li [2015] Andrej Karpathy and Fei-fei Li. Deep visual-semantic alignments for generating image descriptions. In CVPR, pages 3128–3137, 2015.
- Chen and Lawrence Zitnick [2015] Xinlei Chen and C Lawrence Zitnick. Mind’s eye: A recurrent visual representation for image caption generation. In CVPR, pages 2422–2431, 2015.
- Xu et al. [2015a] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C Courville, Ruslan Salakhutdinov, Richard S Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. In ICML, volume 14, pages 77–81, 2015a.
- Shin et al. [2016a] Hoo-Chang Shin, Le Lu, Lauren Kim, Ari Seff, Jianhua Yao, and Ronald M Summers. Interleaved Text / Image Deep Mining on a Large-Scale Radiology Database for Automated Image Interpretation. Journal of Machine Learning Research (JMLR), 17:1–31, 2016a. doi: 10.1109/CVPR.2015.7298712.
- Shin et al. [2016b] Hoo-chang Shin, Kirk Roberts, Le Lu, Dina Demner-Fushman, Jianhua Yao, and Ronald M Summers. Learning to read chest x-rays: Recurrent neural cascade model for automated image annotation. In CVPR, pages 2497–2506, 2016b.
- Zhang et al. [2017a] Zizhao Zhang, Yuanpu Xie, Fuyong Xing, Mason Mcgough, and Lin Yang. Mdnet: A semantically and visually interpretable medical image diagnosis network. In CVPR, July 2017a.
- Zhang et al. [2017b] Zizhao Zhang, Pingjun Chen, Manish Sapkota, and Lin Yang. Tandemnet: Distilling knowledge from medical images using diagnostic reports as optional semantic references. In MICCAI, 2017b.
- Lipscomb [2000] Carolyn E Lipscomb. Medical subject headings (mesh). Bulletin of the Medical Library Association, 88(3):265, 2000.
- Szegedy et al. [2015] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In CVPR, pages 1–9, 2015.
- Deng et al. [2009] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, pages 248–255. IEEE, 2009.
- Hochreiter and Schmidhuber [1997] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
- Sutskever et al. [2014] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In NIPS, pages 3104–3112, 2014.
- Cho et al. [2014] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
- Luong et al. [2015] Minh-Thang Luong, Hieu Pham, and Christopher D Manning. Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025, 2015.
- Donahue et al. [2015] Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. Long-term recurrent convolutional networks for visual recognition and description. In CVPR, pages 2625–2634, 2015.
- Kingma and Ba [2014] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Papineni et al. [2002] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pages 311–318. Association for Computational Linguistics, 2002.
- Banerjee and Lavie [2005] Satanjeev Banerjee and Alon Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, volume 29, pages 65–72, 2005.
- Xu et al. [2015b] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. In ICML, pages 2048–2057, 2015b.