Sepsis Prediction with
Temporal Convolutional Networks

Xing Wang
Splunk Inc.
xingw@splunk.com &Yuntian He
Georgia Institute of Technology
yhe353@gatech.edu

Abstract

We design and implement a temporal convolutional network model to predict sepsis onset. Our model is trained on data extracted from MIMIC III database, based on a retrospective analysis of patients admitted to intensive care unit who did not fall under the definition of sepsis at the time of admission. Benchmarked with several machine learning models, our model is superior on this binary classification task, demonstrates the prediction power of convolutional networks for temporal patterns, also shows the significant impact of having longer look back time on sepsis prediction.

1 Introduction

Sepsis is one of the leading causes of mortality in worldwide. The incidence and mortality rates failed to decrease over the last few decades. Early recognition and early aggressive treatment may improve the outcomes in sepsis. Early and accurate onset predictions could allow more aggressive and focused treatment, which could reduce the incidence of multi-organ dysfunction and significantly decreased the in-hospital mortality rate[1]. Due to disease complexity in clinical context, early recognition can be difficult[2, 3]. Existing detection methods cannot provide high performance and real-time laboratory test results, which would lead the delay of treatment. With the development of machine learning as a tool to analyze large amount of data, prediction models can be developed based on machine learning to diagnose sepsis ahead of time, which may assist onset clinical to decrease the mortality.

Recurrent neural network (RNN) and its variants of sequence to sequence (Seq2Seq) framework [4] have achieved great success in many sequential modeling tasks, such as machine translation [5], speech recognition [6], natural language processing [7], and extensions to autoregressive time series forecasting [8, 9] in recent years. And there have been work on applying RNN on sepsis prediction [10]. However, RNNs can suffer from several major challenges. Due to its inherent temporal nature (i.e., the hidden state is propagated through time), the training cannot be parallelized. Moreover, trained with backpropagation through time (BPTT) [11], RNNs severely suffer from the problem of gradient vanishing thus often cannot capture long time dependency [12]. More elaborate architectures of RNNs use gating mechanisms to alleviate the gradient vanishing problem, the long short-term memory (LSTM) [13] and its simplified variant, the gated recurrent unit (GRU) [14], are the two canonical architectures commonly used in practice. On the other hand, convolutional neural networks (CNNs) [15] can be easily parallelized, and recent advances effectively eliminate the vanishing gradient issue and hence help building very deep CNNs. These works include the residual network (ResNet) [16] and its variants such as highway network [17], DenseNet [18], etc. In the area of sequential modeling, 1D convolutional networks offered an alternative to RNNs for decades [19]. In recent years, Oord et al.[20] proposed WaveNet, a dilated causal convolutional network as an autoregressive generative model. Ever since, multiple research efforts have shown that with a few modifications, certain convolutional architectures achieve state-of-the-art performance in various fields [20, 21, 22]. In particular, Bai et al.[23] abandoned the gating mechnism in WaveNet and proposed temporal convolutional network (TCN). The authors benchmarked TCN with LSTM and GRU on several sequence modeling problems, and demonstrated that TCN exhibits substantially longer memory and achieves better performance. In this paper, we build a WaveNet-style TCN for learning temporal information from historical data, with the application in sepsis prediction.

The rest of this report is organized as follows. Section 2 formulate the problem and introduce the collection and preprocessing of the data. Section 3 illustrates the building blocks as well as the implementation and learning details of the TCN model. We benchmark our TCN with several other models, the evaluation results are shown in Section 4. Finally, we conclude our findings and discuss future work directions in Section 5.

2 Problem Formulation and Data

We aim at making predictions of sepsis prediction onset. As our goal is to distinguish patients who obtained sepsis at any point in time during their stay in the intensive care unit from those who did not, two classes were defined: the sepsis-class and non-sepsis-class. The point in time from where sepsis is to be predicted is defined by the difference of sepsis onset and the prediction time.

2.1 Data Collection and Inclusion Criteria

The MIMIC III database was recorded between 2001 and 2012, in the Beth Israel Deaconess Medical Center in Boston, Massachusetts. We use the most recent Version (v1.4) for this work. The database contains 58976 admissions of 46520 patients.

After getting all the data we needed in this paper, we apply the tool from MIT-LCP to get the pivoted SOFA table, which extracts the sequential organ failure assessment for the patients in the ICU. SOFA[24] score is used to track a person’s status during the stay in an intensive care unit (ICU) to determine the extent of a person’s organ function or rate of failure. The score is based on six different scores, one each for the respiratory, cardiovascular, hepatic, coagulation, renal and neurological systems. The score is calculated every hour during the patient’s ICU stay. Meanwhile, the tool[25] was applied to get the suspicious infection time and sepsis cohort. The parameters that the query used are listed in Table 1.

2.2 Data pre-processing

From the result of the suspicious infection time, sepsis cohort and pivoted SOFA table, we apply our sepsis-label criteria. We observed the SOFA score from 2 days before the suspicious infection time to 1 day after the suspicious infection time. If the score went up by 2 in the time window, the patients will be labeled as sepsis onset. Meanwhile, the time window need to be between ICU in time and ICU out time.

Refer to caption — Figure 1: Observation Window/Prediction Window

The look back is the sequence of values that is used to predict if there will occur a sepsis onset or not, so if a specific look back is classified as belonging to the sepsis-class or non-sepsis-class. The look back time here we applied is 13 hours, and prediction window is 2 hours. The length of look back time has a significant impact on the performance of the classifier. Jozwiak et al.[26] indicates the earlier the treatment is achieved, the lower is the mortality. It has been shown that the compliance with early resuscitation treatments, within the first 3 hours, was associated with a lower probability of being eligible for later resuscitation and maintenance bundle element.

Parameters	ITEMIDs
patient age	$\geq$ 18
systolic blood presure	220050, 225309, 220179, 51, 455, 6701
diastolic blood pressure	220051, 225309, 225310, 8386, 8441, 8555
blood oxygen saturation( $SO_{2}$ )	220227, 220277, 834, 646
temperature	223762, 223761, 676, 678
heart rate	220045, 211
respiratory rate	220210, 224422, 224689, 224690, 618, 651, 615, 614
$CO_{2}$ partial pressure( $PaCO_{2}$ )	220235, 778

Table 1: Extracted parameters from the MIMIC III database. The ITEMID represents the identification number of a measurement in the MIMIC III database.

As a tentative illustration, Figure 2 shows the most important 10 features for predicting the sepsis onset, according to the decision tree model we trained for benchmarking.

3 Approach and Implementation

3.1 TCN Architecture

As we mentioned above, we will use temporal CNN to replace the temporal module learns from historical data, where RNNs are usually applied. CNNs have shown success in time series applications, in which the 1D convolution is simply an operation of sliding dot products between the input vector and the kernel vector. We will make several modifications to traditional 1D convolutions according to recent advances. The detailed building blocks of our temporal CNN components are illustrated in the following subsections.

3.1.1 Causal Convolutions

In a traditional 1D convolutional layer, the filters are slided across the input series. As a result, the output is related to the connection structure between the inputs before and after it. As shown in Figure 3(a), by applying a filter of width 2 without padding, the predicted outputs $\hat{x}_{1},\cdots,\hat{x}_{T}$ are generated using the input series $x_{1},\cdots,x_{T}$ . The most severe problem within this structure is that we use the future to predict the past, e.g., we have used $x_{2}$ to generate $\hat{x}_{1}$ , which is not appropriate in time series analysis. To avoid the issue, causal convolutions are used, in which the output $x_{t}$ is convoluted only with input data which are earlier and up to time $t$ from the previous layer. We achieve this by explicitly zero padding of length $(kernel\_size-1)$ at the beginning of input series, as a result, we actually have shifted the outputs for a number of time steps. In this way, the prediction at time $t$ is only allowed to connect to historical information, i.e., in a causal structure, thus we have prohibited the future affecting the past and avoided information leakage. The resulted causal convolutions is visualized in Figure 3(b).

3.1.2 Dilated Convolutions

Time series often exhibits long-term autoregressive dependencies. With neural network models hence, we require for the receptive field of the output neuron to be large. That is, the output neuron should be connected with the neurons that receive the input data from many time steps in the past. A major disadvantage of the aforementioned basic causal convolution is that in order to have large receptive field, either very large sized filters are required, or those need to be stacked in many layers. With the former, the merit of CNN architecture is lost, and with the latter, the model can become computationally intractable. Following Oord et al.[20], we will adopt the dilated convolutions in our model instead, which is defined as

F(s)=(\mathbf{x}\ast_{d}f)(s)=\sum_{i=0}^{k-1}f(i)\cdot\mathbf{x}_{s-d\times i},

where $x\in\mathbb{R}^{T}$ is a 1-D series input, and $f:\{0,\cdots,k-1\}\to\mathbb{N}$ is a filter of size $k$ , $d$ is called the dilation rate, and $(s-d\times i)$ accounts for the direction of the past. In a dilated convolutional layer, filters are not convoluted with the inputs in a simple sequential manner, but instead skipping a fixed number ( $d$ ) of inputs in between. By increasing the dilation rate multiplicatively as the layer depth (e.g., a common choice is $d=2^{j}$ at depth $j$ ), we increase the receptive field exponentially, i.e., there are $2^{l-1}k$ input in the first layer that can affect the output in the $l$ -th hidden layer. Figure 4 compares non-dilated and dilated causal convolutional layers.

3.1.3 Residual Connections

In traditional neural networks, each layer feeds into the next. In a network with residual blocks, by utilizing skip connections, a layer may also short-cut to jump over several others. The use of residual network (ResNet) [16] has been proven to be very successful and become the standard way of building deep CNNs. The core idea of ResNet is the usage of shortcut connection which skips one or more layers and directly connects to later layers (which is the so-called identity mapping), in addition to the standard layer stacking connection $\mathcal{F}$ .

Figure 5 illustrates a residual block, which is the basic unit in ResNet. A residual block consists of the abovementioned two branches, and its output is then $g(\mathcal{F}(x)+x)$ , where $x$ denotes the input to the residual block, and $g$ is the activation function. By reusing activation from a previous layer until the adjacent layer learns its weights, CNNs can effectively avoid the problem of vanishing gradients.

3.2 Implementation Details and Benchmark Models

Our TCN model was implemented in PyTorch [27]. We build a 9-layer dilated causal CNN was built as the component that focuses on capturing the time-dependent features. Each layer contains 16 filters, and each filter has a width of 2. Every two consecutive convolutional layers form a residual block after which the previous inputs are added to the flow. The dilation rate increases exponentially along every stacked residual blocks, i.e., to be $1,2,4,8,\cdots,128$ . Within each TCN block, we use dropout (with probability 0.2) for the convolutional layers in order to limit the influence that earlier data have on learning [28]. It is then followed by a batch normalization layer [29]. Both dropout and batch normalization provide a regularization effect that avoids overfitting. The most widely used activation function, the rectified linear unit (ReLU) [30] is used after each layer except for the last one. The convolutional output of the TCN module is a feature map which is set to be of size 4, it is then sent to a fully-connected classifier with 2 outputs, as the model is optimized on binary cross-entropy cost function, represents a standard approach for dichotomous classification tasks. Note that the TCN module can be replaced by any other architecture that learns temporal patterns, for example, RNN-type network. As a benchmarking deep learning model, we built an LSTM model which consists of 2 hidden layers with 4 neurons each plays the role of the TCN module, while the rest of the model is the same as we mentioned above. The deep learning models were trained using stochastic gradient descent (SGD), with batch size of 64. In particular, the Adam optimizer [31] with initial learning rate of 0.005 was used for both LSTM and TCN. We trained the TCN and LSTM models for 10 epochs. Figure 6 shows the changes of training and validation loss and accuracy over the number of epochs, we can see that as learning proceeds, the model performance keeps improving.

In addition to LSTM, we also benchmark with several other machine learning models. Random forest [32] and AdaBoost [33] are ensemble models that deploy enhanced bagging and boosting, respectively. We pick these two models since both have shown powerful predicting ability. Scikit-learn [34] was used to build these machine learning models, and both ensemble models apply decision trees as the base learners. We split the data into training and testing set by a ratio of 80% and 20%, and the optimal hyperparameters of our traditional machine learning models were obtained with K-fold cross validation based on grid search. For the two deep learning models, we further split the training data into training and validation sets, and tuned the hyperpameters such as learning rate as well as hidden size, etc. according to the performance on validation data.

4 Experiment Results

In the computational experiments, we compare performance of several models, decision tree, random forest, AdaBoost, and LSTM, along with our TCN. Table 2 illustrates the evaluation results on the test set, from which we can see that with bagging or boosting, the random forest model and AdaBoost model perform better than the decision tree model, demonstrates the power of ensemble. However, tree models do not capture the temporal dependencies, thus perform worse. The deep learning models, on the other hand, significantly improve the performance to a different level, with an emphasis on learning the time-dependent patterns. Our TCN exhibits the highest accuracy, recall, F1 score as well as the area under the ROC curve (AUC). This demonstrates that the application of such networks can improve the prediction of sepsis in intensive care, and thus potentially further reduce sepsis mortality. In particular, it shows stronger ability to extract the temporal features than LSTM. The LSTM model does achieve the highest precision. However, consider the high mortality rate of sepsis, a higher recall might be more valuable for a predictive model on sepsis, as it would be safer to have less false negatives. We show the confusion matrix and the ROC curve obtained by our TCN model in Figure 7, from which we can observe that our TCN model has a recall rate of 62%. And as shown in the ROC curve, we may accept to sacrifice a little bit on a very high accuracy to achieve lower false negative rate, as the data are expected to be extremely imbalanced.

	Accuracy	Precision	Recall	F1 Score	AUC	MCC
Decision Tree	0.667	0.668	0.421	0.517	0.634	0.300
Random Forest	0.677	0.669	0.458	0.544	0.646	0.319
AdaBoost	0.695	0.681	0.522	0.591	0.672	0.362
LSTM	0.779	0.857	0.571	0.685	0.754	0.551
TCN	0.785	0.829	0.619	0.700	0.770	0.558

Table 2: Evaluation metrics comparison of different models on test data.

Although we would not provide a comparative analysis here, during the hyperparameter tuning process, we did find that increasing the number of convolutional layers (which means the receptive fields are enlarged) in TCN would improve the model performance. This might explain the reason that TCN outperforms LSTM, as it is well-known that LSTM can hardly capture information maybe 40 time steps ago (source). And it also indicates that an improved prediction based on a longer look back might strengthen the knowledge about sepsis. As argued in Scherpf et al.[10], the symptoms and related vital sign patterns of sepsis appear quite early, and effective deep learning algorithms such as our TCN are capable of detecting such complex interdependencies between different physiological parameters.

5 Conclusion and Discussion

In this paper, we build a temporal convolutional neural network to achieve the binary classification task for sepsis prediction. The experimental results not only show that deep learning model improves the forecasting performance significantly over several traditional machine learning models such as random forest and AdaBoost, also emphasize the value of temporal information and a gradual development of sepsis. In particular, we demonstrate that the convolutional layers have several advantages over the most widely used recurrent neural network layers for temporal data. There are several directions that we can dive deeper as the future work. First of all, our dataset is relatively small for deep learning models, moreover, we only extracted 9 features from the MIMIC-III data. With a better understanding of the domain knowledge, larger dataset is expected to be obtained and more relevant features are expected to be derived, so that more useful information can be provided to increase the predictive ability of our models. In addition, the layers/modules to learn time-dependent patterns in our deep learning models could be replaced by any other type, for instance, the recent advances of attention models [35] could be a good candidate. And finally, as the labels for sepsis are usually extremely imbalanced, techniques to resolve the issue of imbalanced data could be applied, for instance, the use of focal loss [36] might improve our model performance.

References

1. Hwan Il Kim and Sunghoon Park. Sepsis: Early recognition and optimized treatment. Tuberculosis and respiratory diseases, 82(1):6–14, 2019.
2. Dileepa Senajith Ediriweera, Anuradhani Kasturiratne, Arunasalam Pathmeswaran, Nipul Kithsiri Gunawardena, Buddhika Asiri Wijayawickrama, Shaluka Francis Jayamanne, Geoffrey Kennedy Isbister, Andrew Dawson, Emanuele Giorgi, Peter John Diggle, et al. Mapping the risk of snakebite in sri lanka-a national survey with geospatial analysis. PLoS neglected tropical diseases, 10(7):e0004813, 2016.
3. Victor B Talisa, Sachin Yende, Christopher W Seymour, and Derek C Angus. Arguing for adaptive clinical trials in sepsis. Frontiers in immunology, 9:1502, 2018.
4. Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112, 2014.
5. Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
6. Hasim Sak, Andrew W Senior, and Françoise Beaufays. Long short-term memory recurrent neural network architectures for large scale acoustic modeling. 2014.
7. Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. A neural probabilistic language model. Journal of machine learning research, 3(Feb):1137–1155, 2003.
8. David Salinas, Valentin Flunkert, Jan Gasthaus, and Tim Januschowski. Deepar: Probabilistic forecasting with autoregressive recurrent networks. International Journal of Forecasting, 2019.
9. Syama Sundar Rangapuram, Matthias W Seeger, Jan Gasthaus, Lorenzo Stella, Yuyang Wang, and Tim Januschowski. Deep state space models for time series forecasting. In Advances in neural information processing systems, pages 7785–7794, 2018.
10. Matthieu Scherpf, Felix Gräßer, Hagen Malberg, and Sebastian Zaunseder. Predicting sepsis with a recurrent neural network using the mimic iii database. Computers in biology and medicine, 113:103395, 2019.
11. Paul J Werbos. Backpropagation through time: what it does and how to do it. Proceedings of the IEEE, 78(10):1550–1560, 1990.
12. Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. In International conference on machine learning, pages 1310–1318, 2013.
13. Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
14. Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014.
15. Yann LeCun, Bernhard Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne Hubbard, and Lawrence D Jackel. Backpropagation applied to handwritten zip code recognition. Neural computation, 1(4):541–551, 1989.
16. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
17. Rupesh K Srivastava, Klaus Greff, and Jürgen Schmidhuber. Training very deep networks. In Advances in neural information processing systems, pages 2377–2385, 2015.
18. Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017.
19. Alex Waibel, Toshiyuki Hanazawa, Geoffrey Hinton, Kiyohiro Shikano, and Kevin J Lang. Phoneme recognition using time-delay neural networks. IEEE transactions on acoustics, speech, and signal processing, 37(3):328–339, 1989.
20. Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
21. Mikolaj Binkowski, Gautier Marti, and Philippe Donnat. Autoregressive convolutional neural networks for asynchronous time series. In International Conference on Machine Learning, pages 580–589, 2018.
22. Yitian Chen, Yanfei Kang, Yixiong Chen, and Zizhuo Wang. Probabilistic forecasting with temporal convolutional neural network. Neurocomputing, 2020.
23. Shaojie Bai, J Zico Kolter, and Vladlen Koltun. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271, 2018.
24. Mervyn Singer, Clifford S Deutschman, Christopher Warren Seymour, Manu Shankar-Hari, Djillali Annane, Michael Bauer, Rinaldo Bellomo, Gordon R Bernard, Jean-Daniel Chiche, Craig M Coopersmith, et al. The third international consensus definitions for sepsis and septic shock (sepsis-3). Jama, 315(8):801–810, 2016.
25. Alistair EW Johnson, Jerome Aboab, Jesse D Raffa, Tom J Pollard, Rodrigo O Deliberato, Leo Anthony Celi, and David J Stone. A comparative analysis of sepsis identification methods in an electronic database. Critical care medicine, 46(4):494, 2018.
26. Mathieu Jozwiak, Xavier Monnet, and Jean-Louis Teboul. Implementing sepsis bundles. Annals of translational medicine, 4(17), 2016.
27. Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017.
28. Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958, 2014.
29. Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
30. Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann machines. In ICML, 2010.
31. Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
32. Leo Breiman. Random forests. Machine learning, 45(1):5–32, 2001.
33. Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1):119–139, 1997.
34. Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikit-learn: Machine learning in python. the Journal of machine Learning research, 12:2825–2830, 2011.
35. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30:5998–6008, 2017.
36. Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017.

Sepsis Prediction with Temporal Convolutional Networks