This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Supervised topic models for clinical interpretability

Michael C. Hughes School of Engineering and Applied Sciences, Harvard University, Cambridge, MA, USA Huseyin Melih Elibol School of Engineering and Applied Sciences, Harvard University, Cambridge, MA, USA Thomas McCoy, M.D. Massachusetts General Hospital, Boston, MA, USA Roy Perlis, M.D. Massachusetts General Hospital, Boston, MA, USA Finale Doshi-Velez School of Engineering and Applied Sciences, Harvard University, Cambridge, MA, USA
Abstract

Supervised topic models can help clinical researchers find interpretable cooccurence patterns in count data that are relevant for diagnostics. However, standard formulations of supervised Latent Dirichlet Allocation have two problems. First, when documents have many more words than labels, the influence of the labels will be negligible. Second, due to conditional independence assumptions in the graphical model the impact of supervised labels on the learned topic-word probabilities is often minimal, leading to poor predictions on heldout data. We investigate penalized optimization methods for training sLDA that produce interpretable topic-word parameters and useful heldout predictions, using recognition networks to speed-up inference. We report preliminary results on synthetic data and on predicting successful anti-depressant medication given a patient’s diagnostic history.

1 Introduction

Abundant count data—procedures, diagnoses, meds—are produced during clinical care. An important question is how such data can assist treatment decisions. Standard pipelines usually involve some dimensionality reduction—there are over 14,000 diagnostic ICD9-CM codes alone—followed by training on the task of interest. Topic models such as latent Dirichlet allocation (LDA) (Blei, 2012) are a popular tool for such dimensionality reduction (e.g. Paul and Dredze (2014) or Ghassemi et al. (2014)). However, especially given noise and irrelevant signal in the data, this two-stage procedure may not produce the best predictions; thus many efforts have tried to incorporate observed labels into the dimensionality reduction model. The most natural extension is supervised LDA (McAuliffe and Blei, 2007), though other attempts exist (Zhu et al., 2012, Lacoste-Julien et al., 2009).

Unfortunately, a recent survey by Halpern et al. (2012) finds that many of these approaches have little benefit, if any, over standard LDA. We take inspiration from recent work (Chen et al., 2015) to develop an optimization algorithm that prioritizes document-topic embedding functions useful for heldout data and allows a penalized balance of generative and discriminative terms, overcoming problems with traditional maximum likelihood point estimation or more Bayesian approximate posterior estimation. We extend this work with recognition network that allows us to scale to a data set of over 800,000 patient encounters via an approximation to the ideal but expensive embedding required at each document.

2 Methods

We consider models for collections of DD documents, each drawn from the same finite vocabulary of VV possible word types. Each document consists of a supervised binary label yd{0,1}y_{d}\in\{0,1\} (extensions to non-binary labels are straightforward) and NdN_{d} observed word tokens xd={xdn}n=1Ndx_{d}=\{x_{dn}\}_{n=1}^{N_{d}}, with each word token an indicator of a vocabulary type. We can compactly write xdx_{d} as a sparse count histogram, where xdvx_{dv} indicates the count of how many words of type vv appear in document dd.

2.1 Supervised LDA and Its Drawbacks

Supervised LDA (McAuliffe and Blei, 2007) is a generative model with the following log-likelihoods:

logp(xd|ϕ,πd)\displaystyle\log p(x_{d}|\phi,\pi_{d}) =logMult(xd|Nd,k=1Kπdkϕk)=v=1Vxdvlog(k=1Kπdkϕkv)\displaystyle=\log\mbox{Mult}(x_{d}|N_{d},\textstyle\sum_{k=1}^{K}\pi_{dk}\phi_{k})=\sum_{v=1}^{V}x_{dv}\log\left(\textstyle\sum_{k=1}^{K}\pi_{dk}\phi_{kv}\right) (1)
logp(yd|πd,η)\displaystyle\log p(y_{d}|\pi_{d},\eta) =logBern(yd|σ(ηTπd))=ydlogσ(ηTπd)+(1yd)log(1σ(ηTπd))\displaystyle=\log\mbox{Bern}(y_{d}|\sigma(\eta^{T}\pi_{d}))=y_{d}\log\sigma(\eta^{T}\pi_{d})+(1-y_{d})\log(1-\sigma(\eta^{T}\pi_{d}))

where πdk\pi_{dk} is the probability of topic kk in document dd, ϕkv\phi_{kv} is the probability of word vv in topic kk, ηk\eta_{k} are coefficients for predicting label ydy_{d} from doc-topic probabilities πd\pi_{d} via logistic regression, and σ()\sigma(\cdot) is the sigmoid function. Conjugate Dirichlet priors p(πd)p(\pi_{d}) and p(ϕk)p(\phi_{k}) can be easily incorporated.

For many applications, we wish to either make predictions of ydy_{d} or inspect the topic-word probabilities ϕ\phi directly. In these cases, point estimation is a simple and effective training goal, via the objective:

maxϕ,π,ηwy(d=1Dlogp(yd|η,πd))+wx(logp(ϕ)+d=1Dlogp(xd|πd,ϕ)+logp(πd))\displaystyle\max_{\phi,\pi,\eta}~w_{y}\Big{(}\sum_{d=1}^{D}\log p(y_{d}|\eta,\pi_{d})\Big{)}+w_{x}\Big{(}\log p(\phi)+\sum_{d=1}^{D}\log p(x_{d}|\pi_{d},\phi)+\log p(\pi_{d})\Big{)} (2)

We include penalty weights wx>0,wy>0w_{x}>0,w_{y}>0 to allow adjusting the importance of the unsupervised data term and the supervised label term. Taddy (2012) gives a coordinate ascent algorithm for the totally unsupervised objective (wx=1,wy=0w_{x}=1,w_{y}=0), using natural parameterization to obtain simple updates. Similar algorithms exist for all valid penalty weights.

Two problems arise in practice with such training. First, the standard supervised LDA model sets wx=wy=1w_{x}=w_{y}=1. However, when xdx_{d} contains many words but ydy_{d} has a few binary labels, the logp(x)\log p(x) term dominates the objective. We see in Fig. 1 that the estimated topic word parameters ϕ\phi barely change between wx=1,wy=0w_{x}=1,w_{y}=0 and wx=1,wy=1w_{x}=1,w_{y}=1 under this standard training.

Second, the impact of observed labels yy on topic-word probabilities ϕ\phi can be negligible. According to the model, when the document-topic probabilities πd\pi_{d} are represented, the variables ϕ\phi are conditionally independent of yy. At training time the πd\pi_{d} may be coerced by direct updates using observed ydy_{d} labels to make good predictions, but such quality may not be available at test-time, when πd\pi_{d} must be updated using ϕ\phi and xdx_{d} alone. Intuitively, this problem comes from the objective treating xdx_{d} and ydy_{d} as “equal” observations when they are not. Our testing scenario always predicts labels ydy_{d} from the words xdx_{d}. Ignoring this can lead to severe overfitting, particularly when the word weight wxw_{x} is small.

2.2 End-to-End Optimization

Introducing weights wxw_{x} and wyw_{y} can help address the first concern (and are equivalent to providing a threshold on prediction quality). To address the second concern, we pursue gradient-based inference of a modified version of the objective in Eq. (2) that respects the need to use the same embedding of observed words xdx_{d} into low-dimensional πd\pi_{d} in both training and test scenarios:

maxϕ,ηwy(d=1Dlogp(yd|f(xd,ϕ),η))+wx(d=1Dlogp(xd|f(xd,ϕ),ϕ))\displaystyle\max_{\phi,\eta}w_{y}\Big{(}\sum_{d=1}^{D}\log p(y_{d}|f^{*}(x_{d},\phi),\eta)\Big{)}+w_{x}\Big{(}\sum_{d=1}^{D}\log p(x_{d}|f^{*}(x_{d},\phi),\phi)\Big{)} (3)

The function ff^{*} maps the counts xdx_{d} and topic-word parameters ϕ\phi to the optimal unsupervised LDA proportions πd\pi_{d}. The question, of course, is how to define the function ff^{*}. One can estimate πd\pi_{d} by solving a maximum a-posteriori (MAP) optimization problem over the space of valid KK-dimensional probability vectors ΔK\Delta^{K}:

πd=maxπdΔK(πd),(πd)=logp(xd|πd,ϕ)+logDir(πd|α).\displaystyle\pi_{d}^{\prime}=\max_{\pi_{d}\in\Delta^{K}}\ell(\pi_{d}),\quad\ell(\pi_{d})=\log p(x_{d}|\pi_{d},\phi)+\log\mbox{Dir}(\pi_{d}|\alpha). (4)

We can compute πd\pi_{d}^{\prime} via the exponentiated gradient algorithm (Kivinen and Warmuth, 1997), as described in (Sontag and Roy, 2011). We begin with a uniform probability vector, and iteratively reweight each entry by the exponentiated gradient until convergence using fixed stepsize ξ>0\xi>0:

init: πd0[1K1K].until converged: πdktpdktj=1Kpdjt,pdkt=πdkt1eξ(πdkt1).\displaystyle\mbox{init:~~}\pi^{0}_{d}\leftarrow[\frac{1}{K}\ldots\frac{1}{K}].\quad\mbox{until converged:~~}\pi^{t}_{dk}\leftarrow\frac{p^{t}_{dk}}{\sum_{j=1}^{K}p^{t}_{dj}},\quad p^{t}_{dk}=\pi^{t-1}_{dk}\cdot e^{\xi\nabla\ell(\pi^{t-1}_{dk})}. (5)

We can view the final result after T>>1T>>1 iterations, πdπdT\pi^{\prime}_{d}\approx\pi^{T}_{d}, as a deterministic function f(xd,ϕ)f^{*}(x_{d},\phi) of the input document xdx_{d} and topic-word parameters ϕ\phi.

End-to-end training with ideal embedding.

The procedure above does not directly lead to a way to estimate ϕ\phi to maximize the objective in Eq. (3). Recently, Chen et al. (2015) developed backpropagation supervised LDA (BP-sLDA), which optimizes Eq. (3) under the extreme discriminative setting wy=1,wx=0w_{y}=1,w_{x}=0 by pushing gradients through the exponentiated gradient updates above. We can further estimate ϕ\phi under any valid weights with this objective. We call this “training with ideal embedding”, because the embedding is optimal under the unsupervised model.

End-to-end training with approximate embedding.

Direct optimization of the ideal embedding function ff^{*}, as done by Chen et al. (2015), has high implementation complexity and runtime cost. We find in practice that each document requires dozens or even hundreds of the iterations in Eq. (5) to converge reasonably. Performing such iterations at scale and back-propagating through them is possible with careful C++ implementation but will still be the computational bottleneck. Instead, we suggest an approximation: use a simpler embedding function fλ(xd,ϕ)f^{\lambda}(x_{d},\phi) which has been trained to approximate the ideal embedding. Initial experiments suggest a simple multi-layer perceptron (MLP) recognition network architecture with one hidden layer of size H50H\approx 50 does reasonably well:

fkλ(xd,ϕ)=softmax(h=1Hλhkoutputσ(v=1Vλhvhiddenxdvϕkv)).\displaystyle f^{\lambda}_{k}(x_{d},\phi)=\textstyle\mbox{softmax}\Big{(}\sum_{h=1}^{H}\lambda^{\text{output}}_{hk}\sigma(\sum_{v=1}^{V}\lambda^{\text{hidden}}_{hv}x_{dv}\phi_{kv})\Big{)}. (6)

During training, we periodically pause our gradient descent over η,ϕ\eta,\phi and update λ\lambda to minimize a KL-divergence loss between the approximate embedding fλf^{\lambda} and the ideal, expensive embedding ff^{*}.

true topics: Refer to caption example docs: Refer to caption
Train with Instantiated π\pi Train with Ideal Embedding Train with Approx. Embedding
maxlogp(y,x|η,ϕ,π)\max\log p(y,x|\eta,\phi,\pi) maxlogp(y,x|η,ϕ,f(x,ϕ))\max\log p(y,x|\eta,\phi,f^{*}(x,\phi)) maxlogp(y,x|η,ϕ,fλ(x,ϕ))\max\log p(y,x|\eta,\phi,f^{\lambda}(x,\phi))
wx=1\displaystyle w_{x}=1 wy=0\displaystyle w_{y}=0 Refer to caption Refer to caption Refer to caption
wx=1\displaystyle w_{x}=1 wy=1\displaystyle w_{y}=1 Refer to caption Refer to caption Refer to caption
wx=.01\displaystyle w_{x}=.01 wy=1\displaystyle w_{y}=~~~1 Refer to caption Refer to caption Refer to caption
wx=0\displaystyle w_{x}=0 wy=1\displaystyle w_{y}=1 N/A Refer to caption Refer to caption
Fig. 1: Toy Bars Case Study. Top Row: True topic-word parameters and example documents. Only the last 6 single bars were used to generate data xx, but all 10 topics produce binary labels yy given xx. Remainder: Estimated topic-word parameters and prediction results for different training algorithms (columns). Each row represents a setting of the penalty weights in the objective function: wyw_{y} weights the supervised loss term, while wxw_{x} weights the unsupervised data term. We perform 3 separate initializations of each method with K=6K=6 topics and report the one with best (lowest) training error rate. Test set error rates are computed by using only the observed words xdx_{d} and the estimated topics ϕ\phi as input for each document, then computing πd\pi_{d} via f(xd,ϕ)f^{*}(x_{d},\phi) in Eq. (5).

3 Case study: Toy bars data

To understand how different training objectives impact both predictive performance and interpretability of topic-word parameters, we consider a version of the toy bars dataset inspired by Griffiths and Steyvers (2004), but changed so the optimal ϕ\phi parameters are distinct for unsupervised LDA and supervised LDA objectives. Our dataset has 144 vocabulary words visualized as pixels in a square grid in Fig. 1. To generate the observed words xx, we use 6 true topics: 3 horizontal bars and 3 vertical bars. However, we generate label ydy_{d} using an expanded set of 10 topics, where the extra topics are combinations of the 6 bars. Some combinations produce positive labels, but no single bar does. We train multiple initializations of each possible training objective and penalty weight setting, and show the best run of each method in Fig. 1. Our conclusions are listed below:

Standard training that instantiates π\pi can either ignore labels or overfit. Fig. 1’s first column shows two problematic behaviors with the optimization objective in Eq. (2). First, when wx=1w_{x}=1, the topic-word parameters are basically identical whether labels are ignored (wy=0w_{y}=0) or included (wy=1w_{y}=1). Second, when the observed data is weighted very low (wx=0.01w_{x}=0.01), we see severe overfitting, where the learned embeddings at training time are not reproducible at test time.

Ideal end-to-end training can be more predictive but has expensive runtime. In contrast to the problems with standard training, we see in the middle column of Fig. 1 that using the ideal test-time embedding function ff^{*} also during training can produce much lower error rates on heldout data. Varying the data weight wxw_{x} interpolates between interpretable topic-word parameters ϕ\phi and good predictions. One caveat to ideal embedding is its expensiveness: Completing 100 sweeps through this 1000 document toy dataset takes about 2.5 hours using our vectorized pure Python with autograd.

Approximate end-to-end training is much cheaper and often does as well. We see in the far right column of Fig. 1 that using our proposed approximate embedding fλf^{\lambda} often yields similar predictive power and interpretable topic-word parameters when wx>0w_{x}>0. Furthermore, it is about 3.6X faster to train due to avoiding the expensive embedding iterations at every document.

approx fλf^{\lambda} ideal ff^{*} ideal ff^{*} ideal ff^{*} ideal ff^{*} BoW
wx=0w_{x}=0 wx=0w_{x}=0 wx=0.01w_{x}=0.01 wx=1w_{x}=1 wx=1w_{x}=1
(prevalence) DRUG wy=1w_{y}=1 wy=1w_{y}=1 wy=1w_{y}=1 wy=1w_{y}=1 wy=0w_{y}=0
(0.215) citalopram 0.65 0.64 0.63 0.62 0.61 0.72
(0.135) fluoxetine 0.66 0.64 0.64 0.63 0.63 0.76
(0.133) sertraline 0.66 0.66 0.63 0.63 0.63 0.75
(0.119) trazodone 0.64 0.66 0.64 0.61 0.62 0.65
(0.115) bupropion 0.64 0.64 0.59 0.56 0.58 0.71
(0.070) amitriptyline 0.77 0.76 0.77 0.75 0.75 0.78
(0.059) venlafaxine 0.64 0.62 0.62 0.61 0.61 0.73
(0.059) paroxetine 0.68 0.73 0.74 0.76 0.75 0.76
(0.047) mirtazapine 0.70 0.69 0.70 0.71 0.70 0.67
(0.046) duloxetine 0.71 0.69 0.70 0.69 0.70 0.74
(0.041) escitalopram 0.65 0.62 0.61 0.61 0.61 0.80
(0.038) nortriptyline 0.71 0.73 0.70 0.70 0.71 0.71
(0.007) fluvoxamine 0.70 0.72 0.74 0.77 0.76 0.93
(0.007) imipramine 0.40 0.56 0.50 0.48 0.48 0.82
(0.006) desipramine 0.47 0.57 0.54 0.57 0.54 0.72
(0.003) nefazodone 0.71 0.65 0.71 0.72 0.72 0.80
Table 1: Heldout AUC scores for prediction of 16 drugs for treating depression. Each drug decision is independent, since multiple drugs might be given to a patient. “BoW” is logistic regression using xdx_{d} as features.

4 Case study: Predicting drugs to treat depression

We study a cohort of 875080 encounters from 49322 patients drawn from two large academic medical centers with at least one ICD9 diagnostic code for major depressive disorder (ICD9s 296.2x or 3x or 311, or ICD10 equivalent). Each included patient had an identified successful treatment: a prescription repeated at least 3 times in 6 months with no change.

We extracted all procedures, diagnoses, labs, and meds from the EHR (22,000 total codewords). For each encounter, we built xdx_{d} by concatenating count histograms from the last three months and all prior history. To simplify, we reduced this to the 9621 codewords that occurred in at least 1000 distinct encounters. The prediction goal was to identify which of 16 common anti-depressants drugs would be successful for each patient. (Predicting all 25 primaries and 166 augments is future work).

Table 1 compares each method’s area-under-the-ROC-curve (AUC) with K=50K=50 topics on a held-out set of 10% of patients. We see that our training algorithm using the ideal embedding ff^{*} improves its predictions over a baseline unsupervised LDA model as the weight wxw_{x} is driven to zero. Our approximate embedding fλf^{\lambda} is roughly 2-6X faster, allowing a full pass through all 800K encounters in about 8 hours, yet offers competitive performance on many drug tasks except for those like desipramine or imipramine for which less than 1% of encounters have a positive label. Unfortunately, our best sLDA model is inferior to simple bag-of-words features plus a logistic regression classifier (rightmost column “BoW”), which we guess is due to local optima. To remedy this, future work can explore improved data-driven initializations.

References

  • Blei (2012) D. M. Blei. Probabilistic topic models. Communications of the ACM, 55(4):77–84, 2012.
  • Chen et al. (2015) J. Chen, J. He, Y. Shen, L. Xiao, X. He, J. Gao, X. Song, and L. Deng. End-to-end learning of LDA by mirror-descent back propagation over a deep architecture. In Neural Information Processing Systems, 2015.
  • Ghassemi et al. (2014) M. Ghassemi, T. Naumann, F. Doshi-Velez, N. Brimmer, R. Joshi, A. Rumshisky, and P. Szolovits. Unfolding physiological state: mortality modelling in intensive care units. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 75–84. ACM, 2014.
  • Griffiths and Steyvers (2004) T. L. Griffiths and M. Steyvers. Finding scientific topics. Proceedings of the National Academy of Sciences, 2004.
  • Halpern et al. (2012) Y. Halpern, S. Horng, L. A. Nathanson, N. I. Shapiro, and D. Sontag. A comparison of dimensionality reduction techniques for unstructured clinical text. In ICML workshop on clinical data analysis, 2012.
  • Kivinen and Warmuth (1997) J. Kivinen and M. K. Warmuth. Exponentiated gradient versus gradient descent for linear predictors. Information and Computation, 132(1):1–63, 1997.
  • Lacoste-Julien et al. (2009) S. Lacoste-Julien, F. Sha, and M. I. Jordan. DiscLDA: Discriminative learning for dimensionality reduction and classification. In Neural Information Processing Systems, 2009.
  • McAuliffe and Blei (2007) J. D. McAuliffe and D. M. Blei. Supervised topic models. In Neural Information Processing Systems, 2007.
  • Paul and Dredze (2014) M. J. Paul and M. Dredze. Discovering health topics in social media using topic models. PLoS One, 9(8):e103408, 2014.
  • Sontag and Roy (2011) D. Sontag and D. Roy. Complexity of inference in latent dirichlet allocation. In Neural Information Processing Systems, 2011.
  • Taddy (2012) M. Taddy. On estimation and selection for topic models. In Artificial Intelligence and Statistics, 2012.
  • Zhu et al. (2012) J. Zhu, A. Ahmed, and E. P. Xing. MedLDA: maximum margin supervised topic models. The Journal of Machine Learning Research, 13(1):2237–2278, 2012.