Hierarchical Pretraining for Biomedical Term Embeddings

Bryan Cai^∗,†,1 Sihang Zeng^†,2 Yucong Lin³ Zheng Yuan⁴ Doudou Zhou⁵ and Lu Tian⁶
¹ Department of Computer Science, Stanford University, bxcai@stanford.edu, 0000-0001-9335-5828
² Department of Electronic Engineering, Tsinghua University, Beijing, China. zengsh19@mails.tsinghua.edu.cn, 0009-0003-2921-829X
³ Institute of Engineering Medicine, Beijing Institute of Technology, Beijing, China, linyucong@bit.edu.cn, 0000-0002-9039-0318
⁴ Alibaba Damo Academy, Hangzhou, China, yuanzheng.yuanzhen@alibaba-inc.com, 0000-0001-7179-2437
⁵ Department of Biostatistics, Harvard T.H. Chan School of Public Health, doudouzhou@hsph.harvard.edu, 0000-0002-0830-2287
⁶ Department of Biomedical Data Science, Stanford University, lutian@stanford.edu, 0000-0002-5893-0169

\dagger

Equal contribution
^∗corresponding author

Abstract

medical term representation, knowledge graph embedding, contrastive learning
Abstract. Electronic health records (EHR) contain narrative notes that provide extensive details on the medical condition and management of patients. Natural language processing (NLP) of clinical notes can use observed frequencies of clinical terms as predictive features for downstream applications such as clinical decision making and patient trajectory prediction. However, due to the vast number of highly similar and related clinical concepts, a more effective modeling strategy is to represent clinical terms as semantic embeddings via representation learning and use the low dimensional embeddings as feature vectors for predictive modeling. To achieve efficient representation, fine-tuning pretrained language models with biomedical knowledge graphs may generate better embeddings for biomedical terms than those from standard language models alone. These embeddings can effectively discriminate synonymous pairs of from those that are unrelated. However, they often fail to capture different degrees of similarity or relatedness for concepts that are hierarchical in nature. To overcome this limitation, we propose HiPrBERT, a novel biomedical term representation model trained on additionally complied data that contains hierarchical structures for various biomedical terms. We modify an existing contrastive loss function to extract information from these hierarchies. Our numerical experiments demonstrate that HiPrBERT effectively learns the pair-wise distance from hierarchical information, resulting in a substantially more informative embeddings for further biomedical applications.

1 Introduction

Biomedical term representations condense the semantic meanings of terms into a low-dimensional space, which is useful for various downstream applications, such as clinical decision making , patient trajectory modeling , and automated phenotyping. Current state-of-the-art methods [1, 2, 3] employ pretrained language models (PLMs) with contrastive learning loss to generate contextual embeddings from biomedical knowledge graphs like the Unified Medical Language System (UMLS) [4]. These methods focus on term normalization or entity linking problems and expect similar terms to be close in the embedding space. While they excel at similarity modeling, even in challenging tasks like unsupervised synonym grouping [2], they do not perform well in modeling hierarchies between biomedical terms [5].

Efforts have been made in recent studies to incorporate hierarchical information into biomedical term representations. For example, Kaylan and Sangeetha (2021) used a retrofitting algorithm and UMLS relationships to incorporate ontology relationship knowledge into term representations [5]. However, this method treats all relationships equally. Another approach was proposed by Yang et al. (2022) based on a hierarchical triplet loss with dynamic margin learned from the hierarchy of ICD codes [6, 7], which improved the performance of the ICD coding task. However, this method is less flexible as it requires an explicit parametrization of the dynamic margin, which can be difficult in the presence of many different classes of term pairs.

To incorporate specific biomedical term hierarchies into training the embedding, we select a set of terms based on these hierarchies for each anchor term. Our model learns to improve the concordance between the cosine similarities of embedded term pairs and their similarities within hierarchies. Existing techniques for optimizing the rank loss require the specification of margins between adjacent categories[8], which is delicate and time-consuming [9, 10].

In this paper, we present a novel hierarchical biomedical term representation model that leverages both the synonyms in UMLS and hierarchies in EHR codified data. To this end, we have gathered medication terms from RxNorm [11], phenotype terms from PheCode [12], procedure terms from CPT [13], and laboratory terms from LOINC [14], and organize them into hierarchical structure for embedding training. Taking advantage of constructed hierarchies, we adapt the existing contrastive loss function to handle any number of ordered categories without the need of specifying any between-category margin. We name our model Hierarchical Pretrained BERT (HiPrBERT).

2 Related Works

Biomedical term representation is the foundation of biomedical language understanding. Word embeddings generally use word2vec algorithm with biomedical corpus for training [15]. Cui2vec factorizes a shifted, positive pointwise mutual information matrix to obtain a lower-dimension embedding of the words [16]. CODER and SapBERT extend the fixed vocabulary in word2vec models to arbitrary inputs by using pretrained language models and contrastive learning to learn from the synonyms in UMLS. To encode hierarchies in biomedical term representations, Yang et al. (2022) designs a hierarchical triplet loss with pre-assigned dynamic margin to learn from the hierarchy of ICD codes [6], while Kayyan and Sangeetha (2021) uses a retrofitting algorithm to refine the representations using UMLS relationships [5]. These methods facilitate the development of biomedical NLP, but are still restrictive in exploring the fine information in various types of hierarchies.

3 Data and Methods

We will introduce the structure of the input data, the general model architecture that we use to build embeddings, the hard pair mining strategy, and the loss functions.

3.1 UMLS and Medical Hierarchies

HiPrBERT leverages two main sources of data. The first is the UMLS, a knowledge graph that encodes relations across many different medical vocabularies. These terms have no inherent order to them, and there are many different types of relations between pairs of terms. In addition to the UMLS knowledge graph, we have a collection of various hierarchies that we can leverage. Specifically, PheCODE is a hierarchy containing ICD codes that can be represented as a forest of trees. The root of each tree is a separate concept, and children of a node will represent a more specific concept. LOINC is another hierarchy representing laboratory observations, containing 171,191 nodes from 27 trees, whose depth varies from 2 to 13; Similarly, RxNorm and CPT are also represented as forests focusing on medication and procedure terms, respectively. PheCode contains 1,601 nodes, RxNorm contains 192,683 nodes; and CPT contains 10,360 nodes. In these hierarchies, the structure contains more information than UMLS on the “closeness” between various biomedical terms, which can be used to accomplish a fine embedding better discriminating closed related terms from moderately related terms.

It is worth noting that although the number of terms in each hierarchy is significantly lower than the number of terms in the UMLS, we expect that we can obtain enough high-quality training pairs from the hierarchy to enhance the embeddings in most relevant regions of the embedding space. In practical terms, each hierarchy consists of two mappings: one from parents to children and one from codes to the biomedical term strings.

Table 1: Hierarchy map

Parent	Child
LP29693-6	LP158133-1
LP29693-6	LP7798-4

Table 2: String map

Code	String
LP29693-6	Laboratory
LP158133-1	HNA
LP7798-4	Fertility testing

3.2 Term Embeddings

HiPrBERT takes in an input term $s$ and outputs a corresponding embedding $\textbf{e}_{s}\in R^{d}$ . Specifically, the input $s$ is first converted into a series of tokens, which are then encoded by HiPrBERT into a series of $d$ dimensional hidden state vectors

[\text{CLS}],\mathbf{t}_{0},\mathbf{t}_{1},...,\mathbf{t}_{n},[\text{SEP}]\xrightarrow{\textsc{HiPrBERT}}\mathbf{h}_{[\text{CLS}]},\mathbf{h}_{0},\mathbf{h}_{1},...,\mathbf{h}_{n},\mathbf{h}_{[\text{SEP}]}.

The embedding of $s$ is defined to be the latent vector corresponding to the [CLS] token

s\rightarrow\mathbf{e}_{s}=\mathbf{h}_{[\text{CLS}]}\in R^{d}.

3.3 Distance metric

Similar to SapBERT, our approach learns term representations by maximizing the embedding similarity between term-term pairs that are “close” and minimizing embedding similarities between term-term pairs that are “far”. We define the embedding similarity between terms $s_{i}$ and $s_{j}$ as $S_{ij}=\cos(\mathbf{e}_{i},\mathbf{e}_{j}).$ We also define following distances to quantify the resemblance between terms $s_{i}$ and $s_{j}.$ These particular choices of the numerical value are not important and only their order matters in training embeddings.

1.

If $s_{i}$ and $s_{j}$ are from the UMLS, $d(s_{i},s_{j})=\begin{cases}0&\text{$s_{i}$ and $s_{j}$ are synonyms;}\\ 3&\text{otherwise.}\end{cases}$
2.

If $s_{i}$ and $s_{j}$ are from a hierarchy, $d(s_{i},s_{j})=\begin{cases}0&\text{$s_{i}$ and $s_{j}$ are synonyms;}\\ 1&\text{$s_{i}$ and $s_{j}$ have the same parent (a sibling pair);}\\ 2&\text{$s_{i}$ and $s_{j}$ are a parent-child pair;}\\ 3&\text{otherwise.}\end{cases}$

3.4 Hard Pair Mining

When sampling UMLS term data, we use an online triplet miner to select negative pairs. Specifically, among all triplets of terms $(s_{a},s_{p},s_{n})$ , where $(s_{a},s_{p})$ are synonymous and $(s_{a},s_{n})$ are non-synonymous, based on initial embeddings, $(s_{a},s_{p},s_{n})\rightarrow(\mathbf{e}_{a},\mathbf{e}_{p},\mathbf{e}_{n})$ , we consider the difference between $\mbox{cos}(\mathbf{e}_{a},\mathbf{e}_{p})$ and $\mbox{cos}(\mathbf{e}_{a},\mathbf{e}_{n}),$ and select the triplets with this difference $>0.25$ to be included in our minibatch for further training. We do the same for UMLS relational data.

For hierarchical data, we leverage the structure of the tree to construct minibatches. For example, we use distance 0 pairs as positive samples, and distance $>0$ pairs as negative samples. We do this with every distance to encourage separation between varying levels of similarity.

3.5 Loss Function

Given an anchor term $s_{i}$ and a set of terms $\Omega_{i}$ , we can define the sets

\Omega_{i}^{(0)}(d_{0})=\left\{j\in\Omega_{i}\mid d(s_{i},s_{j})\leq d_{0}\right\}\leavevmode\nobreak\ \leavevmode\nobreak\ \mbox{and}\leavevmode\nobreak\ \leavevmode\nobreak\ \Omega_{i}^{(1)}(d_{0})=\left\{j\in\Omega_{i}\mid d(s_{i},s_{j})>d_{0}\right\}.

In other words, $\Omega_{i}^{(0)}(d_{0})$ contains all terms that are at most distance $d_{0}$ from $s_{i}$ , $\Omega_{i}^{(1)}(d_{0})$ contains all terms that are further than $d_{0}$ away. Our goal is to create embeddings such that the similarity between $s_{i}$ and terms in $\Omega_{i}^{(0)}(d_{0})$ is greater than that between $s_{i}$ and terms in $\Omega_{i}^{(1)}(d_{0})$ . We use the multi-similarity loss [17]. For UMLS data, we have the standard MS loss function.

\sum_{i=1}^{k}\left[\alpha^{-1}\log\left(1+\sum_{j\in\Omega_{i}^{(0)}(0)}e^{-\alpha\left(S_{ij}-\lambda\right)}\right)+\beta^{-1}\log\left(1+\sum_{j\in\Omega_{i}^{(1)}(0)}e^{\beta\left(S_{ij}-\lambda\right)}\right)\right],

where $\alpha=2,\beta=2,\lambda=.5$ . Note that the terms in $\Omega_{i}^{(1)}(0)$ come from the triplet mining procedure. For hierarchical data, we use a modified loss:

\sum_{d_{0}=0}^{2}\sum_{i=1}^{k}\left[\alpha^{-1}\log\left(1+\sum_{j\in\Omega_{i}^{(0)}(d_{0})}e^{-\alpha\left(S_{ij}-\lambda\right)}\right)+\beta^{-1}\log\left(1+\sum_{j\in\Omega_{i}^{(1)}(d_{0})}e^{\beta\left(S_{ij}-\lambda\right)}\right)\right],

with the same set of tuning parameters.

4 Experiments

4.1 Model Training

Our training process is similar to that of SapBERT, with the main key difference being the loss functions that were used. Using PyTorch [18] and the transformers library [19], our model was initialized from PubMedBERT [20] and trained using AdamW [21] with a learning rate of $2\times 10^{-5}$ , a weight decay rate of 0.01, and linear learning rate scheduler. We use a training batch size of 256, and train on the preprocessed UMLS synonym data, UMLS relation data, and hierachical data for one epoch. This equates to about 120 thousand iterations, and takes less than 10 hours on a single GPU machine.

4.2 Model Evaluation

To objectively evaluate our models, we randomly selected evaluation pairs from hierarchies that were not used in model training. For each evaluation pair, we calculated the cosine similarity between the respective embeddings to determine their relatedness. The quality of the embedding was measured using the AUC under the ROC curve for discriminating between distance $i$ pairs and distance $j$ pairs, where $0\leq i<j\leq 3$ . In addition, we have also evaluated the embedding performance via Spearman’s correlation and precision-recall curve.

For relatedness tasks, we used pairs of terms in our holdout set for various relations to test the models. There are many different types of relationships, and we report three of clinical importance, as well as the average of the 28 most common relations. We also included performance on the Cadec term normalization task.

We compare HiPrBERT with a set of competitors including SapBERT, CODER, PubMedBERT, BioBERT, BioGPT and DistilBERT, where the SapBERT is retrained without using testing data for generating fair comparisons.

4.3 Evaluation Results

The AUC values for discriminating pairs of different distances are reported in Table 3. HiPrBERT, fine-tuned on hierarchical datasets, outperforms all its competitors in every category, except for 1 vs 3, where it’s performance is very close to CODER. The most noteworthy improvement is in the 0 vs 1 task, where models have to distinguish synonyms from very closely related pairs, such as “Type 1 Diabetes” and “Type 2 Diabetes”. We have also reported the results using Spearman’s rank correlation coefficient in Table 4, and the conclusions are similar.

We also see significant improvements in all relatedness tasks (Table 5). For example, the AUC in the “Causative” category improves from 91.9% to 98.1% in comparison with the second best embedding generated by CODER. Similar improvement has been also observed in detecting “May Cause/Treat” and “Method of” relations. Overall, the average performance of the model in detecting the 28 most common relationships improved from 88.6% to 93.7% in comparison with the next best embedding. This demonstrates a substantial improvement in our ability to capture more nuanced information. It is worth noting that HiPrBERT’s performance in Cadec is on par with other existing models, indicating that our model does not compromise on performance in similarity tasks while achieving improvements in other areas. Lastly, the comparison results based on Spearman’s correlation (Table 4) and precision-recall curve (not reported) are similar.

5 Discussion

Our model is one of the first to include terms from medical term hierarchies (PheCODE, LOINC, RxNorm), and these trees contain terms critical for structured EHR data. Existing methods such as CODER and SapBERT do not train on this specific vocabulary. By improving embeddings for these strings in particular, our embeddings have the potential to integrate better with structured EHR data, enhancing the representation of patients. This then directly leads improvements in downstream tasks such as extracting prediction features and patients clustering.

The use of induced distance from hierarchies helps improve model performance, and can be expanded in several ways. One may consider more pair types within each hierarchy; for example the distance metric can be expanded to include grandparent-child and uncle-nephew pairs. Alternatively, the distance metric can take into account the global structure of the tree. Currently, pairwise resemblance only takes into account the local information around the term, looking only at immediate connections. However, typically nodes closer to the root of the hierarchies represent broader concepts that are further apart, whereas nodes closer to the leaves represent more specific concepts that are closer together. This can either be explicitly coded into the training process, or ideally learnt on the fly. In addition, different hierarchies will naturally differ in structure and therefore pairwise distance, so this adjustment would be hierarchy specific. Our simple choice here is for computational convenience and can be improved.

6 Conclusion

In this paper we present a novel method for training embeddings better discriminating pairs of different similarity by taking advantage of additional hierarchical structures. Operationally, the method only requires to order the term-term similarity, which is much simpler than assigning quantitative margins between similarities used in the rank loss. The new model outperforms existing ones on separating weakly related terms from closely related terms without sacrificing performance on other metrics.

Table 3: Tree Results (ROC AUC)

Model	Distance Categories
	(0, 1)	(0, 2)	(0, 3)	(1, 2)	(1, 3)	(2, 3)
SapBERT⁰	0.636	0.779	0.967	0.702	0.971	0.905
CODER	0.599	0.737	0.977	0.679	0.979	0.931
PubMedBERT¹	0.539	0.609	0.744	0.575	0.721	0.653
BioBERT	0.497	0.576	0.599	0.586	0.611	0.523
BioGPT	0.571	0.667	0.795	0.604	0.754	0.662
DistilBERT	0.544	0.631	0.729	0.589	0.694	0.614
HiPrBERT	0.657	0.796	0.986	0.704	0.977	0.936

${}^{0}:$ Representation trained after removing evaluation data ${}^{1}:$ Initial representation for our model training

Table 4: Tree Results (Spearman’s Correlation)

Model	Distance Categories
	(0, 1)	(0, 2)	(0, 3)	(1, 2)	(1, 3)	(2, 3)
SapBERT	0.198	0.483	0.800	0.294	0.627	0.693
CODER	0.144	0.411	0.817	0.262	0.638	0.738
PubMedBERT	0.057	0.190	0.417	0.109	0.294	0.262
BioBERT	-0.004	0.131	0.169	0.125	0.148	0.039
BioGPT	0.104	0.289	0.506	0.152	0.338	0.277
DistilBERT	0.065	0.227	0.393	0.130	0.259	0.196
HiPrBERT	0.229	0.512	0.832	0.298	0.635	0.747

Table 5: Other Tasks

Model	Relatedness Tasks
	Causative	May Cause/Treat	Method of	Top Rel.⁰	Cadec (Top 1/3)
SapBERT	0.889	0.791	0.880	0.869	0.610/0.801
CODER	0.919	0.750	0.902	0.886	0.585/0.760
PubMedBERT	0.711	0.706	0.554	0.572	0.172/0.238
BioBERT	0.508	0.499	0.503	0.534	0.083/0.116
BioGPT	0.771	0.687	0.582	0.612	0.221/0.299
DistilBERT	0.683	0.605	0.472	0.579	0.177/0.245
HiPrBERT	0.981	0.803	0.923	0.937	0.605/0.805

${}^{0}:$ Average performance over 28 most common relations in UMLS

References

Yuan et al. [2020] Z. Yuan, Z. Zhao, and S. Yu, “Coder: Knowledge infused cross-lingual medical term embedding for term normalization,” Journal of Biomedical Informatics, p. 103983, 2020.
Zeng et al. [2022] S. Zeng, Z. Yuan, and S. Yu, “Automatic biomedical term clustering by learning fine-grained term representations,” in Workshop on Biomedical Natural Language Processing, 2022.
Liu et al. [2020] F. Liu, E. Shareghi, Z. Meng, M. Basaldella, and N. Collier, “Self-alignment pretraining for biomedical entity representations,” in North American Chapter of the Association for Computational Linguistics, 2020.
Bodenreider [2004] O. Bodenreider, “The unified medical language system (umls): integrating biomedical terminology,” Nucleic Acids Research, vol. 32 Database issue, pp. D267–70, 2004.
Kalyan and Sangeetha [2021] K. S. Kalyan and S. Sangeetha, “A Hybrid Approach to Measure Semantic Relatedness in Biomedical Concepts,” Jan. 2021, arXiv:2101.10196 [cs]. [Online]. Available: http://arxiv.org/abs/2101.10196
Yang et al. [2022] Z. Yang, S. Wang, B. P. S. Rawat, A. Mitra, and H. Yu, “Knowledge Injected Prompt Based Fine-tuning for Multi-label Few-shot ICD Coding,” Oct. 2022, arXiv:2210.03304 [cs]. [Online]. Available: http://arxiv.org/abs/2210.03304
Braemer [1988] G. Braemer, “International statistical classification of diseases and related health problems. tenth revision.” World Health Statistics Quarterly. Rapport Trimestriel de Statistiques Sanitaires Mondiales, vol. 41 1, pp. 32–6, 1988.
Liu et al. [2022] Y. Liu, P. Liu, D. Radev, and G. Neubig, “BRIO: Bringing order to abstractive summarization,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Dublin, Ireland: Association for Computational Linguistics, May 2022, pp. 2890–2903. [Online]. Available: https://aclanthology.org/2022.acl-long.207
Yuan et al. [2023] Z. Yuan, H. Yuan, C. Tan, W. Wang, S. Huang, and F. Huang, “Rrhf: Rank responses to align language models with human feedback without tears,” 2023.
LeCun et al. [2006] Y. LeCun, S. Chopra, R. Hadsell, A. Ranzato, and F. J. Huang, “A tutorial on energy-based learning,” 2006.
Nelson et al. [2011] S. J. Nelson, K. Zeng, J. Kilbourne, T. Powell, and R. Moore, “Normalized names for clinical drugs: Rxnorm at 6 years,” Journal of the American Medical Informatics Association : JAMIA, vol. 18 4, pp. 441–8, 2011.
Wu et al. [2019] P. Wu, A. Gifford, X. Meng, X. Li, H. Campbell, T. Varley, J. Zhao, R. J. Carroll, L. A. Bastarache, J. C. Denny, E. Theodoratou, and W.-Q. Wei, “Mapping icd-10 and icd-10-cm codes to phecodes: Workflow development and initial evaluation,” JMIR Medical Informatics, vol. 7, 2019.
Dotson [2013] P. Dotson, “Cpt® codes: What are they, why are they necessary, and how are they developed?” Advances in Wound Care, vol. 2 10, pp. 583–587, 2013.
McDonald et al. [2003] C. J. McDonald, S. M. Huff, J. G. Suico, G. Hill, D. Leavelle, R. D. Aller, A. W. Forrey, K. Mercer, G. J. E. DeMoor, J. Hook, W. G. Williams, J. Case, and P. Maloney, “Loinc, a universal standard for identifying laboratory observations: a 5-year update.” Clinical Chemistry, vol. 49 4, pp. 624–33, 2003.
Mikolov et al. [2013] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” ArXiv, vol. abs/1310.4546, 2013.
Beam et al. [2018] A. Beam, B. Kompa, A. Schmaltz, I. Fried, G. M. Weber, N. P. Palmer, X. Shi, T. Cai, and I. S. Kohane, “Clinical concept embeddings learned from massive sources of multimodal medical data,” Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing, vol. 25, pp. 295 – 306, 2018.
Wang et al. [2019] X. Wang, X. Han, W. Huang, D. Dong, and M. R. Scott, “Multi-similarity loss with general pair weighting for deep metric learning,” CoRR, vol. abs/1904.06627, 2019. [Online]. Available: http://arxiv.org/abs/1904.06627
Paszke et al. [2019] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “PyTorch: An Imperative Style, High-Performance Deep Learning Library,” in Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché Buc, E. Fox, and R. Garnett, Eds. Curran Associates, Inc., 2019, pp. 8024–8035. [Online]. Available: http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf
Wolf et al. [2020] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, C. Ma, Y. Jernite, J. Plu, C. Xu, T. Le Scao, S. Gugger, M. Drame, Q. Lhoest, and A. M. Rush, “Transformers: State-of-the-Art Natural Language Processing.” Association for Computational Linguistics, Oct. 2020, pp. 38–45. [Online]. Available: https://www.aclweb.org/anthology/2020.emnlp-demos.6
Gu et al. [2020] Y. Gu, R. Tinn, H. Cheng, M. Lucas, N. Usuyama, X. Liu, T. Naumann, J. Gao, and H. Poon, “Domain-specific language model pretraining for biomedical natural language processing,” CoRR, vol. abs/2007.15779, 2020. [Online]. Available: https://arxiv.org/abs/2007.15779
Loshchilov and Hutter [2017] I. Loshchilov and F. Hutter, “Fixing weight decay regularization in adam,” CoRR, vol. abs/1711.05101, 2017. [Online]. Available: http://arxiv.org/abs/1711.05101