Multiple Word Embeddings for Increased Diversity of Representation

Brian Lester, Daniel Pressel^†^†footnotemark: , Amy Hemmeter,
Sagnik Ray Choudhury and Srinivas Bangalore

Interactions, Ann Arbor MI 48104
{blester, dpressel, ahemmeter, schoudhury, sbangalore}
@interactions.com
Equal contribution; authors listed alphabetically

Abstract

Most state-of-the-art models in natural language processing (NLP) are neural models built on top of large, pre-trained, contextual language models that generate representations of words in context and are fine-tuned for the task at hand. The improvements afforded by these “contextual embeddings” come with a high computational cost. In this work, we explore a simple technique that substantially and consistently improves performance over a strong baseline with negligible increase in run time. We concatenate multiple pre-trained embeddings to strengthen our representation of words. We show that this concatenation technique works across many tasks, datasets, and model types. We analyze aspects of pre-trained embedding similarity and vocabulary coverage and find that the representational diversity between different pre-trained embeddings is the driving force of why this technique works. We provide open source implementations of our models in both TensorFlow and PyTorch.

1 Introduction

Much of the recent work in NLP has focused on better feature representations via contextual word embeddings Peters et al. (2018, 2017); Radford et al. (2018); Akbik et al. (2018); Devlin et al. (2019). These models vary in architecture and pre-training objective but they all encode the input based on the surrounding context in some way. These papers normally compare to baselines like a bidirectional LSTM-CRF (biLSTM-CRF) where words are represented by a single pre-trained word embedding.

Peters et al. (2018, 2017) and Akbik et al. (2018) pre-train large language models based on LSTMs. Task-specific architectures are then built on top of these pre-trained models. Peters et al. (2018) introduce a technique for extracting word representations as a linear combination of layers in the pre-trained model. Gradient updates are only applied to this weighting factor, which simplifies the training to some extent, but forward propagation is still required for the full network which makes the model slow to train and evaluate.

Radford et al. (2018), followed by Devlin et al. (2019), pre-train deep transformers Vaswani et al. (2017) on massive corpora. They both use a simple output layer on top of the pre-trained model and tune the parameters of the whole model. In this case, training requires the forward and backward pass of the entire pre-trained model, which has a significant impact on size and speed. Devlin et al. (2019) used specialized hardware which may be unrealistic for many inference scenarios.

The prevailing wisdom is that, because these pre-trained models are contextual, they can create representations of a word that is different in different contexts. For example, a polysemous word can be represented by different vectors when its context suggests a different sense of a word, while context-independent word vectors need to represent a mix of all the senses of a word. The majority of NLP models have a similar “contextualization” step, typically done via a biLSTM, convolutional layers, or self-attention, but it is only learned from a smaller, task-specific corpus in contrast to the massive corpora used by contextual embeddings.

Contextual embeddings and transfer learning architectures are slow to train and evaluate, which may make them infeasible for many types of deployments. Using multiple pre-trained embeddings trained on different datasets, we can exploit the bias in different datasets that results in different representations of the same word. By combining these embeddings, we can create richer representations of the word without the high computational overhead required by contextual alternatives. We find that the concatenation of multiple pre-trained word embeddings show consistent improvements over single embeddings yielding results much closer to contextual alternatives.

2 Experiments & Results

Task	Dataset	Model	Embeddings	mean	std	min	max
NER	CoNLL	biLSTM-CRF	6B	91.12	0.21	90.62	91.37
			Senna	90.48	0.27	90.02	90.81
			6B, Senna	91.61	0.25	91.15	92.00
	WNUT-17	biLSTM-CRF	27B	39.20	0.71	37.98	40.33
			27B, w2v-30M	39.52	0.83	38.09	40.39
			27B, w2v-30M, 840B	40.33	1.13	38.38	41.99
	OntoNotes	biLSTM-CRF	6B	87.02	9.15	86.75	87.24
			6B, Senna	87.41	0.16	87.14	87.74
Slot Filling	Snips	biLSTM-CRF	6B	95.84	0.29	95.39	96.21
			GN	95.28	0.41	94.51	95.81
			6B, GN	96.04	0.28	95.39	96.35
POS	TW-POS	biLSTM-CRF	w2v-30M	89.21	0.28	88.72	89.74
			27B	89.63	0.19	89.35	89.92
			27B, w2v-30M	90.35	0.20	89.99	90.60
			27B, w2v-30M, 840B	90.75	0.14	90.53	91.02
Classification	SST2	LSTM	840B	88.39	0.45	87.42	89.07
			GN	87.58	0.54	86.16	88.19
			840B, GN	88.57	0.44	87.59	89.24
	AG-NEWS	LSTM	840B	92.53	0.45	87.42	89.07
			GN	92.20	0.18	91.80	92.40
			840B, GN	92.60	0.20	92.30	92.86
	Snips	Conv	840B	97.47	0.33	97.01	97.86
			GN	97.40	0.27	97.00	97.86
			840B, GN	97.63	0.52	97.00	98.29

Table 1: Results using multiple embeddings applied to several tasks and datasets. NER and Slot Filling tasks report entity-level F1. POS tagging and Classification report token-level and example-level accuracy respectively. Using multiple pre-trained embeddings helps across a wide range of tasks and datasets as well as across different model architectures within a given task. All results are reported across 10 runs.

We use three sequential prediction tasks to test the performance of our concatenated embeddings: NER (CoNLL 2003 Tjong Kim Sang and De Meulder (2003), WNUT-17 Derczynski et al. (2017), and OntoNotes Hovy et al. (2006)), Slot filling (Snips Coucke et al. (2018)) and POS tagging (TW-POS Gimpel et al. (2011)). We also show results on three classification datasets: SST2 Socher et al. (2013), Snips intent classification Coucke et al. (2018), and AG-News¹¹1http://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html. For each (task, dataset) pair we use the most common embedding used in literature, for example, GloVe embeddings were used for CONLL 2003 in Ma and Hovy (2016) and Senna embeddings in Chiu and Nichols (2016); Peters et al. (2018). Embeddings were also chosen based on how well the embedding training data fit the task, i.e., we used GloVe vectors trained on twitter for the twitter part of speech tagging task. Once we developed tests for which embeddings worked together in Section 3 we checked if there were any more embeddings combinations we should try but did not find any additional combinations. For all tagging tasks, a biLSTM-CRF model with convolutional character compositional inputs, following Ma and Hovy (2016), is used. For all classification tasks, a single layer LSTM model is used except for the Snips classification dataset, where a convolutional word-based model Kim (2014) is used. The hyperparameters are omitted here for brevity but can be found in our implementation.

The results are presented in Table 1. 6B, 27B and 840B are well-known, pre-trained GloVe embeddings Pennington et al. (2014) distributed via the authors site, w2v-30M Pressel et al. (2018) and GN Mikolov et al. (2013) are Word2Vec embeddings trained on a corpus of 30 million tweets and Google News respectively, and the Senna embeddings were trained by Collobert et al. (2011).

We leverage multiple pre-trained embeddings in a model by creating one embeddings table per pre-trained embedding. Each input token in embedded into each vector space and the resulting vectors are concatenated into a single vector. This means that it is possible for there to be a type that is unattested in one pre-trained embedding vocabulary but present in the other. This results in a pre-trained vector from one embedding being concatenated with a randomly initialized vector form the other embedding space.

As hypothesized, we see improvements across tasks, datasets, and model architectures when using multiple embeddings.

Models using the concatenation of pre-trained and randomly initialized embeddings do $0.6$ % worse on average compared to models that only use a single pre-trained embedding. This demonstrates that the performance gains are from the combination of different pre-trained embeddings rather than the increase in the number of parameters in the model. In some cases we were able to improve results further by adding several sets of additional embeddings.

Task	Domain	$\Delta$
NER	General NER	0.51
Slot Filling	Automotive	0.14
	Cyber Security	0.06
	Customer Service	0.34
Intent	Automotive	0.52
	Cyber Security	0.03
	Customer Service	0.16

Table 2: Performance using multiple embeddings on internal datasets. Although smaller than well-known datasets, we see consistent improvements across internal tasks and domains.

Task	Dataset	$\Delta$
NER	CoNLL	0.54
	WNUT-17	2.88
	OntoNotes	0.45
Slot Filling	Snips	0.21
POS	TW-POS	1.25
Classification	SST2	0.20
	AG-NEWS	0.08
	Snips	0.16

Table 3: Relative difference for well-known datasets to help frame the results in Table 2

Table 2 summarizes the results of using the multiple embedding approach on internal datasets. These datasets are drawn from the tasks defined earlier and span a variety of specialized domains. Due to the nature of the datasets the results are presented as the relative change in performance. Table 3 is provided to help frame the relative performance numbers from the internal datasets.

The models were trained with MEAD/Baseline Pressel et al. (2018), an open-source framework for developing, training, and deploying NLP models.

3 Analysis

Dataset	Embeddings	mean	std	max
SST2	GN	87.58	0.54	88.19
	GN + Random init	87.62	0.22	88.64
	GN + 840B complement to GN	87.72	0.23	88.02
	GN + 840B matched to GN	88.53	0.55	89.45
	GN + 840B	88.57	0.44	89.24
CoNLL	6B	91.12	0.21	91.37
	6B + Random init	90.77	0.17	91.11
	6B + Senna complement to 6B	90.73	0.29	91.19
	6B + Senna matched to 6B	91.47	0.18	91.78
	6B + Senna	91.61	0.25	92.00

Table 4: An ablation to explain why multiple embeddings work. The majority of the improvement comes the case where we take only the words from the second pre-trained embedding that appear in the first vocab (the matched row). This suggests that having different representations for a word is much more important than increased model capacity (tested in the Random init row) or the increased coverage in the pre-trained vocabulary (represented by the complement row).

	Overlap		Attested		Performance
Embeddings	train	dev	train	dev	mean	std
Senna	18.9	20.8	74.3	80.3	91.610	0.247
GloVe twitter 27B	24.9	27.2	68.1	76.1	91.098	0.135
GloVe 840B	41.7	40.6	83.2	88.5	91.011	0.228
GloVe 42B	45.5	45.3	90.4	93.8	91.163	0.146
GoogleNews	25.2	26.8	55.9	65.1	90.948	0.180

Table 5: Embedding similarity as defined by average Jaccard similarity of the 10 nearest neighbors on the top 200 words in CoNLL 2003. Performance is the entity-level F1 score of each embedding when paired with Glove 6B 100 dimension embeddings. Here we can see that using pairs of dissimilar embeddings correlate with better performance as long as the embeddings have enough coverage to be effectively leveraged.

There are three logical places where the observed improvements could come from. 1) The use of multiple pre-trained embeddings creates a slightly larger model, increasing the network capacity—the embeddings are larger and therefore the projection from the embeddings to the first layer of the model will also be slightly bigger. 2) The use of a second pre-trained embedding increases the vocabulary size and more words are attested. A word that has a pre-trained representation will start the model in a better spot than a randomly initialized representation. 3) The second set of pre-trained embeddings gives a different perspective of the words. Most pre-trained embeddings are trained on different data and encode different biases and senses into the embedding that reflect the quirks and unique contexts found in the pre-training data. This representational diversity will allow a model to capitalize on different senses, or the combination of senses, that would not be present when using a single embedding.

In order to tease apart which of these factors are at play we designed a series of models that aim to isolate each effect and report results in Table 4. First, we train a model that uses a single pre-trained embedding and a second set of vectors that are initialized randomly. If the main improvement is due to increased model capacity this configuration should perform well. The second model uses a special version of the second pre-trained embedding where we remove all the words that already appear in the original pre-trained vocabulary. In this second set of embeddings, randomly initialized vectors are used for the words that are covered in the original vocabulary in order to keep the embeddings size consistent with the previous model. If the main reason for improvement is the increased vocabulary coverage, this model should perform well. The final version of the model also uses a customized version of the second pre-trained embedding. This time we only keep embeddings that are already represented in the original vocabulary. This is designed to test if the main source of improvement is the difference in the representations each pre-trained embedding brings to the table.

From our ablation studies using the above variations on both the SST2 and CoNLL datasets and find that the most important thing is the representational diversity in the pre-trained embeddings. This dovetails nicely with our observation that embeddings trained on distinct datasets tend to perform well together. To further test this hypothesis, we look at the “similarity” of various pre-trained embeddings. We define “similarity” using the overlap of nearest neighbors in the embedding space as in Wendlandt et al. (2018). Specificly we use the average Jaccard overlap percentage between the 10 nearest neighbors for each of the top 200 words in the dataset by frequency. Table 5 shows the overlap of different embeddings with the Glove 6B 100 dimension embedding and how their combination affects the performance. As it can be seen, Senna has the lowest overlap and causes the biggest performance gain.

However, this does not hold for the GoogleNews embedding which also has a low overlap yet the combination actually causes a drop in performance. This can be explained by coverage—the percentage of unique types in the data that are attested in the pre-trained vocabulary. That number is surprisingly low for GoogleNews and causes the GoogleNews representations to be used so rarely they actually cause a drop in performance.

In summary, one should look for two characteristics when combining embeddings: the word representations should have low “similarity” and the unique types in the dataset should be highly attested in both pre-trained vocabularies.

4 Conclusion

Recent large-scale, contextual, pre-trained models are exciting but produce relatively slow models. We propose a simple, lightweight technique: concatenation of pre-trained embeddings. We show that this technique has a significant impact on error reduction and a negligible effect of speed.

However, the concatenation on any two random pre-trained embeddings is not guaranteed to work well. From our analysis, we are able to suggest a recipe for finding an effective combination: there should be a high degree of coverage of the unique types in each of the pre-trained embedding vocabularies and the word vectors should exhibit representational diversity. In future work, we intend to try other methods of embeddings combination while remaining computationally cheap. We also plan to find more principled ways to quantify the diversity in pre-trained embeddings, which can suggest ways to induce representational diversity into the embedding pre-training procedure itself.

References

Akbik et al. (2018) Alan Akbik, Duncan Blythe, and Roland Vollgraf. 2018. Contextual String Embeddings for Sequence Labeling. In Proceedings of the 27th International Conference on Computational Linguistics, pages 1638–1649, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
Chiu and Nichols (2016) Jason P.C. Chiu and Eric Nichols. 2016. Named Entity Recognition with Bidirectional LSTM-CNNs. Transactions of the Association for Computational Linguistics, 4:357–370.
Collobert et al. (2011) Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel P. Kuksa. 2011. Natural Language Processing (Almost) from Scratch. Journal of Machine Learning Research, 12:2493–2537.
Coucke et al. (2018) Alice Coucke, Alaa Saade, Adrien Ball, Théodore Bluche, Alexandre Caulier, David Leroy, Clément Doumouro, Thibault Gisselbrecht, Francesco Caltagirone, Thibaut Lavril, Maël Primet, and Joseph Dureau. 2018. Snips Voice Platform: an Embedded Spoken Language Understanding System for Private-by-design Voice Interfaces. arXiv preprint, arXiv:1805.10190.
Derczynski et al. (2017) Leon Derczynski, Eric Nichols, Marieke van Erp, and Nut Limsopatham. 2017. Results of the WNUT2017 Shared Task on Novel and Emerging Entity Recognition. In Proceedings of the 3rd Workshop on Noisy User-generated Text, pages 140–147, Copenhagen, Denmark. Association for Computational Linguistics.
Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. pages 4171–4186.
Gimpel et al. (2011) Kevin Gimpel, Nathan Schneider, Brendan O’Connor, Dipanjan Das, Daniel Mills, Jacob Eisenstein, Michael Heilman, Dani Yogatama, Jeffrey Flanigan, and Noah A. Smith. 2011. Part-of-speech tagging for twitter: annotation, features, and experiments. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2, HLT ’11, pages 42–47, Stroudsburg, PA, USA. Association for Computational Linguistics.
Hovy et al. (2006) Eduard Hovy, Mitchell Marcus, Martha Palmer, Lance Ramshaw, and Ralph Weischedel. 2006. OntoNotes: The 90% Solution. In Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers, NAACL-Short ’06, pages 57–60, Stroudsburg, PA, USA. Association for Computational Linguistics.
Kim (2014) Yoon Kim. 2014. Convolutional Neural Networks for Sentence Classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1746–1751, Doha, Qatar. Association for Computational Linguistics.
Ma and Hovy (2016) Xuezhe Ma and Eduard Hovy. 2016. End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1064–1074, Berlin, Germany. Association for Computational Linguistics.
Mikolov et al. (2013) Tomas Mikolov, Kai Chen, Greg S. Corrado, and Jeffrey Dean. 2013. Efficient Estimation of Word Representations in Vector Space.
Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation. In Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543.
Peters et al. (2018) Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep Contextualized Word Representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2227–2237, New Orleans, Louisiana. Association for Computational Linguistics.
Peters et al. (2017) Matthew E. Peters, Waleed Ammar, Chandra Bhagavatula, and Russell Power. 2017. Semi-supervised Sequence Tagging with Bidirectional Language Models. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, pages 1756–1765.
Pressel et al. (2018) Daniel Pressel, Sagnik Ray Choudhury, Brian Lester, Yanjie Zhao, and Matt Barta. 2018. Baseline: A Library for Rapid Modeling, Experimentation and Development of Deep Learning Algorithms Targeting NLP. In Proceedings of Workshop for NLP Open Source Software (NLP-OSS), pages 34–40. Association for Computational Linguistics.
Radford et al. (2018) Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving Language Understanding by Generative Pre-Training.
Socher et al. (2013) Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. 2013. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1631–1642. Association for Computational Linguistics.
Tjong Kim Sang and De Meulder (2003) Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the CoNLL-2003 Shared Task: Language-independent Named Entity Recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003 - Volume 4, CONLL ’03, pages 142–147, Stroudsburg, PA, USA. Association for Computational Linguistics.
Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is All You Need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5998–6008. Curran Associates, Inc.
Wendlandt et al. (2018) Laura Wendlandt, Jonathan K. Kummerfeld, and Rada Mihalcea. 2018. Factors Influencing the Surprising Instability of Word Embeddings. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2092–2102, New Orleans, Louisiana. Association for Computational Linguistics.

Appendix A Reproducibility

A.1 Hyperparameters

Mead/Baseline is a configuration file driven model training framework. All hyperparameters are fully specified in the congifuration files included with the source code for our experiments.

A.2 Computational Resources

All models were trained on a single NVIDIA 1080Ti. While multiple GPUs were used for training many models in parallel to facilitate a testing many datasets and to estimate the variability of the method the actual model can easily be trained on a single GPU.

A.3 Evaluation

To calculate metrics, entity-level F1 is used for NER and slot-filling. In entity level F1 first entities are created from the token level labels and compared to the gold ones. Entities that match on both type and boundaries are considered correct while a mismatch in either causes an error. The F1 score is then calculated from these entities. Accuracy is used for classification and part of speech tagging. Accuracy is defined as the proportion of correct elements to all elements. In classification a single example is an element. In part of speech tagging each token is an element so our accuracy is the the number of correct tokens divided by the number of tokens in the dataset. We use the evaluation code that ships with the framework we use, MEAD/Baseline, which we have bundled with the source code of our experiments.

Task	Dataset	Model	Embeddings	Number of parameters
NER	CoNLL	biLSTM-CRF	6B	3,234,440
			Senna	1,810,690
			6B, Senna	4,658,190
	WNUT-17	biLSTM-CRF	27B	3,849,632
			27B, w2v-30M	6,499,532
			27B, w2v-30M, 840B	12,090,032
	OntoNotes	biLSTM-CRF	6B	5,569,382
			6B, Senna	7,673,632
Slot Filling	Snips	biLSTM-CRF	6B	1,819,466
			GN	4,567,066
			6B, GN	5,940,866
POS	TW-POS	biLSTM-CRF	w2v-30M	1,241,332
			27B	1,788,982
			27B, w2v-30M	2,908,132
			27B, w2v-30M, 840B	5,408,332
Classification	SST2	LSTM	840B	6,456,702
			GN	6,456,702
			840B, GN	12,109,002
	AG-NEWS	LSTM	840B	20,842,604
			GN	20,842,604
			840B, GN	41,522,804
	Snips	Conv	840B	4,003,807
			GN	4,003,807
			840B, GN	8,005,207

Table 6: The number of parameters for different models.

A.4 Dataset Information

Dataset		Train	Dev	Test	Total
CoNLL	Examples	14,987	3,466	3674	22137
	Tokens	204,567	51,578	46,666	302,811
WNUT-17	Examples	3,394	1,009	1,287	5,690
	Tokens	62,730	15,733	23,394	101,857
OntoNotes	Examples	59,924	8,528	8,262	76,714
	Tokens	1,088,503	147,724	152,728	1,388,955
Snips	Examples	13,084	700	700	14,484
	Tokens	117,700	6,384	6,354	130,438
TW-POS	Examples	1,000	327	500	1,827
	Tokens	14,619	4,823	7,152	26,594
SST2	Examples	76,961	872	1,821	79,654
	Tokens	717,127	17,046,	35,023	769,196
AG-NEWS	Examples	110,000	10,000	7,600	127,600
	tokens	4,806,909	433,659	329,617	5,570,185

Table 7: Example and token count statistics for public datasets used.

Relevant information about datasets can be found in Table 7. The majority of data is used as distributed except we convert NER and slot-filling datasets to the IOBES format. All public dataset used are included in the supplementary material. A quick overview of each dataset follows:

CoNLL: A NER dataset based on news text. We converted the IOB labels into the IOBES format. There are 4 entity types, MISC, LOC, PER, and LOC.

WNUT-17: A NER dataset of new and emerging entities based on noisy user text. We converted the BIO labels into the IOBES format. There are 6 entity types, corporation, creative-work, group, location, person, and product.

OntoNotes: A much larger NER dataset. We converted the labels into the IOBES format. There are 18 entity types, CARDINAL, DATE, EVENT, FAC, GPE, LANGUAGE, LAW, LOC, MONEY, NORP, ORDINAL, ORG, PERCENT, PERSON, PRODUCT, QUANTITY, TIME, and WORK_OF_ART.

Snips: A slot-filling dataset focusing on commands one would give a virtual assistant. We converted the dataset from its normal format of two associated files, one containing surface terms and one containing labels to the more standard CoNLL file format and converted the labels to the IOBES format. There are 39 entity types, album, artist, best_rating, city, condition_description, condition_temperature, country, cuisine, current_location, entity_name, facility, genre, geographic_poi, location_name, movie_name, movie_type, music_item, object_location_type, object_name, object_part_of_series_type, object_select, object_type, party_size_description, party_size_number, playlist, playlist_owner, poi, rating_unit, rating_value, restaurant_name, restaurant_type, served_dish, service, sort, spatial_relation, state, timeRange, track, and year.

TW-POS: A twitter part of speech dataset. There are 25 parts of speech, !, #, $, &, ,, @, A, D, E, G, L, M, N, O, P, R, S, T, U, V, X, Y, Z, ^, and ~.

SST2: A binary sentiment analysis dataset based on movie reviews. We use the version where the training data is made up of phrases.

AG-NEWS: A four class text classification dataset for categorizing news data based on the 4 most common categories. There is not a standardized train and development split (there is a defined test set) so we created our own split which is included in the supplementary material.

Snips-Intent: The intent classification portion of the snips dataset. Again the intents pertain to requests one would make to a virtual assitant. There are 7 intents, SearchScreeningEvent, PlayMusic, AddToPlaylist, BookRestaurant, RateBook, SearchCreativeWork, and GetWeather.