∎

¹¹institutetext: M. Atif Qureshi ²²institutetext: Insight Centre for Data Analytics, University College Dublin, Dublin, Ireland
²²email: muhammad.qureshi@ucd.ie ³³institutetext: Derek Greene ⁴⁴institutetext: Insight Centre for Data Analytics, University College Dublin, Dublin, Ireland
⁴⁴email: derek.greene@ucd.ie

EVE: Explainable Vector Based Embedding Technique Using Wikipedia

M. Atif Qureshi Derek Greene

Abstract

We present an unsupervised explainable word embedding technique, called EVE, which is built upon the structure of Wikipedia. The proposed model defines the dimensions of a semantic vector representing a word using human-readable labels, thereby it readily interpretable. Specifically, each vector is constructed using the Wikipedia category graph structure together with the Wikipedia article link structure. To test the effectiveness of the proposed word embedding model, we consider its usefulness in three fundamental tasks: 1) intruder detection — to evaluate its ability to identify a non-coherent vector from a list of coherent vectors, 2) ability to cluster — to evaluate its tendency to group related vectors together while keeping unrelated vectors in separate clusters, and 3) sorting relevant items first — to evaluate its ability to rank vectors (items) relevant to the query in the top order of the result. For each task, we also propose a strategy to generate a task-specific human-interpretable explanation from the model. These demonstrate the overall effectiveness of the explainable embeddings generated by EVE. Finally, we compare EVE with the Word2Vec, FastText, and GloVe embedding techniques across the three tasks, and report improvements over the state-of-the-art.

Keywords:

Distributional semantics Unsupervised learning Wikipedia

^†^†journal: myjournal

1 Introduction

Recently the European Union has approved a regulation which requires that citizens have a “right to explanation” in relation to any algorithmic decision-making (Goodman and Flaxman, 2016). According to this regulation, due to come into force in 2018, an algorithm that makes an automatic decision regarding a user, entitles that user to a clear explanation as to how the decision was made. With this in mind, we present an explainable decision-making approach to generating word embeddings, called the EVE model. Word embeddings reference to a family of techniques that simply describes a concept (i.e. word or phrase) as a vector of real numbers (Pennington et al, 2014). These vectors have been shown useful in a variety of applications, such as topic modelling (Liu et al, 2015), information retrieval (Diaz et al, 2016), and document classification (Kusner et al, 2015)

Generally, word embedding vectors are defined by the context in which those words appear (Baroni et al, 2014). Put simply, “a word is characterized by the company it keeps” (Firth, 1957). To generate these vectors, a number of unsupervised techniques have been proposed which includes applying neural networks (Mikolov et al, 2013a, b; Bojanowski et al, 2016), constructing a co-occurrence matrix followed by dimensionality reduction (Levy and Goldberg, 2014; Pennington et al, 2014), probabilistic models (Globerson et al, 2007; Arora et al, 2016), and explicit representation of words appearing in a context (Levy et al, 2014, 2015).

Existing word embedding techniques do not benefit from the rich semantic information present in structured or semi-structured text. Instead they are trained over a large corpus, such as a Wikipedia dump or collection of news articles, where any structure is ignored. However, in this contribution we propose a model that uses the semantic benefits of structured text for defining embeddings. Moreover, to the best of our knowledge, previous word embedding techniques do not provide human-readable vector dimensions, thus are not readily open to human interpretation. In contrast, EVE associates human-readable semantic labels with each dimension of a vector, thus making it an explainable word embedding technique.

To evaluate EVE, we consider its usefulness in the context of three fundamental tasks that form the basis for many data mining activities – discrimination, clustering, and ranking. We argue for the need for objective evaluation-based strategies to ensure that subjective opinions are discouraged, which may be found tasks such as finding word analogies (Mikolov et al, 2013a). These tasks are applied to seven annotated datasets which differ in terms of topical content and complexity, where we demonstrate not only the ability of EVE to successfully perform these tasks, but also its ability to generate meaningful explanations to support its outputs.

The reminder of the paper is organized as follows. In Section 2, we provide an overview of research relevant to this work. In Section 3, we provide background material covering the structure of Wikipedia, and then describe the methodology of the EVE model in detail. In Section 4, we provide detailed experimental evaluation on the three tasks mentioned above, and also demonstrate the novelty of the EVE model in generating explanations. Finally, in Section 5, we conclude the paper with further discussion and future directions. The relevant dataset and source code for this work can be publicly accessed at http://mlg.ucd.ie/eve.

2 Related Work

Assessing the similarity between words is a fundamental problem in natural language processing (NLP). Research in this area has largely proceeded along two directions: 1) techniques built upon distributional hypothesis whereby contextual information serves as the main source for word representation; 2) techniques built upon knowledge bases whereby encyclopedic knowledge is utilized for determination of word associations. In this section, we provide an overview of these directions, along with a description of some works attempting to bridge the gap between techniques (1) and (2) above through knowledge-powered word embeddings. Finally, we conclude the section with an explanation of the novelty of EVE.

2.1 From Distributional Semantic Models to Word Embeddings

Traditional computational linguistics has shown the utility of contextual information for tasks involving word meanings, in line with the distributional hypothesis which states that “linguistic items with similar distributions have similar meanings” (Harris, 1954). Concretely, distributional semantic models (DSMs) keep count-based vectors corresponding to co-occurring words, followed by a transformation of the vectors via weighting schemes or dimensionality reduction (Baroni and Lenci, 2010; Gallant et al, 1992; Schütze, 1992). A new family of methods, generally known as “word embeddings”, learns word representations in a vector space, where vector weights are set to maximize the probability of the contexts in which the word is observed in the corpus (Bengio et al, 2003; Collobert and Weston, 2008).

A more recent type of word embedding technique word2vec called into question the utility of deep models for learning useful representations, instead proposing continuous bag-of-words (Mikolov et al, 2013a) and skip-gram (Mikolov et al, 2013b) models built upon a simple single-layer architecture. Another recent word embedding technique by Pennington et al (2014) aims to combine best of both strategies, i.e. usage of global corpus statistics available to traditional distributional semantics models and meaningful linear substructures. Finally, Bojanowski et al (2016) proposed an improvement over word2vec by incorporating character n-grams into the model, thereby accounting for sub-word information.

2.2 Knowledge Base Approaches for Semantic Similarity and Relatedness

Another category of work which measures semantic similarity and relatedness between textual units relies on pre-existing knowledge resources (e.g. thesauri, taxonomies or encyclopedias). Within the proposed works in the literature, the key differences lie in the knowledge base employed, the technique used for measurement of semantic distances, and the application domain (Hoffart et al, 2012). Both Budanitsky and Hirst (2006) and Jarmasz (2012) used generalization (‘is a’) relations between words using WordNet-based techniques; Metzler et al (2007) used web search logs for measuring similarity between short texts, and both Strube and Ponzetto (2006) and Gabrilovich and Markovitch (2007) used rich encyclopedic knowledge derived from Wikipedia. Witten and Milne (2008) made use of tf.idf-like measures on Wikipedia links and Yeh et al (2009) made use of random walk algorithm over the graph driven from Wikipedia’s hyperlink structure, infoboxes, and categories. Recently, Jiang et al (2015) utilize various aspects of page organizations within a Wikipedia article to extract Wikipedia-based feature sets for calculating semantic similarity between concepts. Also Qureshi (2015) presented a Wikipedia-based semantic relatedness framework which uses Wikipedia categories and their sub-categories to a certain depth count to define the relatedness between two Wikipedia articles whose categories overlap with the generated hierarchies.

2.3 Knowledge-Powered Word Embeddings

In order to resolve semantic ambiguities associated with text data, researchers have recently attempted to increase the effectiveness of word embeddings by incorporate knowledge bases when learning vector representations for words Xu et al (2014). Two categories of works exist in this direction: 1) encoding entities and relations in a knowledge graph within a vector space with the goal of knowledge base completionBordes et al (2011); Socher et al (2013); 2) enriching the learned vector representations with external knowledge (from within a knowledge base) in order to improve the quality of word embeddings Bian et al (2014). The works in the first category aim to train neural tensor networks for learning a d-dimensional vector for each entity and relation in a given knowledge base. The works in the second category leverage morphological and semantic knowledge from within knowledge bases as an additional input during the process of learning word representations.

The EVE model relates to the works described in Section 2.1 in the sense that these models all attempt to construct word embeddings in order to characterize relatedness between words. However, like the approaches described in Section 2.2, EVE also benefits from semantic information present in structured text, albeit with the different aim of producing word embeddings. The EVE model is different from knowledge-powered word embeddings in that we produce a more general framework by learning vector representations for concepts rather than limiting the model to entities and/or relations. Furthermore, we utilize the structural organization of entities and concepts within a knowledge base to enrich the word vectors. A relevant recent work-in-progress, called ConVec (Sherkat and Milios, 2017), attempts to learn Wikipedia concept embeddings by making use of anchor texts (i.e. linked Wikipedia articles). In contrast, EVE gives a more powerful representation through the combination of Wikipedia categories and articles. Finally, a key characteristic that distinguishes EVE from all existing models is its expressive mode of explanations, as enabled by the use of Wikipedia categories and articles.

3 The EVE Model

3.1 Background on Wikipedia

Before we present the methodology of the proposed EVE model, we firstly provide background information on Wikipedia, whose underlying graph structure forms the basic building blocks of the model.

Wikipedia is a multilingual collaboratively-constructed encyclopedia which is actively updated by a large community of volunteer editors. Figure 1 shows the typical Wikipedia graph structure for a set of articles and associated categories. Each article can receive an inlink from another Wikipedia article while it can also outlink to another Wikipedia article. In our example, article A₁ receives inlinks from A₄ and A₁ outlinks to A₂. In addition, each article can belong to a number of categories, which are used to group together articles on a similar subject. In Fig. 1, A₁ belongs to categories C₁ and C₉. Furthermore, each Wikipedia category is arranged in a category taxonomy i.e. , each category can have arbitrary number of super-categories and sub-categories. In our case, C₅, C₆, C₇ are sub-categories of C₄, whereas C₂ and C₃ are super-categories of C₄.

To motivate with a simple real example, the Wikipedia article “Espresso” receives inlinks from the article “Drink” and it outlinks to the article “Espresso machine”. The article “Espresso” belongs to several categories, including “Coffee drinks” and “Italian cuisine”. The category “Italian cuisine” itself has a number of super-categories (e.g. “Italian culture”, “Cuisine by nationality”) and sub-categories (e.g. “Italian desserts”, “Pizza”). These Wikipedia categories serve as a semantic tag for the articles to which they link (Zesch and Gurevych, 2007).

Refer to caption — Figure 1: An example Wikipedia graph structure for a set of four articles and ten associated categories.

3.2 Methodology

We now present the methodology for generating word embedding vectors with the EVE model. Firstly, a target word or concept is mapped to a single Wikipedia concept article¹¹1This can be an exact match or a partial best match using an information retrieval algorithm. The vector for this concept is then composed of two distinct types of dimensions. The first type quantifies the association of the concept with other Wikipedia articles, while the second type quantifies the association of the concept with Wikipedia categories. The intuition here is that related words or concepts will share both similar article link associations and similar category associations within the Wikipedia graph, while unrelated concepts will differ with respect to both criteria. The methods used to define these associations are explained next.

3.2.1 Vector dimensions related to Wikipedia articles

We firstly define the strategy for generating vector dimensions corresponding to individual Wikipedia articles. Given the target concept, which is mapped to a Wikipedia article denoted $A_{concept}$ , we enumerate all incoming links and outgoing links between this article and all other articles. We then create a dimension corresponding to each of those linked articles, where the strength of association for a dimension is defined as the sum of the number of incoming and outgoing links involving an article and $A_{concept}$ . After creating dimensions for all linked articles, we also add a self-link dimension²²2This dimension the most relevant dimension defining the concept which is the article itself., where the association of $A_{concept}$ with itself is defined to be the twice of the maximum count received from the linking articles.

Fig. 2 shows an example of the strategy. In the first step, all inlinks and outlinks are counted for the other non-concept articles (e.g. $A_{concept}$ has 3 inlinks and 1 outlink from $A_{3}$ ). In the next step, the self-link score is computed as twice the maximum of sum of inlinks and outlinks from all other articles (which is 8 in this case). In the final step, normalization³³3In case of best match strategy, where more than one article is mapped to a concept i.e., $A_{concept1},A_{concept2},...$ the score computed is further scaled by the relevance score of the each article for the top-k articles, then reduced by the vector addition, and normalized again. of the scores takes place, dividing by the maximum score (which is 8 in this case). Articles having no links to or from $A_{concept}$ receive a score of 0. Given the sparsity of the Wikipedia link graph, the article-based dimensions are also naturally sparse.

3.2.2 Vector dimensions related to Wikipedia categories

Next we define the method for generating vector dimensions corresponding to all Wikipedia categories which are related to the concept article. The strategy to assign a score to the related Wikipedia categories proceeds as follows:

1.

Start by propagating the score uniformly to the categories to which the concept article belongs to (see Fig. 1).
2.

A portion of the score is further propagated by the probability of jumping from a category to the categories in the neighborhood.
3.

Score propagation continues until a certain hop count is reached (i.e. a threshold value $category_{depth}$ ), or there are no further categories in the neighborhood.

Fig. 3 illustrates the process, where the concept article $A_{concept}$ has a score s, which is 1⁴⁴4In case of the partial best match it is the relevance score returned by BM25 algorithm. for an exact match. First, the score is uniformly propagated across the number of Wikipedia categories and their tree structure to which the article belongs to ( $C_{1}$ and $C_{7}$ tree receive $s/2$ from $A_{concept}$ ). In the next step, the directly-related categories ( $C_{1}$ and $C_{7}$ ) further propagate the score to their super and sub-categories, while retaining a portion of score. $C_{1}$ retains a portion by the factor $1-jump_{prob}$ of the score that it propagate to the super and sub-categories. Where $jump_{prob}$ is the probability of jumping from a category to either a connected super or sub-category. While $C_{7}$ retains the full score since there is no super or sub-category for further propagation. In step 3 and onwards, the score continues to propagate in a direction (to either a super or sub-category) until hop count $category_{depth}$ is reached, or until there is no further category to which score could propagate to. In Fig. 3, $C_{0}$ and $C_{3}$ are the cases where the score cannot propagate further, while $C_{4}$ is the stopping condition for the score to propagate when using a threshold $category_{depth}=2$ .

3.2.3 Overall vector dimensions

Once the sets of dimensions for related Wikipedia articles and categories have been created, we construct an overall vector for the concept article as follows. Eq. 1 shows the vector representation of a concept, where norm is a normalization function, $articles_{score}$ and $categories_{score}$ are the two sets of dimensions, while $bias_{article}$ and $bias_{category}$ are the bias weights which control the importance of the associations with the Wikipedia articles and categories respectively. The bias weights can tuned to give more importance to either type of association. In Eq. 2 we normalize the entire vector such that the sum of the scores of all dimension equates to 1, so that a unit length vector is obtained.

	$\displaystyle\begin{aligned} \mathllap{Vector(concept)}&=<norm(articles_{score})bias_{article},\\ &\qquad norm(categories_{score})bias_{category}>\\ \end{aligned}$		(1)
	$\displaystyle\begin{aligned} \mathllap{Vector(concept)}&=norm(Vector(concept))\\ \end{aligned}$		(2)

The process above is repeated for each word or concept in the input dataset to generate a set of vectors, representing an embedding of the data. In this embedding, each vector dimension is labeled with a tag which corresponds to either a Wikipedia article name or a Wikipedia category name. Therefore, each dimension carries a direct human-interpretable meaning. As we see in the next section, these labeled dimensions prove useful for the generation of algorithmic explanations.

4 Evaluation

In this section we investigate the extent to which embeddings generated using the EVE model are useful in three fundamental data mining tasks. Firstly, we describe a number of alternative baseline methods, along with the relevant parameter settings. Then we describe the dataset which is used for the evaluations, and finally we report the experimental results to showcase the effectiveness of the model. We also highlight the benefits of the explanations generated as part of this process.

4.1 Baselines and Parameters

We compare EVE with three popular word embedding algorithms: Word2Vec, FastText, and GloVe. For Word2Vec and FastText, we trained two well-known variants of each – i.e. the continuous bag of words model (CBOW) and the skip-gram model (SG). For GloVe, we trained the standard model.

For each baseline, we use the default implementation parameter values (window_size=5, vector_dimensions=100), except for the minimum document frequency threshold, which is set to 1 to generate all word vectors, even for rare words. This enables direct comparisons to be made with EVE. For EVE, we use uniform bias weights (i.e. $bias_{article}$ =0.5, $bias_{category}$ =0.5). This provides equal importance to both dimension types. The parameter $jump_{prob}$ =0.5 was chosen arbitrarily, so as to retain half of the score by the category while the rest is propagated.

4.2 Dataset

To evaluate the performance of the different models, we constructed a new dataset from the complete 2015 English-language Wikipedia dump, composed of seven different topical types, each containing at least five sub-topical categories. On average each sub-topical category contains a list of 20 items or concepts. The usefulness of the dataset lies in the fact that the organization, from topics to categories to items, is made on the bases of factual position.

Table 1: Summary statistics of the dataset.

Topical Type	Categories	Mean Items per	Example (Category: Items)
		Category
Animal class	5	20	Mammal: Baleen whale, Elephant,
			Primate
Continent to country	6	17	Europe: Albania, Belgium, Bulgaria
Cuisine	5	20	Italian cuisine: Agnolotti, Pasta, Pizza
European cities	5	20	Germany: Berlin, Bielefeld, Bonn
Movie genres	5	20	Science fiction film: RoboCop,
			The Matrix, Westworld
Music genres	5	20	Grunge: Alice in Chains
			Chris Cornell, Eddie Vedder
Nobel laureates	5	20	Nobel laureates in Physics:
			Albert Einstein, Niels Bohr

Table 2: Dataset topical types and corresponding sub-topical categories.

Topical Type	Categories
Animal classes	Mammal, Reptile, Bird, Amphibian, Fish
Continent to Country	Africa, Europe, Asia, South America, North America, Oceania
Cuisine	Italian cuisine, Mexican cuisine, Pakistani cuisine,
	Swedish cuisine, Vietnamese cuisine
European cities	France, Germany, Great Britain, Italy, Spain
Movie genres	Animation, Crime film, Horror film, Science fiction film,
	Western (genre)
Music genres	Jazz, Classical music, Grunge, Hip hop music, Britpop
Nobel laureates	Nobel laureates in Chemistry, Nobel Memorial
	Prize laureates in Economics, Nobel laureates in Literature,
	Nobel Peace Prize laureates, Nobel laureates in Physics

Table 1 shows a statistical summary of the dataset. In this table, the column “Example (Category, Items)” shows an example of a category name in the “Topical Type”, together with a subset of list of items belonging to that category. For instance, in the first row “Topical Type” is Animal class and Mammal is one of the category belonging to this type, while Baleen whale is an item with in the category of Mammal. Similarly there are other categories of the type Animal class such as Reptile. Table 2 shows the list of categories for each topical type.

All embedding algorithms in our comparison were trained on this dataset. In case of baseline models, we use “article labels”, “article redirects”, “category labels”, and “long abstracts”, with each entry as a separate document. Note that, prior to training, we filter out four non-informative Wikipedia categories which can be viewed as being analogous to stopwords: {“articles contain video clips”, “hidden categories”, “articles created via the article wizard”, “unprintworthy redirects”}.

4.3 Experiments

To compare the EVE model with the various baseline methods, we define three general purpose data mining tasks: intruder detection, ability to cluster, ability to sort relevant items first. In the following sections we define the tasks separately, each accompanied by experimental results and explanations.

4.3.1 Experiment 1: Intruder detection

First we evaluate the performance of EVE when attempting to detect an unrelated “intruder” item from a list of n items, where the rest of the items in the list are semantically related to one another. The ground truth for the correct relations between articles are based on the “topical types” in the dataset.

Task definition:

For a given “topical type”, we randomly choose four items belonging to one category and one intruder item from a different category of the same “topical type”. After repeating this process exhaustively for all combinations for all topical types, we generated 13,532,280 results for this task. Table 3 shows the breakdown of the total number of queries for each of the “topical types”.

Example of a query:

For the “topical type” European cities, we randomly choose four related items from the “category” Great Britain such as London, Birmingham, Manchester, Liverpool, while we randomly choose an intruder item Berlin from the “category” Germany. Each of the models is presented with the five items, where the challenge is to identify Berlin as the intruder – the rest of the items are related to each other as they are cities in Great Britain, while Berlin is a city in Germany.

Strategy:

In order to discover the intruder item, we formulate the problem as a maximization of pairwise similarity across all items, the item receiving the least score is least similar to all other items, and thus identified as the intruder. Formally, for each model we compute

\displaystyle score(item_{(k)})

\displaystyle=\sum_{i=1}^{5}similarity(item_{(k)},item_{(i)});i\neq k

(3)

where the $similarity$ function is cosine similarity (Manning et al, 2008), $k$ and $i$ are the item positions in the list of items, and $item_{(k)}$ and $item_{(i)}$ are the vectors returned by the model under consideration.

Table 3: Intruder detection task — Statistics for the number of queries.

Topical Types	No. of Queries
Animal class	1,938,000
Continent to country	1,904,280
Cuisine	1,938,000
European cities	1,938,000
Movie genres	1,938,000
Music genres	1,938,000
Nobel laureates	1,938,000
Total	13,532,280

Results:

To evaluate the effectiveness of the EVE model against the baselines for this task, we use $accuracy$ (Manning et al, 2008) as the measure for finding the intruder item. Accuracy is defined as the ratio of correct results (or correct number of intruder items) to the total number of results returned by the model:

\displaystyle accuracy=\dfrac{\mid Results_{Correct}\mid}{\mid Results_{Total}\mid}

(4)

Table 4 shows the experimental results for the six models in this task. From the table it is evident that the EVE model significantly outperforms rest of the models overall. However, in the case of two “topical types”, the FastText CBOW yields better results. To explain this, we next show explanations generated by the EVE model while making decisions for the intruder detection task.

Table 4: Intruder detection task — Detection accuracy results.

	EVE	Word2Vec	Word2Vec	FastText	FastText	GloVe
		CBOW	SG	CBOW	SG
Animal class	0.77	0.39	0.42	0.36	0.43	0.31
Continent to Country	0.75	0.70	0.76	0.79	0.79	0.73
Cuisine	0.97	0.34	0.43	0.62	0.75	0.25
European cities	0.94	0.93	0.98	0.91	0.99	0.74
Movie genres	0.71	0.23	0.24	0.22	0.25	0.21
Music genres	0.87	0.56	0.59	0.50	0.57	0.38
Nobel laureates	0.91	0.28	0.28	0.23	0.27	0.24
Average	0.85	0.50	0.52	0.52	0.58	0.41

•

Note: all p-values are $<{10}^{-157}$ for EVE with respect to all baselines

Explanation from the EVE model:

Using the labeled dimensions in vectors produced by EVE, we define the process to generate effective explanations for the intruder detection task in Algorithm 1 as follows. The inputs to this algorithm are the vectors of items, and the intruder item identified by the EVE model. In step 1, we calculate the mean vector of all the vectors. In step 2 and 3, we subtract the influence of intruder and mean of vectors from each other to obtain dominant vector spaces to represent detected coherent items and intruder item respectively. In step 4 and 5, we order the labeled dimensions by their informativeness (i.e. the dimension with the highest score is the most informative dimension). Finally, we return a ranked list of informative vector dimensions for the both non-intruders and the intruder as an explanation for the output of the task.

Algorithm 1 Explanation strategy for intruder detection task

0: EVE

\rightarrow vector_{space}

vector_{intruder}

space_{mean}

Mean(vector_{space})

coherentSpace_{leftover}

space_{mean}

vector_{intruder}

intruder_{leftover}

vector_{intruder}

space_{mean}

coherentSpace_{info\_features}

order\_by_{info\_features}(coherentSpace_{leftover})

intruder_{info\_features}

order\_by_{info\_features}(intruder_{leftover})

6: return

coherentSpace_{info\_features}

intruder_{info\_features}

Table 5: Sample explanation generated for the intruder detection task, for the query: {Hawk, Penguin, Gull, Parrot, Snake}. Correct intruder detected: Snake. All top-9 features are Wikipedia categories.

Non-Intruder	Intruder
falconiformes	turonian first appearances
birds of prey	snakes
seabirds	squamata
ypresian first appearances	predators
psittaciformes	lepidosaurs
parrots	predation
rupelian first appearances	carnivorous animals
gulls	venomous snakes
bird families	snakes in art

Table 5 and 6 show sample explanations generated by the EVE model, where the model has detected a correct and incorrect intruder item respectively. In Table 5, the query has items selected from “topical type” animal class, where four of the items belong to the “category” birds, while the item ‘snake’ belongs to the “category” reptile. As can be seen from the table, the bold features in the non-intruder and intruder column obviously represent bird family and snake respectively, which is the correct inference. Furthermore, the non-bold features in the non-intruder and intruder columns represent deeper relevant relations which may require some domain expertise. For instance, falconiformes are a family of 60+ species in the order of birds and turonian is the evolutionary era of the specific genera.

Table 6: Sample explanation generated for the intruder detection task, for the query: {I Am Legend (film), Insidious (film), A Nightmare on Elm Street, Final Destination (film), Children of Men}. Incorrect intruder detected: Final Destination (film). All top-9 features are Wikipedia categories.

Non-Intruder	Intruder
english-language films	studiocanal films
american independent films	splatter films
american horror films	final destination films
universal pictures films	films shot in vancouver
post-apocalyptic films	films shot in toronto
films based on science fiction novels	films shot in san francisco, california
2000s science fiction films	films set in new york
ghost films	films set in 1999
films shot in los angeles, california	film scores by shirley walker

In the example in Table 6, the query has items selected from the “topical type” movie genres, where four of the items belong to the “category” horror film, while the intruder item ‘Children of Men’ belongs to the “category” science fiction film. In this example, EVE identifies the wrong intruder item according to the ground truth, recommending instead the item ‘Final Destination (film)’. From the explanation in the table, it becomes clear why the model made this recommendation. We observe that the non-intruder items have a coherent relationship with ‘post-apocalyptic films’ and ‘films based on science fiction novels’ (both ‘I am Legend (film)’ and ‘Children of Men’ belong to these categories). Whereas ‘Final Destination (film)’ was recommended by the model based on features relating to filming location. A key advantage of having an explanation from the model is that it allows us to understand why a mistake occurs and how we might improve the model. In this case, one way to make improvement might be to add a rule filtering Wikipedia categories relating to locations when consider movie genres.

4.3.2 Experiment 2: Ability to cluster

In this experiment, we evaluate the extent to which the distances computed on EVE embeddings can help to group semantically-related items together, while keeping unrelated items apart. This is a fundamental requirement for distance-based methods for cluster analysis.

Task definition:

For all items in a specific “topical type”, we construct an embedding space without using information about the category to which the items belong. The purpose is then to measure the extent to which these items cluster together in the space relative to the ground truth categories. This is done by measuring distances in the space between items that should belong together (i.e. intra-cluster distances) and items that should be kept apart (i.e. inter-cluster distances), as determine by the categories. Since there are seven “topical types”, there are also even queries in this task.

Example of a query:

For the “topical type” Cuisine, we are provided with a list of 100 items in total, where each of the five categories has 20 items. These correspond to cuisine items from five different countries. The idea is to measure the ability of each embedding model to cluster these 100 items back into five categories.

Strategy:

To formally measure the ability of a model to cluster items, we conduct a two-step strategy as follows:

1.

Calculate a pairwise similarity matrix between all items of a given “topical type”. The similarity function that we use for this task is the cosine similarity.
2.

Transform the similarity matrix to a distance matrix⁵⁵5by simply, 1 - normalized similarity score over each dimension which is used to measure inter and intra-cluster distances relative to the ground truth categories.

Results:

To evaluate the ability to cluster, there are typically two objectives: within-cluster cohesion and between-cluster separation. To this end, we use three well-known cluster validity measures in this task. Firstly, the within-cluster distance (Everitt et al, 2001) is the total of the squared distances between each item $x_{i}$ and the centroid vector $\mu_{c}$ of the cluster $C_{c}$ to which it has been assigned:

\displaystyle within=\sum_{c=1}^{k}\sum_{x_{i}\in C_{c}}d(x_{i},\mu_{c})^{2}

(5)

Typically this value is normalized with respect to the number of clusters $k$ . The higher the score, the more coherent the clusters. Secondly, the between-cluster distance is the total of the squares of the distances between the each cluster centroid and the centroid of the entire dataset, denoted $\hat{\mu}$ :

Table 7: Ability to cluster task — Mean within-cluster distance scores.

	EVE	Word2Vec	Word2Vec	FastText	FastText	GloVe
		CBOW	SG	CBOW	SG
Animal class	2.00	13.03	6.23	10.31	7.71	12.20
Continent to country	2.34	2.63	2.25	2.83	2.56	2.60
Cuisine	2.92	17.31	8.88	9.74	6.25	12.36
European cities	3.13	7.72	5.46	8.30	5.75	6.86
Movie genres	6.92	11.98	6.04	9.81	5.61	17.96
Music genres	1.90	8.25	5.25	6.72	5.77	7.72
Nobel laureates	2.88	14.56	8.99	12.40	10.59	15.13
Average	3.16	10.78	6.16	8.59	6.32	10.69

Table 8: Ability to cluster task — Mean between-cluster distance scores.

	EVE	Word2Vec	Word2Vec	FastText	FastText	GloVe
		CBOW	SG	CBOW	SG
Animal class	0.47	1.30	0.74	1.14	1.13	0.46
Continent to country	3.33	3.86	1.78	4.08	2.83	1.63
Cuisine	8.18	2.12	2.12	14.52	10.80	0.88
European cities	2.39	17.14	7.45	13.24	10.86	3.84
Movie genres	1.58	0.40	0.18	0.41	0.18	0.48
Music genres	2.23	2.79	1.60	1.16	0.18	1.68
Nobel laureates	1.95	0.79	0.39	0.56	1.38	0.20
Average	2.88	4.06	2.04	5.02	3.96	1.31

Table 9: Ability to cluster task — Overall CH-Index validation scores.

	EVE	Word2Vec	Word2Vec	FastText	FastText	GloVe
		CBOW	SG	CBOW	SG
Animal class	7.64	5.98	4.09	3.91	4.44	5.46
Continent to country	15.83	11.84	8.19	13.69	12.29	7.52
Cuisine	54.18	2.38	3.51	14.25	16.00	2.23
European cities	29.08	48.57	28.98	33.73	41.88	15.53
Movie genres	12.45	1.36	1.43	1.51	1.87	1.27
Music genres	25.04	18.01	14.80	13.06	12.93	6.09
Nobel laureates	21.85	3.58	3.34	1.73	3.16	2.91
Average	23.72	13.10	9.19	11.70	13.22	5.86

\displaystyle between=\sum_{c=1}^{k}\left|C_{c}\right|\;d(\mu_{c},\hat{\mu})^{2}\;\;\textrm{ where }\;\;\hat{\mu}=\frac{1}{n}\sum_{i=1}^{n}x_{i}

(6)

This value is also normalized with respect to the number of clusters $k$ . The lower the score, the more well-separated the clusters. Finally, the two above objectives are combined via the CH-Index (Caliński and Harabasz, 1974), using the ratio:

\displaystyle CH=\frac{between/(k-1)}{within/(n-k)}

(7)

The higher the value of this measure, the better the overall clustering.

From Table 7, we can see that EVE generally performs better than rest of the embedding methods for the within-cluster measure. In Table 8, for the between-cluster measure, EVE is outperformed by FastText CBOW, Word2Vec CBOW, and FastText SG mainly due to the “topical type” Cuisine and European cities where EVE does not perform well. Finally, in Table 9 where the combined aim of clustering is captured through the CH-Index, EVE outperforms the rest of the methods, except in the case of the “topical type” European cities.

Explanation from the EVE model:

Using labeled dimensions from the EVE model, we define a similar strategy for explanation as used in the previous task. However, now instead of discovering an intruder item, the goal is to define categories from items and to define the overall space. Algorithm 2 shows the strategy which requires three inputs: the $vector{space}$ representing the entire embedding; the list of categories $categories$ ; the $categories\_vector_{space}$ which is the vector space of items belonging to each category. In step 1, we calculate the mean vector representing for the entire space. In step 2, we order the labeled dimensions of the mean vector by the informativeness. In steps 3–6 we iterate over the list of categories (of a “topical type” such as Cuisine) and calculate mean vector for each category’s vector space, which is followed by the ordering of dimensions of the mean vector of category vector space by the informativeness. Finally, we return the most informative features of the entire space and of each category’s vector space.

Algorithm 2 Explanation strategy for the ability to cluster task.

0: EVE

\rightarrow vector_{space}

categories

categories\_vector_{space}

space_{mean}

Mean(vector_{space})

space_{info\_features}

order\_by_{info\_features}(space_{mean})

3: for

category\in categories

category_{mean}

Mean(categories_{\_}vector_{space}[category])

categories_{info\_features}[category]

order\_by_{info\_features}(category_{mean})

6: end for

7: return

space_{info\_features}

categories_{info\_features}

Table 10: Sample explanation generated for the ability to cluster task, for the query:{items of “topical type” Cuisine}. All top-6 features are Wikipedia categories, except for those beginning with ‘

\alpha

:’ which correspond to Wikipedia articles.

Overall	Italian	Mexican	Pakistani	Swedish	Vietnamese
space	category	category	category	category	category
vietnamese	italian	mexican	pakistani	swedish	vietnamese
cuisine	cuisine	cuisine	cuisine	cuisine	cuisine
swedish	cuisine	tortilla-	indian	finnish	vietnamese
cuisine	of lombardy	based	cuisine	cuisine	words and
		dishes			phrases
mexican	types of	cuisine of	indian	$\alpha$ :swedish	$\alpha$ :
cuisine	pasta	the south-	desserts	cuisine	vietnamese
		western			cuisine
		united states
italian	pasta	cuisine of	pakistani	desserts	$\alpha$ :vietnam
cuisine		the western	breads
		united states
dumplings	dumplings	$\alpha$ :list of	iranian	$\alpha$ :sweden	$\alpha$ :gÃÂ nÃÂÃÂ°ÃÂ¡ÃÂ»ÃÂng sÃÂ¡ÃÂºÃÂ£
		mexican	breads
		dishes
pakistani	italian-	maize	pakistani	potato	$\alpha$ :thit kho
cuisine	american	dishes	meat	dishes	tau
	cuisine		dishes

Tables 10 and 11 show the explanations generated by the EVE model, in the cases where the model performed best and worse against baselines respectively. In Table 10, the query is the list of items from “topical type” cuisine. As can be seen from the bold entries in the table, the explanation conveys the main idea about both the overall space and the individual categories. For example, in the overall space, we can see the cuisines by different nationalities, and likewise we can see the name of nationality from which the cuisine is originated from (e.g. Italian cuisine for the “Italian category” and Pakistani breads for the “Pakistani category”). As for the non-bold entries, we can also observe relevant features but at a deeper semantic level. For example, cuisine of Lombardy in “Italian category” where Lombardy is a region in Italy, and likewise tortilla-based dishes in the Mexican category where tortilla is a primary ingredient in Mexican cuisine.

Table 11: Sample explanation for the ability to cluster task, for the query: {items of “topical type” European cities}. All top-6 features are Wikipedia categories.

Overall	France	Great	Germany	Italy	Spain
space	category	Britain	category	category	category
		category
prefectures	prefectures	articles	university	world	university
in france	in france	including	towns in	heritage	towns in
		recorded	germany	sites in	spain
		pronuncia-		italy
		tions (uk
		english)
university	port cities	county towns	members	mediterra-	populated
towns in	and towns	in england	of the	nean port	coastal
germany	on the fren-		hanseatic	cities and	places in
	ch atlantic		league	towns in	spain
	coast			italy
members	cities in	metropolitan	german	populated	roman
of the	france	boroughs	state	coastal	sites in
hanseatic			capitals	places in	spain
league				italy
articles	subpre-	university	cities in	cities and	port cities
including	fectures	towns in the	north rhine-	towns in	and towns
recorded	in france	united	westphalia	emilia	on the
pronuncia-		kingdom		romagna	spanish
tions (uk					atlantic coast
english)					coast
capitals	world	populated	rhine	former	tourism
of former	heritage	places	province	capitals	in spain
nations	sites in	established		of italy
	france	in the 1st
		century
german	communes	fortified	populated	capitals	mediterranean
state	of nord	settlements	places on	of former	port cities
capitals	(french		the rhine	nations	and towns in
	department)				spain

In Table 11, the query is the list of items from “topical type” European cities and this is the example where EVE model performs worse. However, the explanation allows us to understand why this is the case. As can been from the explanation table, the bold features show historic relationships across different countries, such as “capitals of former nations”, “fortified settlements”, and “Roman sites in Spain”. Similarly, it can also be observed in non-bold features such as “former capital of Italy”. Based on this explanation, we could potentially decide to apply a rule that would exclude any historical articles or categories when generating the embedding for this type of task in future.

Visualization:

Since scatter plots are often used to represent the output of a cluster analysis process, we generate a visualization of all embeddings using T-SNE (Maaten and Hinton, 2008), which is a tool to visually represent high-dimensional data by reducing it to 2–3 dimensions for presentation.⁶⁶6The full set of experimental visualizations is available at http://mlg.ucd.ie/eve/. For the interest of reader, Fig. 4 shows a visualization generated using EVE and GloVe when the list of items are selected from the “topical type” country to continent. As can be seen from the plot, the ground truth categories exhibit better clustering behavior when using the space from the EVE model, when compared to the Glove model. This is also reflected in the corresponding scores in Tables 7, 8, and 9.

4.3.3 Experiment 3: Sorting relevant items first

Task definition:

The objective of this task is to rank a list of items based on their relevance to a given query item. According to the ground truth associated with our dataset, items which belong to the same ‘category’ of “topical type” as the query should be ranked above items which do not belong that ‘category’ (i.e. they are irrelevant to the query). In this task the total number of queries is equal to the total number of categories in the dataset – i.e. 36 (see table 1).

Example of a query:

Unlike the previous tasks, here ‘category’ is used as a query in this task. For example, for the ‘category’ Nobel laureates in Physics, the task is to sort all items from “topical type” Nobel laureates such that the list of items from ‘category’ Nobel laureates in Physics are ranked ahead of the rest of the items. Thus, Niels Bohr, who is a laureate in Physics, should appear near the top of the ranking, unlike Elihu Root, who is a prize winner in Peace.

Strategy:

In order to sort items relevant to a category, we define a simple two-step strategy as follows:

1.

Calculate the cosine similarity between all items and a category belonging to “topical type” in the model space.
2.

Sort the list of items in descending order according to their similarity scores with the category.

Based on this strategy, a successful model should rank items with the same ‘category’ before irrelevant items.

Results:

We use precision-at- $k$ ( $P@k$ ) and average precision ( $AP$ ) (Manning et al, 2008) as the measures to evaluate the effectiveness of the sorting ability of each embedding model with respect to relevance of items to a category. $P@k$ captures how many relevant items are calculated at a certain rank (or in the $top-k$ results), while $AP$ captures how early a relevant item is retrieved on average. It may happen that two models have the same value of $P@k$ , while one of the models retrieves relevant items in an earlier order of rank, thus achieving a higher $AP$ value. $P@k$ is defined as the ratio of relevant items retrieved in the $top-k$ retrieved items, whereas $AP$ is the average of $P@k$ values computed after each relevant item is retrieved. Equations 8 and 9 show the formal definitions of both measures.

\displaystyle P@k=\dfrac{\mid Items_{Relevant}\mid}{\mid Items_{Top\mbox{-}k}\mid}

(8)

	$\displaystyle\begin{aligned} AP=\dfrac{1}{\mid Items_{Relevant}\mid}\sum_{k=1}^{\mid Items\mid}P@k\cdot rel(k)\\ \end{aligned}$		(9)
	$\displaystyle\begin{aligned} \text{where }rel(k)=&\left\{\begin{array}[]{ll}1,\text{ if {item(k)} is relevant}\\ 0,\text{ otherwise}\end{array}\right.\end{aligned}$

Tables 12 and 13 show the experimental results of the sorting relevant items first task. We choose $P@20$ ( $k=20$ ), since on average there are 20 items in each category in the dataset. As can be seen from tables, the EVE model generally outperforms the rest of models, except for the “topical type” European cities where it gets outperformed by a factor of 1.05 and 1.09 times in terms of $P@k$ and $AP$ respectively, while in all other cases EVE outperforms other algorithm by at least 1.51 and 1.37 times in terms of $P@k$ and $AP$ respectively. On average, the EVE model outperforms the second best algorithm by a factor of 1.8 and 1.67 times in terms of $P@k$ and $AP$ respectively. In the next section, we show the corresponding explanations generated by the EVE model for this task.

Table 12: Sorting relevant items first task – Precision (

P@20

) scores.

	EVE	Word2Vec	Word2Vec	FastText	FastText	GloVe
		CBOW	SG	CBOW	SG
Animal class	0.72	0.34	0.38	0.41	0.47	0.22
Continent to country	0.95	0.54	0.51	0.63	0.59	0.31
Cuisine	0.97	0.36	0.49	0.54	0.54	0.24
European cities	0.91	0.85	0.91	0.86	0.96	0.61
Movie genres	0.87	0.30	0.31	0.24	0.29	0.24
Music genres	0.90	0.33	0.30	0.28	0.37	0.21
Nobel laureates	0.99	0.27	0.22	0.20	0.25	0.20
Average	0.90	0.43	0.45	0.45	0.50	0.29

Table 13: Sorting relevant items first task – Average Precision (AP) scores.

	EVE	Word2Vec	Word2Vec	FastText	FastText	GloVe
		CBOW	SG	CBOW	SG
Animal class	0.72	0.38	0.42	0.45	0.52	0.27
Continent to country	0.92	0.55	0.54	0.65	0.67	0.33
Cuisine	0.99	0.39	0.58	0.59	0.59	0.27
European cities	0.91	0.91	0.97	0.93	0.99	0.65
Movie genres	0.88	0.32	0.35	0.29	0.34	0.29
Music genres	0.91	0.35	0.34	0.33	0.40	0.29
Nobel laureates	1.00	0.26	0.26	0.24	0.29	0.24
Average	0.90	0.45	0.49	0.49	0.54	0.33

Explanation from the EVE model:

Using the labeled dimensions provided by the EVE model, we define a strategy for generating explanations for the sorting relevant items first task in Algorithm 3. The strategy requires three inputs. The first is the $vector_{space}$ which is composed of category vector and item vectors. The second input is the $Sim_{wrt\_category}$ which is a column matrix, composed of similarity score between the category vector with itself and item vectors. In this matrix the first entry is 1.0 because of the self similarity of the category vector. The final input is a list of items $items$ . In the step 1 and 2, a weighted mean vector of space is calculated, where the weights are the similarity scores between the vectors in the space and the category vector. In steps 3–6, we iterate over the list of items and calculate the product between the weighted mean vector of the space and the item vector. After taking the product, we order the dimensions by the informativeness. Finally, we return the ranked list of informative features for each item.

Algorithm 3 Explanation strategy for sorting relevant items first task

0: EVE

\rightarrow vector_{space},Sim_{wrt\_category}

items

BiasedSpace

vector_{space}\times SimilarityMatrix

BiasedSpace_{mean}

Mean(BiasedSpace)

3: for

item\in items

item_{projection}

BiasedSpace_{mean}\times vector_{space}[item]^{T}

items_{info\_features}[item]

order\_by_{info\_features}(item_{projection})

6: end for

7: return

items_{info\_features}

Tables 14 and 15 show sample explanations generated by the EVE model. For illustration purposes we select the “topical types” Nobel laureates and Music genres for explanations, as these are the only remaining “topical types” which we have not looked at so far in the other tasks.

Table 14: Sample explanation for the sorting relevant items first task, for the query: {Nobel laureates in Chemistry}. All top-6 features are Wikipedia categories.

Kurt Alder (Chemistry)	Linus Pauling (Peace)
First correct found at #1	First incorrect found at #20
nobel laureates in chemistry	nobel laureates in chemistry
german nobel laureates	Guggenheim fellows
organic chemists	american nobel laureates
university of kiel faculty	national medal of science laureates
university of kiel alumni	american physical chemists
university of cologne faculty	american people of scottish descent

In Table 14, the query is ‘category’ Nobel laureates in Chemistry from the “topical type” nobel laureates. We show the informative features for two cases – the first correct result which appears at rank 1 in the sorted lists produced by EVE, and the first incorrect result which appears at rank 20. The bold features indicates that both individuals are Nobel laureates in Chemistry. However, Linus Pauling also appears to be associated with the Peace category. This reflects that fact that, in fact, Linus Pauling is a two time Nobel laureate in two different categories, Chemistry and Peace. While generating the dataset used in our evaluations, the annotators randomly selected items to belong to a category from the full set of available items, without taking into account occasional cases where an item may belong into two categories. This case highlights the fact that EVE explanations are meaningful and can inform the choices made by human annotators.

Table 15: Sample explanation for the sorting relevant items first task, for the query: {Classical music}. All top-6 features are Wikipedia categories except those beginning with ‘

\alpha

:’ which are Wikipedia articles.

Ludwig van Beethoven (Classical)	Herbie Hancock (Jazz)
First correct found at #1	First incorrect found at #18
romantic composers	20th-century american musicians
19th-century classical composers	$\alpha$ :classical music
composers for piano	american jazz composers
german male classical composers	grammy award winners
german classical composers	$\alpha$ :herbie hancock
19th-century german people	american jazz bandleaders

In Table 15, the query is ‘category’ Classical music from the “topical type” music genres. We see that the first correct result is observed at rank 1 and the first incorrect result is at rank 18. The bold features show that both individuals are associated with classical music. Looking at the biography of the musician Herbie Hancock more closely, we find that he received an education in classical music and he is also well known in the classical genre, although not as strongly as he is known for Jazz music. This again goes to show that explanations generated using the EVE model are insightful and can support the activity of manual annotators.

5 Conclusion and Future Directions

In this contribution, we presented a novel technique, EVE, for generating vector representations of words using information from Wikipedia. This work represents a first step in the direction of explainable word embeddings, where the core of this interpretability lies in the use of labeled vector dimensions corresponding to either Wikipedia categories or Wikipedia articles. We have demonstrated that, not only are the resulting embeddings useful for fundamental data mining tasks, but the provision of labeled dimensions readily supports the generation of task-specific explanations via simple vector operations. We do not argue that embeddings generated on structured data, such as those produced by the EVE model, would replace the prevalent existing word embedding models. Rather, we have shown that using structured data can provide additional benefits beyond those afforded by existing approaches. An interesting aspect to consider in future would be the use of hybrid models, generated on both structured data and unstructured text, which could still retain aspects of explanations as proposed in this work.

In future, we would like to investigate the effect of the popularity of a word or concept (i.e. the number of non-zero dimensions in the embedding). For example, a cuisine-related item might have fewer non-zero dimensions when compared to a country-related item. Similarly, an interesting direction might be to analyze embedding spaces and sub-spaces to learn more about correlations of dimensions, while addressing a task or the effects of dimensionality reduction (even though spaces may be sparse). Another interesting avenue for future work could be to explore different ways of generating task-specific explanations, and to investigate how these explanations might be presented effectively to a user.

Acknowledgements. This publication has emanated from research conducted with the support of Science Foundation Ireland (SFI), under Grant Number SFI/12/ RC/2289.

References

Arora et al (2016) Arora S, Li Y, Liang Y, Ma T, Risteski A (2016) A latent variable model approach to pmi-based word embeddings. Tr Assoc Computational Linguistics 4:385–399
Baroni and Lenci (2010) Baroni M, Lenci A (2010) Distributional memory: A general framework for corpus-based semantics. Computational Linguistics 36(4):673–721
Baroni et al (2014) Baroni M, Dinu G, Kruszewski G (2014) Don’t count, predict! a systematic comparison of context-counting vs. context-predicting semantic vectors. In: ACL (1), pp 238–247
Bengio et al (2003) Bengio Y, Ducharme R, Vincent P, Jauvin C (2003) A neural probabilistic language model. JMLR 3(Feb):1137–1155
Bian et al (2014) Bian J, Gao B, Liu TY (2014) Knowledge-powered deep learning for word embedding. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Springer, pp 132–148
Bojanowski et al (2016) Bojanowski P, Grave E, Joulin A, Mikolov T (2016) Enriching word vectors with subword information. arXiv preprint arXiv:160704606
Bordes et al (2011) Bordes A, Weston J, Collobert R, Bengio Y (2011) Learning structured embeddings of knowledge bases. In: Conference on artificial intelligence, EPFL-CONF-192344
Budanitsky and Hirst (2006) Budanitsky A, Hirst G (2006) Evaluating wordnet-based measures of lexical semantic relatedness. Computational Linguistics 32(1):13–47
Caliński and Harabasz (1974) Caliński T, Harabasz J (1974) A dendrite method for cluster analysis. Communications in Statistics-theory and Methods 3(1):1–27
Collobert and Weston (2008) Collobert R, Weston J (2008) A unified architecture for natural language processing: Deep neural networks with multitask learning. In: Proc. ICML’2008, ACM, pp 160–167
Diaz et al (2016) Diaz F, Mitra B, Craswell N (2016) Query expansion with locally-trained word embeddings. arXiv preprint arXiv:160507891
Everitt et al (2001) Everitt B, Landau S, Leese M (2001) Cluster Analysis. Hodder Arnold Publication, Wiley
Firth (1957) Firth J (1957) A synopsis of linguistic theory 1930-1955. Studies in linguistic analysis pp 1–32
Gabrilovich and Markovitch (2007) Gabrilovich E, Markovitch S (2007) Computing semantic relatedness using wikipedia-based explicit semantic analysis. In: Proc. IJCAI’07, vol 7, pp 1606–1611
Gallant et al (1992) Gallant SI, Caid WR, Carleton J, Hecht-Nielsen R, Qing KP, Sudbeck D (1992) Hnc’s matchplus system. In: ACM SIGIR Forum, ACM, vol 26, pp 34–38
Globerson et al (2007) Globerson A, Chechik G, Pereira F, Tishby N (2007) Euclidean embedding of co-occurrence data. JMLR 8(Oct):2265–2295
Goodman and Flaxman (2016) Goodman B, Flaxman S (2016) European union regulations on algorithmic decision-making and a” right to explanation”. arXiv preprint arXiv:160608813
Harris (1954) Harris ZS (1954) Distributional structure. Word 10(2-3):146–162
Hoffart et al (2012) Hoffart J, Seufert S, Nguyen DB, Theobald M, Weikum G (2012) Kore: Keyphrase overlap relatedness for entity disambiguation. In: Proc. 21st ACM International Conference on Information and Knowledge Management, pp 545–554
Jarmasz (2012) Jarmasz M (2012) Roget’s thesaurus as a lexical resource for natural language processing. arXiv preprint arXiv:12040140
Jiang et al (2015) Jiang Y, Zhang X, Tang Y, Nie R (2015) Feature-based approaches to semantic similarity assessment of concepts using wikipedia. Info Processing & Management 51(3):215–234
Kusner et al (2015) Kusner MJ, Sun Y, Kolkin NI, Weinberger KQ (2015) From word embeddings to document distances. In: Proc. ICML’2015, pp 957–966
Levy and Goldberg (2014) Levy O, Goldberg Y (2014) Neural word embedding as implicit matrix factorization. In: Proc. NIPS’2014, pp 2177–2185
Levy et al (2014) Levy O, Goldberg Y, Ramat-Gan I (2014) Linguistic regularities in sparse and explicit word representations. In: CoNLL, pp 171–180
Levy et al (2015) Levy O, Goldberg Y, Dagan I (2015) Improving distributional similarity with lessons learned from word embeddings. Tr Assoc Computational Linguistics 3:211–225
Liu et al (2015) Liu Y, Liu Z, Chua TS, Sun M (2015) Topical word embeddings. In: AAAI, pp 2418–2424
Maaten and Hinton (2008) Maaten Lvd, Hinton G (2008) Visualizing data using t-sne. JMLR 9(Nov):2579–2605
Manning et al (2008) Manning CD, Raghavan P, Schütze H (2008) Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA
Metzler et al (2007) Metzler D, Dumais S, Meek C (2007) Similarity measures for short segments of text. In: European Conference on Information Retrieval, Springer, pp 16–27
Mikolov et al (2013a) Mikolov T, Chen K, Corrado G, Dean J (2013a) Efficient estimation of word representations in vector space. arXiv preprint arXiv:13013781
Mikolov et al (2013b) Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013b) Distributed representations of words and phrases and their compositionality. In: Proc. NIPS’2013, pp 3111–3119
Pennington et al (2014) Pennington J, Socher R, Manning CD (2014) Glove: Global vectors for word representation. In: Empirical Methods in Natural Language Processing (EMNLP), pp 1532–1543
Qureshi (2015) Qureshi MA (2015) Utilising wikipedia for text mining applications. PhD thesis, National University of Ireland Galway
Schütze (1992) Schütze H (1992) Word space. In: Proc. NIPS’1992, pp 895–902
Sherkat and Milios (2017) Sherkat E, Milios E (2017) Vector embedding of wikipedia concepts and entities. arXiv preprint arXiv:170203470
Socher et al (2013) Socher R, Chen D, Manning CD, Ng A (2013) Reasoning with neural tensor networks for knowledge base completion. In: Proc. NIPS’2013, pp 926–934
Strube and Ponzetto (2006) Strube M, Ponzetto SP (2006) Wikirelate! computing semantic relatedness using wikipedia. In: Proc. 21st national conference on Artificial intelligence, pp 1419–1424
Witten and Milne (2008) Witten I, Milne D (2008) An effective, low-cost measure of semantic relatedness obtained from wikipedia links. In: AAAI Workshop on Wikipedia and Artificial Intelligence: an Evolving Synergy, pp 25–30
Xu et al (2014) Xu C, Bai Y, Bian J, Gao B, Wang G, Liu X, Liu TY (2014) Rc-net: A general framework for incorporating knowledge into word representations. In: Proc. 23rd ACM International Conference on Conference on Information and Knowledge Management, pp 1219–1228
Yeh et al (2009) Yeh E, Ramage D, Manning CD, Agirre E, Soroa A (2009) Wikiwalk: random walks on wikipedia for semantic relatedness. In: Proc. 2009 Workshop on Graph-based Methods for Natural Language Processing, pp 41–49
Zesch and Gurevych (2007) Zesch T, Gurevych I (2007) Analysis of the wikipedia category graph for nlp applications. In: Proc. TextGraphs-2 Workshop (NAACL-HLT 2007), pp 1–8