Top Gear or Black Mirror: Inferring Political Leaning from Non-political Content

Ahmet Kurnaz
Department of Public Administration and Political Science
Çanakkale Onsekiz Mart University
Çanakkale, Turkey
ahmetkurnaz@hotmail.com
&

Scott A. Hale
Oxford Internet Institute
University of Oxford
Oxford, United Kingdom
scott.hale@oii.ox.ac.uk

Abstract

Polarization and echo chambers are often studied in the context of explicitly political events such as elections, and little scholarship has examined the mixing of political groups in non-political contexts. A major obstacle to studying political polarization in non-political contexts is that political leaning (i.e., left vs right orientation) is often unknown. Nonetheless, political leaning is known to correlate (sometimes quite strongly) with many lifestyle choices leading to stereotypes such as the “latte-drinking liberal.” We develop a machine learning classifier to infer political leaning from non-political text and, optionally, the accounts a user follows on social media. We use Voter Advice Application results shared on Twitter as our groundtruth and train and test our classifier on a Twitter dataset comprising the 3,200 most recent tweets of each user after removing any tweets with political text. We correctly classify the political leaning of most users (F1 scores range from 0.70 to 0.85 depending on coverage). We find no relationship between the level of political activity and our classification results. We apply our classifier to a case study of news sharing in the UK and discover that, in general, the sharing of political news exhibits a distinctive left–right divide while sports news does not.

Introduction

Word choice, grammatical structures, and other linguistic features often correlate with demographic variables such as age and gender, although audience and other performative aspects can play a significant role Wang et al. (2019); Nguyen et al. (2016).

Latent attribute inference describes the field of research in computer science that tries to infer demographic and other attributes from individuals’ behaviours. While much attention has been paid to inferring gender, age, and location from social media data Wang et al. (2019); Liu et al. (2019); Zagheni et al. (2017); Zhang et al. (2016); Chen et al. (2015); Sap et al. (2014); Nguyen et al. (2013); Rosenthal and McKeown (2011); Rao et al. (2010), less work has been done on inferring political orientation. Nonetheless, political orientation is known to correlate (sometimes quite strongly) with many lifestyle choices, which leads to stereotypes such as the “latte-drinking liberal” or “bird-hunting conservative” DellaPosta et al. (2015). Individuals’ perception of their own political identities also correlates with the degree of moral value adoption and behaviour Talaifar and Swann (2019). Furthermore, political leaning may influence perceptions of non-political topics: Ahn et al. (2014) were able to infer political leaning from fMRI data in which subjects were shown non-political images as stimuli. Even academic publications and their findings contain some signals of the political leanings of their authors Jelveh et al. (2014).

It, therefore, is very likely that the content people share, the accounts they follow, and the words they use on social media may also contain traces of their political orientation and that a properly trained machine learning system will be able to infer individuals’ political orientations even in non-political contexts.

Although there are several political spectrum models Eysenck (1964); Sznajd-Weron and Sznajd (2005); Kitschelt (1994), in this study, we focus specifically on a unidimensional left–right spectrum, given its ubiquity in political science and popular culture alike.

We develop and freely share our machine learning system, which achieves an F1 score of more than 0.85 even when considering non-political content. Our ground-truth data come from individuals who used Voter Advice Applications (VAAs) during the 2015 and 2017 UK General Elections. Many respondents shared their VAA results on social media during the elections, and we use these social media accounts and their VAA results to train our system. We gather multiple datasets, including these users’ political and non-political Twitter activity.

After developing our classifier, we use the system to infer the political orientations of individuals sharing news articles to investigate the level of polarization in sharing different types of news and from different sources. Specifically, we investigate the sharing of political and sports news from The Telegraph (right-leaning), the BBC (centrist), and The Guardian (left-leaning).

We first present our classifier’s motivation, development, and evaluation before turning to the news sharing case study. Finally, we conclude on broader implications and future research directions.

Inferring political leaning

Political ideology prediction from digital trace data has become a core interest for researchers. Prior work has focused on predicting ideological stances using text data, network data, or a combination.

Social media, news outlets, and parliamentary discussions are the most popular sources of textual data: work on social media has used both users’ and politicians’ textual data Preoţiuc-Pietro et al. (2017); Conover et al. (2011). Newsgroups are also often political, and the political leanings of their text have been studied Kulkarni et al. (2018); Iyyer et al. (2014). Although legislators are already officially affiliated with a political party, several studies have classified the members of the US Congress Iyyer et al. (2014); Diermeier et al. (2012); Yu et al. (2008).

Text-based approaches focus on the words people choose to use Preoţiuc-Pietro et al. (2017), the style of their messages, and social-media specific attributes such as hashtags Weber et al. (2013); Conover et al. (2011). These approaches work well for polarised topics where the left and right use different phrases to talk about the same things: for example, “gun control” and “gun rights.”

Network approaches have used both textual elements such as hashtags, mentions, and URLs Gu et al. (2016) to create networks, and social links Barberá (2015).

Within Twitter, there is a divide between using either retweets and mentions (e.g., Conover et al., 2011; Gu et al., 2016; Hale, 2014) or followers and friends (e.g., King et al., 2016; Barberá, 2015; Golbeck and Hansen, 2014; Pennacchiotti and Popescu, 2011; Zamal et al., 2012; Compton et al., 2014) to infer latent attributes. Networks based on mentions and retweets are often preferred over followers for two reasons. First, taking a snapshot of an extensive network is difficult, given the Twitter API rate limits. Second, retweeting and mentions, often have greater recency, can be analysed temporally, and can form weighted networks (e.g., based on how many times users mention each other). As retweeting is more publicly visible than following a user, there is also a greater social cost to retweeting another account, suggesting that users will more carefully consider which users they retweet or mention. Retweeting has been used as a sign of support (e.g., Conover et al., 2011; Wong et al., 2016), but it is important to note that some mentions, retweets, and hashtag uses will be specifically to call attention to content with which a user disagrees.

As detailed below, we examine the use both words (through topic modeling) in political and non-political text, and network features (the accounts a user follows) to estimate the political leaning of social media users.

We find network features and political text equally reveal political leaning, but network features have lower coverage. Inferring political leaning with non-political text is less accurate, but network data helps in that task and performance is on par with previous work inferring political leaning from other sources.

Ground truth data

We collected all mentions of social media accounts and URLs associated with the three most-used voter advice applications (VAAs) during the 2015 and 2017 General Elections in the United Kingdom.¹¹1We used commercial, firehose access for the 2015 election and elevated direct access during the 2017 election. We believe our dataset represents the entire population as closely as possible. VAAs ask individuals a series of political and policy issue questions to help them understand their preferences and how these align with political candidates running for office. Users of VAAs have the opportunity (but not requirement) to share their results on social media.

The three VAAs from which we collected data were “I Side With You” (ISW), “Who Should You Vote For” (WSYVF), and “Vote Match” (VM). We found 4,118 unique users who shared their VAA results: 2,975 of these users came from ISW, 794 from VM, and 387 from WSYVF. Our data only includes original tweets and not retweets of VAA result URLs. Thirty eight users shared results from multiple VAAs.

After collecting the data during the elections, we extracted all VAA result URLs in July 2018, downloaded the content of VAA result pages, and extracted users’ matches with political parties. We furthermore obtained the most recent 3,200 public tweets as of January 2019 of each user who shared a VAA result by using the user_timeline endpoint of Twitter’s public RESTful API.

All VAAs provided the voter’s alignment with individual political parties. WSYVF also gave a left–right score, but the exact details of this score are unknown. Therefore, we develop our own left–right or political leaning score using only the matches to parties, which can be applied consistently across all three VAAs.

We calculate political leaning scores by subtracting the user’s match to Labour and the Conservatives (the UK’s two largest parties).²²2We experimented with summing a user’s match with left-leaning parties (Labour, SNP, Greens, Plaid-Cymru, Sinn Fein) and subtracting this from the sum of the user’s match to right-leaning parties (Conservatives, UKIP, British National). This procedure produced similar results but was less transparent as not all parties are available in all constituencies. We normalize all scores into the [-1,1] interval for each platform before combining data from all the VAAs together.

Users with a political leaning score greater than zero are labelled as “right-leaning” while those with scores less than zero are labelled as “left-leaning.” We dropped all users with political leaning scores equal to zero. A small number of users shared results from multiple VAAs. We checked for consistency and removed one inconsistent user from our dataset. We created one political leaning score for users with multiple but consistent political leanings by taking the mean of their political leaning scores from the multiple VAAs.

In total, our ground-truth dataset contains 2,694 unique users from all three VAAs. We collected last 3,200 tweets from 1,921 of these users in January 2019.

We detect the language of all tweets using the Compact Language Detector 2 Sites (2013) and remove all users who had less than 75% of their most recently 3,200 tweets in English. Last, we remove users who tweeted less than 100 times. As a result, the final dataset comprises 1,760 users. Of these users, 1,396 are left-leaning, and 364 are right-leaning (Figure 1).

Refer to caption — Figure 1: Histogram of left- and right-leaning users sharing Voter Advice Application (VAA) results on Twitter.

Predictive variables

We build five datasets using users’ tweets and Twitter ‘friends’ (i.e., the accounts a user follows). The first three datasets are (1) network information, (2) political text, and (3) non-political text. The final two-hybrid datasets are constructed by joining textual and network data.

Textual data

We create two datasets of texts from tweets authored by the users. The first is a set of political text from tweets containing a political term (e.g., #ge2015, labour) primarily collected during the 2015 and 2017 elections with firehose or elevated streaming API access. The second dataset, non-political text, is mainly from users’ most recent 3,200 tweets collected with the RESTful API. We find that some recent tweets are political (e.g., discussing Brexit) and exclude tweets with political words from the non-political dataset.

Determining political tweets

To detect political tweets, we develop a two-step procedure. In the first step, we create an index by using election and non-election word occurrence frequencies. Then, we extract initial political words by using this index. Finally, we train a word embedding model to extend the political terms list in the second step.

We select words that appear in at least 250 different tweets during an election period in the initial step.³³3The periods for the elections as follow: GE2010 (2010-04-12 to 2010-06-06), GE2015 (2014-11-19 to 2015-06-07), and GE2017 (2017-04-18 to 2017-07-08). Then, we examine word occurrences both within and outside of the election periods. We normalize these two frequencies by using the total tweet count for each period. Finally, we divide the out-of-election number by the in-election number to develop a political index.

In our case, setting political index threshold of 0.25 yields a word list comprising 316 words with scores lower than 0.25. However, this initial list contains several ambiguous words. Therefore, we manually examined all words and removed 78 vague terms (#eurovision2015, #pensioners, youtuber, tuition, etc.) before continuing to the second step. We also add political words that were present throughout the whole period, including ‘brexit’, ‘eu’, ‘trump’, ‘clinton’, ‘sanders’, ‘johnson’, ‘gop’, and ‘referendum.’

In the second step, we use word embeddings to determine the closest terms to the political words extracted in the first step. We train a skip-gram embedding model (skip window: 5, minimum frequency: 100). Using Euclidean distance, we choose the closest three words to any political term in the list. Then, we again manually inspect all words and remove ambiguous terms. This results in 433 political words. The complete list of words is available within the supplemental materials.

We label any tweet containing a term from this final list as political. Other tweets are labelled non-political. In the end, the non-political dataset consists of 3.66 million tweets, and the political dataset contains 888 thousand tweets (Figure 2).

Network data

We collected the accounts that users in our dataset followed (their ‘friends’) on Twitter for 1,742 of the 1,760 users in our dataset who shared VAA results. Next, we create a binary matrix where the columns are Twitter accounts followed by multiple users, and the rows are the users in our dataset.

Approach

Our goal is to estimate political leaning at a per-user level, not at a per-tweet level. We, therefore, sort all tweets of each user by date and concatenate these tweets into two text documents: one for political text and one for non-political text.

To construct document-feature matrices (DFM), we remove URLs and non-alpha characters from the text. Next, we stem all words using Porter’s algorithm Porter (1980) and exclude words from the SMART stop word list Lewis et al. (2004). Finally, we create lists of unigram, bigram and trigrams as input features to our classifier. Consequently, our features consist of unigrams, bigrams and trigrams altogether, and the documents are combined tweets of each user.

To reduce the dimension of features, we remove sparse terms by setting sparsity for political dataset to 0.9 and for non-political dataset to 0.85. Consequently, the political DFM consists of 1,760 users and 4,676 features, and the non-political DFM contains 1,760 users and 6,613 features. We also remove sparse terms from the network DFM by setting sparsity to 0.88. Our network DFM consists of 1,742 users and 55 features (followed accounts).

Last but not least, our dataset contains more left-leaning than right-leaning users. Therefore we select all of the right-leaning users and an equal number of left-leaning users randomly. We combine these two subsets and shuffle the documents in the resulting DFM, which contains 728 users balanced across our two classes (left-leaning and right-leaning) in the political and non-political sample datasets. When using network data as an input on its own or in combination with text, the numbers are slightly lower as network data was not available for all users.

Dataset	NB	NN	SVM_lin	SVM_poly	SVM_rad
net	0.27	0.75	0.73	0.75	0.75
non-pol	0.54	0.64	0.60	0.65	0.65
non-pol+net	0.59	0.70	0.68	0.71	0.72
pol	0.66	0.75	0.74	0.73	0.72
pol+net	0.67	0.69	0.66	0.71	0.67

Table 1: Average F1 scores for the training datasets. We use 10-fold cross-validation while training the models. We repeat this procedure for five different randomly selected, but balanced, samples.

network non-political non-pol+net political pol+net Classifier F1 P R F1 P R F1 P R F1 P R F1 P R Naïve Bayes 0.29 0.17 1.00 0.64 0.63 0.67 0.70 0.69 0.72 0.75 0.79 0.71 0.73 0.75 0.71 Neural Network 0.77 0.71 0.83 0.69 0.67 0.73 0.77 0.74 0.80 0.81 0.80 0.82 0.74 0.70 0.78 SVM_lin 0.74 0.66 0.85 0.65 0.59 0.78 0.72 0.66 0.80 0.79 0.76 0.83 0.74 0.68 0.82 SVM_poly 0.77 0.72 0.83 0.69 0.72 0.71 0.78 0.74 0.83 0.74 0.78 0.79 0.75 0.70 0.81 SVM_rad 0.78 0.74 0.83 0.70 0.81 0.67 0.75 0.70 0.81 0.74 0.77 0.78 0.73 0.65 0.83

Table 2: F1, Precision, and Recall scores for predicting left–right political leaning on the 20% fully-held out test data (averaged across five balanced samples). Five datasets are used: network friends data only, non-political tweet text, non-political and network data (non-pol+net), political tweet text only, and political text and network data (pol+net). Results are shown for five classifiers: Naïve Bayes, Neural Network, and SVMs with linear, polynomial, and radial kernels. The highest F1, precision, and recall values for each dataset (column) are in bold.

Classification procedure

Our classification procedure follows these steps: (1) We split each sample data into training (80%) and test (20%) datasets. (2) We create a topic model using only the training data. (3) We train the political leaning classifiers on the training data using 10-fold cross-validation. (4) We apply the topic model to test data, and (5) we predict the leanings of test data. We completely isolate the test dataset from both the topic generation and classifier training processes by following this procedure.

We use Structural Topic Models (STM) to create our topic models. However, we do not provide prevalence co-variates to prevent biased topic generation. Therefore, we feed the topic modelling algorithm with only text data. In this case, the algorithm becomes an implementation of the Correlated Topic Model (CTM) Roberts et al. (2019).

We treat the number of topics as a hyperparameter that we tune through cross-validation on the training data. We find 150 topics gives the most meaningful results due to the large corpora (3.8M) and long-period (approx. nine years). We use spectral initialization as advised by the creators since it generates ‘better’ and more ‘consistent’ topics by using “a spectral decomposition (non-negative matrix factorization) of the word co-occurrence matrix" Roberts et al. (2019). We extract the theta distributions to classify documents (users) when the model returns. Theta states the probability of a document belonging to a topic.

To infer political leanings and compare the results, we train and compare neural network (NN), Naïve Bayes (NB), and support-vector machine (SVM) classifiers. The SVMs are trained with linear, radial and polynomial kernels. SVM, NB and NN are three of the most popular machine learning methods to classify text, although the ‘best’ approach depends heavily on the data and the specific task being performed Wang and Manning (2012); Ng and Jordan (2001).

After performing a classification, we can apply a threshold to the probabilistic results from the classifier. If the probability of a user’s classification is above the threshold, we assign them the predicted outcome. When the probability is below the threshold, we assign a value of ‘unknown’ for each best-performing model in Table 3. The ‘Unknown’ column reports the percentage of users in each classification task with probabilities below the stated threshold.

Classification results

We first calculate F1 scores using 10-fold cross-validation on the training data (Table 1). For those results, we renew the classification with five different balanced, random samples and report the average final F1 scores across all folds and samples. We also report F1, precision, and recall scores on the entirely held out test data (Table 2).This test data had no role in creating the topic models or other features and better approximates real-world performance.

In Table 1 we compare the performance of our models on the training data. The neural network, SVM_poly and SVM_rad models perform similar to each other for the network dataset. The SVM_lin and neural network model performances are close on political text and network only datasets. The F1 scores for the dataset combining political text and network data are generally lower than for either alone, suggesting the signals from these two sources somewhat overlap.

In Table 2, we compare the neural network, SVM and NB model performances on the unseen test data averaged across five balanced samples. The results show the best performing classifier changes with the task. The neural network model performs best on the political dataset, but the SVM_poly classifier has the highest F1 score for both hybrid datasets. Lastly, for the non-political and network datasets the SVM_rad model has the highest F1 score.

Our main objective in this paper is to understand how well political leaning can be predicted from the non-political text; so, we further explore how different probability thresholds can be applied to the best performing classifiers’ outputs and report these in Table 3.

For the non-political dataset we select the SVM_rad model, and for the non-political+network dataset we select the SVM_poly model. Classifying political leaning from non-political tweets is more challenging than using political tweets. Incorporating network data helps increase coverage and accuracy in the non-political case. However, even without network data, it is possible to infer a political leaning for nearly half of the users with high performance (For a probability threshold of $0.51$ , the F1 score is $0.9$ , but 56% of users are classified as being of an unknown political leaning).

Estimating political leaning from political tweets has a similar F1 score to using network data alone. However, combining network and political text data results in lower confidence estimates (and a higher number of unknown users than either text or network data alone at each threshold).

The effect of the probability threshold for models makes the comparison easier. This threshold excludes users with results of lower certainty and marks them as unknown. The impact on the performance of each threshold is reasonably consistent across input datasets: F1 scores increase with higher thresholds, but so too do the percentages of users classified with an unknown political leaning. For example, for the non-political input dataset, a 0.62 probability threshold increases the F1 scores by up to 23 percentage points but also marks nearly 97% of users in the dataset as unknown. The effects are similar for other input datasets but less extreme.

Dataset	Prob. Threshold	F1	Prec.	Rec.	Unknown
net	0.50	0.86	0.85	0.87	0.00
(SVM_rad)	0.64	0.90	0.88	0.92	0.23
	0.68	0.95	0.95	0.95	0.41
non-pol	0.50	0.77	0.75	0.79	0.00
(SVM_rad)	0.51	0.90	0.93	0.87	0.56
	0.62	1.00	1.00	1.00	0.97
non-pol+net	0.50	0.85	0.82	0.88	0.00
(SVM_poly)	0.70	0.90	0.90	0.90	0.36
	0.78	0.95	0.95	0.95	0.47
pol	0.50	0.85	0.83	0.87	0.00
(NN)	0.84	0.90	0.88	0.92	0.27
	0.94	0.95	0.92	0.97	0.41
pol+net	0.50	0.81	0.75	0.87	0.00
(SVM_poly)	0.74	0.91	0.90	0.92	0.52
	0.76	0.96	0.97	0.94	0.59

Table 3: Example thresholds and F1, precision, and recall scores for the best performing models (the models with the highest F1 scores in Table 2) for each dataset. Different probability thresholds are set to give base, 0.9 and 0.95 F1 scores.

Post-hoc analysis

We extract the eleven most important topics and the ten most influential Twitter accounts from our best performing non-political+network SVM_poly model.

Table 4 shows the 10 most significant Twitter accounts in the SVM_poly non-political+network model. We generate the ratio of the normalized number of left and right followers of these accounts using our ground-truth dataset, and find a higher proportion of users following @10DowningStreet, @realDonaldTrump, and @JeremyClarkson are right-leaning. The remaining accounts were followed more by left-leaning users than right-leaning users. Although most of these accounts are political, there are two non-political accounts: Jeremy Clarkson—the former Top Gear presenter—and Charlton Brooker—the creator of Black Mirror.

Account	Screen Name	Left (%)	Right (%)
Jeremy Corbyn	@jeremycorbyn	47	11
Owen Jones	@OwenJones84	33	4
UK Prime Minister	@10DowningStreet	17	36
Donald J. Trump	@realDonaldTrump	16	39
Charlie Brooker	@charltonbrooker	30	8
Caroline Lucas	@CarolineLucas	25	2
Jeremy Clarkson	@JeremyClarkson	15	35
The Green Party	@TheGreenParty	20	3
Jon Snow	@jonsnowC4	25	7
NHS Million	@NHSMillion	21	3

Table 4: Most influential Twitter accounts, which are generated by the non-political+network model, sorted by their importance. The Left column displays the percentage of users in the groundtruth data labelled as left-leaning who follow the account. The Right column displays the equivalent value for right-leaning users. All ten of the accounts are verified (have a blue checkmark badge) on Twitter.

After detecting the most important topics, we assign a label to each based on the most prevalent words and estimate the effects of left-leaning and right-leaning users. Figure 3 shows the contrast between leanings. The top words belonging to these topics are displayed in Table 5 and a complete list of all topics is available in the supplemental materials. The word lists are developed in two steps. First, we list the top 10 words by using FREX, LIFT and log scores. Then we concatenate these three lists in the given order and extract the top 15 unique words.⁴⁴4“FREX is the weighted harmonic mean of the word’s rank in terms of exclusivity and frequency,” “LIFT weights words by dividing by their frequency in other topics, therefore giving higher weight to words that appear less frequently in other topics,” and “Score divides the log frequency of the word in the topic by the log frequency of the word in other topics” Roberts et al. (2019).

Figure 3 shows topics discussing social democratic values are more prevalent in left-leaning users. A gardening related topic is also more left-leaning while the Premier League and a pollution-related topic are more right-leaning. It should be noted that most 95% confidence intervals cross the zero point indicating any left–right distinction is subtle.

Topic Name	Features
London/Beer	unit_kingdom, kingdom, ben, unit, beer, pub, london, januari, station, tim, wimbledon, cheat, rip, slice, peak
Pollution	plastic, contain, ocean, pollut, dr, usa, sea, recycl, dwp, path, locat, commit, toe, current, planet
Follow/Unfollow	unfollow, automat, peopl_follow, stat, follow, check, person, ireland, mention, reach, irish, track, retweet, big_fan, joe
Technology/EU	android, dj, ee, eu, app, greec, independ, mobil, scotland, ipad, euro, europ, io, beat, scottish
Entertainment	beth, xx, haha, xxx, deploy, sean, artist, nicki, wire, hahaha, jay, spain, boyfriend, til, willi
Africa/Corruption	ni, nigeria, lawyer, african, corrupt, parcel, bbcradio, lol, journo, arrest, wk, individu, brazil, pregnant, id
MUFC/Premier League	mufc, manutd, unit, rooney, manchest_unit, utd, goal, golf, moy, mourinho, ronaldo, pogba, score, fifa, leagu
Sugar Tax	frog, level, skill, c***, sugar, committe, divin, nhs, suspect, holland, hunter, jeremi, bullshit, jeremi_hunt, ww
Premier League	west_ham, spur, ham, salah, everton, kane, chelsea, klopp, pogba, lukaku, goal, mourinho, tottenham, midfield, liverpool
Soc. Dem. Values	foodbank, poverti, nhs, wage, worker, crisi, fund, wick, solidar, food_bank, grenfel, bank, auster, polit, privat
Gardening	bloom, garden, cute, layer, kinda, ship, charact, rich, bless, hug, glad, mom, damn, lmao, gonna

Table 5: Words belonging to corresponding topics. Unique words are selected by combining top 10 words of FREX, LIFT and log scores in that order. Full topic models are in the supplemental materials.

Effect of political activity

After labelling political tweets as discussed, we develop a political activity index by dividing the number of political tweets by the number of overall tweets per user. To determine whether there is a relation between political activity and classifier performance, we apply a Pearson correlation test to the classifier probabilities and our measure of political activity. For the best performing non-political textual model (SVM_rad), we find only very weak correlations between the classifier probability and political activity ( $r=0.15$ ), the number of tokens ( $r=0.02$ ), and leaning scores ( $r=-0.02$ ). The best-performing political textual model (NN) similarly has weak correlations between its predicted probabilities and political activity ( $r=0.21$ ), the number of tokens ( $r=0.17$ ), and leaning scores ( $r=0.13$ ).

Case study: Polarization and news sharing

Having developed a classifier that successfully predicts the users left–right political leaning even with non-political text as input we now seek to apply this classifier in a short case study. Social media has become intertwined with ideas of political polarization and echo chambers in popular culture Pariser (2011). There are, of course, many studies that show polarization in explicitly political contexts, such as the hyperlinks of political bloggers Adamic and Glance (2005), the sharing of political news Flaxman et al. (2016), and Twitter activity in the context of politics Barberá (2015); Hong and Kim (2016).

Despite this, there is a lack of studies examining the polarization and echo chamber hypotheses in diverse contexts (for more, see, Dubois and Blank, 2018). The Internet offers the potential to access various information sources, and enthusiasm for politics and consumption of more media sources negatively coordinate with the likelihood of being in an echo chamber Dubois and Blank (2018); Barberá et al. (2015).

In this case study, we apply our non-political classifier to the sharing of news from three UK media outlets: The Telegraph (right-leaning), the BBC (centrist), and The Guardian (left-leaning). The case study fulfils two objectives. First, it demonstrates the face validity of our classifier by matching common expectations that political news from The Telegraph should be shared chiefly by right-leaning users, political news from The Guardian should be shared mostly by left-leaning users, and the BBC should be the least polarized of the three. Second, the case study advances the scholarly conversation on polarization by examining the sharing of non-political news—namely sports news—from these three outlets. Our results show that many right-leaning users consume sports news from The Guardian, suggesting more right-leaning users engage with, at least the sports section of, The Guardian than conventional wisdom would expect.

Data and methods

We examine the number of left- and right-leaning accounts sharing news in a two-by-three design. Our three news sources are The Telegraph, the BBC, and The Guardian, and our new types are political and sports news.

We first use elevated Twitter Streaming API access to collect URL patterns related to theguardian.com, telegraph.co.uk, and bbc.co.uk. We then manually examine these URLs to identify URLs patterns that are explicitly associated with political news and sport news. For example, all URLs containing bbc.co.uk/sport/ are labelled as sports news from the BBC, and all URLs containing theguardian.com/politics/2018/ are labelled as political news from The Guardian. A complete list of URL patterns is available in the supplemental materials.

Returning to the Twitter Streaming API, we collect all instances of URLs shared matching any of our patterns between 1 September 2018 and 8 September 2018. We then query the 3,200 most recent tweets of each user and collect the friends of each user appearing in our dataset using the Twitter RESTful API statuses/user_timeline endpoint and friends/ids endpoint.

Each user’s recent tweets and friends are used as input to estimate each user’s left–right political leaning. Next, we clean the text as described in the development of our classifier and combine all tweets from a given user into one document per user to estimate the political leaning of that user. The DFMs result in extremely sparse matrices (e.g., approx. 25M features). We, therefore, remove terms whose total frequency is lower than three, which makes the DFM dimensions computationally-feasible (e.g., approx. 5M features). Then, we extract features which are not present in our non-political DFM. As a result, we also remove political terms in this step. In the last step, we fit the non-political topic model to the news data to predict the political leaning of each user.

Our setup matches the non-political + network approach described in the main article. To classify our users sharing political and sports news, we select the SVM_poly classifier that yielded the optimal performance in the non-political + network task and set probability threshold to 0.7.

News sharing results

The results of our classification are shown in Table 6. The results conform with our hypothesis that the political leaning of users would match the political-leaning of the news publishers for political news sharing. For example, among users sharing political news who were classified as left-leaning, over 75% shared articles from the left-leaning Guardian. Similarly, over half of the users are classified as right-leaning who shared political news links from the right-leaning Telegraph.

In contrast to politics, we find The Guardian is the most popular source for sports news among users classified as either left- or right-leaning. In general right-leaning users share more sports than left-leaning users as a whole. Even so, the number of right-leaning users sharing sports news from The Telegraph is 8.5 times higher than the number of left-leaning users sharing articles from that source—the largest proportional difference in the dataset. We suspect sports news sharing is influenced by availability: The Telegraph has a paywall while the BBC and The Guardian do not. So it makes sense that subscribers paying for The Telegraph would get both political and sports news from it but that non-subscribers would look elsewhere. Many right-leaning users share sports news from the left-leaning, but free-to-access, Guardian suggests factors beyond political leaning such as availability and cost heavily influence the news sources people read.

Overall, the case study shows our classifier’s face-validity being able to detect the expected leanings of users sharing political news. At the same time, it suggests that the sharing of sports news is not as polarized and points toward new avenues of study made possible with a classifier that can predict political leaning from non-political text.

	Guardian	BBC	Telegraph	Total
Political
Left	1,223	240	161	1,624
Right	812	377	1,230	2,419
Unknown	650	238	334	3,641
Sport
Left	314	295	44	653
Right	1,631	869	378	2,878
Unknown	956	564	115	1,635

Table 6: The political orientation of individuals sharing news by source and type. We set the probability threshold to 0.7 which has a 90% accuracy is our model development (non-political+network). The highest value in each row, excluding the total, is in bold.

Discussion

Polarization is often studied in explicitly political contexts; however, understanding the actual effects of polarization requires researchers to explore non-political contexts. This, in turn, requires the ability to estimate political leaning in these non-political contexts. Building on the concept of ‘lifestyle politics’ DellaPosta et al. (2015) this paper has shown it is possible to estimate political leaning from non-political text.

What exactly is non-political text is unclear. Our preliminary analysis found that tweets sent outside of election periods were still often political in nature (e.g., containing discussion about the UK’s relationship with Europe). In response, we developed an approach to define and expand a list of political keywords and only allow tweets without these keywords to be considered for our tasks with non-political text. The topics displayed in Table 5 suggest this worked reasonably well. However, a qualitative inspection of the topics reveals a small number of political tweets are still in the non-political dataset: eight out of the 150 topics (5.3%) are political (Israel–Palestine, Scottish independence, Brexit, Macron–Merkel, the Green Party, UK Politics, US Politics, Westminster). Nonetheless, these form a tiny part of the dataset and cannot be responsible for the performance of our classifiers. Indeed, applying the best performing classifier trained on political text to the non-political text results in accuracy no better than random chance.

We find only only weak correlations between classifier performance and either political activity or political leaning. This suggests that individuals with different activity levels and at different points in the left–right spectrum all leave traces of their political leaning.

The use of Voter Advice Applications (VAAs) for ground truth data represents an exciting way to build larger-datasets for training and testing machine learning classifiers. In contrast to human coders (e.g., Conover et al., 2011) or crowdsourcing (e.g., Iyyer et al., 2014), VAAs are self-reported. Preoţiuc-Pietro et al. (2017) used self-report surveys, but VAAs can result in larger sample sizes, provide a longer assessment, and rely on the genuine motivation of users who also learn how their political preferences align with major parties. Nonetheless, it will be important to understand any biases in who shares VAA results on social media platforms. For instance, it may be that only the most politically-engaged users share VAA results on social media.

In addition to the variety of input data used, prior research has also tried various machine learning approaches to classifying political leaning including SVM, Naïve Bayes, and neural networks. Reported performance scores range from 60% to 90% Preoţiuc-Pietro et al. (2017); Gu et al. (2016); Jelveh et al. (2014); Iyyer et al. (2014); Yu et al. (2008); Kulkarni et al. (2018). Thus, our F1 score of 0.81 for political text is good, and achieving an F1 score of 0.7 for non-political text alone or 0.75 for non-political text and network data is in line with other work using explicitly political text.

Ideology, of course, is much broader than a simple left–right or liberal–conservative spectrum. Our work, therefore, complements approaches such as Graham et al. (2012) who apply moral foundation theory and find liberals and conservatives do not differ dramatically on moral foundations. The topics we find echo the moral foundations Graham et al. identify: liberals give higher importance to individualizing foundations—harm/care and fairness/reciprocity—and conservatives give higher importance to binding foundations—in-group/loyalty, authority/respect, and purity/sanctity. That the topic on pollution is more associated with the right may seem counter-intuitive, but we note the effect sizes are small and the analysis only considers prevalence, not attitude or sentiment.

We find the accounts that a user follows often reveal that user’s political leaning. While some of these accounts are explicitly political, others such as Jeremy Clarkson, a former Top Gear presenter, and Charlton Brooker, the creator of Black Mirror, are not. In other words, interest in Top Gear/Grand Tour or Black Mirror may hint at individuals’ political leaning more than expected. It may be unexpected that an official government account such as that of the UK Prime Minister is so explicitly political, but it is worth noting in the UK context that the Conservative Party has been in power since 2010.

Although our F1 scores indicate there are left/right differences, claiming that all right and left users behave similarly would be a gross oversimplification. There may be significant differences in other countries and cultures. In addition, our data is not representative. We cannot generalize the results of this study to explain complex structures such as conservative or social-democratic ideologies.

Our brief case study of news sharing hints at promising ways in which classifiers of non-political content may be used. We found that political news sharing on Twitter is associated with the political leaning of users, but that sports news sharing is less polarised. Many accounts classified as right-leaning shared sports news from The Guardian, a left-leaning source. Telegraph sports news, however, was still heavily associated with right-leaning users. We suspect paywalls partially drive this result: The Telegraph is the only source with a paywall in our data. It may be that many right-leaning users look to The Guardian and other sources for sports news rather than pay for access to The Telegraph. On the other hand, individuals who already pay for access to The Telegraph likely also choose to get their sports news there.

The level of polarization observed on social media thus depends in part on the type of content examined. As mentioned, most studies examine polarization in the context of the sharing of explicitly political content. However, our results indicate that a more holistic view of the content shared beyond politics may show less polarization. Our results also suggest that editorial decisions such as having a paywall or not affect the level of polarization observed in link-sharing on social media.

Conclusions

As political orientation influences and correlates with many aspects of non-political life, there is good reason to expect that political leaning can be inferred even from non-political text.

We first developed a classifier using political text from tweets primarily associated with general elections in the UK, achieving an F1 score of 0.81. We then collected recent tweets from users outside of any election period and removed tweets with political keywords. Using this non-political text data to estimate the political leaning of users, we were still able to achieve an F1 score of 0.70. Furthermore, incorporating the network information of these users alongside the textual features from their tweets further increased the F1 score.

Using our classifier, we were able to classify the political leaning of users sharing political and sports news on Twitter from three UK national news sources. Our analysis indicated the importance of examining polarization in political and non-political contexts.

This study has shown the possibility for the classification of political leaning from non-political text and highlighted the importance of studying political leaning in non-political contexts. We hope that this paper can help spur research to understand polarization in non-political contexts.

References

Adamic and Glance (2005) Adamic LA, Glance N (2005) The political blogosphere and the 2004 u.s. election: Divided they blog. In: Proceedings of the 3rd International Workshop on Link Discovery, ACM, New York, NY, USA, LinkKDD ’05, pp 36–43, doi:10.1145/1134271.1134277, URL http://doi.acm.org/10.1145/1134271.1134277
Ahn et al. (2014) Ahn WY, Kishida K, Gu X, Lohrenz T, Harvey A, Alford J, Smith K, Yaffe G, Hibbing J, Dayan P, Montague PR (2014) Nonpolitical images evoke neural predictors of political ideology 24(22):2693–2699, doi:10.1016/j.cub.2014.09.050, URL http://www.sciencedirect.com/science/article/pii/S0960982214012135
Barberá (2015) Barberá P (2015) Birds of the Same Feather Tweet Together. Bayesian Ideal Point Estimation Using Twitter Data p 28
Barberá et al. (2015) Barberá P, Jost JT, Nagler J, Tucker JA, Bonneau R (2015) Tweeting from left to right: Is online political communication more than an echo chamber? 26(10):1531–1542, doi:10.1177/0956797615594620, URL http://journals.sagepub.com/doi/10.1177/0956797615594620
Chen et al. (2015) Chen X, Wang Y, Agichtein E, Wang F (2015) A comparative study of demographic attribute inference in twitter p 4
Compton et al. (2014) Compton R, Jurgens D, Allen D (2014) Geotagging one hundred million twitter accounts with total variation minimization. In: 2014 IEEE International Conference on Big Data (Big Data), IEEE, pp 393–401, doi:10.1109/BigData.2014.7004256, URL http://ieeexplore.ieee.org/document/7004256/
Conover et al. (2011) Conover MD, Goncalves B, Ratkiewicz J, Flammini A, Menczer F (2011) Predicting the Political Alignment of Twitter Users. In: 2011 IEEE Third International Conference on Privacy, Security, Risk and Trust and 2011 IEEE Third International Conference on Social Computing, pp 192–199, doi:10.1109/PASSAT/SocialCom.2011.34
DellaPosta et al. (2015) DellaPosta D, Shi Y, Macy M (2015) Why do liberals drink lattes? American Journal of Sociology 120(5):1473–1511, doi:10.1086/681254, URL https://doi.org/10.1086/681254
Diermeier et al. (2012) Diermeier D, Godbout JF, Yu B, Kaufmann S (2012) Language and ideology in congress 42(1):31–55, doi:10.1017/S0007123411000160, URL https://www.cambridge.org/core/journals/british-journal-of-political-science/article/language-and-ideology-in-congress/1063F5509BC2ABC3F9A0E164E58157EE
Dubois and Blank (2018) Dubois E, Blank G (2018) The echo chamber is overstated: the moderating effect of political interest and diverse media 21(5):729–745, doi:10.1080/1369118X.2018.1428656, URL https://doi.org/10.1080/1369118X.2018.1428656
Eysenck (1964) Eysenck HJ (1964) Sense and nonsense in psychology. Penguin Books
Flaxman et al. (2016) Flaxman S, Goel S, Rao JM (2016) Filter bubbles, echo chambers, and online news consumption. Public Opinion Quarterly 80(S1):298–320, doi:10.1093/poq/nfw006, URL https://doi.org/10.1093/poq/nfw006
Golbeck and Hansen (2014) Golbeck J, Hansen D (2014) A method for computing political preference among Twitter followers. Social Networks 36:177–184, doi:10.1016/j.socnet.2013.07.004, URL http://www.sciencedirect.com/science/article/pii/S0378873313000683
Graham et al. (2012) Graham J, Nosek BA, Haidt J (2012) The moral stereotypes of liberals and conservatives: Exaggeration of differences across the political spectrum. PLoS ONE 7(12):e50092, doi:10.1371/journal.pone.0050092, URL https://dx.plos.org/10.1371/journal.pone.0050092
Gu et al. (2016) Gu Y, Chen T, Sun Y, Wang B (2016) Ideology Detection for Twitter Users with Heterogeneous Types of Links. arXiv:161208207 [cs] URL http://arxiv.org/abs/1612.08207, arXiv: 1612.08207
Hale (2014) Hale SA (2014) Global connectivity and multilinguals in the twitter network. In: Proceedings of the 32nd annual ACM conference on Human factors in computing systems - CHI ’14, ACM Press, pp 833–842, doi:10.1145/2556288.2557203, URL http://dl.acm.org/citation.cfm?doid=2556288.2557203
Hong and Kim (2016) Hong S, Kim SH (2016) Political polarization on twitter: Implications for the use of social media in digital governments 33(4):777–782
Iyyer et al. (2014) Iyyer M, Enns P, Boyd-Graber J, Resnik P (2014) Political ideology detection using recursive neural networks. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, pp 1113–1122, doi:10.3115/v1/P14-1105, URL http://aclweb.org/anthology/P14-1105
Jelveh et al. (2014) Jelveh Z, Kogut B, Naidu S (2014) Detecting latent ideology in expert text: Evidence from academic papers in economics. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, pp 1804–1809, doi:10.3115/v1/D14-1191, URL http://aclweb.org/anthology/D14-1191
King et al. (2016) King AS, Orlando FJ, Sparks DB (2016) Ideological Extremity and Success in Primary Elections: Drawing Inferences From the Twitter Network. Social Science Computer Review 34(4):395–415, doi:10.1177/0894439315595483, URL https://doi.org/10.1177/0894439315595483
Kitschelt (1994) Kitschelt H (1994) The Transformation of European Social Democracy by Herbert Kitschelt. doi:10.1017/CBO9780511622014, URL /core/books/transformation-of-european-social-democracy/C92F284FC17302253C3B5B14123BBA80
Kulkarni et al. (2018) Kulkarni V, Ye J, Skiena S, Wang WY (2018) Multi-view models for political ideology detection of news articles URL http://arxiv.org/abs/1809.03485, 1809.03485
Lewis et al. (2004) Lewis DD, Yang Y, Rose TG, Li F (2004) RCV1: A new benchmark collection for text categorization research pp 361–397
Liu et al. (2019) Liu Y, Cheng D, Pei T, Shu H, Ge X, Ma T, Du Y, Ou Y, Wang M, Xu L (2019) Inferring gender and age of customers in shopping malls via indoor positioning data p 2399808319841910, doi:10.1177/2399808319841910, URL https://doi.org/10.1177/2399808319841910
Ng and Jordan (2001) Ng AY, Jordan MI (2001) On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes. In: Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic, MIT Press, NIPS’01, pp 841–848, URL http://dl.acm.org/citation.cfm?id=2980539.2980648, event-place: Vancouver, British Columbia, Canada
Nguyen et al. (2013) Nguyen D, Gravel R, Trieschnigg D, Meder T (2013) "how old do you think i am?" a study of language and age in twitter. In: ICWSM
Nguyen et al. (2016) Nguyen D, Doğruöz AS, Rosé CP, de Jong F (2016) Computational sociolinguistics: A survey. Computational Linguistics 42(3):537–593, doi:10.1162/COLI_a_00258, URL https://doi.org/10.1162/COLI_a_00258
Pariser (2011) Pariser E (2011) The filter bubble: What the Internet is hiding from you. Viking, London
Pennacchiotti and Popescu (2011) Pennacchiotti M, Popescu AM (2011) Democrats, republicans and starbucks afficionados: user classification in twitter p 9
Porter (1980) Porter MF (1980) An algorithm for suffix stripping 40:211–218, doi:10.1108/00330330610681286
Preoţiuc-Pietro et al. (2017) Preoţiuc-Pietro D, Liu Y, Hopkins D, Ungar L (2017) Beyond Binary Labels: Political Ideology Prediction of Twitter Users. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Vancouver, Canada, pp 729–740, doi:10.18653/v1/P17-1068, URL http://aclweb.org/anthology/P17-1068
Rao et al. (2010) Rao D, Yarowsky D, Shreevats A, Gupta M (2010) Classifying latent user attributes in Twitter. In: Proceedings of the 2nd international workshop on Search and mining user-generated contents - SMUC ’10, ACM Press, Toronto, ON, Canada, p 37, doi:10.1145/1871985.1871993, URL http://portal.acm.org/citation.cfm?doid=1871985.1871993
Roberts et al. (2019) Roberts ME, Stewart BM, Tingley D (2019) stm: An R package for structural topic models 91(1):1–40, doi:10.18637/jss.v091.i02, URL https://www.jstatsoft.org/index.php/jss/article/view/v091i02, number: 1
Rosenthal and McKeown (2011) Rosenthal S, McKeown K (2011) Age prediction in blogs: A study of style, content, and online behavior in pre- and post-social media generations p 10
Sap et al. (2014) Sap M, Park G, Eichstaedt J, Kern M, Stillwell D, Kosinski M, Ungar L, Schwartz HA (2014) Developing age and gender predictive lexica over social media. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, pp 1146–1151, doi:10.3115/v1/D14-1121, URL http://aclweb.org/anthology/D14-1121
Sites (2013) Sites D (2013) Compact language detector 2. URL https://github.com/CLD2Owners/cld2
Sznajd-Weron and Sznajd (2005) Sznajd-Weron K, Sznajd J (2005) Who is left, who is right? 351(2):593–604, doi:10.1016/j.physa.2004.12.038, URL http://www.sciencedirect.com/science/article/pii/S0378437104016061
Talaifar and Swann (2019) Talaifar S, Swann WB (2019) Deep alignment with country shrinks the moral gap between conservatives and liberals 40(3):657–675, doi:10.1111/pops.12534, URL https://onlinelibrary.wiley.com/doi/abs/10.1111/pops.12534, _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1111/pops.12534
Wang and Manning (2012) Wang S, Manning CD (2012) Baselines and bigrams: Simple, good sentiment and topic classification. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers - Volume 2, Association for Computational Linguistics, ACL ’12, pp 90–94, URL http://dl.acm.org/citation.cfm?id=2390665.2390688, event-place: Jeju Island, Korea
Wang et al. (2019) Wang Z, Hale S, Adelani DI, Grabowicz P, Hartman T, FlÃ{\textbackslash}Pck F, Jurgens D (2019) Demographic inference and representative population estimates from multilingual social media data. In: The World Wide Web Conference, ACM, WWW ’19, pp 2056–2067, doi:10.1145/3308558.3313684, URL http://doi.acm.org/10.1145/3308558.3313684, event-place: San Francisco, CA, USA
Weber et al. (2013) Weber I, Garimella VRK, Teka A (2013) Political hashtag trends. In: ECIR
Wong et al. (2016) Wong FMF, Tan CW, Sen S, Chiang M (2016) Quantifying Political Leaning from Tweets, Retweets, and Retweeters. IEEE Transactions on Knowledge and Data Engineering 28(8):2158–2172, doi:10.1109/TKDE.2016.2553667
Yu et al. (2008) Yu B, Kaufmann S, Diermeier D (2008) Classifying party affiliation from political speech 5(1):33–48, doi:10.1080/19331680802149608, URL https://doi.org/10.1080/19331680802149608
Zagheni et al. (2017) Zagheni E, Weber I, Gummadi K (2017) Leveraging facebook’s advertising platform to monitor stocks of migrants 43(4):721–734, doi:10.1111/padr.12102, URL https://onlinelibrary.wiley.com/doi/abs/10.1111/padr.12102
Zamal et al. (2012) Zamal FA, Liu W, Ruths D (2012) Homophily and Latent Attribute Inference: Inferring Latent Attributes of Twitter Users from Neighbors. pp 1–6, doi:10.1109/IWBF.2017.7935106
Zhang et al. (2016) Zhang J, Hu X, Zhang Y, Liu H (2016) Your age is no secret: Inferring microbloggers’ ages via content and interaction analysis p 10

Appendix A Supplemental materials

This supplemental material includes additional detail on how we identify political content, the URL patterns we use to identify the sharing of political and sport news, and all topics in the topic model for non-political and also political content using 150 topics.

Identifying political content

Political Words

\seqsplit

#askmay, #battlefornumber10, #bbcdebate, #bbcelection, #bbcqt, #bbcsp, #believeinbritain, #brexit, #brexitdeal, #bristolgreenmp, #bristolwest, #budget2015, #budget2017, #cameronmustgo, #conservative, #conservatives, #corbyn, #cpc17, #dementiatax, #dupcoalition, #ed4pm, #election, #election2015, #election2017, #electionday, #electionday2017, #forthemany, #ge15, #ge17, #ge2015, #ge2017, #generalelection, #generalelection17, #generalelection2017, #getcameronout, #greens, #greensurge, #hungparliament, #imvotinglabour, #itvdebate, #jc4pm, #jc4pm2019, #jeremycorbyn, #jezwecan, #labour, #labourdoorstep, #labourleadership, #labourmanifesto, #labourmustwin, #le2017, #leadersdebate, #libdem, #libdemfightback, #libdems, #makejunetheendofmay, #marrshow, #mayvcorbyn, #newsnight, #nhscyberattack, #nigelfarage, #peston, #plaid15, #pmqs, #queensspeech, #registertovote, #ridge, #saveilf, #snp, #snpout, #tories, #toriesout, #toriesoutnow, #tory, #toryelectionfraud, #torymanifesto, #ukip, #uklabour, #victorialive, #vote, #vote2017, #votecameronout, #voteconservative, #votegreen2015, #votegreen2017, #votelabour, #votelibdem, #votematch, #votesnp, #voteukip, #weakandwobbly, #whyimvotingukip, #whyvote, #wsyvf, @aaronbastani, @ainemichellel, @amberruddhr, @amelia_womack, @andrewspoooner, @andysearson, @angelarayner, @angiemeader, @angry_voice, @angrysalmond, @annaturley, @barrygardiner, @bbcbreaking, @bbclaurak, @bbcnews, @bbcpolitics, @bbcr4today, @bbcscotlandnews, @bonn1egreer, @borisjohnson, @brexitbin, @bristolgreen, @britainelects, @campbellclaret, @carolinelucas, @cchqpress, @charliewoof81, @christinasnp, @chukaumunna, @chunkymark, @conservatives, @corbyn_power, @corbynator2, @d_raval, @daily_politics, @darrenhall2015, @david_cameron, @davidjfhalliday, @davidjo52951945, @davidlammy, @davidschneider, @dawnhfoster, @debbie_abrahams, @dmreporter, @dpjhodges, @dvatw, @ed_miliband, @el4jc, @emilythornberry, @evolvepolitics, @faisalislam, @fight4uk, @frasernelson, @gdnpolitics, @georgeaylett, @georgeeaton, @grantshapps, @guardian, @guardiannews, @guidofawkes, @hackneyabbott, @harrietharman, @harryslaststand, @hephaestus7, @hrtbps, @huffpostuk, @huffpostukpol, @iainmartin1, @iandunt, @imajsaclaimant, @independent, @ipsosmori, @itvnews, @jacob_rees_mogg, @james4labour, @jameskelly, @jamesmelville, @jamieross7, @jeremy_hunt, @jeremycorbyn, @jeremycorbyn4pm, @jimwaterson, @joglasg, @johnmcdonnellmp, @johnprescott, @johnrentoul, @jolyonmaugham, @jon_swindon, @jonashworth, @joswinson, @keir_starmer, @kevin_maguire, @kezdugdale, @krishgm, @laboureoin, @labourleft, @labourlewis, @labourpress, @ladydurrant, @leannewood, @liamyoung, @libdempress, @libdems, @louisemensch, @lucympowell, @mancman10, @marcherlord1, @markfergusonuk, @mhairihunter, @michaelrosenyes, @mikegalsworthy, @mirrorpolitics, @mmaher70, @molly4bristol, @mollymep, @montie, @mrmalky, @msmithsonpb, @natalieben, @newsthump, @nhaparty, @nhsmillion, @nick_clegg, @nickreeves9876, @nicolasturgeon, @nigel_farage, @normanlamb, @nw_nicholas, @owenjones84, @patricianpino, @paulmasonnews, @paulnuttallukip, @paulwaugh, @peoplesmomentum, @peston, @peterstefanovi2, @petewishart, @plaid_cymru, @politicshome, @prisonplanet, @rachael_swindon, @realdonaldtrump, @reclaimthenews, @redhotsquirrel, @redpeter99, @rhonddabryant, @richardburgon, @richardjmurphy, @robmcd85, @rosscolquhoun, @ruthdavidsonmsp, @sarahchampionmp, @scottishlabour, @scottories, @screwlabour, @skwawkbox, @skynews, @skynewsbreak, @socialistvoice, @standardnews, @stvnews, @sunny_hundal, @survation, @suzanneevans1, @telegraph, @telegraphnews, @telepolitics, @thecanarysays, @thegreenparty, @themingford, @thepileus, @theredrag, @theresa_may, @theresa_may’s, @therightarticle, @thesnp, @thisisamy_, @timfarron, @timothy_stanley, @tnewtondunn, @tomlondon6, @toryfibs, @trevdick, @trobinsonnewera, @uk_rants, @ukip, @uklabour, @vincecable, @wesstreeting, @willblackwriter, @wingsscotland, @wowpetition, @yougov, @yvettecoopermp, abbot, abbott, bennett, bernie, boris, brexit, cameron, cameron’s, campaign, campaigning, candidate, candidates, canvassing, centrist, clegg, clegg’s, clinton, coalition, communist, comres, con, conservative, conservatives, constituency, constituents, corbyn, corbyn’s, councillor, councillors, cymru, davidson, debates, dem, democrat, democrats, dems, dup, election, elections, electoral, entitlement, eu, fallon, farage, farage’s, farron, fein, fiscal, gardiner, ge, gop, gove, govern, government, government’s, govt, govt’s, greens, grn, guardian, hamas, hammond’s, hillary, hustings, icm, ifs, ind, ira, isis, islamist, johnson, johnson’s, jones, labour, labour’s, labours, ld, ldem, leanne, lib, libdem, libdems, liberal, manifesto, manifestos, may’s, mcdonnell, miliband, miliband’s, milliband, minister, mogg, mps, nuttall, obama, oth, palin, paxman, plaid, pm’s, policies, policy, poll, polling, polls, portillo, putin, queen’s, ref, referendum, republican, republicans, rudd, russia, salmond, sanders, scrapping, sinn, snp, snp’s, socialist, soros, sturgeon, sturgeon’s, terror, terrorism, terrorist, theresa, thornberry, tns, tories, tories’, tory, trident, trump, ukip, ukip’s, unionist, vaughan, vladimir, vote, voter, voters, votes, voting, yougov

Ambiguous Words

\seqsplit

free, #amazonbasket, #beastfromtheeast, #bournemouth, #colchester, #eurovision2015, #finland, #firearms, #foxhunting, #freebies, #glastonbury, #grenfell, #grenfelltower, #leeds, #londonbridge, #marr, #onelovemanchester, #spanishgp, #thursdaythoughts, #trident, #uk, #york, #youtube, @daraobriain, @emmakennedy, @jamin2g, @janeygodley, @kindleuk, @marcuschown, @paulbernaluk, @pinknews, @rustyrockets, @sophiareed1, @swanseacity1, 10m, 1bn, 2015, 34, 8bn, 8th, allowance, amber, ars, attack, austerity, baptist, behave, benz, branch, burrows, buts, chaos, clarkson, conference, cons, costed, costings, cuts, davey, david, deal, debate, deceased, defend, dementia, diane, diaz, dodging, donald, dunk, ed, emily, esther, exc, factions, fairer, fees, firearms, firefighters, foxes, freddy, general, glastonbury, grenfell, gy, hamlets, hs2, hsbc, ht, htt, hung, inheritance, intention, jeremy, june, kensington, kyle, lab, lambert, landslide, leader, leaders’, leaflet, leafleting, leaflets, lorde, marginal, matched, may, mcintyre, meeting, meetup, members, membership, methodist, middlesbrough, middleton, muppets, murphy, natalie, nicola, nigel, observer, participate, party, pensioners, pledge, pledges, plymouth, posters, progressive, prop, reception, register, registered, rehearsing, results, rodgers, rt, ruth, scarpping, scrap, seat, seats, sentencing, sheeran, slater, sos, source, stable, sub, supporting, surge, tactical, tactically, th, theo, tim, transformative, tuition, uc, uk, violence, watson, weak, weir, wheat, youtuber

News case study URL patterns

Source	Type	URL pattern
Guardian	political	theguardian.com/politics/2018/
	sport	theguardian.com/sport/2018/
BBC	political	bbc.co.uk/news/uk-politics-
	sport	bbc.co.uk/sport/
Telegraph	political	telegraph.co.uk/politics/2018/
	sport	telegraph.co.uk/football/2018/
		telegraph.co.uk/cycling/2018/
		telegraph.co.uk/cricket/2018/
		telegraph.co.uk/rugby-union/2018/

Table 7: URL patterns used to select sport and political news sharing on Twitter from The Guardian, the BBC, and The Telegraph for the news sharing case study.

Full topic models

We are providing non-political and political topic models in this section.

Non-Political Topics

The topics we name in the paper are: 38 (Entertainment), 46 (Technology/EU), 48 (Gardening), 58 (MUFC/Premier League), 75 (London/Beer), 77 (Follow/Unfollow), 91 (Africa/Corruption), 95 (Sugar Tax), 98 (Premier League), 114 (Soc. Dem. Values), 148 (Pollution).

The explicitly political topics we identified in the non-political model are 27 (Israel–Palestine), 37 (Scottish independence), 54 (UK politics), 62 (EU politics), 69 (the Green Party), 84 (Brexit), 86 (the US), 106 (Westminster).

Table 8: Words belongs to non-political topics. Unique words are selected by combining top 10 words of FREX, LIFT and log scores in given order.

Topic	Features
1	risk, daili, de, hr, late, leadership, insur, manag, thetim, innov, aug
2	youtub, youtub_video, video, video_youtub, islam, lyric, hd, feat, music_video, trailer, video_make, parodi, muslim
3	anxieti, ur, mental, mental_health, depress, mental_ill, ill, stigma, base, disord, anxious, therapist, mental_health_issu, insta, xx
4	man_love, steel, andrew, everton, liam, lfc, liverpool, gallagh, jim, lg, oasi, arena, easter_egg, klopp, giveaway
5	comic, health, daili, technolog, mentalhealth, healthcar, late, tech, busi, dc, medicin, mm, book
6	fm, mile, en, gb, step, travel, insid, leadership, eu, construct, stem, sustain
7	gay, today_find, pride, tho, netflixuk, porn, coffe, sean, ah, memori, twitter_account, dublin, domino, work_today, ugh
8	ink, illustr, draw, japanes, lesbian, omg, lil, irish, queer, ireland, submiss, dublin, plz, freelanc, eurovis
9	bronz, trophi, ps, earn, silver, gold, destini, broadcast, assassin, giveaway, gran, game
10	tourism, student, research, academ, univers, publish, phd, colleg, studi, organis, leisur, philosoph, prof, philosophi, sustain
11	lewi, xx, haha, chris, happi_birthday, birthday, gari, cheer, liam, bf, spencer, haha_good, oop, lol
12	sampl, repli, bargain, thriller, kindl, free, mysteri, buy, author, book, suspens, amazonuk, trilog, romanc
13	academi, chariti, budget, kent, educ, sector, trust, tax, manufactur, school, sussex, cyber, mat, implic
14	chanc_win, giveaway, competit, follow_chanc, enter, follow_chanc_win, chanc, follow_win, simpli, prize, winner_announc, follow_enter, follow_retweet, bundl, win
15	playlist, folk, tea, tuesday, thursday, monday, wednesday, august, decemb, novemb, reel, hardi, rhythm, elvi, octob
16	angel, pari, sam, bastard, danni, barber, pump, shite, lad, jim, barclay, glenn, innit, daft, casual
17	ebay, volum, nobl, barn, check, dvd, review, ukchang, sign_petit, petit, sign_petit_ukchang, petit_ukchang, disc, fossil
18	fb, mk, past_week, donat, past, awesom, geek, airport, cake, cycl, browni, amazonuk, phase, protein, workout
19	autom, schedul, script, web, window, marcus, loop, solut, technic, wizard, stop_work, scrape, boundari, blog, datum
20	gt_gt, gt, gt_gt_gt, album, bird, song, music, vibe, mega, moth, dope, make_sick, bloodi_good, blur, ala
21	railway, carriag, rail, destini, train, ps, delay, se, marvel, sigh, molli, virginmedia, berri, geek, lbc
22	lodg, swansea, ken, swan, leicest, anna, squar, mason, ed, instal, surnam, ceremoni, virginmedia, cardiff
23	canal, trip, boat, sun, great_day, festiv, weekend, rugbi, lock, cycl, sail, countrysid, great_weekend, visitor, temp
24	sw, pool, waterloo, citi, pr, xxxx, leader, anna, andrea, kyli, compassion, abba, samuel, yawn, grayl
25	sibl, studi, immigr, patient, diseas, evid, effect, outcom, increas, clinic, diabet, mortal, norm, intervent, genet
26	shaw, race, bet, cricket, bbcsport, bat, ffs, ball, bowl, counti, fort, moor, hardi, prove_wrong, tbh
27	antisemit, israel, jew, palestinian, jewish, isra, racist, anti_semit, semit, gaza, reluct
28	post_photo, photo, facebook, post, fisher, hors, race, xxx, parti, indi, sim, happi_day, rider, adel, xx
29	subscript, content, exclus, access, fan, renew, month, join, utd, portug, ab, manchest_unit, ferguson, leed, man_utd
30	durham, newcastl, laura, lauren, xo, hun, isl, dubai, cute, nah, luci, mc, lmao, tbh
31	develop, birmingham, role, derbi, coventri, net, softwar, job, hire, engin, infrastructur, midland, bbc_news
32	ur, pl, gal, wanna, uni, liter, babe, xo, bc, omg, ew, soz, makeup, sophi, casualti
33	detect, youtub, trailer, cinema, video_youtub, director, color, magazin, lee, ai, maggi, nasa, arch, depict, vice
34	imaceleb, doctorwho, wale, gbbo, nurs, swansea, flood, die_age, doctor, bbcone, phillip, minut_silenc, iain, evacu, sharon
35	eurovis, sync, brit, fave, song, spice, fab, chart, singl, battl, woo, wk, love_song, mexico, album
36	mi, run, race, marathon, runner, nike, crush, pace, jedi, paul, hm, endur, sprint, gym
37	scottish, scotland, independ, proport, glasgow, scot, averag, edinburgh, 3, wee, ruth, good_support, tier, uefa, ranger
38	beth, xx, haha, xxx, deploy, sean, artist, nicki, wire, hahaha, knit, navi, bbc_radio, willi, lol
39	tshirt, york, wed, edinburgh, magic, earn, badg, awesom, film, packag, surf, tee, yorkshir, chanc_win
40	keith, trek, samuel, christ, faith, god, spirit, israel, august, promis, ministri, profound, divin, amanda, revel
41	sheffield, yorkshir, south, vine, centr, ago_today, year_ago_today, sarah, year_ago, shop, hillsborough, world_big, arthur, xbox, lol
42	disabl, dwp, disabl_peopl, benefit, univers_credit, poverti, welfar, credit, homeless, auster, pip, sanction, luther, duncan
43	consult, insight, appl, io, iphon, karl, server, tim, read_tweet, cook, smartphon, broadband, samsung, softwar
44	afc, arsenal, wenger, chelsea, spur, arsenal_fan, alexi, sanchez, tottenham, emir, au, bayern, dortmund
45	legal, law, court, lawyer, lynn, ms, divorc, student, aid, justic, suprem_court, profess, mainten, academ
46	android, dj, ee, eu, app, greec, independ, mobil, scotland, ipad, tutori, currenc, rubi, versus, workout
47	lt, oxford, input, lt_lt, hey, stream, ill, nowplay, rank, hall, aha, acoust, meter, gordon, ah_good
48	bloom, garden, cute, layer, kinda, ship, charact, rich, bless, hug, hope_feel, roller, ldn, shoutout, gosh
49	thanksgiv, km, coffe, cat, tourist, holiday, gym, tube, latin, breakfast, sticker, latt, florida, unlock, kitti
50	connect, typo, lbc, garylinek, coin, sugar, lord, bot, threaten, climat_chang, fork, lord_sugar, edl, man_make, russian
51	gay, eurovis, lgbt, money_make, pride, australia, marriag, equal, sydney, homophob, gay_man, lgbtq, homophobia, abba, queer
52	bbc_news, bbc, news, great_good, portrait, exhibit, irish, ireland, wonder, museum, serena, twitter_follow, austria, thame, sad_news
53	opinion, fact, agre, understand, genuin, sens, fair, forbid, disagre, interest, logic, necessarili, piti, grasp, satir
54	bruce, anna, remain, parliament, migrant, illeg, franc, countri, politician, puppi, griev, leaver, traitor, des, imparti
55	fuck, cunt, fuckin, connor, yer, kyle, shite, ye, shit, lmao, cum, shag, fanni, fuck_hate, wank
56	writer, kickstart, movi, film, cinema, write, author, doctorwho, book, episod, haul, andr, crowdfund, day_leav, big_screen
57	follow_follow, blog, blogger, instagram, shoe, outfit, fashion, dress, gorgeous, beauti, follow_back, skirt, aso, luxuri, wardrob
58	mufc, manutd, unit, rooney, manchest_unit, utd, goal, golf, moy, mourinho, bale, trafford, ucl, cristiano, pogba
59	shade, film, star_war, hairdress, starwar, movi, hell, war, jedi, damn, shaun, laser, film_make, titan, awaken
60	lol, bald, lmao, im, omg, tho, tbh, ur, ppl, wtf, lol_good, sooo, goin, carniv, sooooo
61	cat, eye, eden, mill, insect, fuck_fuck, biscuit, ian, tree, poem, badger, windsor, good_god, andr, ha
62	deficit, el, la, los, remind, econom, en, tax, america, del, es, nationalist, merkel, macron, lo
63	turkish, lisa, bing, miss, mailonlin, harvey, louis, bloke, lbc, custom_servic, leo, robbi, vip, xx
64	art, galleri, artist, exhibit, studio, paint, print, workshop, beach, landscap, contemporari, tide, artwork
65	cricket, broad, bat, ash, stoke, test, england, bowl, root, ball, aussi, ol, over, jonni, fuck
66	thor, teacher, physic, activ, educ, child, school, teach, student, earli_year, nurseri, pe, classroom, toddler, egypt
67	stain, graphic, websit, van, ff, sign, newcastl, vehicl, updat, wall, saturday_morn, copper, frost, banner, high_street
68	exam, collin, uni, revis, essay, err, sister, lectur, graduat, hull, abbey, gcse, time_aliv, snapchat
69	green_parti, climat, green, climat_chang, ride, bike, peac, nhs, mag, hunt, pollut, heathrow, countrysid, environment, finland
70	stamp, paul, writer, morri, blue, andr, scotland, fabric, craft, recip, thumb, sew, stripe, xx, tho
71	jennif, hahaha, danni, haha, lawrenc, em, il, fav, nottingham, stop_watch, hunger, larri, fuck, im
72	market, strategi, content, social_medium, brand, social, digit, medium, trend, tip, optim, linkedin, influenc, algorithm, facebook
73	mia, xfactor, sing, factor, song, sterl, anthoni, england, fuck_fuck, gonna, fluffi, big_brother, week_day, carrot, blade
74	dec, furi, tyson, fight, fighter, box, mate, jr, pal, champ, boxer, ko, parker, big_man, aj
75	unit_kingdom, kingdom, ben, unit, beer, pub, london, januari, station, tim, buffet, slice, high_street, poet, buff
76	ppl, didnt, im, sandra, ive, doesnt, isnt, cuz, folk, what, your, hmmm, lol, hes
77	unfollow, automat, peopl_follow, stat, follow, check, person, ireland, mention, reach, shane, big_fan, monitor, vinc, find_peopl
78	long_term, ear, lfc, defend, poor, injuri, midfield, el, defens, hes, dale, trent, analys, salli, your
79	jane, lt, xxx, xo, nugget, ad, robert, hay, laura, lt_lt, sob, xxxx
80	blaze, nfl, wrestl, season, coach, defens, chief, matt, game, offens, vardi, boston, houston, panther, ref
81	maintain, action, figur, base, statement, appar, talk, link, prefer, give, mental_health, connect, make, time
82	council, communiti, wellb, local, west, committe, servic, resid, fund, volunt, social_care, neighbourhood, district, turnout, great_hear
83	nathan, josh, hall, eve, exam, jame, gonna, shit, revis, preston, homework, kennedi, quid, il, fuck
84	deal, trade, remain, democraci, agreement, union, border, european, freedom, good_pay, david_davi, european_union, leaver, negoti, norway
85	sign_petit, petit, frack, ukchang, degre, nhs, petit_ukchang, sign_petit_ukchang, sign, gaza, privatis, open_letter, georg_osborn
86	presid, america, syria, cnn, health_care, american, gun, senat, white_hous, potus, congress, fox_news, presid_unit, presid_unit_state, fbi
87	ted, steve, neighbour, chris, math, podcast, sum, wealth, hey, parliament, ch, sky_news, gorilla, puzzl, worth_watch
88	barrel, rick, turner, swansea, church, tire, brighton, rugbi, mate, morn, julia, tobi, chick, dissert, briton
89	chef, restaur, menu, manchest, lunch, food, meal, open, dish, roast, good_food, dine, norman, citi_centr, pm_pm
90	librari, museum, cat, omg, jo, daniel, liter, ur, staff, pay, patron, windsor, charlott, printer, citizenship
91	ni, nigeria, lawyer, african, corrupt, parcel, bbcradio, lol, journo, arrest, astronaut, eye_open, hoo, detain, wk
92	karaok, drink, gin, bake, tire, cake, nottingham, beer, dave, lil, gemma, cider, burnley, lush, buffet
93	choir, experi, care, young_peopl, ne, scotland, wee, kenni, glasgow, edinburgh, portray, peopl_care, jk_rowl, show_support, rowl
94	carter, saint, billi, st, fa, dan, ranger, bbcsport, band, radio, show_tonight, nra, curl, di, andi_murray
95	frog, level, skill, cunt, sugar, committe, divin, nhs, suspect, holland, jeremi_hunt, coffin, rag, blog, fascist
96	theapprentic, ps, nintendo, xbox, switch, mario, episod, game, star_war, trailer, ea, batman, starwar, consol
97	raf, york, scotland, poppi, veteran, fraser, soldier, jim, store, boss, primark, lili, today_rememb, case_miss, bbcradio
98	west_ham, spur, ham, salah, everton, kane, chelsea, klopp, pogba, lukaku, cl, mane, hazard, fuck, midfield
99	blackpool, wimbledon, itv, liverpool, strict, fabul, tenni, eurovis, ff, manutd, casualti, handsom, watson, saturday_night, im
100	vegan, hunt, anim, lash, rescu, wildlif, fox, rickygervai, protect, meat, rhino, endang, extinct, eleph
101	dart, mark, piersmorgan, lbc, pint, proud, midland, yep, kthopkin, twat, wetherspoon, hood, lap, chuck, rickygervai
102	leed, veggi, daughter, gbbo, rat, mum, charlott, sew, kim, advert, elton, usernam, crappi, amber, fuck
103	rio, euro, olymp, gold, medal, doctorwho, bridg, wale, silver, itv, andi_murray, world_record, showcas, pursuit, bbcsport
104	print, font, leed, design, poster, gig, mount, beer, brew, peter, ale, pale, punk, fest
105	album, gaga, omg, song, ariana, kate, meghan, singl, tour, icon, good_song, rihanna, banger, beyonc, nicki
106	economi, rural, britain, unemploy, budget, growth, chancellor, union, bn, westminst, commonwealth, margaret_thatcher, good_futur, gdp, booth
107	bowi, gig, david_bowi, album, david, band, music, ticket, tour, vinyl, great_weekend, tix, acoust, happi_friday, pre_order
108	mate, lincoln, lad, haha, newcastl, sunderland, hahaha, aye, dean, cunt, mate_good, morri, hahahahaha, ashley, fuck
109	muslim, islam, migrant, rape, tommi, ukrain, religion, gang, tommi_robinson, immigr, mosqu, free_speech, hate_crime, merkel, groom
110	celtic, ranger, rodger, wit, hes, scottish, bn, robert, brendan, mcgregor, iv, lennon, gerrard, daft, midfield
111	pin, fanci, newcastl, iphon, favorit, fab, nowplay, nois, bike, wear, christma_dinner, cube, lamp, rickygervai, bicycl
112	retail, water, ceo, energi, custom, market, gas, sector, regul, innov, supplier, icymi, storag, resili
113	yoga, theatr, delici, fabul, cast, workout, drag, mac, audit, dinner, repost, luv, nightclub, popcorn, rehears
114	foodbank, poverti, nhs, wage, worker, crisi, fund, wick, solidar, food_bank, grayl, live_wage, privatis, hunger, comrad
115	irl, ass, pokemon, damn, movi, dude, meme, bc, charact, shit, wizard, behold, vampir, worm, demon
116	franci, scout, shift, tattoo, beer, german, tbh, world_cup, worldcup, incid, navi, ballot, houston, troop, sadiq
117	black, smh, nah, bro, ass, mad, girl, lmao, drake, tryna, pierc, tl, black_man, black_peopl, hoe
118	cruis, safeti, driver, bbc_news, fire, leicest, car, vegetarian, member, union, exposur, fleet, trail, england_wale, fog
119	world_cup, goal, cup, fulham, footbal, leagu, worldcup, premier_leagu, sterl, england, huddersfield, pep, southgat, manchest_citi, colombia
120	franchis, agent, properti, episod, bro, soundcloud, ms, review, lol, itun, buyer, estat, mortgag, seller, xx
121	lfc, liverpool, belfast, klopp, anfield, suarez, xx, hillsborough, gerrard, lui, ski, salah
122	ironi, racist, rail, remain, ban, odd, invent, lefti, sane, tho, appropri, dodgi, mph, impli, nearbi
123	panel, energi, solar, hill, climat, local, council, sustain, green, winner, borough, effici, org, st_centuri, coal
124	outdoor, mountain, spotifi, lake, peak, trail, summit, district, honey, walk, moss, sculptur, vintag, nowplay, yorkshir
125	loveisland, georgia, alex, megan, laura, wes, love_island, liverpool, adam, beyonc, madonna, kyli, loyal, xx
126	church, fr, pray, mass, bishop, priest, cathol, mari, feast, worship, ministri, christ
127	villa, aston, cricket, dr, premier_leagu, xi, premier, england, score, mini, sydney, russia, off, good_lad, hugh
128	xxx, fab, xx, xxxx, wonder, love_show, gorgeous, smile, daughter, ador, make_smile, daisi, mt, hope_feel, bravo
129	photographi, cornwal, photograph, bird, sunset, essex, natur, wolf, photo, van, je, mere, sunris, bonfir, wildlif
130	carbon, thread, cox, bark, artist, andrea, tran, cc, jersey, energi, orient, fossil, solar, mt, word_day
131	calm, london, friday, airport, languag, uniqu, googl, ad, gatwick, german, camden, fur, den, paddington, croatia
132	welsh, cardiff, wale, languag, member, independ, william, group, english, facebook, renam, mud, royal_famili, christma_card, dump
133	recruit, warehous, graduat, virginmedia, email, glasgow, author, bobbi, web, slight, rupert, ten_minut, murdoch, amanda, email_address
134	mrjamesob, gun, gender, feminist, dumb, argument, dude, disagre, ideolog, nazi, alt, rifl, notch, ration, virtu
135	tom, jen, tit, nowplay, pint, doo, wrestl, fool, telli, tbh, minc_pie, minc, preston, fuck
136	christian, bibl, jesus, christ, ye, church, sin, pray, god, faith, satan, worship, vers, atheist, flesh
137	bristol, cook, bang, tape, record, arm, releas, bear, hideous, field, bleach, trace, gag, twelv, cd
138	googl, search, index, mobil, site, rank, content, tool, updat, link, algorithm, thx, optim, crawl, fetch
139	clarkson, cycl, favorit, eastend, driver, pride, bike, alan, gun, cyclist, waitros, potus, adida, paedophil, bicycl
140	bed, jodi, sleep, boyfriend, tire, flavour, eat, wash, nap, tesco, bedtim, payday, bra, lose_weight, crave
141	io, appl, raspberri, app, microsoft, iphon, window, ipad, code, mac, chrome, beta, browser, os, butler
142	airway, jet, nowplay, flight, rout, airport, leed, pepper, ryanair, airlin, paradis, ba, heathrow
143	norwich, neil, wes, gut, ticket, bet, championship, superb, refere, season, game_today, season_ticket, alloc, applaus, win_ticket
144	ha_ha, ha, wright, nurs, placement, ami, jo, disabl, luke, hug, brace, ar, jo_cox, orlando, lol
145	korean, riot, na, player, team, split, tournament, korea, region, champ, passiv, reddit, tier, leagu, game
146	gender, tran, feminist, sex, rape, woman, sexual, male, femal, femin, gum, misogyni, sexism, prostitut, sexual_assault
147	greg, cfc, chelsea, cont, sadiqkhan, iran, racist, mourinho, hazard, staff, diego, fifti, grenfelltow, fuck, jose
148	plastic, contain, ocean, pollut, dr, usa, sea, recycl, dwp, path, curios, litter, tonn, live_uk, crook
149	southampton, player, saint, squad, club, fc, season, midfield, loan, leagu, crook, start_season, make_chang, scorer, delight_announc
150	blake, chees, babe, holli, dale, aha, haha, joe, ahh, mac, fml, aso, dissert, deffo, bruis

Political Topic Model

This is the model created with the political text dataset.

Table 9: Words belongs to political topics. Unique words are selected by combining top 10 words of FREX, LIFT and log scores in given order.

Topic	Features
1	blue, chicken, stori, ukchang, thegreenparti, vote_green, food, stuff, breakfast, philip, fri, lover, elect_manifesto, clair, green
2	life, world, fulfil, promis, breakdown, announc, labour, admir, check, gap, precis
3	mate, teen, fiction, stori, wolf, truth, true_stori, beach, heart, dirti, stori_good, top_stori, love_stori
4	tomorrow, tonight_vote, tonight, vote_tomorrow, forget, sum, asham, home, centr, releas, vote_govern, liam, mp_back, week_ago, vote
5	jeremycorbyn, marri, jeremycorbyn_theresa, peopl, jeremycorbyn_labour, stand, make, peopl_vote, vote_ge, sad, theresa_jeremycorbyn, behav, neighbour, vote_jeremycorbyn, februari
6	green, parti, polit_parti_side, green_parti, today, brand, campaign, labour_back, labour_polit, labour_polit_parti, parti_ge, conserv
7	prime, prime_minist, minist, categori, presid, live, chief, birthday, call, offic, sincer, minist_theresa, life
8	soar, vote_retweet, retweet, sampl, retweet_vote, maguir, sampl_size, kevin_maguir, size, kevin, dailymailuk, tomorrow_vote, poll, faisalislam
9	bbcqt, labour_ge, hrtbps, audienc, bbcquestiontim, owenjon, dimblebi, carolineluca, soubri, anna_soubri, interrupt, hmm, audienc_member, white_man, vote_record
10	agre, matter, true, wrong, politician, vote, vote_vote, peopl_vote, chang, vote_matter, vote_parti, parti_vote, chang_vote
11	ref, vote_support, cheer, lee, wow, morn, appoint, mr, jack, laugh, coach, merri_christma, moan, asap, milliband
12	chanc_win, competit, enter, select, follow, winner, chanc, product, prize, copi, tag, entri, fanci, win
13	cameron, david_cameron, david, miliband, surgeri, ed, ed_miliband, telegraph, osborn, pay, reshuffl, georg_osborn, labour_leader, pmqs
14	gb, york, endur, sajidjavid, javid, ruthdavidsonmsp, squar, arrang, remoan, good_luck, oop, curious, sajid, corbyn
15	instagram, snapchat, girl, storylin, artist, gonna, music, fuck, favorit, song, cute, good_stori, album, horror_stori, true_stori
16	jean, william, plaid, lib_dem, lib, leann, dem, cardiff, bbcdebat, ld, erm, bad_idea, leannewood, mare, tori_major
17	rhetor, energi, water, sector, industri, connect, blog, invest, berlin, confer, ceo, sustain, retail, cc, leadsom
18	read, small, outrag, act, make_big, differ, big, luck, govt, power, wash, mouth, yep, door, love
19	bbc, bbc_news, make_happen, news, davi, bbcnew, david_davi, news_brexit, blair, fish, news_tori, river, banner, botch, brexit
20	revis, exam, histori, ur, uni, lol, god, haha, obama, annoy, horror_stori, histori_book, toy_stori, presid_obama, snapchat
21	clegg, nick_clegg, nick, send_messag, leadersdeb, mind, divis, tuition_fee, end, tuition, xenophobia, sin, childish, generous, goodby
22	hollywood, front_page, page, newspap, front, soldier, beer, host, soviet, journal, hillsborough, genocid, murray, flee, instal
23	septemb, vote, labour, check, card, life, announc, world, wild, stori, dog
24	support_peopl, owner, shop, pride, mine, wolf, oop, agent, wall, remain_win, govern_support, peopl_peopl, merri_christma, support, wto
25	carolin_luca, liberti, guidofawk, telegraphnew, brighton, compromis, fish, afneil, carolin, dpjhodg, clap, luca, retail, cricket, wilson
26	peoplesvot, peoplesvot_uk, ordinari_peopl, martin, insult, edinburgh, deliv, outcom, attempt, approach, bell, call_peopl, bbc_live, bad_brexit, uk
27	conserv, vote_conserv, parti_side, conserv_parti, conserv_win, parti_work, parti, conserv_mp, conserv_govern, today, good_good, conserv_vote, campaign, polit
28	sign_petit, ukchang, petit, sign_petit_ukchang, sign, petit_ukchang, theresa_mp, anim, hunt, farm, trophi, xx, petit_call, call_govern
29	bold, elect_quiz, quiz, uk_elect, elect_quiz_result, quiz_result, wsyvf, uk_elect_quiz, result_labour, result, exhibit
30	rend, properti, agenc, photograph, landlord, insur, provid, embrac, agent, inform, cloud, tenant, aspect, packag, vat
31	leadersdeb, russel_brand, libdem, bbcdebat, barackobama, farron, tim_farron, ge, candid, liber_democrat, natali_bennett, hust, charl, natali, elect_debat
32	technolog, comic, health, daili, late, health_care, busi, tech, con, conserv_win, ep, top_stori, great_stori, stori
33	kent, scienc, plastic, environ, natur, climat, theatr, scheme, nato, beauti, recycl, museum, marin, robot, exhibit
34	life, check, promis, world, stori, life_stori, announc, vote, true, match, wed
35	gibraltar, collabor, spanish, wto, escap, spain, freedom, de, sovereignti, evil, 4, convent, dictatorship, anti_democrat, oppress
36	lib_dem, lib, dem, vote_futur, vote_lib_dem, vote_lib, lol, tim, amber_rudd, rudd, snake, yup, gear, comic, sigh
37	cancer, healthi, test, research, mum, amaz, share, inspir, share_stori, birmingham, top_stori, fab, love_stori, charli, grammar_school
38	hiv, experi, student, care, sexual, woman, welsh, lgbt, young, inspir, share_stori, launch_campaign, sister, peopl_live, harass
39	salmond, indyref, telegraph, shot, scot, currenc, scottish_independ, scottish, scotland, independ, milliband, scotland_vote, sterl, undecid, snp
40	battlefornumb, nhsmillion, leadersdeb, doctor, nhs, dr, jeremi_hunt, ge_labour, paxman, privatis, exempt, privatis_nhs, uk_polit, ukchang, dwp
41	unbeliev, wit, ref, thick, pathet, folk, disgrac, newsnight, pr, daili_mail, fiasco, carpet, slaveri, year_year, bitter
42	termin, cornwal, googl, comput, datum, facebook, art, search, site, websit, storytel, web, museum, traffic, exhibit
43	polit, parti, polit_parti, campaign, today, judg, debat, mp, reform, system, parti_leader, british_polit, conserv
44	manchest, wisdom, uklabour, abbott, wood, dian_abbott, dian, stephen, euref, resid, grammar_school, grammar, blast, champagn, steven
45	reward, swindon, pool, leadsom, andrea, rachael, rachael_swindon, wale, voteleav, vote_leav, compassion, badg, dpjhodg, britainelect
46	isi, saudi, yemen, obama, plea, syria, clinton, refuge, donald_trump, arabia, hunger, humanitarian, barack_obama, presid_obama, barack
47	vote_lose, irish, im, gonna, magic, abort, tree, money_tree, magic_money_tree, magic_money, dublin, wanna, black_woman, lend, cathol
48	gt, gt_gt, gay, marriag, miss, facebook, news, blog, sex, datum, privaci, berlin, nake, sex_marriag, bake
49	singl_market, hard_brexit, procedur, davi, brexit_brexit, market, brexit, singl, citizen, negoti, brexit_impact, ken_clark, ep, eea, soft_brexit
50	disabl, welfar, laboureoin, welfar_cut, poverti, jon, cut, mirrorpolit, swindon, tori, tax_credit, corbyn_leader, bedroom_tax, dailymirror, harryslaststand
51	innov, guardian, ian, drink, creativ, advertis, art, ad, food, de, storytel, gif, ted, magazin
52	peoplesvot, stopbrexit, proport, brexit_deal, fbpe, represent, deal_brexit, squeez, democraci, stop_brexit, vote_system, make_voic_hear, miser, peoplesvotemarch, brexit
53	ree_mogg, mogg, ree, bori_johnson, bori, jacob, jacob_ree_mogg, jacob_ree, johnson, dear_theresa, shower, tobi, call_jeremi, michael_gove, cartoon
54	cambridg, council, stark, councillor, fbpe, resid, citi, amend, motion, local_elect, scrutini, final_deal, localelect, vote_local, district
55	ft, victoria, iandunt, folk, music, tuesday, monday, wednesday, podcast, bill, februari, final_brexit_deal, final_brexit, hillaryclinton, guardian
56	launch, campaign, today, commit, polit, parti, confer, pledg, green, futur, parti_confer
57	councillor, local, joy, council, tim, canvass, candid, resid, cpc, ward, borough, district, afternoon, session, local_govern
58	welsh, ceremoni, wale, cymru, cardiff, plaid, dawn, deputi, plaid_cymru, councillor, deputi_leader, local_govern
59	elector_commiss, custom_union, brussel, custom, commiss, 2_referendum, chequer, elector, trade, union, darren, remain_campaign, grime, reckon, regulatori
60	fb, regist_vote, regist, young_peopl, ireland, northern_ireland, northern, girl, young, obligatori, peopl_regist, make_voic_hear, mini, make_voic, young_peopl_vote
61	comprehens, smear, tori_mp, jeremi_corbyn, video, msm, accus, expos, jeremi, homeless, bomb_syria, blairit, nerv, attack_corbyn, huffpostukpol
62	deal_brexit, ree, mogg, ree_mogg, deal, jacob, erg, jacob_ree, jacob_ree_mogg, brexit_deal, raab, domin_raab, irish_border, white_paper, countri_brexit
63	skynew, ukip_vote, skynewsbreak, gold, airport, hour, leed, maguir, idiot, abbott, olymp, flight, maker, yorkshir, kevin_maguir
64	vote_labour, labour_parti, labour_win, labour, vote_labour_vote, labour_govern, votelabour, dup, tori_govern, labour_vote, theresa_lose, lose_major, labour_gain, labour_elect, labour_labour
65	realdonaldtrump, piersmorgan, potus, crook, presid, trump, nigel_farag, nigel, presid_trump, fake, anti_trump, haha, cnn, william, fbi
66	today_good, id, bbclaurak, king, enter, broadcast, code, user, text, garden, santa, privaci, marvel, februari, communic