\ul
Inferring Human Traits From Facebook Statuses
Abstract
This paper explores the use of language models to predict 20 human traits from users’ Facebook status updates. The data was collected by the myPersonality project, and includes user statuses along with their personality, gender, political identification, religion, race, satisfaction with life, IQ, self-disclosure, fair-mindedness, and belief in astrology. A single interpretable model meets state of the art results for well-studied tasks such as predicting gender and personality; and sets the standard on other traits such as IQ, sensational interests, political identity, and satisfaction with life. Additionally, highly weighted words are published for each trait. These lists are valuable for creating hypotheses about human behavior, as well as for understanding what information a model is extracting. Using performance and extracted features we analyze models built on social media. The real world problems we explore include gendered classification bias and Cambridge Analytica’s use of psychographic models.
Keywords:
Social Media Psychographic Prediciton NLP.1 Introduction
Facebook’s 2 billion users spend an average of 50 minutes a day on Facebook, Messenger, or Instagram [1]. Industry seeks to obtain, model and actualize this mountain of data in a variety of ways. For example, social media can be used to establish creditworthiness [2, 3], persuade voters [4, 5], or seek cognitive behavioral therapy from a chatbot [6]. Many of these tasks depend on knowing something about the personal life of the user. When determining the risk of default, a creditor may be interested in a debtor’s impulsiveness or strength of support network. A user’s home town could disambiguate a search term. Or—reflecting society’s values—a social media company may be less willing to flag inflammatory language when the speaker is criticizing their own [7].
Social media’s endlessly logged interactions have also been a boon to understanding human behavior. Researchers have used various social networks to model bullying [8], urban mobility [9], and the interplay of friendship and shared interests [10]. Such studies do not have the benefit of a controlled setting where a single variable can be isolated. However, orders of magnitude more observations in participants’ natural habitat offer more fidelity to lived experience [11]. Additionally subjects can be sampled from countries not so singularly Western, Educated, Industrialized, Rich, and Democratic—or WEIRD, in the parlance of Henrick et al [12].
In this paper we show how readily different personality and demographic information can be extracted from Facebook statuses. Our reported performance is useful to learn how traits are related to online behavior. For example, sensational interests as measured by the Sensational Interest Questionnaire (SIQ) have been studied for internal reliability [13], relationship to physical aggression [14], and role in intrasexual competition [15]. Yet work connecting SIQ with social media use relies on individually labeling sensational interests in statuses and is only predictive among males [16]. Our model performs well for both males and females without hand-labeling statuses. Similarly, other research found no relationship between satisfaction with life (SWL) and status updates [17]; we show modest test set performance. Finally, although Facebook Likes have been shown to be highly predictive of many personal traits [18], language models with good performance on this dataset have been limited to predicting personality, age, and gender [19, 20, 21].
The benchmark also helps assess the efficacy of services that explicitly or implicitly rely on inferring these traits. This is valuable to those developing new services as well as to users concerned about privacy. Of particular interest is the role of psychographic models in Cambridge Analytica’s (CA) marketing strategy. From leaked internal communications, in 2014 CA amassed a dataset of Facebook profiles and traits almost identical to those in the myPersonality dataset [22]. The week after CA’s project became public, Facebook’s stock plummeted $75 billion [23]. One factor in that drop was the belief that Facebook had allowed a third party to create a powerful marketing tool that could manipulate elections [24, 22]. There are dozens of publications on the myPersonality dataset. However, this is the first to predict SIQ, fair-mindedness, and self-disclosure, which CA discussed in relation to building user models [22].
Besides performance benchmarks, the other major contribution of this paper are the most highly weighted words to predict each trait. The weights also say something about human behavior. The interpretation here is more complex: regression on tens of thousands of features is fraught with over-fitting and colinearity. Despite those problems, in Section 3 we argue that the weights can still be treated as a data exploration tool similar to clustering. We provide examples of previously studied relationships that are borne out in the word lists, and believe the lists are a useful tool to develop yet unstudied hypotheses.
Highly weighted features are also an important way to analyze models. We argue in section 4.4 that a militarism predictor CA may have built is accurate, but extracts obvious features. Additionally, by inspecting the features in an Atheist vs. Agnostic classifier we find many gendered words. We demonstrate the bias empirically, then fix the classifier to be more fair. This approach is instructive for interrogating more critical models built on social media data.
This paper includes many contributions that could stand alone. We show that the text of Facebook statuses can predict user SWL and SIQ. We expand the prediction of political identity from a single spectrum (liberal/conservative) to twelve distinct ideologies with varying levels of overlap and popularity. On that task, we establish state of the art performance with a model that also provides informative features for every pairwise political comparison. We recreate models CA may have built, and report their performance and the type of information they extracted. We bring character level deep learning to gender prediction. To our knowledge, we also set the standard for predicting IQ, fair-mindedness, self-disclosure, race, and religion from Facebook statuses. Finally, we propose a novel method to make classification less biased.
Given the broad scope of this paper, some contributions are given less space than they would typically merit. Even so, we believe it is important to report results on many traits in a single paper. This demonstrates the power of a simple model and allows task difficulty and extracted features to be compared across traits without concerns about changing experimental setup.
2 Background
2.1 myPersonality Dataset
From 2008 to 2012, over 7 million Facebook users took the myPersonality quiz produced by the psychologist David Stillwell [11]. After answering at least 20 questions, users were scored on the Big Five personality axes: openness, creativity, extraversion, agreeableness, and neuroticism. Over 3 million of those users agreed to give researchers access to their extant Facebook profile and their personality scores. A much smaller subset of users answered additional questionnaires about their interests, Friends’ personality, belief in astrology, and other personal information. The research community has added to the dataset by providing race labels for several hundred thousand users; representing the text of statuses in terms of their Linguistic Inquiry and Word Count (LIWC) statistics [25]; and much more. Labels used in this study are listed in Tables 2 and 1, along with descriptive statistics. To see all available labels, visit myPersonality.org.
myPersonality.org lists 43 publications that use this data. Most work explores the relationship between personality and easily extractable features such as number of Friends or Likes, geographic location, or user-Like pairs. For example, user-Like pairs are shown to be better predictors of a personality than one’s spouse [26]. In 2013, Schwartz et al introduced the open vocabulary approach (or bag of words) to personality, gender, and age prediction [19]. This significantly outperforms closed-vocabulary approaches such as LIWC that rely on domain knowledge to assign each word to one or more of 69 categories. For an excellent overview of related work, we direct readers to that paper’s introduction [19].
2.2 Language Models
2.2.1 Bag of Words
The majority of our experiments use bag of words (BoW) term frequency-inverse document frequency (tf-idf) preprocessing followed by regularized regression. First, the vocabulary is limited to the most common words in a given training set. Then a matrix of word counts, , is constructed, where refers to how often word is used by subject . Each row is normalized to sum to one, moved to a log scale, and divided by , the ratio of documents in which each word appears. In more formal notation, each element of the tf-idf matrix is defined by
is then normalized so each row lies on the unit sphere. can now be used for linear classification or regression with regularization on the parameters. This is commonly called Ridge Regression. For binary classification problems, labels are assigned values of and a threshold determines predicted label. For categorical data with more than two labels, we train a classifier on each pair of labels. Predicted label is decided by majority vote of the classifiers, where is the number of classes.
2.2.2 Character-Level Convolutional Neural Network
For gender prediction, we also train a 49 layer character level convolutional neural network (char-CNN) described in [27]. Much like successful computer vision architectures [28], each character is embedded in continuous space and combined with neighbors by many layers of convolutional filters. Unlike BoW models, CNNs preserve the temporal dimension, allowing the use of syntactic information. While a great advantage, and theoretically more similar to human cognition, this requires different preprocessing. During training, all inputs must be the same length along the temporal axis despite the wide variation in total length of users’ statuses. We chose to split users’ concatenated statuses into chunks of no more than 4000 characters, and no less than 1000, as this is enough text for humans to perform gender classification [29]. Each chunk contains roughly 800 words. Chunks from the same user are assigned entirely to either the training or test set. Unfortunately, preprocessing differences do not allow for a direct comparison between methods. However, enforcing the same preprocessing for both models would necessarily limit one.
2.3 Labels
Tables 1 and 2 provide statistics of the continuous and categorical data respectively. What follows is a brief description of each label and how it was collected.
2.3.1 Gender
is the binary label users supplied when setting up their Facebook account. Offering this information was common before 2008, and mandatory from 2008-2014. In 2014, (after the collection of this dataset) Facebook added 56 more gender options but still uses a binary representation to monetize users [30].
2.3.2 Race
labels provided in the dataset are inferred from profile pictures using the Faceplusplus.com algorithm which can identify races termed White, Black, and Asian. A noisy measure of visual phenotype is not the gold standard for the study of race, however, our results indicate it is related to social media use.
2.3.3 Political identity
is limited to the twelve most common responses: IPA, anarchist, centrist, conservative, democrat, doesn’t care, hates politics, independent, liberal, libertarian, republican, and very liberal. These are heterogenous categories from an open-ended question. No work was done to limit labels to political parties (eg. remove “doesn’t care”), disambiguate misspelled or similar responses (eg. combine “anarchy” and “anarchist” or “liberal” and “very liberal”), or limit responses to one country. To produce the word list for Liberals and Conservatives in Table 15, we combine “liberal”, “very liberal”, and “democrat” as well as “conservative”, “very conservative”, and “republican”. The most likely meaning of IPA is the Independence Party of America, which was in its nascence during this survey. The party is most popular among young people disaffected by the two party system, a sentiment reflected by the users who report IPA.
2.3.4 Religion
categories were limited to the nine most common responses, and similar labels were combined. Three variants of Catholic—“catholic”,“christian-catholic”, and “romancatholic”—were merged to form Catholic. Likewise, Christian refers to “christian”, “christian-baptist” and “christian-evangelical”. The entire list includes: Atheist, Agnostic, Catholic, Christian, Hindu, and None.
2.3.5 Belief in star sign
is the user’s response to “Horoscopes provide useful information to help guide my decisions?” Options include: Strongly Agree, Slightly Agree, No Opinion, Slightly Disagree, and Strongly Disagree.
2.3.6 Personality
is determined on five axes—Openness, Conscientiousness, Extroversion, Agreeableness, and Neurotocism—by a survey. Users answer 20-300 questions which are used to score each personality component on a scale of 1-5. There is a large body of research showing that five factor analysis is explanatory for behavior [31], and its measurement is reproducible [32]. That work is now adapting to larger datasets collected online [11].
2.3.7 Sensational Interests
include Militarism, Violent-Occult, Intellectual Recreation, Occult Credulousness, and Wholesome activities. Users can indicate “Great Dislike”, “Slight Dislike”, “No Opinion”, “Slight Interest”, and “Great Interest” for 28 different items including: “Drugs”, “Paganism”, “Philosophy”, “Survivalism”, and “Vampires and Wolves”. Interest levels are calculated by summing responses from relevant items. The full calculation can be found in [13].
2.3.8 IQ
is determined by 20 questions that conform to Raven’s Standard Progressive Matrices. The development and validation of these questions is explained in [33] and [34]. Because performance on IQ tests has been rising at roughly 0.3 points a year over the past century and IQ is defined as mean 100, the scoring of a test is properly defined over an age cohort [35]. These scores do not take age into account and the mean is 114.
2.3.9 Satisfaction with life, self-disclosure, and fair-mindedness
are assessed by separate questionnaires. SWL is a measure of global well being somewhat robust to short term mood fluctuations [36].
3 The Interpretation of Feature Weights
A common approach to understand traits in social science is to solve
where is observations of subjects, is the traits of subjects, is a transition matrix, and is model error [3, 13, 37, 38, 39, 40, 41, 42, 43]. Traits are preferred to be orthogonal to promote compactness without sacrificing modeling power. The Big 5 personality model is both criticized and defended on grounds of trait independence, explanatory power, and measureability, which conforms to the linear model above [44]. Because the traits are defined by language they will not be completely orthogonal. Additionally, observations are not independent. As such, values in will have dependencies across both rows and columns. Some traits like personality are used to predict other traits or life events [13, 40]. Learning those relationships can be interpreted as informing our beliefs about column dependencies for when both traits are part of .
In this paper, is the tf-idf word matrix, is defined by our labels, and the model weights are some estimate of we define as . Row dependencies in are based on how words function. For example, ‘camp’ and ‘camping’ perform similar roles in a status. Likewise, the relationship between IQ and agreeableness will be embedded in the columns of . However, many of the tasks have little training data and the solution is ill-posed. Regularization encourages generalization, but does not provide any guarantees. Further, sometimes dominates the model when observations are not very explanatory or the relationship to a trait is not linear. Given these challenges, what confidence can be placed in the estimate ?
These problems mirror those faced when clustering data. Clustering does not come with guarantees it will yield sensible answers in diverse scenarios [45]. However, it is broadly useful when exploring large sets of data [46, 47, 48]. Similarly, can be viewed as a way of ranking features for exploration. A highly ranked observation is not proof it is important. But several highly ranked observations with functional coherence may suggest a hypothesis; particularly when coupled with domain knowledge of row and column dependencies in .
The 55 most highly weighted features for each label are reported in the Appendix. Though the word lists are shown in order of importance, this ranking is not strict. Different regularization, preprocessing, or train/test splits can alter the ordering, especially when there are few examples. Additionally, more common words with lower weights may be used more often in a model’s prediction, but may not appear at the top of a list. One may use regularization to obtain an arbitrary small number of non-zero weights [49]. This encourages weighting common words and provides more stable rankings. We demonstrate that approach with our IQ model in Section 4.2.5.
There are many well-studied phenomena embedded in the produced by our work. For example, Sarah Palin is the only politician indicated in the liberal word list in Table 15. Likewise, Nancy Pelosi ranks just below Ronald Reagan among conservative words. This accords with literature on the memorability of negative ads [50], importance of outgroup prejudice for social identity [51, 52], and biases women face in politics [53, 54]. We hope the many word lists in the appendix will be useful to researchers in the development of new hypotheses.
is also useful to understand models built on social media data. Until recently, the models themselves were not very important. However, machine learning can now be used to estimate sensitive traits such criminal recidivism [43]. Given the literalness with which estimates are often interpreted, it is essential to note that model weights are causal for the predicted label. In Section 4.5 we use our understanding of the input features to characterize information the model extracts to predict religion. This dataset also includes demographic labels, which show predicted religion labels are more gendered than the ground truth.
We hope the included word lists (a) highlight unstudied relationships about these traits (b) illustrate what kind of information is extracted from social media by machine learning systems.
4 Results and Discussion
4.1 Experimental Setup
All BoW experiments employ the same preprocessing. Users must have over 500 words in the sum of all their statuses. 80% of the data is randomly assigned to the training set; the remaining samples constitute the test set. The vocabulary is limited to the 40,000 most common words in each training set. Words must be used by at least 10 users but no more than 60% of users in the training set. The regularization parameter is tuned via efficient leave one out cross validation [55] when , and -fold cross validation for larger datasets. All BoW models are implemented using the sklearn library [56]. Table 1 reports the number of samples and explained variance (EV) of the predictions on continuous data. Table 2 reports the number of classes, ratio of samples in the dominant class, homogeneity, and performance on tasks with categorical data.
Label | N | EV |
---|---|---|
Personality | ||
Openness | 84451 | 0.171 |
Conscientiousness | 84451 | 0.120 |
Extroversion | 84451 | 0.141 |
Agreeableness | 84451 | 0.090 |
Neuroticism | 84451 | 0.100 |
Sensational Interests | ||
Militarism | 4074 | 0.165 |
Violent-Occult | 4074 | 0.192 |
Intellectual Recreation | 4074 | 0.033 |
Occult Credulousness | 4074 | 0.144 |
Wholesome Activities | 4074 | 0.108 |
Satisfaction With Life | 2502 | 0.034 |
Self Disclosure | 2006 | 0.092 |
Fair-Mindedness | 2006 | 0.064 |
IQ | 1807 | 0.128 |
Explained Variance (EV) is 1-, where is the predicted label.
Label | N | Classes | Mode | Homogeneity | F1-score | Acc |
---|---|---|---|---|---|---|
Gender | 109104 | 2 | 0.598 | 0.519 | 0.92 | 0.903 |
Race | 22059 | 3 | 0.682 | 0.52 | 0.74 | 0.766 |
Political identity | 19769 | 12 | 0.213 | 0.133 | 0.33 | 0.337 |
Religious identity | 8388 | 5 | 0.488 | 0.318 | 0.54 | 0.541 |
Belief in Star Sign | 7115 | 5 | 0.331 | 0.245 | 0.32 | 0.334 |
Mode is the ratio of the dominant class. Homogeneity is the probability two random samples will be of the same class. The F1-Score is the harmonic mean of precision and recall. For non-binary labels, the precision and recall for each class is weighted by its support.
4.2 Performance
4.2.1 Gender
Table 3 compares our gender predictor to several other methods. The BoW model with a vocabulary of 500,000 yields accuracy of 92.8%, 1.4% more accurate than the tri-gram model reported by Schwartz et al [19]. Even though the same dataset is used, the comparison is not direct. The tri-gram model seeks to remove the age information from words, has a larger vocabulary, preserves some temporal relationships in the tri-grams, and draws a different train/test split. Moreover, the preprocessing is more restrictive and only includes users with at least 1000 words. Notwithstanding these discrepancies, which may boost or dampen performance, the results are very similar. When the LIWC representation is added to the tri-grams, there is a slight improvement to 91.6% accuracy. Preprocessing is even less similar for the char-CNN described in the Section 2.2.2. The human baseline of 84.0% consists of volunteer judgments based on 20-40 user tweets as reported by Nguyen et al [29]. This is less text than is available to the other models, and from a different social media platform. But, with 210 volunteer guesses per user, it provides a relevant human baseline.
4.2.2 Personality
After gender, personality is the most studied trait in this paper. Likewise, Schwartz et al achieve the best results to date [19]. They report the square root of EV to two significant digits: 0.42, 0.35, 0.38, 0.31, 0.31. In that format, we are just 0.01 beneath the state of the art for openness and agreeableness, 0.01 better for neuroticism, and equivalent for the remaining traits. As with gender, we achieve this with a simpler model.
4.2.3 Political Identity
Prediction accuracy of 33.7% is a gain of 11.7% over the baseline strategy of always predicting the mode, ‘doesn’t care’. As noted in the experiments section, training samples are weighted inversely to their class representation; therefore, ignoring any class will result in an equal loss. This does not provide the highest classification accuracy. However, we believe when some classes are sparsely populated an MSE optimal classifier that is highly biased toward the mode should not be the standard. For reference, equal sample weights and the same training scheme yield classification accuracy of 36.3% and a weighted f1 score of 31.6%. Five classes—IPA, hates politics, independent, libertarian, and very liberal—have no representation in the test set predictions. The weighted classifier predicts each class at least once.
According to Preotiuc-Pietro et al., all previous research on predicting political ideology from social media text has used binary labels such as liberal vs conservative or Democrat vs Republican. They broaden the classification task to include seven gradations on the liberal to conservative spectrum [57]. When predicting ideological tilt from tweets, they achieve a 2.6% boost over baseline (19.6%) with BoW follow by logistic regression. Word2Vec feature embeddings [58] and multi-target learning with some hand-crafted labels yield an 8.0% boost. From classification along grades of a single spectrum, we significantly expand the task to twelve diverse identities with varying levels of representation and ideological overlap while maintaining classification accuracy.
In Table 6 we report the matrix of highest weighted words for separating users in each pairwise class comparison. As with race, belief in star sign, and religion, we plan on making expanded pairwise lists available online. In Table 7 we report the confusion matrix. Note that many errors are between similar labels, such as liberal and democrat. Ease of training, strong performance, and representation of minority classes make a majority vote system of shallow pairwise classifiers a good approach for this task.
For binary comparison, by pooling {‘very liberal’,‘liberal’,‘democrat’} and {‘very conservative’,‘conservative’,‘republican’} we achieve 76.4% accuracy; 12.1% above baseline. Table 15 shows the top 55 liberal and conservative words.
4.2.4 Religion
Religion seems to be more difficult to glean from statuses than political identity. At 54.1%, accuracy is a modest 5.3% above guessing the mode. The most highly weighted pairwise words are on Table 8, and Table 9 shows the confusion matrix. The most highly weighted word to distinguish someone who is agnostic from an atheist is ‘boyfriend’. This led us to look deeper at that pairwise classifier in Section 4.5. Binary labels were constructed by pooling {‘catholic’, ‘christian-catholic’, ‘romancatholic’, ‘christian’, ‘christian-baptist’} and {‘atheist’, ‘agnostic’,‘none’ }. We achieve 78.0% accuracy, 5.2% above baseline. Those words are on table 15. To our knowledge, there is no other multi class religion predictor to which our results can be compared.
4.2.5 IQ
In a genome wide association meta study of 78,308 individuals, 336 single nucleotide polymorphisms were found to explain 2.1-4.8% of the IQ variance among the test population [59]. We achieve 12.8% EV with a model trained on less than 2000 users and their statuses. Using regularization to limit the vocabulary to the ten most informative words—final, physics; ayaw, family, friend, heart, lmao, nite, strong, ur—still yields 5.6% percent EV. The relative accuracy of such a trivial model that leverages intuitive features is a helpful comparison for any project predicting this important trait. To our knowledge, this is the only work to date that infers IQ from social media.
The selected features are also informative. Words suggesting intelligence—‘final’ and ‘physics’—are parsimonious and singularly academic. Whereas the university experience is sufficient to find users with high IQ, features inversely related to IQ are more focused on disposition. From table 10, agreeableness is implied by ‘family’ and ‘heart’; conscientiousness is implied by ‘family’ and ‘lmao’; and low openness is implied by ‘ur’. Overall, the list can be characterized as prosocial, or at least concerned with social relationships. Predicting low IQ with prosocial features seems to challenge some previous research.
Gottlieb et al observed that learning disabled children were more likely to engage in solitary play [60]. Play has also been observed to be more aggressive [61]. More directly related to our task, McConaughy and Ritter showed a positive correlation between the IQ of learning disabled boys and social competence scores; and a negative correlation between IQ and behavior problem scores [62]. For further review of the subject see [63].
An MSE optimal classifier seeks to generalize information about samples near the average. This can cause bias when classifying minorities, but is instructive when interpreting features. Features should say something about the majority of our sample, those with IQ near the mean. This explains why antisocial behavior among those with extremely low IQ does not preclude prosocial behavior indicating moderately lower IQ. Reflecting the limitations of this type of study, words like ‘family’, ‘friend’, and ‘heart’ could also be caused by differing norms for social media use or many other factors. Prosocial words predicting lower IQ does however suggest interesting future work.
4.2.6 Sensational Interests
In this study, SIQ is the easiest continuous variable to predict, even with an order of magnitude less training data than personality. The SIQ asks lists 28 discrete interests like ‘black magic’ and ‘the armed forces’. Very similar terms can be recovered from statuses: ‘zombie’, ‘blood’, ‘vampire’; ‘military’, ‘marines’, ‘training’. Personality tests, on the other hand, ask more abstract questions like ‘I shirk my duties’ for conscientiousness. Many of these duties seem to be extracted in Table 10: ‘studying’, ‘busy’,‘obstacles’. But many more training examples are required for similar performance.
This is the first work to demonstrate an automatic system for predicting SIQ. Previous research relied on manually counting the number of sensational interests in statuses. The count was only correlated with militarism among men; the relationship was negative for women [16].
4.2.7 Satisfaction With Life
Previous research cast doubt on the relationship between status updates and SWL [17]. The number of positive words used on Facebook nationwide in a given day, week, or month, is inversely correlated with the SWL of that time period’s myPersonality participants. The interpretation of that result is that it “challenges the assumption that linguistic analysis of internet messages is related to underlying psychological states.” Here we show that a BoW model accounts for 3.4% of the variance in SWL scores. Moreover, the most important words the model finds are intuitive. Lower SWL is implied by “fucking”, “hate”, “bored”, “interview”, “sick”, “hospital”, “insomnia”, “farmville”, and “video”. The deleterious effects of joblessness, anger, chronic illness, and isolation are well documented. Words positively associated with SWL—“camping”, “imagination”, “epic”, “cleaned”, “success”—make similar sense.
Conversational AI on Facebook Messenger is an efficacious and scalable way to administer cognitive behavioral therapy [6]. Our results show linguistic analysis can shed light on underlying psychological states. This is important to find users that could benefit from such treatment.
4.2.8 Belief in Star Sign
Compared to political identity, BSS has seven fewer classes and a far more homogeneous distribution. Even so, the BSS classifier performs slightly worse than the politics classifier and roughly on par to the baseline of predicting the mode. Unlike our race, gender, politics and sensational interests, we don’t wear belief in astrology on our sleeve.
4.3 Model Selection
BoW models are somewhat unintuitive. Humans use syntactic information when decoding language, which the model discards. Yet, for many tasks they achieve state of the art performance. We compare our BoW to a character-level CNN on gender prediction, our most data rich problem. A character-level CNN is well suited to large amounts of messy, user generated data. Pooling layers in a CNN allow generalization of words like “gooooooooo” and “gooooooo”, while BoW must learn distinct weights. Surprisingly, the CNN does not outperform the simple BoW as shown in Table 3.
We found the choice of prediction model is not as important as preprocessing. In initial experiments, Support Vector Machines [64] and logistic regression, and regularized regression yielded similar performance, depending on choice of -grams and whether Singular Value Decomposition was used [65]. We implement ridge regression and classification for simplicity.
Inferring human traits from social media is now being done using deep models [66, 57]. That may be useful in some cases, but for this project the deep model offered no performance boost or intuition to underlying human behavior. Perhaps a continuous bag of words [58] and recurrent neural network [67] would have done better, but researchers should not consider deep learning essential for this field. Moreover, any performance gains should be weighed against loss of interpretability.
4.4 Cambridge Analytica
With current technology, Facebook statuses are a better predictor of someone’s IQ than the totality of their genetic material [59]. When a marketing firm adds such a tool to their arsenal it is natural to be suspicious. Indeed, The Guardian article that broke the CA story was headlined “‘I made Steve Bannon’s psychological warfare tool’: meet the data war whistleblower” [24]. (Steve Bannon is the former chief executive of the Trump presidential campaign.) However, closer inspection of psychographic models casts doubt on their ability to add value to an advertising campaign, even when the predictions are accurate. In this paper we show that militarism is one of the most easily inferred traits. At 16.5% explained variance, it is more predictable than any of the big 5 personality traits except openness, even with just 5% of the training data. SIQ is also a much stronger predictor of aggressive behavior than the Big 5 [14]. If this trait was actionable for the Trump campaign, it is interesting that the two most highly weighted features are ‘xbox’ and ‘man’. Gaming interest and gender are already available via Facebook’s advertising platform; reaching that demographic does not require an independent model. Additionally, Steve Bannon’s belief in the political power of gamers predates CA’s psychographic model by a decade [68].
Readers are encouraged to view the word lists in the Appendix through the lens of task accuracy on Tables 1 and 2. They may come to the same conclusion as the Trump campaign who, according to CBS News, “never used the psychographic data at the heart of a whistleblower who once worked to help acquire the data’s reporting – principally because it was relatively new and of suspect quality and value.” [69]. Performance results and extracted features allow for more informed discussion; particularly for SIQ, fair-mindedness and self-disclosure on which we report the first accurate prediction model.
There are limitations to this analysis. Our models only use statuses; Likes and network statistics could increase accuracy. Further, other psychographic traits beyond militarism may be politically useful but have no obvious demographic stand-in. Finally, we don’t have access to CA’s exact dataset and instead built our models on the myPersonality dataset.
Predicted (Men) | ||||
---|---|---|---|---|
Agnostic | Atheist | Total | ||
True | Agnostic | 36 | 33 | 69 |
Atheist | 28 | 58 | 86 | |
Total | 64 | 91 |
Predicted (Women) | ||||
Agnostic | Atheist | Total | ||
86 | 21 | 107 | ||
34 | 16 | 50 | ||
120 | 37 |
Predicted (Men) | ||||
---|---|---|---|---|
Agnostic | Atheist | Total | ||
True | Agnostic | 40 | 29 | 69 |
Atheist | 31 | 55 | 86 | |
Total | 71 | 84 |
Predicted (Women) | ||||
Agnostic | Atheist | Total | ||
85 | 22 | 107 | ||
31 | 19 | 50 | ||
116 | 41 |
4.5 Gender Bias in Atheist vs Agnostic Classifier
Highly weighted atheist words include “fucking”, “bloody”, “maths”, “degrees”, “disease”, “wifey”, and “religion”. Meanwhile, “beautiful”, “santa”, “friggin”, “thank”, “hubby”, “miles”, and “paperwork” imply the user is agnostic. This paints a picture of academic, male, disagreeable and British atheists. Agnostic words are more positive, female, and related to mundane preparation. A more complete list is shown in Table 15. What follows is an empirical analysis of our estimator‘s gender bias, a discussion of fairness, and results debiasing the model.
In this dataset, atheists and agnostics are 33.5% and 50.3% female respectively. This is a stronger female preference for agnosticism than random surveys across the United States which report 32% and 38%, respectively [70]. Table 4 shows the confusion matrices for men and women. The ratio of predicted to true agnostics is 0.945 for men and 1.35 for women. Similarly, the ratio of false atheist to false agnostic predictions is 90.8% larger for men than women. The classification of women, the minority in this dataset, is highly distorted.
Models built to generalize information often amplify biases in training data. Cooking videos elicit female pronouns in machine-generated captions 68% more than male pronouns, even though the training shows only 33% more women cooking [71]. Word embeddings used in machine translation [72], information retrieval [73], and student grade prediction [74] produce analogies such as “man is to computer programmer as woman is to homemaker”[75].
There are many notions of fairness defined over an individual [76, 77, 78], population [79, 80], or information available to the model [81]. Building a fair estimator often requires domain knowledge to define a similarity metric [76], make corpus-level constraints [71], or construct a causal model that separates protected information from other latent variables [78]. In this paper, we will use the notion of Disparate Mistreatment to measure fairness [79]. That is, if protected classes experience disparate rates of false positive, false negative or overall misclassification, the estimator is unfair.
To mitigate Disparate Mistreatment we explicitly encode gender—{,0,} for {male, unknown, female}—in the feature vector during train time. At test time the gender of all samples is encoded as unknown. The intuition is that latent variables are amplified when they are easy to extract and correlated with the target. As demonstrated by the accuracy of our race and gender predictors, that is often the case for protected information. There often exist more informative, if more subtle, traits than the protected features. For example, atheists and agnostics report a yawning gap in those that don’t believe in God, at 92% and 41% [70]. Additionally, religiosity is shown to be correlated with both Agreeableness and Conscientiousness [82]. But gender is much easier to extract then belief in God or personality. By explicitly giving the model gender information, we hope that the model will do more to extract those other features.
This approach produces much less Disparate Mistreatment of men and women. The ratio of predicted to true agnostics moves closer to parity at 1.02 for men and 1.22 for women. Additionally, the ratio of false atheist to false agnostic predictions is now only 31.8% larger for men, compared to 90.8% without intervention. The most highly weighted agnostic words for the new fair classifier are also less gendered; “hair”, “wifey”, and “boyfriend” are no longer in the top 55, as reported in Table 15. We also saw no decay in classification rate.
The gender bias of the atheism classifier is clear by simply inspecting its most heavily weighted features. More opaque models should be subjected to more rigorous inspection for bias.
5 Conclusion and Future Work
We match or set the state of the art for the 20 traits in this paper. Additionally, we provide the top words for many pairwise classification problems, and top 55 words for regression or binary classification problems. We hope researchers from many fields find the benchmarks and word lists useful. Our analysis of psychographic models in marketing as well as gender bias in a religion classifier are examples of how these performance measures and extracted features can be used together.
In future work we hope to explore what types of unfairness can be solved by our approach in Section 4.5. Further, models built on traits with few examples are well suited to be augmented by transfer learning. This is especially pressing for detecting states like low satisfaction with life, which can be somewhat ameliorated at low cost.
References
- [1] J. B. Stewart, “Facebook has 50 minutes of your time each day. it wants more,” The New York Times, vol. 5, 2016.
- [2] SunCorp, “Digitising reputation pays off in the rental market,” 2017.
- [3] A. E. Khandani, A. J. Kim, and A. W. Lo, “Consumer credit-risk models via machine-learning algorithms,” Journal of Banking & Finance, vol. 34, no. 11, pp. 2767–2787, 2010.
- [4] D. L. Cogburn and F. K. Espinoza-Vasquez, “From networked nominee to networked nation: Examining the impact of web 2.0 and social media on political participation and civic engagement in the 2008 obama campaign,” Journal of Political Marketing, vol. 10, no. 1-2, pp. 189–213, 2011.
- [5] R. J. González, “Hacking the citizenry?: Personality profiling,‘big data’and the election of donald trump,” Anthropology Today, vol. 33, no. 3, pp. 9–12, 2017.
- [6] K. K. Fitzpatrick, A. Darcy, and M. Vierhile, “Delivering cognitive behavior therapy to young adults with symptoms of depression and anxiety using a fully automated conversational agent (woebot): a randomized controlled trial,” JMIR mental health, vol. 4, no. 2, 2017.
- [7] R. Allan, “Hard questions: Who should decide what is hate speech in an online global community?,” 2017.
- [8] J. Cheng, C. Danescu-Niculescu-Mizil, and J. Leskovec, “Antisocial behavior in online discussion communities.,” in ICWSM, pp. 61–70, 2015.
- [9] A. Noulas, S. Scellato, R. Lambiotte, M. Pontil, and C. Mascolo, “A tale of many cities: universal patterns in human urban mobility,” PloS one, vol. 7, no. 5, p. e37027, 2012.
- [10] S.-H. Yang, B. Long, A. Smola, N. Sadagopan, Z. Zheng, and H. Zha, “Like like alike: joint friendship and interest propagation in social networks,” in Proceedings of the 20th international conference on World wide web, pp. 537–546, ACM, 2011.
- [11] M. Kosinski, S. C. Matz, S. D. Gosling, V. Popov, and D. Stillwell, “Facebook as a research tool for the social sciences: Opportunities, challenges, ethical considerations, and practical guidelines.,” American Psychologist, vol. 70, no. 6, p. 543, 2015.
- [12] J. Henrich, S. J. Heine, and A. Norenzayan, “The weirdest people in the world?,” Behavioral and Brain Sciences, vol. 33, no. 2-3, p. 61–83, 2010.
- [13] V. Egan, J. Auty, R. Miller, S. Ahmadi, C. Richardson, and I. Gargan, “Sensational interests and general personality traits,” The Journal of Forensic Psychiatry, vol. 10, no. 3, pp. 567–582, 1999.
- [14] V. Egan and V. Campbell, “Sensational interests, sustaining fantasies and personality predict physical aggression,” Personality and Individual Differences, vol. 47, no. 5, pp. 464–469, 2009.
- [15] A. Weiss, V. Egan, and A. J. Figueredo, “Sensational interests as a form of intrasexual competition,” Personality and Individual Differences, vol. 36, no. 3, pp. 563–573, 2004.
- [16] G. Hagger-Johnson, V. Egan, and D. Stillwell, “Are social networking profiles reliable indicators of sensational interests?,” Journal of Research in Personality, vol. 45, no. 1, pp. 71–76, 2011.
- [17] N. Wang, M. Kosinski, D. Stillwell, and J. Rust, “Can well-being be measured using facebook status updates? validation of facebook’s gross national happiness index,” Social Indicators Research, vol. 115, no. 1, pp. 483–491, 2014.
- [18] M. Kosinski, D. Stillwell, and T. Graepel, “Private traits and attributes are predictable from digital records of human behavior,” Proceedings of the National Academy of Sciences, vol. 110, no. 15, pp. 5802–5805, 2013.
- [19] H. A. Schwartz, J. C. Eichstaedt, M. L. Kern, L. Dziurzynski, S. M. Ramones, M. Agrawal, A. Shah, M. Kosinski, D. Stillwell, M. E. Seligman, et al., “Personality, gender, and age in the language of social media: The open-vocabulary approach,” PloS one, vol. 8, no. 9, p. e73791, 2013.
- [20] G. Farnadi, G. Sitaraman, S. Sushmita, F. Celli, M. Kosinski, D. Stillwell, S. Davalos, M.-F. Moens, and M. De Cock, “Computational personality recognition in social media,” User modeling and user-adapted interaction, vol. 26, no. 2-3, pp. 109–142, 2016.
- [21] M. Sap, G. Park, J. Eichstaedt, M. Kern, D. Stillwell, M. Kosinski, L. Ungar, and H. A. Schwartz, “Developing age and gender predictive lexica over social media,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1146–1151, 2014.
- [22] N. Y. Times, “How trump consultants exploited the data of millions,” 2018.
- [23] M. Watch, “Facebook valuation drops $75 billion in week after cambridge analytica scandal,” 2018.
- [24] T. Guardian, “‘i made steve bannon’s psychological warfare tool’: meet the data war whistleblower,” 2018.
- [25] J. W. Pennebaker, M. E. Francis, and R. J. Booth, “Linguistic inquiry and word count: Liwc 2001,” Mahway: Lawrence Erlbaum Associates, vol. 71, no. 2001, p. 2001, 2001.
- [26] W. Youyou, M. Kosinski, and D. Stillwell, “Computer-based personality judgments are more accurate than those made by humans,” Proceedings of the National Academy of Sciences, vol. 112, no. 4, pp. 1036–1040, 2015.
- [27] A. Conneau, H. Schwenk, L. Barrault, and Y. Lecun, “Very deep convolutional networks for text classification,” in Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, vol. 1, pp. 1107–1116, 2017.
- [28] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, pp. 1097–1105, 2012.
- [29] D. Nguyen, D. Trieschnigg, A. S. Doğruöz, R. Gravel, M. Theune, T. Meder, and F. De Jong, “Why gender and age prediction from tweets is hard: Lessons from a crowdsourcing experiment,” in Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, pp. 1950–1961, 2014.
- [30] R. Bivens, “The gender binary will not be deprogrammed: Ten years of coding gender on facebook,” New Media & Society, vol. 19, no. 6, pp. 880–898, 2017.
- [31] J. M. Digman, “Personality structure: Emergence of the five-factor model,” Annual review of psychology, vol. 41, no. 1, pp. 417–440, 1990.
- [32] R. R. McCrae and P. T. Costa, “Validation of the five-factor model of personality across instruments and observers.,” Journal of personality and social psychology, vol. 52, no. 1, p. 81, 1987.
- [33] M. LLC, “The development and piloting of an online iq test,” 2014.
- [34] M. Kosinski, “Measurement and prediction of individual and group differences in the digital environment,” Department of Psychology University of Cambridge, 2014.
- [35] J. R. Flynn, “Massive iq gains in 14 nations: What iq tests really measure.,” Psychological bulletin, vol. 101, no. 2, p. 171, 1987.
- [36] E. Diener, R. A. Emmons, R. J. Larsen, and S. Griffin, “The satisfaction with life scale,” Journal of personality assessment, vol. 49, no. 1, pp. 71–75, 1985.
- [37] L. Cooke, J. Wardle, E. Gibson, M. Sapochnik, A. Sheiham, and M. Lawson, “Demographic, familial and trait predictors of fruit and vegetable consumption by pre-school children,” Public health nutrition, vol. 7, no. 2, pp. 295–302, 2004.
- [38] M. Peciña, H. Azhar, T. M. Love, T. Lu, B. L. Fredrickson, C. S. Stohler, and J.-K. Zubieta, “Personality trait predictors of placebo analgesia and neurobiological correlates,” Neuropsychopharmacology, vol. 38, no. 4, p. 639, 2013.
- [39] L. C. Quilty, M. Sellbom, J. L. Tackett, and R. M. Bagby, “Personality trait predictors of bipolar disorder symptoms,” Psychiatry Research, vol. 169, no. 2, pp. 159–163, 2009.
- [40] R. P. Tett, D. N. Jackson, and M. Rothstein, “Personality measures as predictors of job performance: a meta-analytic review,” Personnel psychology, vol. 44, no. 4, pp. 703–742, 1991.
- [41] G. Park, H. A. Schwartz, J. C. Eichstaedt, M. L. Kern, M. Kosinski, D. J. Stillwell, L. H. Ungar, and M. E. Seligman, “Automatic personality assessment through social media language.,” Journal of personality and social psychology, vol. 108, no. 6, p. 934, 2015.
- [42] N. Cesare, C. Grant, and E. O. Nsoesie, “Detection of user demographics on social media: A review of methods and recommendations for best practices,” arXiv preprint arXiv:1702.01807, 2017.
- [43] J. Kleinberg, S. Mullainathan, and M. Raghavan, “Inherent trade-offs in the fair determination of risk scores,” arXiv preprint arXiv:1609.05807, 2016.
- [44] O. P. John and S. Srivastava, “The big five trait taxonomy: History, measurement, and theoretical perspectives,” Handbook of personality: Theory and research, vol. 2, no. 1999, pp. 102–138, 1999.
- [45] J. M. Kleinberg, “An impossibility theorem for clustering,” in Advances in neural information processing systems, pp. 463–470, 2003.
- [46] A. K. Jain, M. N. Murty, and P. J. Flynn, “Data clustering: a review,” ACM computing surveys (CSUR), vol. 31, no. 3, pp. 264–323, 1999.
- [47] R. Shamir and R. Sharan, “1 1 algorithmic approaches to clustering gene expression data,” Current Topics in Computational Molecular Biology, p. 269, 2002.
- [48] S. Dixon, E. Pampalk, and G. Widmer, “Classification of dance music by periodicity patterns,” 2003.
- [49] N. Meinshausen and B. Yu, “Lasso-type recovery of sparse representations for high-dimensional data,” The Annals of Statistics, pp. 246–270, 2009.
- [50] R. R. Lau, L. Sigelman, and I. B. Rovner, “The effects of negative political campaigns: a meta-analytic reassessment,” Journal of Politics, vol. 69, no. 4, pp. 1176–1209, 2007.
- [51] L. Huddy, “Group identity and political cohesion,” Emerging Trends in the Social and Behavioral Sciences: An Interdisciplinary, Searchable, and Linkable Resource, 2003.
- [52] N. R. Branscombe and D. L. Wann, “Collective self-esteem consequences of outgroup derogation when a valued social identity is on trial,” European Journal of Social Psychology, vol. 24, no. 6, pp. 641–657, 1994.
- [53] M. C. Schneider and A. L. Bos, “Measuring stereotypes of female politicians,” Political Psychology, vol. 35, no. 2, pp. 245–266, 2014.
- [54] K. Dolan, “The impact of gender stereotyped evaluations on support for women candidates,” Political Behavior, vol. 32, no. 1, pp. 69–88, 2010.
- [55] A. Vehtari, A. Gelman, and J. Gabry, “Efficient implementation of leave-one-out cross-validation and waic for evaluating fitted bayesian models,” arXiv preprint arXiv:1507.04544, 2015.
- [56] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn: Machine learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.
- [57] D. Preoţiuc-Pietro, Y. Liu, D. Hopkins, and L. Ungar, “Beyond binary labels: political ideology prediction of twitter users,” in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), vol. 1, pp. 729–740, 2017.
- [58] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” in Advances in neural information processing systems, pp. 3111–3119, 2013.
- [59] S. Sniekers, S. Stringer, K. Watanabe, P. R. Jansen, J. R. Coleman, E. Krapohl, E. Taskesen, A. R. Hammerschlag, A. Okbay, D. Zabaneh, et al., “Genome-wide association meta-analysis of 78,308 individuals identifies new loci and genes influencing human intelligence,” Nature genetics, vol. 49, no. 7, p. 1107, 2017.
- [60] B. W. Gottlieb, J. Gottlieb, D. Berkell, and L. Levy, “Sociometric status and solitary play of ld boys and girls,” Journal of Learning Disabilities, vol. 19, no. 10, pp. 619–622, 1986.
- [61] T. Bryan, R. Wheeler, J. Felcan, and T. Henek, ““come on, dummy” an observational study of children’s communications,” Journal of Learning Disabilities, vol. 9, no. 10, pp. 661–669, 1976.
- [62] S. H. McConaughy and D. R. Ritter, “Social competence and behavioral problems of learning disabled boys aged 6-11,” Journal of Learning Disabilities, vol. 19, no. 1, pp. 39–45, 1986.
- [63] C. J. Bellanti and K. L. Bierman, “Disentangling the impact of low cognitive ability and inattention on social behavior and peer relationships,” Journal of Clinical Child Psychology, vol. 29, no. 1, pp. 66–75, 2000.
- [64] J. A. Suykens and J. Vandewalle, “Least squares support vector machine classifiers,” Neural processing letters, vol. 9, no. 3, pp. 293–300, 1999.
- [65] G. H. Golub and C. Reinsch, “Singular value decomposition and least squares solutions,” Numerische mathematik, vol. 14, no. 5, pp. 403–420, 1970.
- [66] M. Iyyer, P. Enns, J. Boyd-Graber, and P. Resnik, “Political ideology detection using recursive neural networks,” in Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), vol. 1, pp. 1113–1122, 2014.
- [67] B. Felbo, A. Mislove, A. Søgaard, I. Rahwan, and S. Lehmann, “Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm,” arXiv preprint arXiv:1708.00524, 2017.
- [68] Wired, “The decline and fall of an ultra rich online gaming empire,” 2008.
- [69] C. News, “Trump campaign phased out use of cambridge analytica data before election,” 2018.
- [70] Pew, “Religious landscape study.,” 2014.
- [71] J. Zhao, T. Wang, M. Yatskar, V. Ordonez, and K.-W. Chang, “Men also like shopping: Reducing gender bias amplification using corpus-level constraints,” arXiv preprint arXiv:1707.09457, 2017.
- [72] W. Y. Zou, R. Socher, D. Cer, and C. D. Manning, “Bilingual word embeddings for phrase-based machine translation,” in Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1393–1398, 2013.
- [73] S. Clinchant and F. Perronnin, “Aggregating continuous word embeddings for information retrieval,” in Proceedings of the workshop on continuous vector space models and their compositionality, pp. 100–109, 2013.
- [74] J. Luo, S. E. Sorour, K. Goda, and T. Mine, “Predicting student grade based on free-style comments using word2vec and ann by considering prediction results obtained in consecutive lessons.,” International Educational Data Mining Society, 2015.
- [75] T. Bolukbasi, K.-W. Chang, J. Y. Zou, V. Saligrama, and A. T. Kalai, “Man is to computer programmer as woman is to homemaker? debiasing word embeddings,” in Advances in Neural Information Processing Systems, pp. 4349–4357, 2016.
- [76] C. Dwork, M. Hardt, T. Pitassi, O. Reingold, and R. Zemel, “Fairness through awareness,” in Proceedings of the 3rd innovations in theoretical computer science conference, pp. 214–226, ACM, 2012.
- [77] M. Joseph, M. Kearns, J. Morgenstern, S. Neel, and A. Roth, “Rawlsian fairness for machine learning,” arXiv preprint arXiv:1610.09559, 2016.
- [78] M. J. Kusner, J. Loftus, C. Russell, and R. Silva, “Counterfactual fairness,” in Advances in Neural Information Processing Systems, pp. 4069–4079, 2017.
- [79] M. B. Zafar, I. Valera, M. Gomez Rodriguez, and K. P. Gummadi, “Fairness beyond disparate treatment & disparate impact: Learning classification without disparate mistreatment,” in Proceedings of the 26th International Conference on World Wide Web, pp. 1171–1180, International World Wide Web Conferences Steering Committee, 2017.
- [80] M. Hardt, E. Price, N. Srebro, et al., “Equality of opportunity in supervised learning,” in Advances in neural information processing systems, pp. 3315–3323, 2016.
- [81] N. Grgic-Hlaca, M. B. Zafar, K. P. Gummadi, and A. Weller, “The case for process fairness in learning: Feature selection for fair decision making,” in NIPS Symposium on Machine Learning and the Law, vol. 1, p. 2, 2016.
- [82] V. Saroglou, “Religiousness as a cultural adaptation of basic traits: A five-factor model perspective,” Personality and social psychology review, vol. 14, no. 1, pp. 108–125, 2010.
IPA anarchist centrist conserv. dem. doesn’t care hates pol. indep. lib. liber. repub. v. lib. IPA fuck wishes wishes smh yay rain congrats wishes money church damn anarchist excited wishes driving excited lol dont driving excited ready ready excited centrist xd fuck lord today tattoo shit surgery shit government school damn conservative xd fuck damn fb anymore shit damn damn art school damn democrat xd fuck wishes tonight stupid fuck died wishes government church wishes doesn’t care packers fuck wishes lord smh shit definitely wishes government church damn hates politics class music dey loves fb tht movie wishes email camp damn independent xd fuck wishes lord valentine sitting fuck wishes beer parents damn liberal xd fuck final lord im xd im gonna government church damn libertarian xd fuck headache lord walk xd dont till packing girls vacation republican xd fuck wishes wishes smh mum fuck minute wishes fucking damn very liberal xd xd boy lord im xd xd school missing im im
Predicted Label IPA anar. centrist conserv. dem. doesn’t care hates pol. indep. lib. liber. repub. v. lib. Total IPA 0 2 3 3 11 18 2 1 3 1 16 1 61 anarchist 0 24 4 3 5 21 1 3 15 5 4 3 88 centrist 2 9 74 40 52 66 3 6 95 7 43 4 401 conservative 2 5 29 113 26 31 0 7 53 5 62 0 333 democrat 5 17 53 36 321 101 4 18 80 9 89 3 736 doesn’t care 3 39 51 29 122 373 12 12 105 12 102 9 869 hates politics 0 4 6 1 6 30 5 3 6 0 2 0 63 independent 0 8 16 13 35 22 1 8 29 4 25 1 162 liberal 1 18 51 27 74 51 6 6 223 15 24 13 509 libertarian 0 12 17 9 17 28 0 6 32 11 12 4 148 republican 1 8 19 57 67 64 1 8 29 3 179 3 439 very liberal 0 4 25 2 11 22 2 2 67 1 6 3 145 Total 14 150 348 333 747 827 37 80 737 73 564 44 3954
athiest | agnostic | catholic | christian | none | |
---|---|---|---|---|---|
athiest | boyfriend | thank | church | lol | |
agnostic | fucking | prayers | church | lol | |
catholic | fucking | fucking | lol | lol | |
christian | fucking | fucking | mass | xmas | |
none | fucking | apartment | god | church |
The most highly weighted word from each pairwise classifier. Word implies top label.
Predicted Label | ||||||
---|---|---|---|---|---|---|
Atheist | Agnostic | Catholic | Christian | None | Total | |
Atheist | 68 | 29 | 17 | 16 | 21 | 151 |
Agnostic | 54 | 69 | 27 | 55 | 11 | 216 |
Catholic | 27 | 37 | 172 | 130 | 9 | 375 |
Christian | 35 | 48 | 126 | 560 | 26 | 795 |
None | 22 | 11 | 19 | 50 | 39 | 141 |
Total | 206 | 194 | 361 | 811 | 106 | 1678 |
In the remaining tables the top 55 words are listed in order for each trait.
Openness | Conscientious | Extroversion | |||
- | + | - | + | - | + |
bored | art | lost | gym | internet | party |
boring | poetry | fucking | ready | quiet | guys |
husband | beautiful | xd | weekend | bored | amazing |
attitude | universe | phone | excited | listening | audition |
shopping | peace | im | success | apparently | baby |
dinner | poem | bored | finished | computer | haha |
tv | writing | fuck | studying | stupid | dance |
game | books | gonna | busy | pc | girls |
proud | theatre | sick | vacation | hmm | fabulous |
ur | dream | procrastination | arm | anime | blast |
dentist | mind | internet | officially | tt | ready |
daughter | book | computer | family | dark | im |
dont | woman | probably | relax | probably | wine |
haha | guitar | cousins | tennis | sims | success |
stupid | damn | hates | wonderful | didn | lets |
ni | awesome | sims | special | watching | excited |
ipod | tea | anybody | win | slow | super |
bed | apartment | charger | glad | depressing | text |
justin | insomnia | sister | piano | calculus | chill |
gift | xd | playing | scholarship | kind | phone |
2nd | adventure | grounded | received | anymore | dear |
hurt | cali | poker | lmao | repost | parties |
ohh | far | tt | degrees | maybe | support |
baseball | philosophy | status | state | draw | loves |
mum | sigh | momma | tons | yay | pics |
pray | nature | ftw | motor | trying | hey |
school | maybe | press | obstacles | books | big |
repost | music | dead | research | shadow | hit |
booked | blues | failed | extremely | bother | met |
lord | chill | forgot | circumstances | damned | pirate |
ops | fam | depression | workout | suppose | ben |
nice | epic | lazy | paid | reading | rocked |
tmr | places | youtube | 100 | cat | gang |
dam | rights | 420 | hit | poor | sex |
idol | dragons | school | surgery | depression | sing |
snowing | woot | http | law | sigh | btw |
pissed | vampire | awsome | university | games | gorgeous |
shut | soul | pokemon | anatomy | drawing | musical |
maths | eclipse | woke | blessings | odd | cali |
msn | drawing | dammit | hmmmm | 10th | girlfriend |
aldean | strange | hair | husband | pokemon | stoked |
vodka | planet | wished | counting | nice | folks |
comes | yay | cleaning | calc | essay | ponder |
eid | dreams | fine | louis | pointless | wanna |
alot | blood | dunno | delhi | managed | hahahaha |
waste | sushi | enemy | final | looks | pool |
worst | smoking | social | drive | grr | tanning |
kiero | contact | yo | lets | darkness | hello |
soo | lines | procrastinator | iphone | saw | pumped |
mas | deep | black | lunch | crying | chillin |
staff | genius | magic | yankees | lonely | theatre |
12 | novel | wasn | running | laptop | kiss |
piss | smh | fans | weather | shouldn | office |
transformers | worried | kinda | zone | paranoid | cock |
car | folks | trying | smart | walking | lauren |
Agreeable | Neurotic | Satisfaction With Life | |||
---|---|---|---|---|---|
- | + | - | + | - | + |
fucking | wonderful | loving | sick | bored | family |
stupid | amazing | girlfriend | nervous | fuck | loving |
kill | awesome | wife | stressed | fucking | hope |
shopping | haha | awesome | depression | hates | thankful |
shit | smile | parties | depressed | bday | india |
burn | happiness | party | anymore | apparently | wonderful |
bitch | phone | weekend | lonely | damn | busy |
pissed | urself | haha | stress | internet | friend |
punch | family | doing | fucking | zero | heart |
hates | blessed | game | tired | chem | man |
death | status | sunday | trying | wat | yum |
hell | music | kansas | depressing | supposed | fb |
suck | woop | guy | sims | ma | glad |
freak | hands | delicious | anxiety | hating | beautiful |
piss | heart | beach | worst | spend | lauren |
dead | spirit | definitely | hair | la | lord |
xmas | smiles | swag | fed | dumb | wine |
karma | guy | started | scream | young | swim |
fight | moment | ready | fine | british | energy |
blood | beautiful | hunting | nightmare | killed | lunch |
awful | movie | power | rip | hmm | locked |
deal | theres | funniest | tears | france | woot |
misery | car | melody | horrible | chances | sons |
fuck | dancing | hawaii | flu | simply | special |
enemies | lord | action | worse | exams | trust |
fake | guitar | hit | issues | mum | wish |
pathetic | sore | chillin | scared | main | weeks |
irony | sara | workout | stressful | hate | day |
dumb | help | flow | fml | edge | father |
cunt | walk | portland | care | dnt | tried |
care | excited | seat | shes | party | journey |
devil | prayers | smart | stressing | kept | hospital |
black | knowing | snowboarding | ugh | dat | |
ich | valentines | knowing | sad | didn | business |
russian | borrow | sore | gary | months | santa |
idiots | laura | greatest | hates | du | walked |
cunts | notifications | success | die | rain | lights |
wtf | beard | basketball | actually | pass | kingdom |
crap | reli | update | scary | bus | work |
truck | snowboarding | gf | boyfriend | okay | lol |
deleted | sorry | women | pills | australia | mommy |
anger | chillin | gotta | crying | shooting | turkey |
die | hill | followed | kitty | england | nap |
tu | whats | jumping | awful | africa | revenge |
nightmare | hearts | fool | hurt | rachel | truly |
annoyed | kindness | dancing | bored | fml | son |
rip | study | greatness | fair | metal | final |
bloody | worry | blast | screaming | uk | reached |
drama | clients | woke | dreading | school | survived |
bitches | smells | ass | friggin | wtf | dont |
stupidity | troops | hitting | suicide | matt | 0 |
hair | sing | cock | miserable | freakin | god |
wifi | goood | wise | quiet | 15 | kitchen |
fat | holy | kiss | xd | 200 | normal |
rage | faster | toes | sadness | free | blessing |
Militaristic | Violent-Occult | Intellectual Recreation | |||
---|---|---|---|---|---|
- | + | - | + | - | + |
sleeping | man | lord | hell | im | life |
ugh | xbox | pray | zombie | course | jon |
sad | gets | cousins | damn | boring | beautiful |
excited | gotta | church | fuck | painful | dancing |
lovely | good | michael | bitch | decision | yoga |
oh | training | allah | ass | hurts | thankful |
hair | headed | jesus | drink | bus | peace |
shopping | truck | game | blood | game | kinda |
husband | guitar | 0 | lmao | stupid | truly |
sick | guys | summer | xd | bak | la |
cares | bro | gosh | woot | hero | ich |
mum | gun | praise | halloween | problem | miss |
boyfriend | boom | sunday | play | yeah | likes |
lady | epic | dad | guys | christ | comfort |
concert | work | loving | drunk | gona | lol |
today | weight | mum | thanx | id | wtf |
gaga | gym | team | animal | sittin | insomnia |
okay | bike | hospital | sanity | die | chicken |
pic | dang | 10 | fucking | horse | children |
adorable | game | tv | dragons | yell | tired |
sunday | blast | christ | burn | chuck | lovely |
ordered | lol | heal | vampires | 2day | ap |
birth | war | usa | blah | tommorrow | funny |
lots | black | personal | man | ow | things |
poor | fish | best | loved | bored | man |
ben | military | ray | pissed | fukin | simple |
fine | woot | nervous | lil | inbox | thank |
settings | 12 | thing | bday | race | period |
birthday | till | look | send | basketball | countdown |
cousins | ppl | week | body | word | baby |
shoes | brave | 2morrow | metal | rhys | beach |
art | 17 | quite | head | tell | hey |
omg | fight | poor | piss | step | depression |
stop | success | brazil | blast | wats | jobs |
wear | marines | cup | theyre | coke | cure |
prince | hrs | zumba | cause | football | manage |
round | sword | account | gun | penguins | sugar |
come | make | website | death | won | aware |
neighbours | ko | tryna | vampire | facebookers | singing |
basement | friend | study | bleh | letters | egg |
music | hit | haha | tattoo | awsome | taste |
speak | play | soccer | ppl | dont | rains |
thoughts | pics | feeling | dead | blah | log |
story | hahaha | christmas | woman | till | taught |
weird | troops | round | purple | playing | coolest |
awful | army | youth | peaceful | dead | yellow |
quite | running | story | message | fact | cheers |
rachel | mag | bible | shit | learned | small |
hear | strong | woah | angel | visit | society |
alice | knw | grace | kinda | address | fly |
tea | beer | prayers | tongue | 14 | social |
promised | hehehe | plan | sushi | chilling | boo |
jesus | comwatch | feat | wolf | win | beauty |
actually | xoxo | anybody | poke | pokemon | world |
counting | run | stressed | kick | sees | sunshine |
Occult Credulousness | Wholesome Activities | Belief in Star Sign | |||
---|---|---|---|---|---|
- | + | - | + | No | Yes |
church | zombie | coke | woot | minutes | omg |
praise | ass | michigan | camping | didn | im |
jesus | bitch | stupid | fish | church | ready |
lord | halloween | pathetic | life | praise | friend |
bible | animal | ops | yesterday | jesus | mind |
christ | sign | husband | beautiful | probably | ass |
team | omg | didn | rain | physics | butt |
quite | xd | hurts | man | jess | stay |
loving | job | kurwa | mexico | white | tom |
pray | woot | evil | wish | religion | tomarrow |
paper | wish | afternoon | river | iv | october |
game | cure | problem | love | officially | promise |
blessed | street | taylor | path | imagine | lol |
salvation | vampire | idea | moon | christ | searching |
ops | guys | jess | haha | germany | bitch |
summer | send | glee | snow | giants | bleh |
michael | lol | mum | bike | saw | eye |
spent | thanx | mental | hahaha | wants | cute |
youth | luck | meg | ghost | north | family |
cousins | wtf | mad | baking | decided | halloween |
word | nature | 360 | grandma | discovered | hanging |
god | cancer | pissed | live | 11th | haunted |
homework | woohoo | club | goin | ouch | japanese |
alarm | miss | uni | sky | skin | mother |
0 | barely | lyrics | cat | doesn | dinner |
haha | moment | head | animal | bacon | card |
player | bar | recently | netflix | train | help |
sunday | safe | internet | birds | hahaha | bored |
college | proud | min | smile | lasts | luv |
wedding | woman | lesson | happiness | america | luck |
prayer | mom | bus | mom | haven | neighbors |
glory | away | rly | yum | burning | yum |
forgiveness | dare | debate | fishing | pray | fireworks |
ann | inches | kevin | truly | thursday | lmao |
mm | boyfriend | inbox | fell | jessica | tt |
political | il | jeez | make | prince | tired |
fact | nd | official | clean | knew | person |
greatest | pls | nite | portland | umm | nd |
confused | aware | ms | smells | quiero | watch |
appreciated | xmas | lack | lake | deserves | ya |
algebra | hell | saw | create | heres | prom |
brazil | solstice | troy | making | finds | crazy |
travel | date | sims | 2010 | kim | upload |
daughter | vampires | school | josh | heard | elf |
bacon | copy | thinks | children | punch | hehe |
laura | purple | thanking | laughing | groups | crack |
personal | haunted | die | sa | car | bell |
week | theyre | hates | law | amazing | human |
greater | lmao | stuff | jobs | sick | finish |
statement | later | band | earth | tape | lnk |
messed | interview | thieves | gets | drink | june |
tv | peeps | feels | hehehe | morn | change |
em | peaceful | elm | swimming | dallas | costume |
poor | drunk | germany | wa | cops | shit |
trust | dunno | sat | monkeys | waters | decorating |
Self-Disclosure | Fair-Mindedness | IQ | |||
- | + | - | + | - | + |
bored | family | bored | excited | nite | exam |
fuck | loving | wat | business | ur | hours |
fucking | hope | soon | says | lmao | sigh |
hates | thankful | dad | apartment | alot | camping |
bday | india | xd | great | family | finish |
apparently | wonderful | stage | delicious | omg | paper |
damn | busy | pass | sure | 2011 | wtf |
internet | friend | moon | needed | city | il |
zero | heart | haha | seattle | lol | finds |
chem | man | kitty | uni | help | important |
wat | yum | tired | airport | wew | read |
supposed | fb | mum | thankful | boy | physics |
ma | glad | farmville | dallas | heart | |
hating | beautiful | face | learn | com | ra |
spend | lauren | drank | weekend | angie | xd |
la | lord | fuk | definitely | www | wifi |
dumb | wine | fuck | dinner | ha | text |
young | swim | ma | card | 333 | weeks |
british | energy | sun | amazing | tom | studying |
killed | lunch | crap | tonight | goodnight | training |
hmm | locked | bday | exciting | history | course |
france | woot | shit | degrees | xxx | student |
chances | sons | hopefully | classes | xdd | magic |
simply | special | feel | support | friend | kinda |
exams | trust | fails | priceless | morning | everytime |
mum | wish | va | oh | mum | raining |
main | weeks | big | certainly | christmas | yea |
hate | day | nd | government | eid | maths |
edge | father | smoke | ticket | kay | semester |
dnt | tried | yay | food | gives | maybe |
party | journey | watchin | january | din | exciting |
kept | hospital | sick | couple | beautiful | point |
dat | wedding | php | folks | kno | |
didn | business | regret | journey | luv | excited |
months | santa | seconds | universe | 0 | imma |
du | walked | im | 21 | hacked | months |
rain | lights | ignore | grateful | secrets | flying |
pass | kingdom | tt | pay | iam | final |
bus | work | lose | size | forgiveness | nah |
okay | lol | marriage | class | strong | library |
australia | mommy | lolz | situation | busy | used |
shooting | turkey | fukin | duke | jo | chem |
england | nap | picture | honesty | hate | brain |
africa | revenge | blessing | austin | ti | everybody |
rachel | truly | slow | tires | nightmare | awesome |
fml | son | anxiety | 29 | ayaw | groups |
metal | final | cy3 | sisters | prayer | progress |
uk | reached | library | mother | fought | champion |
school | survived | tmr | heading | ow | calculus |
wtf | dont | fucking | bc | sana | behave |
matt | 0 | epic | piece | tired | den |
freakin | god | il | summer | afraid | badly |
15 | kitchen | marie | breakfast | para | times |
200 | normal | bunch | answer | sum | mobil |
free | blessing | loaded | surgery | movie | fun |
Agnostic vs Atheist | A. vs A. (Fair) | Religious vs Not | Conservative vs Liberal | ||||
---|---|---|---|---|---|---|---|
extra | physics | miles | fucking | church | fucking | church | damn |
miles | fucking | working | physics | pray | fuck | truck | happy |
turn | snowing | extra | wat | prayers | xmas | government | fb |
hair | shit | awhile | fuck | god | damn | america | smh |
packing | wat | packing | bloody | easter | shit | pray | marriage |
awhile | write | turn | shit | lord | bloody | haha | xmas |
insane | bloody | super | write | blessed | hell | prayers | chicago |
working | enter | hubby | maths | christmas | ass | deer | sex |
hubby | fuck | chill | xx | ugh | india | christmas | hell |
points | sigh | free | snowing | praying | zombie | country | fam |
friggin | thinks | sleepy | enter | hw | fuckin | tonight | lovely |
santa | talk | santa | thinks | ppl | halloween | 17 | halloween |
heck | weeks | heck | talk | prayer | car | lord | health |
wishes | town | ready | science | game | yay | awesome | saw |
child | science | friggin | sigh | believe | social | god | yoga |
free | maths | vacation | hai | family | xx | military | celebrate |
boyfriend | degrees | work | cancer | ready | quite | texas | gay |
lady | lolz | thursday | person | fb | religion | freedom | apartment |
learn | record | late | coursework | bless | drink | savior | wtf |
super | xmas | points | town | im | oh | dad | thoughts |
houston | tom | pack | xd | calling | using | bible | shit |
service | hai | houston | weeks | dang | shitty | jesus | glee |
pack | person | insane | tom | paper | internet | supper | gaga |
late | dat | ya | film | jesus | fucked | girls | da |
wanting | tyler | relax | dat | school | damned | huge | palin |
hasn | cod | join | kill | camp | omfg | praying | 2010 |
mai | afraid | busy | lolz | gosh | meh | camp | help |
sleepy | untill | learn | msn | heart | indian | soldiers | mexico |
worked | present | child | english | success | post | byu | mother |
fly | wifey | headed | xmas | mary | head | christ | indian |
chill | movie | favorite | chemistry | strength | cricket | disney | lady |
join | xx | beautiful | afraid | butt | any1 | risen | studies |
kyle | cancer | season | na | fishing | dragon | beach | social |
dun | boring | san | pierced | brother | lovely | tournament | art |
thursday | rape | fly | dick | military | body | troops | holiday |
taken | month | worked | anatomy | sad | new | schools | shitty |
childhood | kill | service | bbc | uncle | boyfriend | leave | ve |
mother | welcome | spring | tell | senior | teeth | ill | free |
thank | clinton | wanting | untill | fair | nice | blonde | earthquake |
headed | nicht | halloween | memory | mom | fml | armed | street |
ya | ay | lady | bothered | tan | warped | xbox | phone |
london | brother | thank | horse | watching | woke | reagan | lakers |
beautiful | tell | childhood | record | em | bleh | utah | ur |
jail | hadn | mai | cod | president | wednesday | served | fine |
hates | pierced | hair | ki | smh | gods | tide | relationship |
paperwork | wild | paperwork | nicht | love | afford | gators | asshole |
wanna | use | 4th | sheep | haha | japanese | pelosi | worried |
clear | perfect | hopefully | chem | future | tongue | husband | purple |
san | return | missed | brother | best | robert | stinks | putting |
til | needed | peace | fancy | emails | sophie | trial | omg |
halloween | paid | hasn | degrees | goin | holy | picked | nature |
bring | half | trip | disease | football | eye | beep | prop |
kindle | horse | mother | realised | latest | tattoo | gun | black |
vida | disease | sunshine | room | thank | decent | trailer | live |
powers | chuck | kyle | religion | matthew | odd | ready | eid |
White vs. Black | White vs. Asian | Black vs. Asian | |||
---|---|---|---|---|---|
tonight | smh | tonight | asian | smh | korea |
dad | fb | blonde | tt | fb | sa |
stupid | lord | town | tmr | lord | na |
exited | fam | fuckin | korea | wit | asian |
thinks | nigeria | ass | chinese | aint | gay |
ends | yall | college | ng | da | chinese |
journey | black | gas | na | yall | internet |
meet | fathers | dope | korean | lol | korean |
hahahahahaha | mj | worse | china | say | monday |
fun | yuh | night | ang | fam | xd |
awesome | gon | men | aq | jackson | tmr |
ability | birthday | sons | asians | cos | shooting |
night | mad | adult | chen | michael | philippines |
mas | lol | pretty | guys | finals | 3d |
wouldnt | finish | theres | thailand | ass | babe |
chargers | dey | idea | taiwan | yuh | heaven |
bein | asap | hope | karaoke | black | important |
aftr | tryna | ability | sa | ny | tan |
pretty | jackson | melissa | chan | sooooo | thailand |
eh | came | state | dream | mad | yummy |
tom | degrassi | unique | company | mind | completely |
exhausted | wat | weekend | craving | season | woot |
tough | iz | screaming | zzz | wat | smell |
great | hw | mamaya | holiday | birthday | bought |
running | pple | tune | wanna | degrassi | fly |
exciting | jus | figure | ms | hell | tt |
yankees | braids | inside | nguyen | chelsea | worry |
politics | haters | exited | singapore | woman | ruin |
mirror | females | wine | yang | figure | passed |
pepsi | misfits | 5th | hu | african | skating |
roll | god | superman | fat | nigeria | english |
animal | man | emotionally | ftw | episode | belong |
grr | omg | sell | gg | iz | shot |
gay | african | sitting | rice | smart | mas |
tattoo | desires | february | tttt | saying | grandpa |
2nite | chelsea | easter | damnit | asap | lazy |
spend | female | months | 555 | attention | sacrifice |
monday | cousin | saying | wong | knowing | grr |
sorrow | holla | expecting | achieve | ki | broken |
ed | smart | rollin | pa | meeting | yang |
healthy | laker | wheres | mode | hw | beer |
enjoyable | favour | eminem | lmao | sings | chatting |
actually | dis | apparently | pride | india | meet |
charity | money | does | bbq | gas | shoulder |
delete | happy | status | super | self | ang |
iron | mii | legit | 1st | ready | funn |
blonde | aye | 30 | long | college | shoes |
comforted | hard | wen | skating | mj | wood |
standards | wuz | eric | mean | search | dad |
shot | ready | yelled | heart | years | apart |
chose | nigga | mis | dx | misfits | aj |
chatting | jamaica | breaking | faith | blessed | line |
damage | bus | homework | expectation | advice | jack |
innocent | actually | research | boys | totally | |
thnx | cos | wishes | hard | fathers | tomorrow |