Towards Improved Model Design for Authorship Identification:
A Survey on Writing Style Understanding

Weicheng Ma Ruibo Liu Lili Wang Soroush Vosoughi soroush.vosoughi@dartmouth.edu

Abstract

Authorship identification tasks, which rely heavily on linguistic styles, have always been an important part of Natural Language Understanding (NLU) research. While other tasks based on linguistic style understanding benefit from deep learning methods, these methods have not behaved as well as traditional machine learning methods in many authorship-based tasks. With these tasks becoming more and more challenging, however, traditional machine learning methods based on handcrafted feature sets are already approaching their performance limits. Thus, in order to inspire future applications of deep learning methods in authorship-based tasks in ways that benefit the extraction of stylistic features, we survey authorship-based tasks and other tasks related to writing style understanding. We first describe our survey results on the current state of research in both sets of tasks and summarize existing achievements and problems in authorship-related tasks. We then describe outstanding methods in style-related tasks in general and analyze how they are used in combination in the top-performing models. We are optimistic about the applicability of these models to authorship-based tasks and hope our survey will help advance research in this field.

1 Introduction

Writing style understanding is a key topic in Natural Language Processing (NLP) research. While the content varies strongly across documents written by the same author, stylistic features, as defined by Herrmann et al. (2015), remain constant and are the cause of a wide range of phenomena, e.g. text genre, metaphor, and irony. An important application of stylistic features is the identification of authorship. Because of the importance of stylistic features in NLP tasks related to authorship and other stylistic phenomena Kestemont et al. (2016), useful model architectures and feature sets are likely to be shared. This motivates us to survey recent research achievements in style-related tasks to find possible ways of advancing the research on authorship-based tasks¹¹1Our survey covers NLU tasks only..

Our paper consists of three main parts. We first introduce the NLU tasks included in our survey. While we cannot cover all possible tasks depending on writing style understanding, we take into account the five most-studied authorship-based tasks and nine other prevalent tasks related to writing style understanding. Second, we provide a summary of linguistic features and neural network architectures that appear frequently in related papers. While existing reviews have examined various solutions to these tasks, our survey emphasizes newly-emerged neural network architectures. For example, Stamatatos (2009) lists 21 linguistic features in five categories which are prevalent in authorship identification tasks and Zhang et al. (2018a) introduce three neural network architectures widely used in sentiment-related tasks. We not only include these methods, but we also provide detailed analysis of techniques such as Capsule Networks (CapsNets) and Transformer networks. In the last part, we list the models that achieve outstanding performance in authorship-based tasks and analyze their feature engineering and model designs. Finally, to inspire the development of effective model design for authorship-based tasks, we analyze top-performing models in other style-based tasks.

The main contributions of this paper are:

•

By expanding the scope of our review to a wider range of NLU tasks related to writing styles, we bridge the authorship-based tasks and other style-related tasks and enable knowledge transfer of feature engineering and model designs to authorship-based tasks.
•

By including the most up-to-date neural network architectures in our survey, we demonstrate the strong possibility of applying these architectures to authorship-based tasks.
•

By examining the detailed usage of models and/or linguistic features for extracting stylistic features, we provide guidance for future model design in authorship-based tasks.

2 Tasks and Datasets

Our survey covers a wide range of style-related tasks besides authorship-related tasks. We provide definitions of these tasks and regularly-used standard evaluation datasets in this section. To be consistent, in all the task definitions we use $d=\{w_{1},w_{2},...,w_{n}\}$ to represent a document with $n$ words and $L$ to denote task-specific label sets.

2.1 Authorship-related Tasks

Authorship Attribution (AA)

Given a document $d$ and $k$ groups of documents $D_{i}=\{d_{1}^{(i)},d_{2}^{(i)},...,d_{j}^{(i)}\}$ each authored by one of $k$ authors, the goal of AA Juola (2006) is to decide whether $d$ belongs to any $D_{i}$ for all $0<i\leq k$ .

AA benchmark datasets include the CCAT10 and CCAT50 Stamatatos (2008), IMDB62 Seroussi et al. (2010), Blogs10 and Blogs50 Schler et al. (2006), and Novel9 Feng and Hirst (2013) datasets. Recently Preoţiuc-Pietro and Devlin Marier (2019) also released a Twitter-based AA dataset. PAN shared tasks²²2https://pan.webis.de/shared-tasks.html provide additional datasets for AA. The most up-to-date in-domain AA dataset is released in PAN-12 challenges and cross-domain AA dataset in PAN-19 shared tasks.

Authorship Verification (AV)

With two documents $d_{1}$ and $d_{2}$ , the AV task aims at predicting whether they are written by the same author. The Webis Authorship Verification dataset Bevendorff et al. (2019) is benchmarked for AV. PAN shared tasks provide other AV datasets in PAN-13, PAN-14, and PAN-15 challenges as well.

Authorship Profiling (AP)

Given a document $d$ , the AP task classifies $d$ into author groups. There can be multiple different label sets for AP, e.g. $L=$ {male, female} in PAN-18 AP task and $L=$ {bot, human_male, human_female} in PAN-19 AP challenge. Celebrity Profiling (CP) is similar to AP, with the documents coming from celebrity Twitter accounts.

PAN has been hosting AP challenges from 2013, and CP since 2019. Evaluation datasets can be found in PAN shared tasks.

Style Change Detection (SCD)

In a document with $k$ paragraphs $d=\{p_{1},p_{2},...,p_{k}\}$ , the goal of SCD is to detect whether any pair of adjacent paragraphs $p_{i}$ and $p_{i+1}$ are written by the same author or not. SCD datasets are available in PAN challenges since the year of 2017.

2.2 Other Style-Related Tasks

Sentiment Analysis (SA)

Given a document $d$ , the aim of SA Pang et al. (2002) is to predict the sentiment class or sentiment polarity associated with $d$ . SA labels can be either from a discrete label set $L=$ {Negative[, Neutral], Positive} or a sentiment polarity score in $L=[-1,1]$ .

Stanford Sentiment Treebank Socher et al. (2013) is a standard evaluation bed for SA. The RT dataset Pang and Lee (2005) is constructed from Internet movie reviews and it provides two types of sentiment labels. Yelp³³3https://www.yelp.com/dataset also provides a large-scale SA dataset based on business reviews.

Aspect-Based Sentiment Analysis (ABSA)

Different from SA, ABSA Pang and Lee (2008) evaluates the sentiment of $d$ on each aspect $a$ in $d$ . Each aspect $a=\{w^{a}_{1},w^{a}_{2},...,w^{a}_{k}\}$ is either a substring in $d$ or an aspect type (e.g. restaurant). There can be multiple aspects in $d$ and the sentiment polarity on each aspect does not have to be towards the same direction Saeidi et al. (2016). The International Workshop on Semantic Evaluation (SemEval) provides multiple ABSA evaluation datasets Pontiki et al. (2014); Cortis et al. (2017); Pontiki et al. (2015, 2016). Dong et al. (2014) also release an ABSA dataset based on Twitter posts.

Stance Detection (SD)

SD Thomas et al. (2006); Hasan and Ng (2013) takes two inputs, a document $d$ and a target $c$ . $c$ can be either an entity (e.g. a person) or a claim. The goal is to classify the stance of $d$ towards $c$ into the label set $L=$ {Agree, Disagree[, Discuss, Unrelated]}. $c$ does not have to physically appear in $d$ . The number of $c$ per $d$ is not limited either.

One benchmark dataset for SD is from SemEval-2016 Task 6 Mohammad et al. (2016b). The Brexit Blog Corpus Simaki et al. (2017), US Election Tweets Corpus Mohammad et al. (2016a), and Moral Foundations Twitter Corpus Hoover et al. (2019) have also received much attention recently. Siddiqua et al. (2019) publish an SD dataset where each document contains two targets and two stance labels. Some other related datasets are provided in the review by Küçük and Can (2020).

Emotion Recognition (ER)

ER is closely related to SA, but with more fine-grained labels and better expressiveness Zhou et al. (2018). ER classifies $d$ into $L$ containing pre-defined emotion types or draws intensity scores of $d$ over the emotion types in $L$ . The most common ER label set contains eight basic emotion types, i.e. $L=$ {Joy, Sadness, Anger, Fear, Anticipation, Surprise, Love, Disgust, Neutral} Ekman (1992).

The standard evaluation datasets of uni-modal ER consist of the EmoBank Buechel and Hahn (2017) and MELD Poria et al. (2019) datasets. Benchmark datasets on multi-modal ER include the IEMOCAP Busso et al. (2008), MOUD Pérez-Rosas et al. (2013), ICT-MMMO Wöllmer et al. (2013), MOSI Zadeh et al. (2016) and MOSEI Bagher Zadeh et al. (2018) datasets.

Metaphor Detection (MD)

A word or phrase is metaphorical if it bears different meaning from its literal semantic meaning in a context. MD is designed to detect whether $d$ or any word group $w_{i},...,w_{i+m}$ in $d$ is metaphorical. Benchmarked datasets on MD include MOH Mohammad et al. (2016c) and its subset MOH-X, TSV Tsvetkov et al. (2014), TroFi Birke and Sarkar (2006), VUA Steen et al. (2010), and LCC Mohler et al. (2016).

Other Tasks

The Irony Detection (ID), Offense Detection (OD), Formality Classification (FC), and Humor Detection (HD) tasks are all document-level classification tasks. ID detects whether the contextual meaning of $d$ is opposite to its literal meaning. SARC Khodak et al. (2018) is a common evaluation dataset on ID. Castro et al. (2019) also release a multi-modal ID dataset named MUStARD. OD classifies $d$ into $L=$ {Offensive, Non-offensive}. Subtasks of OD include the detection of hate speech, cyber-bulling and cyber-aggression Zampieri et al. (2019). Zampieri et al. (2019) and Ousidhoum et al. (2019) provide two benchmarked OD datasets. FC evaluates whether $d$ is formal or informal and HD identifies humorous writing styles from $d$ . Rao and Tetreault (2018) provide a large-scale dataset for formality transfer, while its labels fit the FC task perfectly. Weller and Seppi (2019) label a HD dataset based on data from Reddit, Kaggle, and Pun of the Day. Zhang et al. (2019b) make HD more fine-grained by introducing eight humor categories and five humor levels to the annotations.

3 Methods

In this section, we summarize useful features and common neural network architectures in style-related tasks. We also display the various feature engineering methods and model designs to inspire future research on writing style understanding.

3.1 Linguistic Features

Language Model Features

N-Gram features on both word and character levels are among the most frequently used language model features in representing writing styles. The most common choices of n-gram features are word-level 1-, 2-, and 3-gram and character-level 3-, 4-, and 5-gram features. Muttenthaler et al. (2019) additionally show the importance of punctuation marks in the AA task using an n-gram model with everything except for punctuation marks masked by asterisk symbols (*). Markov et al. (2017) claim that digits and named entities are also key identifiers of writing styles. Sapkota et al. (2015) attribute the advantage of using character-level n-gram features to the high priority of subword features (e.g. suffixes and prefixes) in authorship-related tasks. Zhao et al. (2019) loosen the constraint of n-gram features and consider the co-occurrence of word pairs instead. Statistical features, e.g. TF-IDF scores, are regularly used in combination to the language-model-based features to avoid overemphasizing stop words Vu et al. (2018). As text representations generated from n-gram features tend to be high-dimensional and sparse, Niu et al. (2017) apply Principal Component Analysis to compress them into low-dimensional vectors. Zhou et al. (2018) achieve the similar goal using topic models (e.g., Latent Dirichlet Allocation). Recently, many researchers have turned to neural language models (e.g. Skip-gram model, Mikolov et al. (2013)) for text representations Zhang and Singh (2018); Kumar et al. (2017); Rei et al. (2017). Word vectors from neural language models can be compared, clustered, and used in any statistical method to form document representations Preoţiuc-Pietro and Devlin Marier (2019); Aragón et al. (2019).

Syntactic Features

Part-of-Speech (POS) tags and dependency relations are two main sets of syntactic features used in style-related tasks. The coexistence of a set of POS forms syntactic patterns that can be used to recognize stylistic phenomena, e.g. complaints Preoţiuc-Pietro et al. (2019). Flekova et al. (2016) also argue that the pattern of using each POS in a document is related to the formality of the text. Dependency parsing is in most cases used in combination to POS tagging to help extract meaningful syntactic patterns. Basile et al. (2019) claim that pure POS tags are simplistic and are only able to model shallow syntactic features, while dependency parsing provides richer authorship information. Their experiments suggest that jointly using POS and dependency features results in the best performance. The reason why syntactic features are important in style-based tasks is clear: writing styles are usually specific to the choice and arrangement of words in a document, so the interrelationship among words is crucial.

Lexical Features

When addressing stylistic phenomena including sentiments and emotions, specifically-designed lexicons provide solid help, especially to short and simple documents. Preoţiuc-Pietro et al. (2019), for example, approach the SA task with the help of MPQA Wiebe et al. (2005) lexicon for sentiment, and the NRC Mohammad and Turney (2010, 2013) and Volkova & Bachrach Volkova and Bachrach (2016) lexicons for sentiment and emotion. The function of lexicons, however, is limited as stylistic phenomena are often caused by linguistic expressions in a long span of text. According to Preoţiuc-Pietro et al. (2019), the best-performing lexicon in their experiments produces worse results than a simple bag-of-words model. The semantic relatedness between words is beneficial to style-related tasks as well. WordNet Miller (1998) is the most heavily used lexical database of semantic relation information among words, according to our survey. Mao et al. (2018) use hypernyms and synonyms of words from WordNet to disambiguate them when discovering nuanced metaphors from text. Since different authors tend to express different attitudes towards specific events, lexical features can help solve simple cases in authorship-related tasks with high accuracy and thus reduce noise when training complex models.

Other Features

In our survey, we came across some linguistic features that are used for specific tasks. We group these features here since we think they can be potentially help detect authorship information. Downgraders and politeness are used by Preoţiuc-Pietro et al. (2019) to identify complaints and Mihalcea and Strapparava (2005) introduce alliterations, slangs, and rhyming to the HD task. These features are all author-specific language-use features and qualify for coarse-grained authorship identification. Hashtags, URLs, retweets, tweet popularity, elongated words, hapax legomena, and superlatives are valuable features for AA Preoţiuc-Pietro and Devlin Marier (2019), and they well fit all the authorship-based tasks in the social media domain, e.g. the CP task.

3.2 Neural Network Architectures

Convolutional Neural Networks (CNNs)

CNNs LeCun et al. (1999) extract local dependency features from positionally nearby tokens, which act similarly to n-gram models with n bounded by the sizes of convolution kernels. Benefited from parameter sharing, the training process of CNNs is easy and time-efficient. Xue and Li (2018) combine the features extracted by multiple one-layer CNNs, each with different kernel sizes, into document representations. While CNN-based models on style-related tasks generally share the same architecture, their inputs and external resources usually vary. Zhang et al. (2018b) apply CNNs on text embeddings explicitly enriched by n-gram and syntactic features. They show by experiments that these additional features help their model achieve better performance than vanilla CNNs. Ruder et al. (2016) argue that character-level CNNs are superior in capturing stylistic features. This attributes to the ability of character-level CNNs in modeling sub-word features, e.g. prefixes, which are important style markers. Similarly, Ferracane et al. (2017) convert documents into character-level bigram embeddings and apply CNN feature extractors on these feature maps. Lexical features can also be used in conjunction with CNNs. For example, Dong and de Melo (2018) keep pre-trained sentiment embeddings of words from multiple domains in a memory module and inject the lexical information into CNN encodings of documents by appending normal word vectors attended by the sentiment embeddings to the end of the CNN output.

Graph Neural Networks (GNNs)

GNNs have recently attracted a lot of research attentions in the NLP community. Popular graph models on style-related tasks include Graph Convolutional Networks (GCNs) Kipf and Welling (2017) and Graph Attention Networks (GATs) Velickovic et al. (2018). While GCNs are similar to CNNs, they convolve on graphs, treating nodes connected by edges as neighbors regardless of the absolute positions of words. In addition to connectivity, GATs generate the representation of each node by applying self-attention mechanism on all its neighbor nodes. These characteristics make GNNs appropriate for encoding dependency information. Researchers have been using multi-layer GCNs Sun et al. (2019); Zhang et al. (2019a) and multi-layer GATs Huang and Carley (2019) to encode documents after syntactic parsing for the ABSA task, for example. In these scenes, GCNs and GATs are usually used jointly with other feature extractors (e.g. a Long-Short-Term Memory (LSTM) model) since syntactic features solely are not enough for style-related tasks. From this perspective, GCNs and GATs are valuable model architectures in authorship-related tasks where syntactic information is crucial. As Ghosal et al. (2019) additionally use GCN to capture high-level features across dialogue utterances, we can foresee the application of GNNs in SCD tasks by viewing the paragraphs as dialogue utterances.

Recurrent Neural Networks (RNNs)

The use of RNNs is widespread in NLP tasks since RNNs are designed to encode sequential data, and because they are able to preserve long-term dependency information. LSTM and Gated Recurrent Units (GRU) are the two most prevalent model architectures in the family of RNNs. Biswas et al. (2015) show an example of approaching the SA task with a one-layer uni-directional GRU model. Similar to CNNs, RNN-based models are also used to extract character-level dependency features in style-related tasks Bagnall (2015). Multiple RNN-based model architectures have been invented to improve their encoding ability. Hierarchical RNNs Hihi and Bengio (1995) are designed to extract features at different abstraction levels. Chen et al. (2016) break documents down to word- and sentence-level pieces and use two LSTM models to generate sentence- and document-level encodings, respectively. Stacked RNNs are relatively rare in our survey. Akhtar et al. (2017) examine two-layer LSTM and GRU models on the SA task and their performances are very closed to a single-layer CNN model. Rather, bidirectional RNNs (BiRNNs, Schuster and Paliwal (1997)) or stacked BiRNNs are popular for their ability to use information flows from both directions of the input sequence, which helps with a lot of style-related tasks Qian et al. (2017); Xu et al. (2018); Hu (2019); Chen et al. (2017). Attention mechanism Bahdanau et al. (2015) is applicable to all the RNN architectures we describe above. In pure RNN models, each token contributes to the output embeddings equally, while some words are in fact more important to the predictions than the rest. Impaired verbs and objects in a sentence are strong signs of metaphors, for example. Attention mechanism teaches the RNN-based models to dynamically weigh the token embeddings and thus improves their performances Felbo et al. (2017); Gao et al. (2018); Yang et al. (2017). In recent research, RNNs are frequently used in combination to CNNs as well. There are two main forms of assembling CNNs and RNNs, one of which is separately encoding the document with RNN and CNN models and concatenating the output vectors together Rouvier (2017) or combining the predictions made by both models through majority voting Cliche (2017). The second form is to stack the CNN and RNN models together and have the CNN model extracting local dependency features on the output of the RNN model or the other way around Li et al. (2018).

Memory Networks (MemNNs)

MemNNs Weston et al. (2015) encode sequences with the help of external memories storing global information and preceding states. Initialization, read, and update are three basic operations to the memory cells in MemNNs, the latter two of which are usually implemented using attention mechanism. Multi-hop memories are used dominantly in style-related tasks as they preserve features from multiple different abstraction levels LeCun et al. (2015). Tang et al. (2016) apply MemNNs in an ABSA task. Their model stacks embeddings of context words to initialize external memory and queries the memory with embeddings of the aspect terms. In recent research, MemNNs are usually used in parallel with other neural networks, e.g. RNNs and CNNs, to keep richer contextual information in the memories. Hazarika et al. (2018b) initialize and update the memories both with the help of GRU, while in Hazarika et al. (2018a) the memory cells are initialized by an ensembled CNN-GRU feature extractor. Memory cells in a MemNN model can save author-specific information so we are especially optimistic about its use in AA and AP tasks.

Models		Datasets
AA		CCAT10	CCAT50	IMDB62	PAN-12
Markov et al. (2017)	CN-gram+SVM	79.60	-	-	-
Sari et al. (2017)	CN-gram+FastText	74.80	72.60	94.80	-
Zhang et al. (2018b)	CN-gram+CNN	86.80	76.50	95.21	-
Zhang et al. (2018b)	CN-gram+Syntax+CNN	88.20	81.00	96.16	-
Jafariakinabad et al. (2020)	Syntax+CNN+RNN	-	-	-	100.00
Akiva (2012)	SVM	-	-	-	85.71

Table 1: Model architectures and accuracy scores on four mainstream AA datasets. CN-gram refers to character-level n-gram. PAN-12 displays accuracy scores on task I of the 2012 PAN Authorship Attribution shared task. The best score on each dataset is in bold.

Capsule Networks (CapsNets)

CapsNets Sabour et al. (2017) feature spatial relation modeling and the dynamic routing mechanism. CapsNets internally cluster the encodings of similar entities (e.g. words with the same sentiment labels) together in a capsule and adjust the encodings’ attributes (e.g. sentiment polarity) with affine operations. The dynamic routing mechanism enables CapsNets to reduce the effect of noise words on the representation of the entire text. These features make CapsNets sensitive to writing styles and robust to irrelevant information. Chen and Qian (2019) use dynamic routing to select appropriate n-gram groups for predictions in the ABSA task. Jiang et al. (2019) show by experiments that even with simple dynamic routing strategy (e.g. using averaged embeddings of sentiment words), CapsNets are able to boost the performance of complex-enough models (e.g. BERT Devlin et al. (2019), a pre-trained Transformer-based model). Xia et al. (2018) utilize CapsNets to address zero-shot learning problem. All the above examples show that CapsNets are efficient in extracting and clustering style markers, and making predictions accordingly. So the use of CapsNets solely or combined with other neural networks on authorship-based tasks is promising and deserves more research attention.

Transformer Networks

Transformer Networks Vaswani et al. (2017) is a newly-emerged family of neural networks, the core of which are self-attention and positional encoding. Models based on Transformer Networks are usually deep, with giant amounts of parameters. In most cases, Transformer-based models are pre-trained on large unlabeled corpora and fine-tuned on task-specific datasets, e.g. BERT. Benefited from their large sizes of parameters and training data, Pre-trained Transformer-based models have been refreshing the records on style-related tasks since their emergence. For example, Weller and Seppi (2019) use BERT to encode text for the HD task and their model achieves results comparable to human annotators without feature engineering or model restructuring. Zhong et al. (2019) use external knowledge from ConceptNet Speer et al. (2017) and the NRC_VAD lexicon Mohammad (2018) to augment word embeddings with emotion information, and they introduce dialogue structure information to the self-attention mechanism in Transformer networks. This shows the advantage of combining Transformer-based models with handcrafted feature sets or other neural network architectures, providing possibilities of improving the performance of Transformer networks in authorship-based tasks where handcrafted feature sets are extensively researched.

4 Discussions

In this section, we first explain in more details the feature engineering methods and model designs that achieve good results in extracting authorship information. We also analyze the causes that some methods do not behave as well as expected. Next, we do similar analysis for the SA, ABSA, and ER tasks to infer if any specific model design can be transferred to authorship-based tasks. These three tasks are among the most well-developed style-based tasks and the models we discuss are also top-performing models in these tasks.

We show the best-performing models with different classifier architectures on four AA datasets in Table 1 to discuss current research on the AA task. We choose PAN-12 instead of PAN-19 since the PAN-12 AA task is under in-domain settings, similar to other tasks we list. According to the PAN official reports Kestemont et al. (2019); Juola (2012), existing systems are based on similar feature sets and classifiers so the task choice does not affect our analysis. From the table we can see the importance of character-level n-gram features for the AA task. Syntactic features additionally show great potentials in this task, according to our observations. As for the choice of model architectures, a shallow neural model (e.g. FastText) does not beat SVM classifiers with similar feature sets, while CNNs show superior encoding ability on the AA task. The combination of CNNs and RNNs with POS features also outperform the SVM model on the PAN-12 Task I, achieving 100% accuracy on the test set. The task, however, is too constrained in data size, with 28 training examples and 14 test examples Juola (2012). As a result, models with handcrafted features and probabilistic models also perform well. Evaluations of the model by Jafariakinabad et al. (2020) on more complex AA tasks should be performed to show its strength and robustness. Existing models on the PAN AV challenges almost entirely rely on handcrafted features sets. Character-level n-gram features are again the most favored linguistic features. Bevendorff et al. (2019) construct character trigram vectors for the documents and evaluate the differences between each pair of documents as features using seven distance measures. Bagnall (2015) uses an RNN-based model on character level to verify authorship and achieves higher scores than Bevendorff et al. (2019) on PAN-15 AV task, proving the power of deep neural networks on authorship-based tasks.

Models	Subtasks
AP	Bot	Gender
Johansson (2019)	95.95	83.79
Random Forest
Valencia et al. (2019)	90.61	84.32
Logistic Regression
Joo and Hwang (2019)	93.33	83.60
Transformer
Bolonyai et al. (2019)	91.36	75.72
RNN
Polignano et al. (2019)	91.82	79.73
CNN
Petrík and Chudá (2019)	90.08	77.58
CNN+RNN

Table 2: Accuracy scores on PAN-19 AP challenge. Bot refers to the Bot/user detection subtask and Gender is the author gender prediction subtask of the challenge.

Models		Datasets
SA		SST-2	SST-5	RT	Yelp
Xu et al. (2019)	CNN	81.60	41.99	76.12	61.18
	RNN	81.58	41.67	76.22	61.86
	Transformer	93.03	55.38	88.68	67.85
ABSA		Twitter	Laptop14	Rest14	Rest16
Jiang et al. (2019)	CapsNets+Transformer	-	-	85.93	-
Sun et al. (2019)	RNN+GCN	73.66	72.99	74.02	69.93
Tang et al. (2019)	RNN+CNN+Attention	77.72	73.84	72.90	-
Huang and Carley (2019)	RNN+Transformer+GAT	-	80.10	83.00	-
ER		IEMOCAP		MELD
Hazarika et al. (2018a)	CNN+RNN+MemNN	63.50		-
Ghosal et al. (2019)	RNN+Attention+GCN	64.18		58.10
Zhong et al. (2019)	Transformer	59.56		58.18

Table 3: Evaluation results on the SA, ABSA, and ER tasks. SST-2 and SST-5 refer to the binary and fine-grained Stanford Sentiment Treebank datasets, respectively; Laptop14 and Rest14 are the laptop and restaurant datasets from SemEval 2014 Task 4; Rest16 is from SemEval 2016 Task 5. Accuracy scores are evaluated for SST-2, SST-5, RT, and Yelp challenge datasets and F-1 scores for the rest tasks.

Table 2 lists the performances of six systems, each with different model architectures, on the PAN-19 AP shared task Pardo and Rosso (2019). It is worth noting that deep neural models perform substantially worse than the Random Forest and Logistic Regression models. This is most likely due to the limited training data size (2,060 records for bot/user specification and 2,060 for gender prediction subtasks). The BERT-based model Joo and Hwang (2019) generates better results than the RNN- or CNN-based models probably because it has been pre-trained on a large corpus. In the AP task, word- and character-level n-gram features are still among the top choices and are used in all the models listed in Table 2. For example, Johansson (2019) bases their predictions on the TF-IDF scores of 300 most frequent word unigrams in the training set. On the other hand, since the corpus is made up of tweets, almost all the systems consider tweet-specific linguistic features including the counts of URLs, retweets, mentions, emojis, wrongly-spelled words, hashtags, etc. Valencia et al. (2019) and Petrík and Chudá (2019) replace emojis with their respective descriptive phrases at the preprocessing stage to provide the models with enriched stylistic information. All the knowledge learned from the AP tasks well apply to the PAN-19 CP challenge Wiegmann et al. (2019) where the number of instances in each class is very imbalanced, e.g., Martinc et al. (2019) show that directly using BERT or other deep neural networks is not as competitive as statistical or probabilistic methods.

The design of neural models has been the major topic in a wide range of style-based tasks, while neural networks are mostly used in the simplest way in authorship-related tasks. To better fit deep neural networks to authorship-related tasks, we also review the usage of neural models in other style-related tasks. Table 3 displays current top-ranking results and their model architectures on SA, ABSA, and ER tasks. Directly applying pre-trained Transformer-based models on the text generates good results except when the utterances are short and transitions between them are frequent Zhong et al. (2019). This suggests that Transformer Networks can potentially be applied to AA and AV tasks without much modification, if the training instances in each class are enough to fine-tune a Transformer-based model. Using Transformer-based models in the SCD task is also promising since they are stronger than RNNs and CNNs in encoding long text pieces, e.g. paragraphs, in an article. CapsNets are effective extractors for stylistic features that can probably be used to solve authorship-based tasks since their dynamic routing mechanism benefits the identification of style markers. Jiang et al. (2019) demonstrate the power of CapsNets by the combined use of CapsNets and Transformer networks in an ABSA task. Xu et al. (2019) additionally introduce adversarial training Goodfellow et al. (2014) to train a robust classifier. Similar methods can be used in authorship-based tasks, e.g. the AP and CP tasks, to compensate for the lack of training data in small classes, relieving the problem of data imbalance.

It is also noteworthy that attention mechanism is widely used in the style-related tasks we surveyed, while participants in the authorship identification tasks seldom take advantage of attentions. The attention mechanism weighs each token in the text differently according to their interrelationships, in order not to confuse the classifier with unimportant tokens. This can be implanted to authorship-related tasks as well since style markers are sparse compared to other words, and as useful information for classification can often be located easily even using handcrafted features only. The combined use of CNNs and RNNs or MemNNs also show strong ability in style-related tasks, as is consistent with results shown in Table 1. Besides hard-coding POS or dependency relationship information into the inputs of CNNs, GNNs used in combination with dependency features show strong ability on the ABSA and ER tasks, implying the importance of syntactic features. These light-weight neural network architectures, though not beating the performances of Transformer-based models on many tasks, are probably more suitable for authorship-related shared tasks in PAN since they require much less data to train. They can also be used in combination to the handcrafted feature sets easily, without the need for restructuring the models and retraining the weights.

5 Conclusion

The identification of authorship is a foundational task in NLP research. In this paper, we analyzed linguistic features and model architectures in recent authorship-based research. Our survey indicates that though feature engineering has been studied extensively for this task, research on the use of neural network models is not as up-to-date as in other style-related tasks. As authorship-based tasks are becoming increasingly complex, the sole use of traditional machine learning models with handcrafted feature sets will be less competitive. To better take advantage of the power of deep neural networks, we examined the model design in various writing style understanding tasks which we hope will inspire future research on authorship-based tasks.

References

Akhtar et al. (2017) Md Shad Akhtar, Abhishek Kumar, Deepanway Ghosal, Asif Ekbal, and Pushpak Bhattacharyya. A multilayer perceptron based ensemble technique for fine-grained financial sentiment analysis. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 540–546, Copenhagen, Denmark. Association for Computational Linguistics.
Akiva (2012) Navot Akiva. Authorship and plagiarism detection using binary bow features. In CLEF (Online Working Notes/Labs/Workshop).
Aragón et al. (2019) Mario Ezra Aragón, Adrian Pastor López-Monroy, Luis Carlos González-Gurrola, and Manuel Montes-y Gómez. Detecting depression in social media using fine-grained emotions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1481–1486, Minneapolis, Minnesota. Association for Computational Linguistics.
Bagher Zadeh et al. (2018) AmirAli Bagher Zadeh, Paul Pu Liang, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2236–2246, Melbourne, Australia. Association for Computational Linguistics.
Bagnall (2015) Douglas Bagnall. Author identification using multi-headed recurrent neural networks. In Working Notes of CLEF 2015 - Conference and Labs of the Evaluation forum, Toulouse, France, September 8-11, 2015, volume 1391 of CEUR Workshop Proceedings. CEUR-WS.org.
Bahdanau et al. (2015) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings.
Basile et al. (2019) Angelo Basile, Albert Gatt, and Malvina Nissim. You write like you eat: Stylistic variation as a predictor of social stratification. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2583–2593, Florence, Italy. Association for Computational Linguistics.
Bevendorff et al. (2019) Janek Bevendorff, Matthias Hagen, Benno Stein, and Martin Potthast. Bias analysis and mitigation in the evaluation of authorship verification. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6301–6306, Florence, Italy. Association for Computational Linguistics.
Birke and Sarkar (2006) Julia Birke and Anoop Sarkar. A clustering approach for nearly unsupervised recognition of nonliteral language. In 11th Conference of the European Chapter of the Association for Computational Linguistics, Trento, Italy. Association for Computational Linguistics.
Biswas et al. (2015) Shamim Biswas, Ekamber Chadda, and Faiyaz Ahmad. Sentiment analysis with gated recurrent units. Department of Computer Engineering. Annual Report Jamia Millia Islamia New Delhi, India.
Bolonyai et al. (2019) Flora Bolonyai, Jakab Buda, and Eszter Katona. Bot or not: A two-level approach in author profiling. In Working Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum, Lugano, Switzerland, September 9-12, 2019, volume 2380 of CEUR Workshop Proceedings. CEUR-WS.org.
Buechel and Hahn (2017) Sven Buechel and Udo Hahn. EmoBank: Studying the impact of annotation perspective and representation format on dimensional emotion analysis. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 578–585, Valencia, Spain. Association for Computational Linguistics.
Busso et al. (2008) Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N. Chang, Sungbok Lee, and Shrikanth S. Narayanan. Iemocap: interactive emotional dyadic motion capture database. Language Resources and Evaluation, 42(4):335.
Castro et al. (2019) Santiago Castro, Devamanyu Hazarika, Verónica Pérez-Rosas, Roger Zimmermann, Rada Mihalcea, and Soujanya Poria. Towards multimodal sarcasm detection (an _Obviously_ perfect paper). In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4619–4629, Florence, Italy. Association for Computational Linguistics.
Chen et al. (2016) Huimin Chen, Maosong Sun, Cunchao Tu, Yankai Lin, and Zhiyuan Liu. Neural sentiment classification with user and product attention. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1650–1659, Austin, Texas. Association for Computational Linguistics.
Chen et al. (2017) Peng Chen, Zhongqian Sun, Lidong Bing, and Wei Yang. Recurrent attention network on memory for aspect sentiment analysis. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 452–461, Copenhagen, Denmark. Association for Computational Linguistics.
Chen and Qian (2019) Zhuang Chen and Tieyun Qian. Transfer capsule network for aspect level sentiment classification. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 547–556, Florence, Italy. Association for Computational Linguistics.
Cliche (2017) Mathieu Cliche. BB_twtr at SemEval-2017 task 4: Twitter sentiment analysis with CNNs and LSTMs. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pages 573–580, Vancouver, Canada. Association for Computational Linguistics.
Cortis et al. (2017) Keith Cortis, André Freitas, Tobias Daudert, Manuela Huerlimann, Manel Zarrouk, Siegfried Handschuh, and Brian Davis. SemEval-2017 task 5: Fine-grained sentiment analysis on financial microblogs and news. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pages 519–535, Vancouver, Canada. Association for Computational Linguistics.
Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
Dong et al. (2014) Li Dong, Furu Wei, Chuanqi Tan, Duyu Tang, Ming Zhou, and Ke Xu. Adaptive recursive neural network for target-dependent twitter sentiment classification. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 49–54, Baltimore, Maryland. Association for Computational Linguistics.
Dong and de Melo (2018) Xin Dong and Gerard de Melo. A helping hand: Transfer learning for deep sentiment analysis. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2524–2534, Melbourne, Australia. Association for Computational Linguistics.
Ekman (1992) Paul Ekman. An argument for basic emotions. Cognition & emotion, 6(3-4):169–200.
Felbo et al. (2017) Bjarke Felbo, Alan Mislove, Anders Søgaard, Iyad Rahwan, and Sune Lehmann. Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1615–1625, Copenhagen, Denmark. Association for Computational Linguistics.
Feng and Hirst (2013) Vanessa Wei Feng and Graeme Hirst. Patterns of local discourse coherence as a feature for authorship attribution. Literary and Linguistic Computing, 29(2):191–198.
Ferracane et al. (2017) Elisa Ferracane, Su Wang, and Raymond Mooney. Leveraging discourse information effectively for authorship attribution. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 584–593, Taipei, Taiwan. Asian Federation of Natural Language Processing.
Flekova et al. (2016) Lucie Flekova, Daniel Preoţiuc-Pietro, and Lyle Ungar. Exploring stylistic variation with age and income on twitter. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 313–319, Berlin, Germany. Association for Computational Linguistics.
Gao et al. (2018) Ge Gao, Eunsol Choi, Yejin Choi, and Luke Zettlemoyer. Neural metaphor detection in context. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 607–613, Brussels, Belgium. Association for Computational Linguistics.
Ghosal et al. (2019) Deepanway Ghosal, Navonil Majumder, Soujanya Poria, Niyati Chhaya, and Alexander Gelbukh. DialogueGCN: A graph convolutional neural network for emotion recognition in conversation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 154–164, Hong Kong, China. Association for Computational Linguistics.
Goodfellow et al. (2014) Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pages 2672–2680.
Hasan and Ng (2013) Kazi Saidul Hasan and Vincent Ng. Stance classification of ideological debates: Data, models, features, and constraints. In Proceedings of the Sixth International Joint Conference on Natural Language Processing, pages 1348–1356, Nagoya, Japan. Asian Federation of Natural Language Processing.
Hazarika et al. (2018a) Devamanyu Hazarika, Soujanya Poria, Rada Mihalcea, Erik Cambria, and Roger Zimmermann. ICON: Interactive conversational memory network for multimodal emotion detection. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2594–2604, Brussels, Belgium. Association for Computational Linguistics.
Hazarika et al. (2018b) Devamanyu Hazarika, Soujanya Poria, Amir Zadeh, Erik Cambria, Louis-Philippe Morency, and Roger Zimmermann. Conversational memory network for emotion recognition in dyadic dialogue videos. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2122–2132, New Orleans, Louisiana. Association for Computational Linguistics.
Herrmann et al. (2015) J Berenike Herrmann, Karina van Dalen-Oskam, and Christof Schöch. Revisiting style, a key concept in literary studies. Journal of literary theory, 9(1):25–52.
Hihi and Bengio (1995) Salah El Hihi and Yoshua Bengio. Hierarchical recurrent neural networks for long-term dependencies. In Advances in Neural Information Processing Systems 8, NIPS, Denver, CO, USA, November 27-30, 1995, pages 493–499. MIT Press.
Hoover et al. (2019) Joseph Hoover, Gwenyth Portillo-Wightman, Leigh Yeh, Shreya Havaldar, Aida Mostafazadeh Davani, Ying Lin, Brendan Kennedy, Mohammad Atari, Zahra Kamel, Madelyn Mendlen, et al. Moral foundations twitter corpus: A collection of 35k tweets annotated for moral sentiment.
Hu (2019) Shengli Hu. Detecting concealed information in text and speech. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 402–412, Florence, Italy. Association for Computational Linguistics.
Huang and Carley (2019) Binxuan Huang and Kathleen Carley. Syntax-aware aspect level sentiment classification with graph attention networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5469–5477, Hong Kong, China. Association for Computational Linguistics.
Jafariakinabad et al. (2020) Fereshteh Jafariakinabad, Sansiri Tarnpradab, and Kien A. Hua. Syntactic neural model for authorship attribution. In Proceedings of the Thirty-Third International Florida Artificial Intelligence Research Society Conference, Originally to be held in North Miami Beach, Florida, USA, May 17-20, 2020, pages 234–239. AAAI Press.
Jiang et al. (2019) Qingnan Jiang, Lei Chen, Ruifeng Xu, Xiang Ao, and Min Yang. A challenge dataset and effective models for aspect-based sentiment analysis. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 6280–6285, Hong Kong, China. Association for Computational Linguistics.
Johansson (2019) Fredrik Johansson. Supervised classification of twitter accounts based on textual content of tweets. In Working Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum, Lugano, Switzerland, September 9-12, 2019, volume 2380 of CEUR Workshop Proceedings. CEUR-WS.org.
Joo and Hwang (2019) Youngjun Joo and Inchon Hwang. Author profiling on social media: An ensemble learning approach using various features. In Working Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum, Lugano, Switzerland, September 9-12, 2019, volume 2380 of CEUR Workshop Proceedings. CEUR-WS.org.
Juola (2006) Patrick Juola. Authorship attribution. Found. Trends Inf. Retr., 1(3):233–334.
Juola (2012) Patrick Juola. An overview of the traditional authorship attribution subtask. In CLEF 2012 Evaluation Labs and Workshop, Online Working Notes, Rome, Italy, September 17-20, 2012, volume 1178 of CEUR Workshop Proceedings. CEUR-WS.org.
Kestemont et al. (2019) Mike Kestemont, Efstathios Stamatatos, Enrique Manjavacas, Walter Daelemans, Martin Potthast, and Benno Stein. Overview of the cross-domain authorship attribution task at PAN 2019. In Working Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum, Lugano, Switzerland, September 9-12, 2019, volume 2380 of CEUR Workshop Proceedings. CEUR-WS.org.
Kestemont et al. (2016) Mike Kestemont, Justin Anthony Stover, Moshe Koppel, Folgert Karsdorp, and Walter Daelemans. Authenticating the writings of julius caesar. Expert Syst. Appl., 63:86–96.
Khodak et al. (2018) Mikhail Khodak, Nikunj Saunshi, and Kiran Vodrahalli. A large self-annotated corpus for sarcasm. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association (ELRA).
Kipf and Welling (2017) Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net.
Küçük and Can (2020) Dilek Küçük and Fazli Can. Stance detection: A survey. ACM Comput. Surv., 53(1):12:1–12:37.
Kumar et al. (2017) Abhishek Kumar, Abhishek Sethi, Md Shad Akhtar, Asif Ekbal, Chris Biemann, and Pushpak Bhattacharyya. IITPB at SemEval-2017 task 5: Sentiment prediction in financial text. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pages 894–898, Vancouver, Canada. Association for Computational Linguistics.
LeCun et al. (2015) Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436–444.
LeCun et al. (1999) Yann LeCun, Patrick Haffner, Léon Bottou, and Yoshua Bengio. Object recognition with gradient-based learning. In Shape, Contour and Grouping in Computer Vision, volume 1681 of Lecture Notes in Computer Science, page 319. Springer.
Li et al. (2018) Xin Li, Lidong Bing, Wai Lam, and Bei Shi. Transformation networks for target-oriented sentiment classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 946–956, Melbourne, Australia. Association for Computational Linguistics.
Mao et al. (2018) Rui Mao, Chenghua Lin, and Frank Guerin. Word embedding and WordNet based metaphor identification and interpretation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1222–1231, Melbourne, Australia. Association for Computational Linguistics.
Markov et al. (2017) Ilia Markov, Efstathios Stamatatos, and Grigori Sidorov. Improving cross-topic authorship attribution: The role of pre-processing. In International Conference on Computational Linguistics and Intelligent Text Processing, pages 289–302. Springer.
Martinc et al. (2019) Matej Martinc, Blaz Skrlj, and Senja Pollak. Who is hot and who is not? profiling celebs on twitter. In Working Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum, Lugano, Switzerland, September 9-12, 2019, volume 2380 of CEUR Workshop Proceedings. CEUR-WS.org.
Mihalcea and Strapparava (2005) Rada Mihalcea and Carlo Strapparava. Making computers laugh: Investigations in automatic humor recognition. In Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, pages 531–538, Vancouver, British Columbia, Canada. Association for Computational Linguistics.
Mikolov et al. (2013) Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. In 1st International Conference on Learning Representations, ICLR 2013, Scottsdale, Arizona, USA, May 2-4, 2013, Workshop Track Proceedings.
Miller (1998) George A Miller. WordNet: An electronic lexical database. MIT press.
Mohammad (2018) Saif Mohammad. Obtaining reliable human ratings of valence, arousal, and dominance for 20,000 English words. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 174–184, Melbourne, Australia. Association for Computational Linguistics.
Mohammad et al. (2016a) Saif Mohammad, Svetlana Kiritchenko, Parinaz Sobhani, Xiaodan Zhu, and Colin Cherry. A dataset for detecting stance in tweets. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 3945–3952, Portorož, Slovenia. European Language Resources Association (ELRA).
Mohammad et al. (2016b) Saif Mohammad, Svetlana Kiritchenko, Parinaz Sobhani, Xiaodan Zhu, and Colin Cherry. SemEval-2016 task 6: Detecting stance in tweets. In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), pages 31–41, San Diego, California. Association for Computational Linguistics.
Mohammad et al. (2016c) Saif Mohammad, Ekaterina Shutova, and Peter Turney. Metaphor as a medium for emotion: An empirical study. In Proceedings of the Fifth Joint Conference on Lexical and Computational Semantics, pages 23–33, Berlin, Germany. Association for Computational Linguistics.
Mohammad and Turney (2010) Saif Mohammad and Peter Turney. Emotions evoked by common words and phrases: Using mechanical turk to create an emotion lexicon. In Proceedings of the NAACL HLT 2010 Workshop on Computational Approaches to Analysis and Generation of Emotion in Text, pages 26–34, Los Angeles, CA. Association for Computational Linguistics.
Mohammad and Turney (2013) Saif M Mohammad and Peter D Turney. Crowdsourcing a word–emotion association lexicon. Computational Intelligence, 29(3):436–465.
Mohler et al. (2016) Michael Mohler, Mary Brunson, Bryan Rink, and Marc Tomlinson. Introducing the LCC metaphor datasets. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 4221–4227, Portorož, Slovenia. European Language Resources Association (ELRA).
Muttenthaler et al. (2019) Lukas Muttenthaler, Gordon Lucas, and Janek Amann. Authorship attribution in fan-fictional texts given variable length character and word n-grams. In Working Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum, Lugano, Switzerland, September 9-12, 2019, volume 2380 of CEUR Workshop Proceedings. CEUR-WS.org.
Niu et al. (2017) Xing Niu, Marianna Martindale, and Marine Carpuat. A study of style in machine translation: Controlling the formality of machine translation output. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2814–2819, Copenhagen, Denmark. Association for Computational Linguistics.
Ousidhoum et al. (2019) Nedjma Ousidhoum, Zizheng Lin, Hongming Zhang, Yangqiu Song, and Dit-Yan Yeung. Multilingual and multi-aspect hate speech analysis. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4667–4676, Hong Kong, China. Association for Computational Linguistics.
Pang and Lee (2005) Bo Pang and Lillian Lee. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05), pages 115–124, Ann Arbor, Michigan. Association for Computational Linguistics.
Pang and Lee (2008) Bo Pang and Lillian Lee. Opinion mining and sentiment analysis. Found. Trends Inf. Retr., 2(1-2):1–135.
Pang et al. (2002) Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. Thumbs up? sentiment classification using machine learning techniques. In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP 2002), pages 79–86. Association for Computational Linguistics.
Pardo and Rosso (2019) Francisco M. Rangel Pardo and Paolo Rosso. Overview of the 7th author profiling task at PAN 2019: Bots and gender profiling in twitter. In Working Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum, Lugano, Switzerland, September 9-12, 2019, volume 2380 of CEUR Workshop Proceedings. CEUR-WS.org.
Pérez-Rosas et al. (2013) Verónica Pérez-Rosas, Rada Mihalcea, and Louis-Philippe Morency. Utterance-level multimodal sentiment analysis. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 973–982, Sofia, Bulgaria. Association for Computational Linguistics.
Petrík and Chudá (2019) Juraj Petrík and Daniela Chudá. Bots and gender profiling with convolutional hierarchical recurrent neural network. In Working Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum, Lugano, Switzerland, September 9-12, 2019, volume 2380 of CEUR Workshop Proceedings. CEUR-WS.org.
Polignano et al. (2019) Marco Polignano, Marco Giuseppe de Pinto, Pasquale Lops, and Giovanni Semeraro. Identification of bot accounts in twitter using 2d cnns on user-generated contents. In Working Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum, Lugano, Switzerland, September 9-12, 2019, volume 2380 of CEUR Workshop Proceedings. CEUR-WS.org.
Pontiki et al. (2016) Maria Pontiki, Dimitris Galanis, Haris Papageorgiou, Ion Androutsopoulos, Suresh Manandhar, Mohammad AL-Smadi, Mahmoud Al-Ayyoub, Yanyan Zhao, Bing Qin, Orphée De Clercq, Véronique Hoste, Marianna Apidianaki, Xavier Tannier, Natalia Loukachevitch, Evgeniy Kotelnikov, Nuria Bel, Salud María Jiménez-Zafra, and Gülşen Eryiğit. SemEval-2016 task 5: Aspect based sentiment analysis. In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), pages 19–30, San Diego, California. Association for Computational Linguistics.
Pontiki et al. (2015) Maria Pontiki, Dimitris Galanis, Haris Papageorgiou, Suresh Manandhar, and Ion Androutsopoulos. SemEval-2015 task 12: Aspect based sentiment analysis. In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), pages 486–495, Denver, Colorado. Association for Computational Linguistics.
Pontiki et al. (2014) Maria Pontiki, Dimitris Galanis, John Pavlopoulos, Harris Papageorgiou, Ion Androutsopoulos, and Suresh Manandhar. SemEval-2014 task 4: Aspect based sentiment analysis. In Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014), pages 27–35, Dublin, Ireland. Association for Computational Linguistics.
Poria et al. (2019) Soujanya Poria, Devamanyu Hazarika, Navonil Majumder, Gautam Naik, Erik Cambria, and Rada Mihalcea. MELD: A multimodal multi-party dataset for emotion recognition in conversations. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 527–536, Florence, Italy. Association for Computational Linguistics.
Preoţiuc-Pietro and Devlin Marier (2019) Daniel Preoţiuc-Pietro and Rita Devlin Marier. Analyzing linguistic differences between owner and staff attributed tweets. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2848–2853, Florence, Italy. Association for Computational Linguistics.
Preoţiuc-Pietro et al. (2019) Daniel Preoţiuc-Pietro, Mihaela Gaman, and Nikolaos Aletras. Automatically identifying complaints in social media. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5008–5019, Florence, Italy. Association for Computational Linguistics.
Qian et al. (2017) Qiao Qian, Minlie Huang, Jinhao Lei, and Xiaoyan Zhu. Linguistically regularized LSTM for sentiment classification. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1679–1689, Vancouver, Canada. Association for Computational Linguistics.
Rao and Tetreault (2018) Sudha Rao and Joel Tetreault. Dear sir or madam, may I introduce the GYAFC dataset: Corpus, benchmarks and metrics for formality style transfer. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 129–140, New Orleans, Louisiana. Association for Computational Linguistics.
Rei et al. (2017) Marek Rei, Luana Bulat, Douwe Kiela, and Ekaterina Shutova. Grasping the finer point: A supervised similarity network for metaphor detection. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1537–1546, Copenhagen, Denmark. Association for Computational Linguistics.
Rouvier (2017) Mickael Rouvier. LIA at SemEval-2017 task 4: An ensemble of neural networks for sentiment classification. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pages 760–765, Vancouver, Canada. Association for Computational Linguistics.
Ruder et al. (2016) Sebastian Ruder, Parsa Ghaffari, and John G Breslin. Character-level and multi-channel convolutional neural networks for large-scale authorship attribution. arXiv preprint arXiv:1609.06686.
Sabour et al. (2017) Sara Sabour, Nicholas Frosst, and Geoffrey E. Hinton. Dynamic routing between capsules. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pages 3856–3866.
Saeidi et al. (2016) Marzieh Saeidi, Guillaume Bouchard, Maria Liakata, and Sebastian Riedel. SentiHood: Targeted aspect based sentiment analysis dataset for urban neighbourhoods. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pages 1546–1556, Osaka, Japan. The COLING 2016 Organizing Committee.
Sapkota et al. (2015) Upendra Sapkota, Steven Bethard, Manuel Montes, and Thamar Solorio. Not all character n-grams are created equal: A study in authorship attribution. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 93–102, Denver, Colorado. Association for Computational Linguistics.
Sari et al. (2017) Yunita Sari, Andreas Vlachos, and Mark Stevenson. Continuous n-gram representations for authorship attribution. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 267–273, Valencia, Spain. Association for Computational Linguistics.
Schler et al. (2006) Jonathan Schler, Moshe Koppel, Shlomo Argamon, and James W. Pennebaker. Effects of age and gender on blogging. In Computational Approaches to Analyzing Weblogs, Papers from the 2006 AAAI Spring Symposium, Technical Report SS-06-03, Stanford, California, USA, March 27-29, 2006, pages 199–205.
Schuster and Paliwal (1997) Mike Schuster and Kuldip K Paliwal. Bidirectional recurrent neural networks. IEEE transactions on Signal Processing, 45(11):2673–2681.
Seroussi et al. (2010) Yanir Seroussi, Ingrid Zukerman, and Fabian Bohnert. Collaborative inference of sentiments from texts. In Proceedings of the 18th International Conference on User Modeling, Adaptation, and Personalization, UMAP’10, pages 195–206, Berlin, Heidelberg. Springer-Verlag.
Siddiqua et al. (2019) Umme Aymun Siddiqua, Abu Nowshed Chy, and Masaki Aono. Tweet stance detection using an attention based neural ensemble model. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1868–1873, Minneapolis, Minnesota. Association for Computational Linguistics.
Simaki et al. (2017) Vasiliki Simaki, Maria Skeppstedt, Carita Paradis, Andreas Kerren, and Magnus Sahlgren. Annotating speaker stance in discourse : The brexit blog corpus.
Socher et al. (2013) Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1631–1642, Seattle, Washington, USA. Association for Computational Linguistics.
Speer et al. (2017) Robyn Speer, Joshua Chin, and Catherine Havasi. Conceptnet 5.5: An open multilingual graph of general knowledge. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA, pages 4444–4451. AAAI Press.
Stamatatos (2008) Efstathios Stamatatos. Author identification: Using text sampling to handle the class imbalance problem. Inf. Process. Manage., 44(2):790–799.
Stamatatos (2009) Efstathios Stamatatos. A survey of modern authorship attribution methods. J. Assoc. Inf. Sci. Technol., 60(3):538–556.
Steen et al. (2010) Gerard J Steen, Aletta G Dorst, J Berenike Herrmann, Anna A Kaal, and Tina Krennmayr. Metaphor in usage. Cognitive Linguistics, 21(4):765–796.
Sun et al. (2019) Kai Sun, Richong Zhang, Samuel Mensah, Yongyi Mao, and Xudong Liu. Aspect-level sentiment analysis via convolution over dependency tree. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5679–5688, Hong Kong, China. Association for Computational Linguistics.
Tang et al. (2016) Duyu Tang, Bing Qin, and Ting Liu. Aspect level sentiment classification with deep memory network. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 214–224, Austin, Texas. Association for Computational Linguistics.
Tang et al. (2019) Jialong Tang, Ziyao Lu, Jinsong Su, Yubin Ge, Linfeng Song, Le Sun, and Jiebo Luo. Progressive self-supervised attention learning for aspect-level sentiment analysis. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 557–566, Florence, Italy. Association for Computational Linguistics.
Thomas et al. (2006) Matt Thomas, Bo Pang, and Lillian Lee. Get out the vote: Determining support or opposition from congressional floor-debate transcripts. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, pages 327–335, Sydney, Australia. Association for Computational Linguistics.
Tsvetkov et al. (2014) Yulia Tsvetkov, Leonid Boytsov, Anatole Gershman, Eric Nyberg, and Chris Dyer. Metaphor detection with cross-lingual model transfer. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 248–258, Baltimore, Maryland. Association for Computational Linguistics.
Valencia et al. (2019) Alex I. Valencia Valencia, Helena Gómez-Adorno, Christopher Stephens Rhodes, and Gibran Fuentes Pineda. Bots and gender identification based on stylometry of tweet minimal structure and n-grams model. In Working Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum, Lugano, Switzerland, September 9-12, 2019, volume 2380 of CEUR Workshop Proceedings. CEUR-WS.org.
Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pages 5998–6008.
Velickovic et al. (2018) Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. Graph attention networks. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net.
Volkova and Bachrach (2016) Svitlana Volkova and Yoram Bachrach. Inferring perceived demographics from user emotional tone and user-environment emotional contrast. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1567–1578, Berlin, Germany. Association for Computational Linguistics.
Vu et al. (2018) Thanh Vu, Dat Quoc Nguyen, Xuan-Son Vu, Dai Quoc Nguyen, Michael Catt, and Michael Trenell. NIHRIO at SemEval-2018 task 3: A simple and accurate neural network model for irony detection in twitter. In Proceedings of The 12th International Workshop on Semantic Evaluation, pages 525–530, New Orleans, Louisiana. Association for Computational Linguistics.
Weller and Seppi (2019) Orion Weller and Kevin Seppi. Humor detection: A transformer gets the last laugh. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3621–3625, Hong Kong, China. Association for Computational Linguistics.
Weston et al. (2015) Jason Weston, Sumit Chopra, and Antoine Bordes. Memory networks. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings.
Wiebe et al. (2005) Janyce Wiebe, Theresa Wilson, and Claire Cardie. Annotating expressions of opinions and emotions in language. Language resources and evaluation, 39(2-3):165–210.
Wiegmann et al. (2019) Matti Wiegmann, Benno Stein, and Martin Potthast. Celebrity profiling. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2611–2618, Florence, Italy. Association for Computational Linguistics.
Wöllmer et al. (2013) Martin Wöllmer, Felix Weninger, Tobias Knaup, Björn Schuller, Congkai Sun, Kenji Sagae, and Louis-Philippe Morency. Youtube movie reviews: Sentiment analysis in an audio-visual context. IEEE Intelligent Systems, 28(3):46–53.
Xia et al. (2018) Congying Xia, Chenwei Zhang, Xiaohui Yan, Yi Chang, and Philip Yu. Zero-shot user intent detection via capsule neural networks. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3090–3099, Brussels, Belgium. Association for Computational Linguistics.
Xu et al. (2018) Chang Xu, Cécile Paris, Surya Nepal, and Ross Sparks. Cross-target stance classification with self-attention networks. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 778–783, Melbourne, Australia. Association for Computational Linguistics.
Xu et al. (2019) Jingjing Xu, Liang Zhao, Hanqi Yan, Qi Zeng, Yun Liang, and Xu Sun. LexicalAT: Lexical-based adversarial reinforcement training for robust sentiment classification. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5518–5527, Hong Kong, China. Association for Computational Linguistics.
Xue and Li (2018) Wei Xue and Tao Li. Aspect based sentiment analysis with gated convolutional networks. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2514–2523, Melbourne, Australia. Association for Computational Linguistics.
Yang et al. (2017) Min Yang, Wenting Tu, Jingxuan Wang, Fei Xu, and Xiaojun Chen. Attention based lstm for target dependent sentiment classification.
Zadeh et al. (2016) Amir Zadeh, Rowan Zellers, Eli Pincus, and Louis-Philippe Morency. Multimodal sentiment intensity analysis in videos: Facial gestures and verbal messages. IEEE Intelligent Systems, 31(6):82–88.
Zampieri et al. (2019) Marcos Zampieri, Shervin Malmasi, Preslav Nakov, Sara Rosenthal, Noura Farra, and Ritesh Kumar. Predicting the type and target of offensive posts in social media. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1415–1420, Minneapolis, Minnesota. Association for Computational Linguistics.
Zhang et al. (2019a) Chen Zhang, Qiuchi Li, and Dawei Song. Aspect-based sentiment classification with aspect-specific graph convolutional networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4568–4578, Hong Kong, China. Association for Computational Linguistics.
Zhang et al. (2019b) Dongyu Zhang, Heting Zhang, Xikai Liu, Hongfei Lin, and Feng Xia. Telling the whole story: A manually annotated Chinese dataset for the analysis of humor in jokes. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 6401–6406, Hong Kong, China. Association for Computational Linguistics.
Zhang et al. (2018a) Lei Zhang, Shuai Wang, and Bing Liu. Deep learning for sentiment analysis: A survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 8(4):e1253.
Zhang et al. (2018b) Richong Zhang, Zhiyuan Hu, Hongyu Guo, and Yongyi Mao. Syntax encoding with application in authorship attribution. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2742–2753, Brussels, Belgium. Association for Computational Linguistics.
Zhang and Singh (2018) Zhe Zhang and Munindar Singh. Limbic: Author-based sentiment aspect modeling regularized with word embeddings and discourse relations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3412–3422, Brussels, Belgium. Association for Computational Linguistics.
Zhao et al. (2019) Zhenjie Zhao, Andrew Cattle, Evangelos Papalexakis, and Xiaojuan Ma. Embedding lexical features via tensor decomposition for small sample humor recognition. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 6375–6380, Hong Kong, China. Association for Computational Linguistics.
Zhong et al. (2019) Peixiang Zhong, Di Wang, and Chunyan Miao. Knowledge-enriched transformer for emotion detection in textual conversations. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 165–176, Hong Kong, China. Association for Computational Linguistics.
Zhou et al. (2018) Deyu Zhou, Yang Yang, and Yulan He. Relevant emotion ranking from text constrained with emotion relationships. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 561–571, New Orleans, Louisiana. Association for Computational Linguistics.

Towards Improved Model Design for Authorship Identification: A Survey on Writing Style Understanding