\copyrightclause

Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).

\conference

IberLEF 2023, September 2023, Jaén, Spain

[email=andrei.preda3006@stud.acs.upb.ro, ]

[email=dumitru.cercel@upb.ro, ]

[email=traian.rebedea@upb.ro ]

[email=costin.chiru@upb.ro, ]

UPB at IberLEF-2023 AuTexTification: Detection of Machine-Generated Text using Transformer Ensembles

Andrei-Alexandru Preda Dumitru-Clementin Cercel Traian Rebedea Costin-Gabriel Chiru Computer Science and Engineering Department, Faculty of Automatic Control and Computers, University Politehnica of Bucharest, Bucharest 060042, Romania

(2023)

Abstract

This paper describes the solutions submitted by the UPB team to the AuTexTification shared task, featured as part of IberLEF-2023. Our team participated in the first subtask, identifying text documents produced by large language models instead of humans. The organizers provided a bilingual dataset for this subtask, comprising English and Spanish texts covering multiple domains, such as legal texts, social media posts, and how-to articles. We experimented mostly with deep learning models based on Transformers, as well as training techniques such as multi-task learning and virtual adversarial training to obtain better results. We submitted three runs, two of which consisted of ensemble models. Our best-performing model achieved macro F1-scores of $66.63\%$ on the English dataset and $67.10\%$ on the Spanish dataset.

keywords:

Machine-Generated Text \sepTransformer \sepMulti-Task Learning \sepVirtual Adversarial Training

1 Introduction

Recently, computer-generated content started growing in presence on the Internet. With the public release of powerful Large Language Models (LLMs) such as the Generative Pre-trained Transformer [1], and its derivative systems such as ChatGPT [2], manufacturing texts is easier than ever and probably harder to detect than ever. This phenomenon has already raised several ethical issues that society must answer soon. This effort can be helped by finding mechanisms to automatically and reliably detect computer-generated text.

The AuTexTification: Automated Text Identification shared task [3] is a natural language processing (NLP) competition at IberLEF-2023 [4]. Its main focus is detecting and understanding the computer-generated text, especially that of LLMs. The competition presents two subtasks: (1) Subtask 1 is a binary classification problem in which we have to detect whether a human or an artificial intelligence model wrote a document, and (2) Subtask 2 is a multi-class classification problem in which you have to select which LLM generated a given document from a list of several LLMs. To address these subtasks, the organizers made available a bilingual dataset of documents produced by humans and computers in English and Spanish, covering several domains.

Our team participated only in the first subtask. We first experimented with more standard machine learning methods before moving to deep learning models, where we explored techniques such as multi-task learning (MTL) [5] and virtual adversarial training (VAT) [6]. Finally, we combined multiple models trained independently to form ensembles, which we used to generate our submissions, since they performed the best.

2 Related Work

While text classification is one of NLP’s fundamental and most well-established tasks, detecting computer-generated text is a relatively novel task. This is probably because, until recently, few systems could produce text realistic enough to fool humans. Creating such texts is commonly called natural language generation [7].

Currently, there seem to be various ways of addressing this problem, which can be classified into black-box and white-box methods [8]. White-box techniques require access to the target language model, and they can involve concepts such as watermarks which the models could embed into their outputs to make detection easier. As such, black-box methods are more relevant to the previously mentioned task since we only have access to the model’s output, but we do not even know which model produced it.

Black-box methods can involve both classical machine learning classification algorithms, as well as ones based on deep learning [8]. To make predictions, traditional algorithms combine statistical features and linguistic patterns with classifiers such as Support Vector Machines (SVMs) [9]. On the other hand, deep learning methods usually involve fine-tuning pre-trained language models using supervised learning in order to make predictions. These deep learning approaches often obtain state-of-the-art results but are harder to interpret, which means they are also harder to trust, as well.

3 Methods

This section describes the different classification methods we tried and the final ensemble architectures we submitted.

3.1 Shallow Learning Models

3.1.1 Readability Scores

Similar to Stodden and Venugopal [10], we combined several linguistic features with pre-trained embeddings. Specifically, we computed the following readability scores: the Flesch reading ease score [11], the Gunning-Fog index [12], and the SMOG index [13]. The intuition behind this choice was that LLMs might not consider the ease of comprehension when generating texts. For example, the generated legal texts might be harder to understand than those written by humans. To compute the aforementioned scores, we used the Readability Python library [14], which offered $35$ such features.

Then, we concatenated the readability scores with document-level pre-trained embeddings offered by the spaCy library [15], which are $300$ -dimensional and language-specific. The English embeddings are based on GloVe [16], while the Spanish ones are based on FastText [17]. Finally, all features were scaled to have zero mean and unit variance with scikit-learn’s StandardScaler}˜\citepscikit-learn, before using them to train two classifiers, namely XGBoost [19] and k-Nearest Neighbors (kNN) [18].

3.1.2 String Kernels

We also experimented with string kernels [20], which are kernel functions that measure the degree of similarity between two strings. An example of a simple string kernel counts the number of n-grams shared by the two strings without considering duplicates. Such a function can be computed for multiple sizes of n-grams, and used as the kernel function of a classifier such as an SVM. We performed common natural language preprocessing operations on the input text: removing punctuation, removing stopwords, lowercasing all letters, and stemming the words. We used n-gram sizes between $3$ and $5$ , and the SVM classifier implemented in scikit-learn [18]. Since custom kernels might need to be computed between each pair of input samples, using the entire training dataset would have taken a long time, so we tested the method only on a small slice of it, comprising several thousand samples.

3.2 Deep Learning Models

3.2.1 Transformers

The Transformer architecture was introduced in 2017 by Vaswani et al. [21] and is currently powering numerous state-of-the-art solutions for many tasks. Transformers usually feature two main components: an encoder and a decoder. However, these two parts can be useful by themselves as well. One example is the Bidirectional Encoder Representations from Transformers (BERT) [22] model family, which can encode input text into contextual embeddings. We experimented with several BERT versions: multilingual ones (i.e., XLM-RoBERTa [23] and multilingual BERT [22]), and one pre-trained on tweets (i.e., TwHIN-BERT [24]). Since BERT models can be large and typically require large amounts of data to train from scratch, we utilize transfer learning [25] instead, by fine-tuning pre-trained models. We experimented with Transformer-based models to encode the raw input text into embeddings, which we then connected to a linear layer, followed by a dropout layer [26] before the final prediction head. The last layer produces a probability of the document being computer-generated, and the binary prediction is chosen by comparing it with a threshold.

3.2.2 Multi-Task Learning

As an additional method of preventing overfitting, we used the technique of multi-task learning. MTL refers to training a model to solve multiple tasks simultaneously. As such, these models typically feature a set of parameters shared for all tasks, and separate prediction heads for each task. Intuitively, multi-task learning should make the task harder to solve, thus adding extra complexity, which the model has to adapt to. In our case, an extra task that is easy to derive is predicting the language of a given input document. More precisely, apart from predicting the human/computer label of a document, the model has to detect whether it is written in English or Spanish. This means that, for training, we combined the two datasets supplied for Subtask 1. However, we did not use any of the data provided for Subtask 2. The MTL architecture is very similar to the one presented in the previous section, only adding an extra classification head. The architecture can be seen in Figure 3.3. Since both tasks involve binary classification, we compute a binary cross-entropy loss for each of them, namely $\mathcal{L}_{\mathrm{bot}}$ for the human/computer classification task, and $\mathcal{L}_{\mathrm{lang}}$ for the language detection task. The final loss of the model is a combination of these two losses, as given by the following formula:

\mathcal{L}=\alpha\mathcal{L}_{\mathrm{bot}}+(1-\alpha)\mathcal{L}_{\mathrm{lang}}

(1)

where hyperparameter $\alpha$ controls how much attention is paid to each task.

3.2.3 Virtual Adversarial Training

VAT [27] is another regularization technique for deep learning models. It aims to help models generalize better by perturbing the inputs to maximize the loss function. For our models, the inputs refer to the embeddings of the raw documents, not to the token IDs. In our case, this method implies performing the forward and backward passes multiple times in order to compute the gradients. Then, the loss function specific to VAT gets added to the regular loss function, the final loss being the sum of the two. We added VAT to our models using the VAT-pytorch Python library [28], which implements the distributional smoothing technique described by Miyato et al. [6].

3.3 Ensemble Learning

Ensemble techniques combine multiple different models to make better predictions. Intuitively, they should make the weaknesses of each model matter less since if one model happens to have poor performance on a certain edge case, all the other models will probably give better results, thus negating the impact of the incorrect prediction. While there are multiple ways of combining models into ensembles (such as majority voting or bagging) [29], we decided to use the stacking technique inspired by the work of Gaman [20]. Thus, we train an extra meta-learner model, which learns to make predictions based on the outputs of each model in the ensemble. We experimented with an XGBoost classifier, which takes as input the probabilities produced by each model, and the binary predictions they make. The submitted final ensembles can be seen in Figure 1(b).

Refer to caption — (a) The MTL architecture used for `.`

Model	Rank	Validation Set F1	Test Set F1
TALN-UPF Hybrid Plus	1	-	80.91
TALN-UPF Hybrid	2	-	74.16
Full ensemble (our & 18 & - & 66.63 \\ \textbfEnsemble without TwHIN-BERT (our run3})} & 19 & - & 66.40 \\ Logistic Regression (baseline) & 23 & - & 65.78 \\ \textbf MTL (`xlm-roberta-base}) (our \mintinline`textrun1)	25	93.30	65.53
Symanto Brain (Few-shot) (baseline)	37	-	59.44
DeBERTa V3 (baseline)	51	-	57.10
Random (baseline)	69	-	50.00
Symanto Brain (Zero-shot) (baseline)	73	-	43.47
MTL ( & - & 92.70 & - \\ \textbfMLP ( bert-base-multilingual-cased})} & - & 91.80 & - \\ \textbf XGBoost + Readability + GloVe	-	79.80	59.22
kNN + Readability + GloVe	-	74.90	56.31

Model	Rank	Validation Set F1	Test Set F1
TALN-UPF Hybrid Plus	1	-	70.77
Linguistica_F-P_et_al	2	-	70.60
RoBERTa (BNE) (baseline)	3	-	68.52
Ensemble without TwHIN-BERT (our & 6 & - & 67.10 \\ \textbfFull ensemble (our run2})} & 7 & - & 66.97 \\ \textbf MTL (`xlm-roberta-base}) (our \mintinline`textrun1)	12	92.30	65.01
Logistic Regression (baseline)	25	-	62.40
Symanto Brain (Few-shot) (baseline)	39	-	56.05
Random (baseline)	46	-	50.00
Symanto Brain (Zero-shot) (baseline)	50	-	34.58
MTL ( & - & 91.00 & - \\ \textbfMLP ( bert-base-multilingual-cased})} & - & 90.90 & - \\ \textbf XGBoost + Readability + FastText	-	80.90	63.70
kNN + Readability + FastText	-	72.20	59.59