Mono vs Multilingual BERT for Hate Speech Detection and Text Classification: A Case Study in Marathi
^†^†thanks: Supported by L3Cube Pune

Abhishek Velankar Pune Institute of Computer Technology
L3Cube Pune
velankarabhishek@gmail.com Hrushikesh Patil Pune Institute of Computer Technology
L3Cube Pune
hrushi2900@gmail.com Raviraj Joshi Indian Institute of Technology Madras
L3Cube Pune
ravirajoshi@gmail.com

Abstract

Transformers are the most eminent architectures used for a vast range of Natural Language Processing tasks. These models are pre-trained over a large text corpus and are meant to serve state-of-the-art results over tasks like text classification. In this work, we conduct a comparative study between monolingual and multilingual BERT models. We focus on the Marathi language and evaluate the models on the datasets for hate speech detection, sentiment analysis, and simple text classification in Marathi. We use standard multilingual models such as mBERT, indicBERT, and xlm-RoBERTa and compare them with MahaBERT, MahaALBERT, and MahaRoBERTa, the monolingual models for Marathi. We further show that Marathi monolingual models outperform the multilingual BERT variants in five different downstream fine-tuning experiments. We also evaluate sentence embeddings from these models by freezing the BERT encoder layers. We show that monolingual MahaBERT-based models provide rich representations as compared to sentence embeddings from multi-lingual counterparts. However, we observe that these embeddings are not generic enough and do not work well on out-of-domain social media datasets. We consider two Marathi hate speech datasets L3Cube-MahaHate, HASOC-2021, a Marathi sentiment classification dataset L3Cube-MahaSent, and Marathi Headline, Articles classification datasets.

Index Terms:

Natural Language Processing, Text Classification, Hate Speech Detection, Sentiment Analysis, BERT, Marathi BERT

I Introduction

The language models like BERT, built over the transformer architecture, have gained a lot of popularity due to the promising results on an extensive range of natural language processing tasks. These large models make use of attention mechanism from transformers and understand the language deeper in terms of context. These models can be fine-tuned on domain-specific data to obtain state-of-the-art solutions.

More recently, there has been a significant amount of research on monolingual and multilingual language models, specifically the BERT variants. Due to the variety of text corpus in terms of languages used for training, multilingual models find notable benefits over multiple applications, specifically for the languages that are low in resources [1, 2, 3]. However, the monolingual models, when used in the corresponding language, outperform the multilingual versions in tasks like text classification. Both the categories of models find their use in several problems like next sentence prediction, named entity recognition, sentiment analysis, etc. Recently, a substantial amount of work can be seen with the use of these models in native languages. [4] propose monolingual BERT models for the Arabic language and show that these models achieve state-of-the-art performances. Additionally, [5], [6], [7] show that the single language models, when used for the corresponding language tasks, perform more efficiently than the multilingual variants. [8] analyze the effectiveness of multilingual models over the monolingual counterparts for 6 different languages including English and German. Our work focuses on hate speech detection, sentiment analysis, and simple text classification in Marathi [9, 10, 11, 12]. We evaluate monolingual and multilingual BERT models on Marathi corpus to compare the performance. A similar analysis for Hindi and Marathi named entity recognition has been performed in [13].

Marathi is a regional language in India. It is majorly spoken by the people in Maharashtra [14]. Additionally, after Hindi and Bengali, it is considered as the third most popular language in India [15, 16]. However, the Marathi language is greatly overlooked in terms of language resources which suggests the need of widening the research in this area.

In this work, we perform a comparative analysis of monolingual and multilingual BERT models for Marathi. We fine-tune these models over the Marathi corpus, which contains hate speech detection and simple text classification datasets. We consider standard multilingual models i.e mBERT, indicBERT, and xlm-RoBERTa, and compare them with Marathi monolingual counterparts i.e. MahaBERT, MahaALBERT, and MahaRoBERTa. We further show that the monolingual models when used in Marathi, outperform the multilingual equivalents. Moreover, we evaluate sentence representations from these models and show that the monolingual models provide superior sentence representations. The advantage of using monolingual models is more visible when extracted sentence embeddings are used for classification. This research is aimed to help the community by giving an insight into the appropriate use of these single and multilingual models when applied to single language tasks.

II Related work

The BERT is currently one of the most effective language models in terms of performance when different NLP tasks like text classification are concerned. The previous research has shown how BERT captures the language context in an efficient way [17], [18], [19].

Recently, a lot of work can be seen in single and multi-language NLP applications. Several efforts have been made to build monolingual variants of BERT and shown to be effective over a quantity of single language downstream tasks. In [20] authors publish a German monolingual BERT model based on RoBERTa. The experiments have been performed in the tasks like named entity recognition (NER) and text classification to evaluate the model performance. They further propose that, with the only little tuning of hyperparameters, the model outperformed all other tested German and multilingual BERT models. A monolingual RoBERTa language model trained on Czech data has been presented in [21]. The authors show that the model significantly outperforms equally-sized multilingual and Czech language-oriented model variants. Other works for single language-specific BERT models include models built in Vietnamese, Hindi, Bengali, etc. [22], [23]. In [24] authors propose model evaluations on toxicity detection in Spanish comments. They show that transformers obtain better results than statistical models. Furthermore, they conclude monolingual BERT models provide better results in their pre-trained language as compared to multilingual models.

III Datasets

•

HASOC’21 Marathi dataset [25]:
A Marathi binary dataset provided in HASOC’21 shared task divided into hateful and non-hateful categories. It consists of a total of 1874 training and 625 testing samples.
•

L3Cube-MahaHate [12]:
A hate speech detection dataset in Marathi consists of 25000 tweet samples divided into 4 major classes namely hate, offensive, profane, and not. The dataset consists of 21500 train, 2000 test, and 1500 validation examples.
•

Articles:
A text classification dataset containing Marathi news articles classified into sports, entertainment, and lifestyle with 3823 train, 479 test, and 477 validation samples.
•

Headlines:
A Marathi news headlines dataset containing the headlines containing three classes viz. entertainment, sports, and state. It consists of 9672 train, 1210 test, and 1210 validation samples.
•

L3Cube-MahaSent [10]:
A Sentiment Analysis dataset in Marathi includes tweets classified as positive, negative, and neutral. It has 12114 train, 2250 test, and 1500 validation examples.

IV Experiments

IV-A Transformer models

BERT is a deep transformer model, pre-trained over large text corpus in a self-supervised manner, and provides a great ability to promptly adapt to a broad range of downstream tasks. There are a lot of different flavors of BERT available openly, some popular variants are ALBERT and RoBERTa. In this work, we are focusing on both multilingual and monolingual models for text classification and hate speech detection tasks. Following standard multilingual BERT models which use Marathi as one of the training languages are used:

•

Multilingual-BERT (mBERT)¹¹1https://huggingface.co/bert-base-multilingual-cased: It is a BERT-base model [26] trained on and usable with 104 languages with Wikipedia using a masked language modeling (MLM) and next sentence prediction (NSP) objective.
•

IndicBERT²²2https://huggingface.co/ai4bharat/indic-bert: a multilingual ALBERT model released by Ai4Bharat [27], trained on large-scale corpora. The training languages include 12 major Indian languages. The model has been proven to be working better for the tasks in Indic languages.
•

XLM-RoBERTa³³3https://huggingface.co/xlm-roberta-base: a multilingual version of the RoBERTa model [28]. It is pre-trained on 2.5TB of filtered CommonCrawl data containing 100 languages with the Masked language modeling (MLM) objective and can be used for downstream tasks.

To compare with the above models, the following Marathi monolingual models are used [14]:

•

MahaBERT⁴⁴4https://huggingface.co/l3cube-pune/marathi-bert: a multilingual BERT model fine-tuned on L3Cube-MahaCorpus and other publicly available Marathi monolingual datasets containing a total of 752M tokens.
•

MahaAlBERT⁵⁵5https://huggingface.co/l3cube-pune/marathi-albert: It is a Marathi monolingual model extended from AlBERT, trained on L3Cube-MahaCorpus and other publicly available Marathi monolingual datasets.
•

MahaRoBERTa⁶⁶6https://huggingface.co/l3cube-pune/marathi-roberta: It is a Marathi RoBERTa model built upon a multilingual RoBERTa (xlm-roberta-base) model fine-tuned on L3Cube-MahaCorpus and other publicly available Marathi monolingual datasets.

IV-B Evaluation results

Refer to caption — TABLE I: Classification accuracies for monolingual and multilingual models

Model	Training
	HASOC	L3Cube-
	L3Cube-	MahaHate
	Articles	Headlines
Multilingual BERT Variants
mBERT	Freeze	0.770	0.516	0.653	0.901	0.907
	Non-Freeze	0.875	0.783	0.786	0.976	0.947
IndicBERT	Freeze	0.710	0.436	0.656	0.828	0.877
	Non-Freeze	0.870	0.711	0.833	0.987	0.937
xlm-RoBERTa	Freeze	0.755	0.487	0.666	0.91	0.79
	Non-Freeze	0.862	0.787	0.820	0.985	0.925
Monolingual BERT Variants
MahaBERT	Freeze	0.824	0.580	0.666	0.939	0.907
	Non-Freeze	0.883	0.802	0.828	0.987	0.944
MahaAlBERT	Freeze	0.800	0.587	0.717	0.991	0.927
	Non-Freeze	0.866	0.764	0.841	0.991	0.945
MahaRoBERTa	Freeze	0.782	0.531	0.698	0.904	0.864
	Non-Freeze	0.890	0.803	0.834	0.985	0.942

Mono vs Multilingual BERT for Hate Speech Detection and Text Classification: A Case Study in Marathi
^†^†thanks: Supported by L3Cube Pune

Abstract

Index Terms:

I Introduction

II Related work

III Datasets

IV Experiments

IV-A Transformer models

IV-B Evaluation results

V Conclusion

Acknowledgment

References

Mono vs Multilingual BERT for Hate Speech Detection and Text Classification: A Case Study in Marathi ††thanks: Supported by L3Cube Pune

Abstract

Index Terms:

I Introduction

II Related work

III Datasets

IV Experiments

IV-A Transformer models

IV-B Evaluation results

V Conclusion

Acknowledgment

References

Mono vs Multilingual BERT for Hate Speech Detection and Text Classification: A Case Study in Marathi
^†^†thanks: Supported by L3Cube Pune