Large Arabic Twitter Dataset on COVID-19

Sarah Alqurashi, Ahmad Alhindi Corresponding author: s43980127@st.uqu.edu.sa Eisa Alanazi
Center of Innovation and Development in Artificial Intelligence
Umm Al-Qura University
Makkah, Saudi Arabia

Abstract

The 2019 coronavirus disease (COVID-19), emerged late December 2019 in China, is now rapidly spreading across the globe. At the time of writing this paper, the number of global confirmed cases has passed two millions and half with over 180,000 fatalities. Many countries have enforced strict social distancing policies to contain the spread of the virus. This have changed the daily life of tens of millions of people, and urged people to turn their discussions online, e.g., via online social media sites like Twitter. In this work, we describe the first Arabic tweets dataset on COVID-19 that we have been collecting since January 1st, 2020. The dataset would help researchers and policy makers in studying different societal issues related to the pandemic. Many other tasks related to behavioral change, information sharing, misinformation and rumors spreading can also be analyzed.

Introduction

On December 31, 2019, Chinese public health authorities reported several cases of a respiratory syndrome caused by an unknown disease, which subsequently became known as COVID-19 in the city of Wuhan, China. This highly contagious disease continued to spread worldwide, leading the World Health Organization (WHO) to declare a global health emergency on January 30, 2020. On March 11, 2020 the disease has been identified as pandemic by WHO, and many countries around the world including Saudi Arabia, United States, United Kingdom, Italy, Canada, and Germany have continued reporting more cases of the disease (?). As the time of writing this paper, this pandemic is affecting more than 208 countries around the globe with more than one million and half confirmed cases (?).

Since the outbreak of COVID-19, many governments around the world enforced different measures to contain the spread of the virus. The measures include travel restrictions, curfews, ban of mass gatherings, social distancing, and probably cities lock-down. This has impacted the routine of people around the globe, and many of them have turned to social media platforms for both news and communication. Since the emergence of COVID-19, Twitter platform plays a significant role in crisis communications where millions of tweets related to the virus are posted daily. Arabic is the official language of more than 22 countries with nearly 300 million native speakers worldwide. Furthermore, there is a large daily Arabic content in Twitter as millions of Arabic users use the social media network to communicate. For instance, Saudi Arabia alone has nearly 15 million Twitter users as of January, 2020 (?). Hence, it is important to analyze the Arabic users’ behavior and sentiment during this pandemic. Other Twitter COVID-19 datasets have been recently proposed (?; ?) but with no significant content for the Arabic language.

In this work, we provide the first dataset dedicated to Arabic tweets related to COVID-19. The dataset is available at https://github.com/SarahAlqurashi/COVID-19-Arabic-Tweets-Dataset. We have been collecting data in real-time from Twitter API since January 1, 2020, by tracking COVID-19 related keywords which resulted in more than 3,934,610 Arabic tweets so far. The presented dataset is believed to be helpful for both researchers and policy makers in studying the pandemic from social perspective, as well as analyzing the human behaviour and information spreading during pandemics.

In what follows, we describe the dataset and the collection methods, present the initial data statistics, and provide information about how to use the dataset.

Dataset Description

We collected COVID-19 related Arabic tweets from January 1, 2020 until April 15, 2020, using Twitter streaming API and the Tweepy Python library. We have collected more than 3,934,610 million tweets so far. In our dataset, we store the full tweet object including the id of the tweet, username, hashtags, and geolocation of the tweet.

We created a list of the most common Arabic keywords associated with COVID-19. Using Twitter’s streaming API, we searched for any tweet containing the keyword(s) in the text of the tweet. Table 1 shows the list of keywords used along with the starting date of tracking each keyword. Furthermore, Table 2 shows the list of hashtags we have been tracking along with the number of tweets collected from each hashtag. Indeed, some tweets were irrelevant, and we kept only those that were relevant to the pandemic.

Keyword	English Translation	Tracing Date
\setcodeutf8 \<الفايروس التاجي¿	Coronavirus	2020-01-01
\<كورونا¿	Corona	2020-01-01
\<كوفيد 19¿	COVID 19	2020-03-01
\<ووهان¿	Wuhan	2020-01-01
\<الصين¿	China	2020-01-01
\<وباء¿	Epidemic	2020-01-22
\<تفشي¿	Outbreak	2020-01-01
\<جائحة¿	Pandemic	2020-01-22
\<كمامة ¿	Mask	2020-01-01
\<كمامات¿	Masks	2020-01-01
\<معقمات¿	Sterilizers	2020-01-01
\<تعقيم¿	Sterilization	2020-01-01
\<غسل اليدين¿	Washing hands	2020-01-01
\<العزل المنزلي¿	Home isolation	2020-02-01
\<الحجر المنزلي¿	Home quarantine	2020-02-01
\<حظر التجول¿	Curfew	2020-03-15
\<التباعد الاجتماعي¿	social distancing	2020-04-01
\< جهاز التنفس الصناعي¿	ventilator	2020-04-01
\< ضيق تنفس¿	Shortness of breath	2020-04-01
\< كحة¿	cough	2020-04-01
\< حراره¿	temperature	2020-04-01
\< متر ونص¿	One and a half meters	2020-04-01
\< فعاليات الحجر¿	quarantine activities	2020-04-01
\< الحجر الصحي¿	Quarantine	2020-04-01

Table 1: The list of keywords that we used to collect the tweets.

Hashtags English Translation Number of Tweets Tracing Date \setcodeutf8 \<الصين¿ China 287509 2020-01-01 \<الصحة¿ Health 53186 2020-01-01 \<كورونا¿ Corona 949531 2020-01-11 \<فيروس - كورونا¿ Coronavirus 188214 2020-01-11 \<الكورونا¿ Corona 52267 2020-01-17 \<كورونا - الجديد¿ New Corona 79302 2020-01-22 \<فايروس - كورونا¿ Corona virus 17723 2020-01-23 \<كوفيد - 19¿ COVID 19 54865 2020-02-01 \<فايروس - كورونا¿ Corona virus 17723 2020-02-01 \<فيروس - كورونا - المستجد¿ the emerging Corona virus 20377 2020-02-22 \<كلنا - مسؤول¿ We are all responsible 16768 2020-03-01 \<ابقوا - في - منازلكم¿ Stay at your homes 24580 2020-03-01 \<فعاليات - الحجر ¿ Quarantine activities 4396 2020-03-01 \<الحجر - المنزلي¿ Home quarantine 44444 2020-03-08 \<حظر - تجول¿ curfew 39778 2020-03-10 \<خلك - في - البيت¿ Stay at home 768196 2020-03-22

Table 2: Hashtags used in collecting this dataset.

A summary over the dataset is given in Table 3. While collecting data, we have observed that the number of retweets increased significantly in late March. This is likely due to the exponential increase in confirmed COVID-19 cases worldwide, including the Arabic speaking countries. A relatively small percentage of tweets were geotagged. Figure 1 presents the location of tweets observed as of 14 April 2020.

Number of Tweets :	3,934,610
Number of Tweets with geolocation :	219
Number of original tweets:	3,934,235
Number of retweets :	375
The average of daily collected tweets:	77471

Table 3: Summary statistics for the collected tweets

Dataset Access

The dataset is accessible on GitHub at this address: https://github.com/SarahAlqurashi/COVID-19-Arabic-Tweets-Dataset

However, to comply with Twitter’s content redistribution policy¹¹1https://developer.twitter.com/en/developer-terms/agreement-and-policy, we are distributing only the IDs of the collected tweets. There are several tools (such as Hydrator²²2https://github.com/DocNow/hydrator) that can be used to retrieve the full tweet object. We also plan to provide more details on the pre-processing phase in the GitHub page.

Refer to caption — Figure 1: The location of geotagged tweets

Future Work

We are continuously updating the dataset to maintain more aspects of COVID-19 Arabic conversations and discussions happening on Twitter. We also plan to study how different groups respond to the pandemic and analyze information sharing behavior among the users.

Acknowledgements

The authors wish to express their thanks to Batool Mohammed Hmawi for her help in data collection.

References

[Chen, Lerman, and Ferrara 2020] Chen, E.; Lerman, K.; and Ferrara, E. 2020. Covid-19: The first public coronavirus twitter dataset. arXiv preprint arXiv:2003.07372.
[Lopez, Vasu, and Gallemore 2020] Lopez, C. E.; Vasu, M.; and Gallemore, C. 2020. Understanding the perception of covid-19 policies by mining a multilanguage twitter dataset. arXiv preprint arXiv:2003.10359.
[Statista 2020] Statista. 2020. Leading countries based on number of twitter users as of january 2020. https://www.statista.com/statistics/242606/number-of-active-twitter-users-in-selected-countries/. (Accessed: 2020-04-06).
[World Health Organization and others 2020] World Health Organization, et al. 2020. Coronavirus disease 2019 (covid-19): situation report, 67.
[World Health Organization 2020] World Health Organization. 2020. Coronavirus disease (covid-19) pandemic. https://www.who.int/emergencies/diseases/novel-coronavirus-2019. (accessed: 2020-04-02).