¹¹institutetext: ¹ Dept of Computer Science & Engineering, Chittagong University of Engineering & Technology, Chittagong-4349, Bangladesh.
² McMaster University, Canada.
¹¹email: ^a,b,c{zubairhossain773,asifiqbalsagor123,avijeetshil110}@gmail.com, ^denamul.haque@uwaterloo.ca, ^emoshiul_240@cuet.ac.bd,

*

Correspondence:iqbal@cuet.ac.bd

An Efficient K-means Clustering Algorithm for Analysing COVID-19

Md. Zubair^1,a MD.Asif Iqbal^1,b Avijeet Shil^1,c Enamul Haque^2,d Mohammed Moshiul Hoque^1,e Iqbal H. Sarker^1,∗

Abstract

COVID-19 hits the world like a storm by arising pandemic situations for most of the countries around the world. The whole world is trying to overcome this pandemic situation. A better health care quality may help a country to tackle the pandemic. Making clusters of countries with similar types of health care quality provides an insight into the quality of health care in different countries. In the area of machine learning and data science, the K-means clustering algorithm is typically used to create clusters based on similarity. In this paper, we propose an efficient K-means clustering method that determines the initial centroids of the clusters efficiently. Based on this proposed method, we have determined health care quality clusters of countries utilizing the COVID-19 datasets. Experimental results show that our proposed method reduces the number of iterations and execution time to analyze COVID-19 while comparing with the traditional k-means clustering algorithm.

Keywords:

machine learning, K-means, principal component analysis, clustering, COVID-19, data analytics

1 Introduction

COVID-19 was detected near the end of 2019 from the Chinese city of Wuhan [1]. The virus spread rapidly by transmission throughout the whole world within a short period of time. The World Health Organization (WHO) had to declare COVID-19 pandemic. To eradicate the pandemic situation, different countries have taken initiatives to remold their health care quality. Clustering the countries with similar types of health care quality helps us to know about the health care quality comparing with the rest of the counties.

In the area of machine learning and data science, both supervised learning and unsupervised learning are popular to solve various kinds of real world problems [2] [3]. In the case of clustering, unsupervised machine learning algorithms are being used [4] [5]. The unsupervised machine learning algorithm typically identifies insight structures of the data from unlabelled data contained in the dataset. The clustering algorithm finds and divides the data points according to the similarity of the hidden structures of the dataset [6]. K-Means [5] clustering is one of the most important and widely used unsupervised machine learning algorithms as it is widely used to identify the hidden structure automatically.

The algorithm is used to find the number of clusters so that it is possible to divide the unlabelled dataset into subgroups. It is done by calculating the distances from a centroid of a cluster. K-means algorithm is an NP-Hard problem [7]. Efficiently estimating the initial centroids is a difficult problem. At the initial state, we need to fix the coordinates of the initial centroid for finding the number of clusters. In the traditional K-Means clustering algorithm, initial centroids usually being selected randomly. Thus, determining the initial coordinate points of centroids can play a signifcant role in clustering, in which we are interested in. The key contributions of our work are as follows -

$\bullet$: Our proposed method efficiently selects the k-means clustering algorithm’s centroids that provides an optimum number of constant iteration as well as execution time.
$\bullet$: We have clustered similar types of countries according to the health care quality during the COVID-19 pandemic.
$\bullet$: We have tested our model with a COVID-19 real-world dataset to show the efficiency of our model with the existing models.

In the following, several related works have been discussed in Section 2. We present our proposed methodology in Section 3. In Section 4, we have shown the experimental results and comparative analysis. In Section 5 we highlight some key points and concludes this paper.

2 Related Work

Several approaches have been proposed to find the initial cluster centroids more efficiently. In this section, we bring some of these works. M. S. Rahman et al. [8], provides a centroids selection method based on radial and angular coordinates. However, the number of iterations of his proposed idea is not constant for all the instances. Also, the execution time of that method increases drastically by the increase of the cluster number. A. Kumar et al. also proposed to find initial centroids based on the dissimilarity tree [9]. Although this method improves k-means clustering, the execution time is not exalted significantly. Also, the smaller datasets that are used for experiment results can not provide insight into a large dataset. In [10], M.S. Mahmud et al. proposed a novel approach to find the initial centroids by calculating the mean of each distance of the data points. It only describes 3 clusters of given three datasets with the execution time. Improvement of execution time is also trivial. The concept is based on the weighted average. In [11], another approach is made to find the initial centroids efficiently. Here in M. Goyal et al. also tried to find the centroids by dividing the sorted distances with k, the number of equal partitions. No execution time was given for this proposed method. S. Na et al. [12] have proposed the use of two elementary data structures to store cluster labels and the distance of all data items with each iteration. On the next iteration, data of the previous iteration was used. But, execution time was trivial for the dataset experimented by authors. M. A. Lakshmi et al. [13] have proposed a method to find initial centroids by using the nearest neighbor method. They compared their idea by using SSE(Sum of the Squared Differences) with random and kmeans++ initial selection. Their SSE is roughly similar to random and kmeans++ initial selection. Moreover, They did not provide any comparison concerning execution time as well.

S. R. Vadyala et al. proposed a combined algorithm with k-means and LSTM to predict the number of confirmed cases of COVID-19 [14]. LSTM is abbreviated as long short-term memory, an artificial recurrent neural network architecture used for Deep learning. In [15], author A. Poompaavai et al. attempted to identify the affected areas by COVID-19 of India by using the k-means clustering algorithm. Many approaches related to COVID-19 problem, k-means clustering has been used. In [16], S.K. Sonbhadra et al. proposed a novel bottom-up approach for COVID-19 articles using k-means clustering along with DBSCAN and HAC.

3 Proposed Methodology

In this section, we present our k-means clustering-based COVID-19 analysis to determine the clusters according to the health care quality of the countries. The result of the k-means clustering is effected on the selection or assignment of initial centroids [17]. So, it is necessary to select the centroids more systematically to improve the performance of the k-means clustering algorithm, also the execution time. In our approach, we use Principal Component Analysis (PCA) [18] while determining the centroids.

The following flowchart in Figure 1 depicts the overall process of our proposed method. Our proposed method can determine the coordinates of initial centroids which can obtain the coordinates more precisely than the existing methods.

Refer to caption — Figure 1: Flowchart of the proposed method

3.1 Input Dataset

Many machine learning algorithms, including both supervised and unsupervised methods have been applied to the COVID-19 dataset [15]. Each attribute of the dataset must contain numerical values for approaching the k-means algorithm [5]. If not then, some pre-processing might be required for handling missing values and non-numeric values. Before going to further methodology, it needs to be ensured that each attributes values in numeric.

For creating our model, we have used a few datasets for selecting the features, required for analyzing the health care quality of the countries. The selected datasets are owid-covid-data [19], covid-19-testing-policy [20], public-events-covid [20], covid-containment-and-health-index [20], inform-covid-indicators[21]. It is worth mentioning that we used the data up to 11^st August 2020. For instance, some of the attributes of the owid-covid-data [19] are shown in Table 1.

Table 1: Sample data of COVID-19 dataset

Country

Date

total

cases

per

million

new

cases

per

million

total

deaths

per

million

new

deaths

per

million

cardiovasc

death

rate

hospital

beds

per

thousand

life

expectancy

Australia

11/08/20

839.102

12.275

0.706

107.791

3.84

83.44

Bangladesh

11/08/20

1581.808

17.651

20.876

0.237

298.003

0.8

72.59

China

11/08/20

61.769

0.079

3.258

0.003

261.899

4.34

76.91

Covid-19-testing-policy [20] dataset contains the categorical values of the testing policy of the countries shown in table 2.

Table 2: Sample data of Covid-19-testing-policy

Entity	Code	Date	Testing Policy
Australia	AUS	Aug 11, 2020	3
Bangladesh	BGD	Aug 11, 2020	2
China	CHN	Aug 11, 2020	3

Other datasets also contains such types of features which is required for ensuring the health care quality of a country. These real-world datasets help us to analyze our proposed method for real-world scenarios.

3.2 Principal Component Analysis and Percentile

Principal component analysis (PCA) is a method, that utilizes an orthogonal transformation to translate a set of observations of potentially correlated variables into a set of values of linearly uncorrelated variables that are called principal components. PCA is widely used in data analysis and for making a predictive model considering dimensionality reduction [22] [18], that has been taken into account in our approach.

On the other hand, the percentile method [23] used in our model divides the whole dataset into 100 different parts. Each part contains 1 percent data of the total dataset. For example, the 25th percentile means this part contains 25 percent data of the total dataset. That implies, using the percentile method, we can split our dataset into different distributions according to our given values. The percentile formula is given below [23]:

R=\frac{\rho}{100}*(\eta+1)

Here, $\rho$ = The Percentile wants to find, $\eta$ = Total Number of values, R = Percentile at $\rho$ .

3.3 Dataset Split and Mean Calculation

After splitting the reduced dimensional dataset through percentile, we then extract the split data from the primary dataset by indexing for each percentile. In this process, we get back the original data. After retrieving the original data for each percentile, we have calculated the mean of each attribute. These means from the split dataset is the initial centroids of our proposed method.

3.4 Centroids Determination

After splitting the dataset and calculating the mean according to the subsection 3.3, we have selected the means of each split dataset as a centroid. These centroids are taken into account as initial centroids for the efficient k-means clustering algorithm.

3.5 Cluster Generation

At the last step, we have executed our modified k-means algorithm until the centroids converge. Passing our proposed centroids instead of random or kmeans++ centroids through the k-means algorithm we have generated the final clusters [24]. The proposed method always considers the same centroids for each test. The pseudocode of our whole proposed methodology is given in the algorithm 1. In the next section, evaluation and experimental results are discussed.

Algorithm 1 Proposed Methodology

1:Input: A dataset D and Number of clusters K

2:Output: Efficient initial centroids for K clusters

3:procedure Proposed Method(D)

4: All n attributes

{a1,a2,a3,...,an}

of D must be numeric. If there is any non-numeric attribute just convert it to numeric value.

5: Apply Principal Component Analysis (PCA) with 2 components to the dataset, D.

6: Apply percentile for splitting the whole dataset into K equal parts based on 1st component.

7: Extract the split dataset from primary data by index.

8: Calculate the mean of each attribute of the split datasets.

9: Take the mean of each dataset as the initial clusters centroids,

C={c1,c2,...,ck}

, where

c1,c2,..,ck

are the initial centroids for

1st,2nd,....,k

clusters consecutively.

10: Assign the centroids to the k-means clustering algorithm

11: Applying k-means algorithm with proposed initial centroids.

12:end procedure

4 Implementation and Experimental Results Analysis

To measure the effectiveness and validate our proposed model for selecting the optimum initial centroids for the k-means clustering algorithm, we have implemented and showed experimental results with real-world dataset. We have used five COVID-19 datasets and merged them to have a handful of features for clustering the countries according to their health quality during COVID-19.

4.1 Datasets Pre-processing

We have merged the datasets described in subsection 3.1 according to the country name. With the regular expression, some pre-processing and data cleaning have been conducted in the case of merging the data. We have also handled some missing data consciously. There are so many attributes regarding COVID-19, among them 25 attributes had been selected finally, as these attributes closely signify the health care quality of a country. The attributes represent categorical and numerical values. These are country name, cancellation of public events (due to public health awareness), stringency index¹¹1It is one of the matrices used by Oxford COVID-19 Government Response Tracker [25]. It delivers a picture of the country’s enforced strongest measures., testing policy ( category of testing facility available to the mass people), total positive case per million, new cases per million, total death per million, new deaths per million, cardiovascular death rate, hospital beds available per thousand, life expectancy, inform the COVID-19 risk (rate), hazard and exposure dimension rate, people using at least basic sanitation services (rate), inform vulnerability(rate), inform health conditions (rate), inform epidemic vulnerability (rate), mortality rate, prevalence of undernourishment, lack of coping capacity, access to healthcare, physicians density, current health expenditure per capita, maternal mortality ratio. All of the attributes are closely investigated before feeding the model.

As we are going to make clusters for the countries with similar types of healthcare quality, the optimum number of clusters is 4, defined by the elbow method [26].

4.2 Centroids of COVID-19 Dataset

We have applied Principal Component Analysis (PCA) to convert a high dimensional dataset into two dimensions. The number of clusters is 4 for the COVID-19 dataset. After applying PCA, we have used percentile concepts to divide the whole dataset into K equal parts. For K=4, each portion contains 25% of the dataset. For this purpose, we have calculated the percentile for 25%, 50%, 75%, and 99.9% data. We have divided the data horizontally because it provides a good intuition to the cluster. Figure 2 depicts plotting the data with two dimensional PCA where horizontal lines represents the splitting according to percentile.

As described in subsection 3.3, after splitting the dataset and mean calculation we have got the proposed initial centroids of k-means. Efficient centroids of COVID-19 for 4 centroids are given in table 3. Here only 7 attributes from 25 attributes are shown for demonstration purposes. In table 3, c1,c2,c3 and c4 represent initial clusters centroids consecutively.

Table 3: Initial Centroids of COVID-19 Dataset

total

cases

per

million

new

cases

per

million

total

deaths

per

million

new

deaths

per

million

cardiovasc

death

rate

hospital

beds

per

thousand

life

expectancy

5052.931

43.53

94.81

1.06

277.75

1.6

67.7

1456.5

17.02

28.69

0.23

320.57

2.361

67.18

2190.72

33.01

47.76

.81

270.99

3.16

74.92

3577.2

27.41

17.57

.28

163.93

4.36

80.73

4.3 Evaluation Process

In machine learning and data science, computational power is one of the main issues. Because at a time, a computer needs to process a large amount of data. So, reducing computational cost is a big deal. As discussed in the methodology in section 3, we implemented our proposed method. Though many researchers proposed many ideas those are discussed in the related work section. We have compared our method with the best existing kmeans++ method [24]. We have measured the effectiveness of the model with

$\bullet$

Number of iterations needed for finding the final clusters
$\bullet$

Execution time for reaching out to the final clusters

These two things are compared in the upcoming subsections.

4.4 Efficiency analysis

In figure 3, we show the experimental result of the COVID-19 dataset for 10 tests. In the graph, the green and the red line represents results for the proposed and kmeans++ algorithms consecutively. As our centroid is pre-defined, the number of iterations is constant in every test case. On the contrary, kmeans++ method works with random iterations. For getting insight into the clusters, we have plotted the cluster map in Figure 4 showing similar types of countries in terms of health care quality. The 2D plot also shows the cluster distributions in a two-dimensional space. Before jumping to the demonstration, it is worth mentioning that the model have been tested on the Intel^® Core^TM i7-8750H processor.

$\bullet$

Figure 3(a) provides the iteration comparison for existing kmeans++ and our proposed method. When our proposed method have been applied, we find a constant number of iterations in each test. But the existing kmeans++ method’s iteration is random. It might be varied for different test cases.
$\bullet$

Figure 3(b) provides the execution time comparison for existing kmeans++ and our proposed method. For each test case, our model have executed the k-means clustering algorithm with the shortest time.

Based on the experimental results, we claim that our model outperforms in the case of real-world application and reduces the computational power for the k-mean clustering algorithm.

5 Concluding Remarks

In this paper, we have presented an efficient clustering method that selects the optimal initial centroids of K-means clustering algorithm. With the help of the proposed method, we have efficiently created the clusters of different countries according to similar health care quality during COVID-19. While experimenting with COVID-19 datasets, our model outperforms in terms of a reduced number of constant iterations, that consequently reduces the execution times as well. Although our proposed model outperforms for real-world scenarios, it might be a little bit diverged if the number of instances of the dataset are high or extremely high. In the future, we have a plan to work on a huge number of instances and build an effective recommendation system.

References

[1] Organization, W.H., et al.: Coronavirus disease 2019 (covid-19): situation report, 82 (2020)
[2] Sarker, I.H., Hoque, M.M., Uddin, M.K., Alsanoosy, T.: Mobile data science and intelligent apps: Concepts, ai-based modeling and research directions. Mobile Networks and Applications pp. 1–19 (2020)
[3] Sarker, I.H., Kayes, A., Badsha, S., Alqahtani, H., Watters, P., Ng, A.: Cybersecurity data science: an overview from machine learning perspective. Journal of Big Data 7(1), 1–29 (2020)
[4] Sarker, I.H.: Context-aware rule learning from smartphone data: survey, challenges and future directions. Journal of Big Data 6(1), 95 (2019)
[5] Han, J., Pei, J., Kamber, M.: Data mining: concepts and techniques. Elsevier (2011)
[6] Sarker, I.H., Colman, A., Kabir, M.A., Han, J.: Individualized time-series segmentation for mining mobile phone user behavior. The Computer Journal 61(3), 349–368 (2018)
[7] Vattani, A.: The hardness of k-means clustering in the plane. Manuscript, accessible at http://cseweb. ucsd. edu/avattani/papers/kmeans_hardness. pdf 617 (2009)
[8] Rahim, M.S., Ahmed, T.: An initial centroid selection method based on radial and angular coordinates for k-means algorithm. In: 2017 20th International Conference of Computer and Information Technology (ICCIT). pp. 1–6. IEEE (2017)
[9] Kumar, A., Gupta, S.C.: A new initial centroid finding method based on dissimilarity tree for k-means algorithm. arXiv preprint arXiv:1509.03200 (2015)
[10] Mahmud, M.S., Rahman, M.M., Akhtar, M.N.: Improvement of k-means clustering algorithm with better initial centroids based on weighted average. In: 2012 7th International Conference on Electrical and Computer Engineering. pp. 647–650. IEEE (2012)
[11] Goyal, M., Kumar, S.: Improving the initial centroids of k-means clustering algorithm to generalize its applicability. Journal of The Institution of Engineers (India): Series B 95(4), 345–350 (2014)
[12] Na, S., Xumin, L., Yong, G.: Research on k-means clustering algorithm: An improved k-means clustering algorithm. In: 2010 Third International Symposium on intelligent information technology and security informatics. pp. 63–67. IEEE (2010)
[13] Lakshmi, M.A., Daniel, G.V., Rao, D.S.: Initial centroids for k-means using nearest neighbors and feature means. In: Soft Computing and Signal Processing, pp. 27–34. Springer (2019)
[14] Vadyala, S.R., Betgeri, S.N., Sherer, E.A., Amritphale, A.: Prediction of the number of covid-19 confirmed cases based on k-means-lstm. arXiv preprint arXiv:2006.14752 (2020)
[15] Poompaavai, A., Manimannan, G.: Clustering study of indian states and union territories affected by coronavirus (covid-19) using k-means algorithm. International Journal of Data Mining And Emerging Technologies 9(2), 43–51 (2019)
[16] Sonbhadra, S.K., Agarwal, S., Nagabhushan, P.: Target specific mining of covid-19 scholarly articles using one-class approach. arXiv preprint arXiv:2004.11706 (2020)
[17] Berkhin, P.: A survey of clustering data mining techniques. In: Grouping multidimensional data, pp. 25–71. Springer (2006)
[18] Sarker, I.H., Abushark, Y.B., Khan, A.I.: Contextpca: Predicting context-aware smartphone apps usage based on machine learning techniques. Symmetry 12(4), 499 (2020)
[19] Total covid-19 tests performed by country - humanitarian data exchange. https://data.humdata.org/dataset/total-covid-19-tests-performed-by-country, (Accessed on 09/03/2020)
[20] Covid-19 testing policies, sep 3, 2020. https://ourworldindata.org/grapher/covid-19-testing-policy?region=Asia, (Accessed on 09/03/2020)
[21] Uncover covid-19 challenge — kaggle. https://www.kaggle.com/roche-data-science-coalition/uncover, (Accessed on 09/03/2020)
[22] Abdi, H., Williams, L.J.: Principal component analysis. Wiley interdisciplinary reviews: computational statistics 2(4), 433–459 (2010)
[23] Altman, D.G., Bland, J.M.: Statistics notes: quartiles, quintiles, centiles, and other quantiles. Bmj 309(6960), 996–996 (1994)
[24] Arthur, D., Vassilvitskii, S.: k-means++: The advantages of careful seeding. Tech. rep., Stanford (2006)
[25] Coronavirus government response tracker — blavatnik school of government. https://www.bsg.ox.ac.uk/research/research-projects/coronavirus-government-response-tracker, (Accessed on 09/06/2020)
[26] Kodinariya, T.M., Makwana, P.R.: Review on determining number of cluster in k-means clustering. International Journal 1(6), 90–95 (2013)