Investigating the Relationship Between World Development Indicators and the Occurence of Disease Outbreaks in the 21^th Century: A Case Study

Content Areas: Paraphrase Identification
Semantic Equivalence
Transformer Architecture Aboli Marathe¹ Equal Contributors Harsh Sakhrani¹¹¹footnotemark: 1&Saloni Parekh¹¹¹footnotemark: 1 ¹Pune Institute of Computer Technology, Maharashtra, India
{aboli.rajan.marathe, harshsakhrani26, saloniparekh1609}@gmail.com

Abstract

The timely identification of socio-economic sectors vulnerable to a disease outbreak presents an important challenge to the civic authorities and healthcare workers interested in outbreak mitigation measures. This problem was traditionally solved by studying the aberrances in small-scale healthcare data. In this paper, we leverage data driven models to determine the relationship between the trends of World Development Indicators and occurrence of disease outbreaks using worldwide historical data from 2000-2019, and treat it as a classic supervised classification problem. CART based feature selection was employed in an unorthodox fashion to determine the covariates getting affected by the disease outbreak, thus giving the most vulnerable sectors. The result involves a comprehensive analysis of different classification algorithms and is indicative of the relationship between the disease outbreak occurrence and the magnitudes of various development indicators.

1 Introduction

Coronavirus (COVID-19) has become an unprecedented health crisis and has spread to over 150 countries, severely impacting the world economy and causing social disruption. In a recent study from 2020, it was observed that the COVID-19 outbreak had a significant impact on the Italian economy, eventually tipping it into recession. The impact of this recession fell on the financially weak population, elderly and the working population. Sanfelici (2020) Learning from our experiences, we wish to move forward and create robust emergency preparedness measures. Providing the authorities with the most vulnerable sectors which will succumb to disease outbreaks first will be an invaluable resource for planning and policy-making. But finding these vulnerable sectors presents a challenge as it requires big data analysis of imperfect data over multiple years of history for a particular region. Furthermore, the vulnerable sectors cannot be directly quantified, thus their vulnerability needs to be estimated through indirect measures. We studied such cases of previous disease outbreaks, and propose a method of accurately identifying these vulnerable sectors, including critical sectors like economy, healthcare and safety. We came across the World Development Indicators (WDI), established by the World Bank that are a set of indicators, collected over time for every country through their individual governments. The indicators cover most sectors of development, including trade and safety markers and we thought of using these indicators to estimate the vulnerability of different sectors. Some world development indicators get more affected than others and their identification is a challenging problem for disease outbreak preparedness and planning.

Over the years, researchers have analysed the disease outbreaks to determine the risk factors Anno et al. (2019) and aid the disease outbreak surveillance Allard (1998). But the relationship between trends in socio-economic indicators and the occurrence of previous disease outbreaks still remains a mystery. While researchers tend to analyse socio-economic systems in the context of disease outbreaks, we tried understanding the relationship between socio-economic systems and disease outbreaks. Whether metadata could be used to analyze the anomalies and their cause or impacts on a country-per-country basis was our primary research question. We tried to answer this question using a different methodology. In this paper, we propose an approach that uses data-driven models for disease outbreak identification rather than disease outbreak forecasting and the treatment of this identification as a classification problem is a novel approach that we would like to introduce to this field. We combine the World Development Indicators data Bank (2010, ) provided by the World Bank and the disease outbreaks data by the World Health Organization Organization to create a dataset. As there was a small degree of uncertainty in the dataset due to missing values, we also make use of statistical data imputation and predictive modelling for data treatment. Lastly, we apply benchmarked classification techniques for disease outbreak identification and CART based feature importance to find the crucial indicators. After finding these indicators, we compare the results with former studies and surveys to validate the performance of this methodology. The verified indicators can be passed on to the authorities for emergency preparedness and planning assistance.

2 Background Work

Due to the recent rise of disease outbreaks, the research community is laying special emphasis on studying epidemiology. Researchers have found correlation of time-series data trends with the presence of disease outbreaks Richardson et al. (2016); Li et al. (2012) and have also found causal relationships between socio-cultural systems Davis et al. (2019). Farrington and Beale (1998) worked extensively to model these outbreaks and predict them. The case-study of Salmonella agona in their paper highlighted both the potential and the shortcomings of automated detection procedures, emphasising both their time optimization and less perceptible results. Heisterkamp et al. (2006) tried another approach, using hierarchical time series analysis model to detect outbreaks and found the proposed model to be a reliable tool for Rubella notifications and Salmonella infections. Streftaris and Gibson (2004) considered continuous-time stochastic compartmental models that can be applied in veterinary epidemiology to model the within-herd dynamics of infectious diseases. Stroup et al. (1993) introduced a statistical method for detection of specific types of aberrations in public health surveillance.

Rohwerder (2020) summarised the work of multiple authors in an attempt to identify the secondary impacts of these disease outbreaks in certain countries. They analyse the economic, political, social and secondary impacts of the outbreaks, unlike the traditional healthcare impacts and found some common features among the countries struck by outbreaks. We were inspired by these results and wondered if one methodology, applied on single or multiple datasets could reproduce these findings in sufficiently timely fashion to allow interventions to take place. While most studies are trying to identify the impact of disease outbreaks using statistical modelling, few try to analyse this problem in a reverse manner, i.e. socio-economic indicators that could be associated with the occurrence of a disease outbreak. The directionality of this indicator-outbreak network has been considered a problem too vast for any single study, something that we agree with but look forward to solving. In 2011 however, Unkel et al. (2012) discussed a wide variety of techniques, their possible limitations and advantages from regression to ARIMA models to Markov models for the identification of unusual patterns in data which may result from infectious disease outbreaks. This study provided a base for our methodology.

3 Data Description and Creation

World Development Indicators (WDI) is the primary World Bank collection of 143 development indicators for more than 200 economies and 40 country groups. The part of the database that we considered spans from the year 2000 - 2019. Bank The disease outbreak data from WHO was extracted separately for individual countries. Organization The years that had a disease outbreak occurrence/absence were labelled as 1/0 respectively.

The basic preprocessing involved encoding categorical features, scaling, normalization and resampling. Robust Scaler was utilized for scaling, since it scales the data according to the quantile range and is insensitive to outliers. A number of other scaling techniques like Min-Max scaler, Standard scaler and our in-house Logarithmic Deviation scaler were also tried, but gave substandard results.

The severely skewed class distribution observed in our dataset posed a challenge for the classification algorithms. Both undersampling and oversampling have known disadvantages. Undersampling can throw away potentially useful data, and oversampling can increase the likelihood of overfitting. Hence, a combination of both Undersampling and Oversampling was used. SMOTE is an oversampling technique that synthesizes new plausible examples of the minority class by interpolating between several minority class examples that lie together. Tomek Links refers to an undersampling technique that identifies cross-class nearest neighbors and removes the majority class occurrence. Batista et al. (2003)

Refer to caption — Figure 1: Feature importance top covariates

4 Methodology

4.1 Data Imputation

We employed a number of statistical and inferential data imputation techniques ranging from simple statistic substitution to complex deep learning based imputation techniques. The techniques that gave us noteworthy results are explained below.

4.1.1 KNN imputation

The K-Nearest Neighbors algorithm is used to map a point with its k closest neighbors in a multi-dimensional space. The intuition behind using KNN for data imputation is that a missing input variable value can be approximated by the value of the points that are closest to it, and this ‘closeness’ can be determined on the basis of other non-missing variables. In our dataset, the missing World Development Indicator values are imputed using this ’closeness’, which is usually seen in groups of countries having similar indicator values, or the countries that have had similar development curves in different time frames. After experimenting with a number of parameters, the best results were obtained when a combination of 5 neighbors and euclidean distance (1) was used.

d\left(p,q\right)=\sqrt{\sum_{i=1}^{n}\left(q_{i}-p_{i}\right)^{2}}

(1)

4.1.2 Stochastic Multiple Regression Imputation

The intuition behind the MSREG algorithm is to leverage the correlation between the input variables by regressing the missing variable on all the other input variables. We employ the Linear Regression Model to estimate the missing values. For example, in our dataset there is a strong positive correlation between the “Number of Community Health Workers” and the “Current health expenditure” columns. The MSREG algorithm is capable of utilizing such correlations in order to impute the missing variable values.

To counter the decrease in the inherent variability of the imputed variable, normally distributed noise with a mean of zero and variance equal to the standard error of regression estimates was introduced. The MSREG method assigns values to each missing element $x$ according to (2), where $k$ is the number of manifest variables used in a model, $N$ is the number of missing values in $x$ , and $Srandn()$ is a function that returns a different element of a standardized normally distributed random column vector each time it is invoked. Kock (2014)

\dot{x}_{ir}=\sum_{j=1}^{k}\hat{\beta}_{x_{i}x_{j}}+(\sqrt{(1-\hat{\beta}_{x_{i}x_{j}}\hat{\Sigma}_{x_{i}x_{j}}})\text{Srandn()}

(2)

where $j=1...k,j$ $\neq$ $i,r=1...N$

4.1.3 Random (Additive Noise) Imputation

This method allows imputation of the missing data by picking random observed values of a particular variable. This method was applied on all features with missing data, by selecting random variables with the probability of an imputation being $1/n$ where $n$ is the number of present values.

4.2 Feature Importance

Conventionally Feature Selection has always been used to identify the relevant set of features for which there is a significant increase in the performance of the algorithm. But we attempt to utilize it in an unorthodox fashion. The promising classification scores [1] do show that there is a strong correlation between the World Development Indicators and Disease Outbreaks. But the crucial question would be to discover the set of unapparent Indicators which get affected by Disease Outbreaks and understand them better, for which we use Feature Selection.

RandomForestClassifier’s implicit feature selection was used to determine the subset of relevant features. For randomized trees’ ensembles, the variable importance $X_{m}$ for predicting $Y$ is calculated by adding up the weighted impurity decreases $p(t)\Delta i(st,t)$ for all nodes $t$ where $X_{m}$ is used, averaged over all $N_{T}$ trees in the forest (3).

Imp(X_{m})=\frac{1}{N_{T}}\sum_{T}\sum_{t\epsilon T;v(s_{t})=X_{m}}p(t)\Delta i(s_{t},t)

(3)

where $p(t)$ is the proportion $N_{t}/N$ of samples reaching $t$ and $v(s_{t})$ is the variable used in split $s_{t}$ . When using the Gini index as an impurity function, this measure is known as the Gini importance or Mean Decrease Gini. Louppe et al. (2013)

Table 1: Results of classification algorithms on the three imputed datasets

Algorithms	KNN		Random		MSREG
	F1	Accuracy	F1	Accuracy	F1	Accuracy
LGBM Classifier	0.934	0.934	0.918	0.918	0.935	0.935
RandomForest	0.930	0.930	0.940	0.940	0.942	0.942
BaggingClassifier	0.906	0.906	0.906	0.906	0.913	0.914
DecisionTreeClassifier	0.832	0.832	0.808	0.808	0.816	0.817
ANN 1 Relu , 1 Sigmoid layer	0.810	0.743	0.780	0.808	0.780	0.799
AdaboostClassifiier	0.799	0.799	0.835	0.835	0.812	0.812
VotingClassifier	0.789	0.790	0.780	0.780	0.786	0.786
ANN 3 Relu , 1 Sigmoid layer	0.780	0.824	0.780	0.800	0.770	0.808
Autoencoder + Logistic Reg.	0.760	0.832	0.780	0.811	0.760	0.829
Autoencoder + Logistic Reg.	0.760	0.834	0.790	0.817	0.760	0.831
GaussianNB	0.644	0.652	0.654	0.662	0.575	0.609
SVC	0.580	0.570	0.570	0.523	0.490	0.540
SGDClassifier	0.546	0.608	0.586	0.617	0.786	0.851

5 Results

The identification of important features from the imputed dataset was accomplished through the use of benchmarked classification techniques to predict the target variable y, which in our case is the disease outbreak occurrence in a particular year. We applied these techniques to our dataset with a 0.2 train-test-split, and compared 3 different methods of imputation: KNN, Random Imputation and MSREG. After analysing the F1-score and the accuracy, hyperparameter optimization was performed to boost our results. This was followed by the usage of the CART based feature selection technique to determine the important features.

Classification:

A wide range of state-of-the-art classification techniques were employed including Bayesian, Tree-based, Ensemble and Deep Learning Algorithms [1].

Feature selection:

We applied feature selection for filtering out the input variables strongly correlated with disease outbreak occurrence. By plotting the relative importance of these covariates, we can increase the interpretability of this pipeline, and thus deliver the vulnerable indicators as our final result [1].

6 Synopsis

The results were very promising on the imperfect data classification as we achieved 94.2% top accuracy and a F1-Score of 0.94 on the dataset using the Random Forest algorithm, MSREG imputation and SMOTE Sampling after hyperparameter tuning. The ensemble techniques performed better than both the regression and the deep learning models. We were able to extract the most important features that the algorithm predicted [1].

To interpret our results better, we visualised the frequency of disease outbreaks per country with one of the more important predicted features- Number of International tourism, expenditures (current US$), and observed that the two variables were indeed correlated [2]. It is interesting to note how our observations match the results put forward by Rohwerder (2020), where they found the above features to be strongly affected by disease outbreaks in low and middle income countries through a different methodology and dataset.

7 Conclusion and Future Work

The dire impact of disease outbreaks are unequivocally faced by the most vulnerable populations, the healthcare workers and the financially disadvantaged, and our insights could help the authorities increase the accessibility of social services. Our proposed method leverages data-driven models and feature selection for the quick identification of the affected indicators, giving the vulnerable sectors. The results on the imputed datasets, while indicative of potential relationships, cannot tell the whole story on their own. Many critical variables (e.g. competing political priorities, cultural narratives etc.) cannot be completely captured in a large scale analysis, and can be found by comparing public opinion and conflict-related casualties. In the future work, these insights can contribute to forming prior knowledge for a knowledge-driven model, providing concrete parameters to help assess and validate the theoretical framework. Our team has also conducted a study parallel to this work that builds on the dataset and analyzes causal relationships between the features Marathe et al. (2021).This research will be useful in the emergency preparedness planning for the developing world.

References

Allard [1998] Rbc Allard. Use of time-series analysis in infectious disease surveillance. Bulletin of the World Health Organization, 76(4):327, 1998.
Anno et al. [2019] Sumiko Anno, Takeshi Hara, Hiroki Kai, Ming-An Lee, Yi Chang, Kei Oyoshi, Yousei Mizukami, and Takeo Tadono. Spatiotemporal dengue fever hotspots associated with climatic factors in taiwan including outbreak predictions based on machine-learning. Geospatial Health, 14(2), 2019.
[3] World Bank. World development indicators. Accessed Feb. 14, 2021.
Bank [2010] World Bank. World development indicators 2010. The World Bank, 2010.
Batista et al. [2003] Gustavo EAPA Batista, Ana LC Bazzan, Maria Carolina Monard, et al. Balancing training data for automated annotation of keywords: a case study. In WOB, pages 10–18, 2003.
Dauda [2019] Rasaki Stephen Dauda. Hiv/aids and economic growth: Evidence from west africa. The International journal of health planning and management, 34(1):324–337, 2019.
Davis et al. [2019] Paul K Davis, Angela O’Mahony, and Jonathan Pfautz. Social-Behavioral Modeling for Complex Systems. John Wiley & Sons, 2019.
Farrington and Beale [1998] CP Farrington and AD Beale. The detection of outbreaks of infectious disease. In Geomed’97, pages 97–117. Springer, 1998.
Farrington et al. [1996] CP Farrington, Nick J Andrews, AD Beale, and MA Catchpole. A statistical algorithm for the early detection of outbreaks of infectious disease. Journal of the Royal Statistical Society: Series A (Statistics in Society), 159(3):547–563, 1996.
Heisterkamp et al. [2006] Simon H Heisterkamp, Arnold LM Dekkers, and Janneke CM Heijne. Automated detection of infectious disease outbreaks: hierarchical time series models. Statistics in Medicine, 25(24):4179–4196, 2006.
Kalibala et al. [2012] Samuel Kalibala, Katie D Schenk, Deborah C Weiss, and Lynne Elson. Examining dimensions of vulnerability among children in uganda. Psychology, health & medicine, 17(3):295–310, 2012.
Kock [2014] Ned Kock. Single missing data imputation in pls-sem. Lar. Tex. Scr. Syst, 2014.
Kuang et al. [2012] Jie Kuang, Wei Zhong Yang, Ding Lun Zhou, Zhong Jie Li, and Ya Jia Lan. Epidemic features affecting the performance of outbreak detection algorithms. BMC Public Health, 12(1):1–9, 2012.
Li et al. [2012] Zhongjie Li, Shengjie Lai, David L Buckeridge, Honglong Zhang, Yajia Lan, and Weizhong Yang. Adjusting outbreak detection algorithms for surveillance during epidemic and non-epidemic periods. Journal of the American Medical Informatics Association, 19(e1):e51–e53, 2012.
Li [2018] Marie Li. Missing value estimation algorithms on cluster and representativeness preservation of gene expression microarray data. arXiv preprint arXiv:1809.05969, 2018.
Lieberman [2007] Evan S Lieberman. Ethnic politics, risk, and policy-making: A cross-national statistical analysis of government responses to hiv/aids. Comparative political studies, 40(12):1407–1432, 2007.
Louppe et al. [2013] Gilles Louppe, Louis Wehenkel, Antonio Sutera, and Pierre Geurts. Understanding variable importances in forests of randomized trees. Advances in neural information processing systems 26, 2013.
Lu et al. [2009] Hsin-Min Lu, Daniel Zeng, and Hsinchun Chen. Prospective infectious disease outbreak detection using markov switching models. IEEE Transactions on Knowledge and Data Engineering, 22(4):565–577, 2009.
Marathe et al. [2021] Aboli Marathe, Saloni Parekh, and Harsh Sakhrani. Modelling major disease outbreaks in the 21st century: A causal approach. 2021.
Novakovic and Veljovic [2011] J Novakovic and A Veljovic. C-support vector classification: Selection of kernel and parameters in medical diagnosis. In 2011 IEEE 9th international symposium on intelligent systems and informatics, pages 465–470. IEEE, 2011.
[21] World Health Organization. Disease outbreaks by countries, territories and areas. Accessed Feb. 14, 2021.
Richardson et al. [2016] Eugene T Richardson, Mohamed Bailor Barrie, J Daniel Kelly, Yusupha Dibba, Songor Koedoyoma, and Paul E Farmer. Biosocial approaches to the 2013-2016 ebola pandemic. Health and human rights, 18(1):115, 2016.
Rohwerder [2020] Brigitte Rohwerder. Secondary impacts of major disease outbreaks in low-and middle income countries. 2020.
Sahri et al. [2014] Zahriah Binti Sahri, Universiti Tekonologi Malaysia, et al. Support vector machine-based fault diagnosis of power transformer using k nearest-neighbor imputed dga dataset. Journal of Computer and Communications, 2(09):22, 2014.
Saiya and Scime [2019] Nilay Saiya and Anthony Scime. Comparing classification trees to discern patterns of terrorism. Social Science Quarterly, 100(4):1420–1444, 2019.
Sanfelici [2020] Mara Sanfelici. The italian response to the covid-19 crisis: Lessons learned and future direction in social development. The International Journal of Community and Social Development, 2(2):191–210, 2020.
Sawers et al. [2008] Larry Sawers, Eileen Stillwaggon, and Tom Hertz. Cofactor infections and hiv epidemics in developing countries: implications for treatment. AIDS care, 20(4):488–494, 2008.
Streftaris and Gibson [2004] George Streftaris and Gavin J Gibson. Bayesian inference for stochastic epidemics in closed populations. Statistical Modelling, 4(1):63–75, 2004.
Stroup et al. [1993] Donna F Stroup, Melinda Wharton, Karen Kafadar, and Andrew G Dean. Evaluation of a method for detecting aberrations in public health surveillance data. American journal of epidemiology, 137(3):373–380, 1993.
Ukpolo [2004] Victor Ukpolo. Aids epidemic and economic growth: testing for causality. Journal of Asian and African Studies, 39(3):169–178, 2004.
Unkel et al. [2012] Steffen Unkel, C Paddy Farrington, Paul H Garthwaite, Chris Robertson, and Nick Andrews. Statistical methods for the prospective detection of infectious disease outbreaks: a review. Journal of the Royal Statistical Society: Series A (Statistics in Society), 175(1):49–82, 2012.
Vogli et al. [2014] Roberto De Vogli, Anne Kouvonen, Marko Elovainio, and Michael Marmot. Economic globalization, inequality and body mass index: a cross-national analysis of 127 countries. Critical Public Health, 24(1):7–21, 2014.

Investigating the Relationship Between World Development Indicators and the Occurence of Disease Outbreaks in the 21th Century: A Case Study