Multiwave COVID-19 Prediction from Social Awareness
using Web Search and Mobility Data

Jiawei Xue xue120@purdue.edu 0000-0001-7519-6130 Purdue University550 Stadium Mall DriveWest LafayetteINUSA47907-2051 , Takahiro Yabe tyabe@mit.edu 0000-0001-8967-1967 Massachusetts Institute of Technology550 Stadium Mall DriveCambridgeMAUSA , Kota Tsubouchi ktsubouc@yahoo-corp.jp Yahoo Japan CorporationTokyoJapan , Jianzhu Ma majianzhu@pku.edu.cn 0000-0002-8236-6609 Peking University; Beijing Institute for General Artificial IntelligenceBeijingChina and Satish V. Ukkusuri 0000-0001-8754-9925 sukkusur@purdue.edu Purdue UniversityWest LafayetteINUSA

(2022)

Abstract.

Recurring outbreaks of COVID-19 have posed enduring effects on global society, which calls for a predictor of pandemic waves using various data with early availability. Existing prediction models that forecast the first outbreak wave using mobility data may not be applicable to the multiwave prediction, because the evidence in the USA and Japan has shown that mobility patterns across different waves exhibit varying relationships with fluctuations in infection cases. Therefore, to predict the multiwave pandemic, we propose a Social Awareness-Based Graph Neural Network (SAB-GNN) that considers the decay of symptom-related web search frequency to capture the changes in public awareness across multiple waves. Our model combines GNN and LSTM to model the complex relationships among urban districts, inter-district mobility patterns, web search history, and future COVID-19 infections. We train our model to predict future pandemic outbreaks in the Tokyo area using its mobility and web search data from April 2020 to May 2021 across four pandemic waves collected by Yahoo Japan Corporation under strict privacy protection rules. Results demonstrate our model outperforms state-of-the-art baselines such as ST-GNN, MPNN, and GraphLSTM. Though our model is not computationally expensive (only 3 layers and 10 hidden neurons), the proposed model enables public agencies to anticipate and prepare for future pandemic outbreaks.

COVID-19 Forecasting, Web Search Data, Human Mobility Data, Graph Neural Networks, Social Awareness Decay

^†^†journalyear: 2022^†^†copyright: rightsretained^†^†conference: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining; August 14–18, 2022; Washington, DC, USA^†^†booktitle: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’22), August 14–18, 2022, Washington, DC, USA^†^†doi: 10.1145/3534678.3539172^†^†isbn: 978-1-4503-9385-0/22/08^†^†ccs: Information systems Spatial-temporal systems^†^†ccs: Applied computing Sociology

1. Introduction

The spreading mechanism of COVID-19 is complicated due to its dependency on disease features and social factors such as human mobility (Yabe et al., 2020; Qian et al., 2021), public awareness (Qian et al., 2020), and intervention policies. One prominent phenomenon of the complex disease spreading process is the multiple outbreak wave, which implies the periodic rebound to a large number of infection cases (Leung et al., 2020) and is obvious in many countries such as the USA, UK, France, and Japan (Figure 1). Abrupt and uncertain disease outbreaks disturb individuals’ daily life, government’s reopening policies (Chang et al., 2020), medical resources managements (Moghadas et al., 2020), and risk assessment (Ye et al., 2020). It has an enormous social impact to investigate and construct an accurate model to predict the multiple waves by fully utilizing different types of data (Xiao et al., 2021).

Many prediction models for the first outbreak wave have been proposed to anticipate the infection and death cases (Shahid et al., 2020; Chimmula and Zhang, 2020; Kufel et al., 2020; Schwabe et al., 2021). One critical input for these models is the mobility data (Kang et al., 2020; Huang et al., 2020; Chang et al., 2021), which describes population movements and is positively related to the disease infections (Xiong et al., 2020). Nevertheless, continuous tracking of human mobility dynamics shows that the mobility strength did not exhibit consistent relationships with infection cases: (1) the USA: human mobility fluctuated around 95% of the normal period from July 1 to Dec. 1, 2020 (Chinazzi, 2021) which witnessed the second wave in July and the third wave in November (Figure 1); (2) Tokyo: the social contact index resumed to the normal level and decreased slightly after July 2020, but Tokyo experienced the second wave since then (Yabe et al., 2021). The research community has also noticed and discussed the limitation of mobility data in long-term multiwave infection prediction (Levin et al., 2021; Badr and Gardner, 2021). The inconsistency in the mobility and infection necessitates other data that is more representative of disease waves.

During COVID-19, many text-based methods have been proposed to aid communities such as to understand human’s emotion states (Ju et al., 2021) and to answer peoples’ questions (Yan et al., 2021). Web search records collected by the web service provider have extensive applications such as customer behavior analysis (Goel et al., 2010), disease outbreak monitoring (Hisada et al., 2020), and evacuation prediction (Yabe et al., 2019). For COVID-19, symptom-related web search records (e.g., fever, cough, and headache) reflect the public’s virus-induced symptoms that cannot be mined from the mobility data. Evidence in Tokyo showed that the Pearson correlation between High-Risk Users (which are defined from web search records) and infection cases was 0.719 with a 16-day lag for the second wave (Yabe et al., 2021). It also found people’s symptom-related web search frequency was smaller during the second wave than the first wave, though the number of patients were significantly higher. These results inspire us to leverage the web search data and recover human awareness decay to predict the multiwave pandemic.

Refer to caption — Figure 1. The daily new infection cases of COVID-19 for four countries (Google Inc., 2021). We observe multiple pandemic waves with varying starting days and magnitudes.

In this study, we propose a multiwave infection prediction approach¹¹1The code and data of this study is open to public and can be found at https://github.com/JiaweiXue/MultiwaveCovidPrediction., with the direct application as urban district-level disease outbreak early warning. District-level disease prediction has the following three requirements: (1) comprehensive data sources, such as people movement and social responses, should be included to contain various hints that are closely related to the disease spreading; (2) spatial and temporal disease transmission patterns of COVID-19 should be taken into consideration; (3) it captures complex dependency between infection cases and other factors. To deal with these challenges, we first define the Web search Mobility Network (WMN) whose nodes and edges maintain web search frequency and inter-district human flow information, respectively. Afterward, we propose a Social Awareness-Based Graph Neural Network (SAB-GNN) architecture upon the WMN to capture the spatio-temporal infection case dynamics in different urban districts. We train and test the model using real-world infection, human mobility, and web search data in Tokyo from April 2020 to May 2021, and obtain better prediction performance than state-of-the-art baseline models. Our method has three contributions:

•

We focus on predicting multiwave disease outbreaks that are globally prevalent but are seldom investigated. Different from a single wave prediction, multiwave prediction enables the public agencies to evaluate the long-term risk and take appropriate actions at different pandemic stages.
•

We propose the SAB-GNN by fusing historical infection, mobility, and web search data that provide sufficient evidence of potential disease outbreaks. The spatial module, temporal module, and social awareness module take separate responsibilities and jointly contribute to the prediction.
•

The proposed method is implemented on a mega-city, Tokyo, with a period spanning more than one year across four pandemic waves. We conduct a comprehensive analysis of disease outbreaks and prediction results from different models at different time intervals, which promotes a more nuanced understanding of the disease waves.

2. Related Work

2.1. Time Series Models

Existing COVID-19 time series models cover multiple types such as auto-regressive integrated moving average (ARIMA) (Kufel et al., 2020), and long short-term memory (LSTM) (Chimmula and Zhang, 2020; Jo et al., 2020). Moreover, biologists and engineering scientists focus on the relationship between fatality rate and biochemical indicators (Zhou et al., 2020a), human mobility (Jo et al., 2020). When it comes to urban district-level disease infection prediction, while the inter-district connections provide crucial pathways for both human movement and disease dissemination, these models are insufficient to capture such spatial dependency between different urban districts. In fact, the spatial dependency information enables us to deal with the data scarcity issue which may occur during the pandemic season. For instance, assume that we have sufficient mobility and social media data for district $\mathcal{A}$ and deficient data for district $\mathcal{B}$ , and recognize strong mobility connections between districts $\mathcal{A}$ and $\mathcal{B}$ . Considering the connections between the two districts helps to fully utilize the infection information and predict the infection cases for both districts $\mathcal{A}$ and $\mathcal{B}$ .

2.2. Graph Neural Networks

Graph neural network (GNN) is an innovative neural network that captures the relationship between multi-hop neighborhood nodes via the message passing mechanism (Zhou et al., 2020b). In the past years, various GNN models such as graph convolutional network (GCN) (Kipf and Welling, 2016), GraphSage (Hamilton et al., 2017), graph attention network (GAT) (Veličković et al., 2017) were developed and applied to many fields such as neural machine translation (Bastings et al., 2017), visual question answering (Narasimhan et al., 2018), traffic prediction (Yao et al., 2019; Zhang et al., 2021; Chen et al., 2019; Peng et al., 2020), and network metric generation (Xue et al., 2022).

Researchers have harnessed the superiority of the GNN in modeling spatial dependency to perform the disease infection case prediction. Most published GNN approaches (Kapoor et al., 2020; Panagopoulos et al., 2021; Gao et al., 2021) focused on the infection prediction before July 2020 when the first global outbreak occurred using the historical infection and mobility data. The mobility data was sufficient to reflect the disease spreading patterns during the first wave thanks to its simple relationship with the infection. First, areas attracting more passengers had higher risks of experiencing rapidly increasing cases than some lonely areas at the beginning of the first wave. Second, the travel restriction policies during the first wave suppressed the mobility strength and thus decelerated the disease outbreak (Xiong et al., 2020).

Nevertheless, the interaction between infection and mobility evolved into a much more complicated status during the later waves because the infection cases were affected by a large variety of causes such as mask policy, the vaccination, which hindered the ability of raw mobility data to reflect the infection tendency. Besides, many studies have recognized that the available mobility data for COVID-19 infection cases prediction was limited by the period length (Panagopoulos et al., 2021; Rodriguez et al., 2021), which resulted in unstable learned models. In summary, the long-term multiwave infection prediction requests alternative data sources that provide sufficient interconnections with the infection case under the dynamic environment. In this study, we turn to the novel web search data, which directly reveals human’s awareness to the disease and potential symptoms (Yom-Tov et al., 2022), to perform the multiwave disease prediction.

3. Preliminaries

We first describe the mobility and web search data used as features in the prediction, and then formally define our prediction task. Assume an urban area is divided into $n$ pre-defined urban districts.

Mobility feature: for mobile phone users within the urban area, we collect location records (longitude, latitude, and time) with the temporal resolution as around 30 minutes and spatial error as at most 100 meters. We extract each individual’s trajectory points as a sequence. Next, we project the sequence points of each mobile phone user into the urban districts to obtain the number of daily inter-district trips. Note that inter-district trips build natural pathways to transmit disease viruses in the city so that these districts have correlated exposure to disease outbreak risks.

Web search feature: we scope $n_{w}$ COVID-19 symptom-related web search words from multiple official sources (Centers for Disease Control and Prevention, 2021; Sarker et al., 2020; Wang et al., 2021). We assign each mobile phone user to the urban district where his/her home is located and then aggregate the daily web searching frequency for all users living in this district. In summary, we obtain the number of inter-district trips between every two districts, and symptom-related web search frequency in each district.

Network definition: We define a sequence of WMNs: $G_{t}=(V,\mathbf{E}_{t},\mathbf{H}_{t},\mathbf{I}_{t})$ . Here, the subscript $t$ implies a day. $V=\{v_{1},v_{2},...,v_{n}\}$ represents $n$ urban districts. $\mathbf{E}_{t}\in\mathbb{R}^{n\times n}$ is a matrix where the ( $i,j$ )-entry represents the number of trips from the district $v_{i}$ to $v_{j}$ on the day $t$ . We denote the daily web search frequency of residents living in the district $v_{i}$ as a vector $\mathbf{h}_{t}^{v_{i}}\in\mathbb{R}^{n_{w}},1\leq i\leq n$ , so that $\mathbf{H}_{t}=[\mathbf{h}_{t}^{v_{1}},\mathbf{h}_{t}^{v_{2}},...,\mathbf{h}_{t}^{v_{n}}]^{T}\in\mathbb{R}^{n\times n_{w}}$ is the node feature matrix encoding the web search frequency for all districts. $\mathbf{I}_{t}\in\mathbb{R}^{n\times 1}$ is the collection of daily new infection cases in the $n$ different districts.

Prediction task: Our goal is to utilize the infection, mobility, and web search data for the past $D_{1}$ days to predict new infection cases for the next $D_{2}$ days at the urban district level. On the day $T$ , given $G_{t}$ for all $t\in[T-D_{1}+1,T]$ , predict $\mathbf{I}_{t}$ for all $t\in[T+1,T+D_{2}]$ .

4. Social Awareness-Based Graph Neural Networks

In this section, we first present the intuition of the SAB-GNN model, and then sequentially introduce its three modules: the spatial information propagation module, the social awareness recovery module, and the temporal information passing module, and finally declare the loss function.

We deduce that the future reported district-level infection cases are jointly influenced by the existing infection cases, the number of susceptible individuals that may have been infected (which can be mined from public’s symptom-related web search data), and in-person contact patterns across the city (which is reflected by the inter-district trip numbers) and use them as features. Given that the pandemic spreading is indeed a complicated temporal process embedding on the space, we propose to build temporal and spatial modules (Yao et al., 2018) to track the infection dynamics. Lastly, the existing study finds that the public’s symptom-related web search frequency decreases from the first wave to the second wave, which reveals a prevalent social phenomenon that people’s awareness of a hot topic gradually declines (which is referred to as social awareness decay in this study) (Yabe et al., 2021). Based on the fact that people adapt themselves to the mask policies, travel restrictions, and routine testings and pay less attention to the COVID-19, we propose a social awareness recovery module in the SAB-GNN to estimate the actual occurrence of COVID-19-related symptoms. In summary, we build an integrated future infection case prediction model with three modules (i.e., spatial module, awareness recovery module, and temporal module) by fusing the historical infection, mobility, and web search data.

4.1. Spatial Module: Graph Neural Networks

Recall that the web search frequency vector $\mathbf{h}_{t}^{v_{i}}$ reflects the number of potential infected individuals in the urban district $v_{i}$ and $\mathbf{E}_{t}$ records the daily inter-district trips. We therefore perform the convolution operation on $\mathbf{h}_{t}^{v_{i}}$ using the $\mathbf{E}_{t}$ information under the graph neural network framework to capture the disease risk propagation properties. As shown in Figure 2, using the symptom-related web search frequency in each urban district, we employ the one-hot encoding to initialize the representation for each urban district (i.e., each node in $G_{t}$ ) as the input matrix: $\mathbf{X}_{t}^{(0)}=\mathbf{H}_{t}$ . Following the GCN model (Kipf and Welling, 2016), we define the node representation propagation rule between the layers $k$ and $(k+1)$ as:

(1)

\mathbf{X}_{t}^{(k+1)}=\sigma(\tilde{\mathbf{D}_{t}}^{-\frac{1}{2}}\tilde{\mathbf{E}_{t}}\tilde{\mathbf{D}_{t}}^{-\frac{1}{2}}\mathbf{X}_{t}^{(k)}\mathbf{W}^{(k)}),

where

(2)

\tilde{\mathbf{E}_{t}}=\mathbf{E}_{t}+\mathbf{I}_{n\times n},\tilde{\mathbf{D}_{ii}}=\sum_{j=1}^{n}\tilde{\mathbf{E}_{ij}},

and $\mathbf{W}^{(k)}$ is a learnable weight matrix, $\mathbf{I}_{n\times n}$ is the $n$ by $n$ identity matrix, $\sigma(\cdot)$ is the activation function ReLU.

Note that we normalize the matrix $\mathbf{E}_{t}$ such that the sum of each column is equal to 1 (i.e., the sum of incoming edges on one node is 1), which is used in the existing study (Panagopoulos et al., 2021). In practice, it is possible to replace the spectral convolution with other GNN variants such as GAT (Veličković et al., 2017), GraphSage (Hamilton et al., 2017). We also implement them and find quite approximate prediction performance as the GCN. The outcome of the spatial module is a matrix

(3)

\mathbf{H}_{t}^{S}=\mathbf{X}_{t}^{(K)}=[\mathbf{x}_{t}^{v_{1}},\mathbf{x}_{t}^{v_{2}},...,\mathbf{x}_{t}^{v_{n}}]^{T},

with $n$ rows that encode the web search frequency and mobility where $K$ is the number of layers.

4.2. Social Awareness Recovery Module

The symptom-related web search frequency is positively related to the number of actual symptom occurrences among the population (Yabe et al., 2021). As mentioned earlier, the social awareness decay effect informs that probability of symptom-related word searching gradually decreases with time after the first COVID-19 wave. To estimate the actual symptom occurrences, we propose to multiply the observed web search representation by a monotonically increasing function with respect to the time.

Specifically, we first linearly normalize each entry of the web search record vector $\mathbf{h}_{t}^{v_{i}}$ to 0 and 1 by the maximal and minimal web search frequency of each word across all days in the urban district $v_{i}$ , and feed them into the spatial module, and obtain $\mathbf{H}_{t}^{S}$ . Next, we define an increasing function $r(t|i,t_{0})=e^{\lambda_{i}^{2}(t-t_{0})}$ regarding $t$ to recover the social awareness:

(4)

\tilde{\mathbf{x}}_{t}^{v_{i}}=\mathbf{x}_{t}^{v_{i}}r(t|i,t_{0})=\mathbf{x}_{t}^{v_{i}}e^{\lambda_{i}^{2}(t-t_{0})},

where $\lambda_{i}^{2}$ is learnable and measures the social awareness recovery rate in $v_{i}$ . $t$ , $t_{0}$ represent the current day and the first day of the study period, respectively. Given that the land use, economy type, demographic characteristic discrepancy may lead to spatially varying social awareness rates, we specify district-dependent $\lambda_{i}^{2}$ to encode its unique awareness decay behavior. A large value of $\lambda_{i}^{2}$ implies that social awareness of COVID-19 for residents living in $v_{i}$ declines rapidly and we therefore adopt this large value to recover the social awareness. Note that we introduce the square in $\lambda_{i}^{2}$ to ensure that it is non-negative. Collectively speaking, the social awareness recovery module transforms $\mathbf{H}_{t}^{S}$ to $\tilde{\mathbf{H}}_{t}^{S}$ by

(5)

\tilde{\mathbf{H}}_{t}^{S}=\mathbf{H}_{t}^{S}\circ\mathbf{M}_{t,t_{0}},

where $\mathbf{M}_{t,t_{0}}$ is the awareness recovery matrix (ARM):

(6)

\mathbf{M}_{t,t_{0}}=\begin{bmatrix}e^{\lambda_{1}^{2}(t-t_{0})}&e^{\lambda_{1}^{2}(t-t_{0})}&...&e^{\lambda_{1}^{2}(t-t_{0})}\\ e^{\lambda_{2}^{2}(t-t_{0})}&e^{\lambda_{2}^{2}(t-t_{0})}&...&e^{\lambda_{2}^{2}(t-t_{0})}\\ \vdots&\vdots&\ddots&\vdots\\ e^{\lambda_{n}^{2}(t-t_{0})}&e^{\lambda_{n}^{2}(t-t_{0})}&...&e^{\lambda_{n}^{2}(t-t_{0})}\\ \end{bmatrix},

and $\circ$ is the Hadamard product. This transform implies that the entries in $\mathbf{H}_{t}^{S}$ are amplified to capture the actual disease risk which is underestimated due to the social awareness decay effect when $t$ is large. In the end, we perform 0-1 normalization on the infection matrix $\mathbf{I}_{t}$ to obtain $\tilde{\mathbf{I}}_{t}$ , and arrive at the output of the social awareness recovery module by concatenating the web search and infection representations, which is

(7)

\mathbf{H}_{t}^{A}=[\tilde{\mathbf{H}}_{t}^{S},\tilde{\mathbf{I}}_{t}].

4.3. Temporal Module: LSTM

To capture the temporal dependency of district-level features and infection cases, we adopt the existing LSTM model (Hochreiter and Schmidhuber, 1997). For $i\in\{1,2,...,n\}$ , we extract the $i$ -th row of matrices $\mathbf{H}_{t}^{A},t\in[T-D_{1}+1,T]$ and feed them into an LSTM sequence (Figure 2). Since we have already modelled the spatial dependency in the spatial module, in the temporal module we pass the node representations from $\mathbf{H}_{t}^{A}$ separately for different nodes, and these LSTM sequences share the identical structures and parameters.

4.4. Loss Function

Recall that our objective is to perform the infection case prediction for the next $D_{2}$ days, we define the loss function as the mean squared error, which is:

(8)

\mathcal{L}_{T}=\frac{1}{D_{2}n}\sum_{t=T+1}^{T+D_{2}}\sum_{i=1}^{n}(I_{t,i}-\hat{I}_{t,i})^{2},

where $I_{t,i}$ and $\hat{I}_{t,i}$ denote the actual and predicted infection cases for the district $i$ on the day $t$ . We present the training process as Algorithm in Supplement.

5. Data

5.1. Mobility, Web Search, and Symptom Data

We utilize four datasets: infection cases, mobile phone location data, web search data, and symptom data in Tokyo from Jan. 6, 2020, to May 15, 2021. The mobility and web search data were owned by Yahoo Japan Corporation with strict privacy protection regulations. The descriptions of these data are as follows:

•

Infection data. We access the daily new infection cases for 23 wards of Tokyo from the Tokyo COVID-19 Task Force website (Tokyo COVID-19 Task Force Website, 2021) which is maintained by Tokyo metropolitan government and open to public.
•

Mobility data. Table 2 in Supplement presents a sample piece of raw mobility data. The mobility data consists of a fake ID (after privacy protection operations), the user’s geolocation in terms of longitude and latitude at a specific time. The data collection frequency was around 30 mins on average, which depends on the movements of the user: the frequency is higher if the user moves faster. In this study, the mobility data is utilized to identify Tokyo residents’ home locations and mine the Mobility Matrix $\mathbf{E}_{t}$ .
•

Web search data. Table 2 in Supplement presents a sample of raw web search record. Each record contains the user ID, web query words, as well as web search time. The web search data is used to get COVID-19 symptom-related web search counts. Note that Yahoo Japan Corporation maintains these data to promote users’ experiences and does not share any individual-level data with other agencies.
•

Symptom data. Based on two existing COVID-19 symptom studies (Sarker et al., 2020; Wang et al., 2021) and the COVID-19 symptom list released by the Centers for Disease Control and Prevention (Centers for Disease Control and Prevention, 2021), we finally specify 44 symptoms, which are used to measure the counts of symptom-related web search records. These symptoms are shown in Table 3 in Supplement. We translate these symptoms from English to Japanese, and then count the number of corresponding web searches in Japanese. Note that a few symptoms have a large number of web search counts, and we therefore utilize the most frequent $1\leq n_{w}\leq 44$ symptoms in our prediction framework.

5.2. Data Preprocessing

Note that the raw mobility and web search data are at the individual level. To prepare the features $\mathbf{E}_{t}$ , $\mathbf{H}_{t}$ used in the machine-learning framework, we conduct the data preprocessing to aggregate the raw mobility and web search data to the urban district level. Please find the complete description of data prepossessing in the Appendix.

The statistics of aggregate daily new infection cases, web search number, and inter-district trip number are shown in Figure 3. The vertical dash lines in Figures 3ac mark March 1, 2020, May 1, 2020, and January 1, 2021, whose web search distribution and mobility flow are visualized in Figures 3bd, respectively. We find from Figure 3a that the dynamic of symptom-related web search exhibits consistent peaks with daily infection cases, especially near April 2020, January 2021, and May 2021, which provides the direct evidence to inspire us to utilize the web search data in the multiwave pandemic prediction.

6. RESULTS

6.1. Setting, Evaluation Metrics, Baselines

We conduct experiments to predict two outbreak waves, i.e., the third wave (from Dec. 10, 2020 to Feb. 7, 2021) and the fourth wave (from March 17 to May 15, 2021). The period for each experiment covers 10 months of observations where the train/validation/test ratio is 70%/10%/20%. We use the first 8 months (months 1 to 8) as training and validation data, and the last 2 months (months 9 and 10) as testing data. Note that the validation data has the size of 1 month and is evenly distributed from months 7 to 8. Since most existing disease prediction studies (Panagopoulos et al., 2021; Rodriguez et al., 2021; Wang et al., 2020) predict the infection at the week-level, we design three scenarios: $(D_{1},D_{2})=(21,7),(21,14),(21,21)$ (recall that we use the past $D_{1}$ days’ features to predict the future $D_{2}$ days’ infection cases).

We train the model using PyTorch and Adam optimizer (Kingma and Ba, 2014). We use the validation set to determine the learning rate as 1e-4, the epoch number, the batch size, and the dropout rate as 100, 8, and 0.50, respectively. The experiments run on an Intel Xeon w-2155 3.3 GHz CPU and 32 GB of RAM. Root Mean Square Error (RMSE) and Mean Absolute Error (MAE) are utilized to evaluate the prediction performance: RMSE = $\sqrt{\frac{1}{D_{2}n}\sum_{t=T+1}^{T+D_{2}}\sum_{i=1}^{n}(I_{t,i}-\hat{I}_{t,i})^{2}}$ , MAE = $\frac{1}{D_{2}n}\sum_{t=T+1}^{T+D_{2}}\sum_{i=1}^{n}|I_{t,i}-\hat{I}_{t,i}|$ .

To benchmark our model, we also implement 9 existing prediction models: (1) Historical average (HA) infection cases until the day $T$ ; (2) Historical average of the last $D_{1}$ days; (3,4) LSTM; (5,6) Seq2seq: encode the input infection case and decode the sequence using two separate LSTMs (Cho et al., 2014); (7) ST-GNN: an Spatio-Temporal GNN model whose inputs are past infection cases and mobility patterns (Kapoor et al., 2020); (8) MPNN+LSTM: a message passing neural network with the LSTM (Panagopoulos et al., 2021); (9) GraphLSTM: a prediction frameworks integrating GraphSage and LSTM (Sesti et al., 2021).

6.2. Comparisons with Baseline Models

We present the prediction performance of the SAB-GNN and other baseline models for the third wave and the fourth wave (Table 1). Our proposed SAB-GNN model yields the smallest MAE and RMSE for most scenarios (5 out 6), which demonstrates our model captures the relationships between past infection, web search, mobility, and future infection. Comparing SAB-GNN and SAB-GNN-wsa (without social awareness recovery), the social awareness recovery mechanism decreases the prediction error for most cases (5 out of 6), and the web search frequency has higher predictability after the awareness recovery operation.

		( $D_{1},D_{2}$ ) for the third wave						( $D_{1},D_{2}$ ) for the fourth wave
Model	Features	(21,7)		(21,14)		(21,21)		(21,7)		(21,14)		(21,21)
		MAE	RMSE	MAE	RMSE	MAE	RMSE	MAE	RMSE	MAE	RMSE	MAE	RMSE
HA (all)	I	22.63	26.45	22.53	27.22	22.16	27.65	8.76	10.04	9.14	10.59	9.41	11.08
HA (X days)	I	14.15	17.02	16.92	20.72	19.14	23.81	4.52	5.76	5.52	7.02	6.47	8.23
LSTM	I	19.67	24.07	17.23	21.93	20.61	26.61	5.94	7.67	8.20	10.64	8.32	10.95
LSTM	I/W	20.61	24.89	20.72	25.91	19.93	25.94	7.43	9.79	8.28	10.55	8.11	10.63
Seq2seq	I	17.27	22.24	23.04	28.73	16.79	22.77	6.37	8.38	8.84	11.42	8.45	11.08
Seq2seq	I/W	15.69	19.83	21.76	27.46	19.66	25.69	5.96	7.74	7.93	9.86	10.43	12.94
ST-GNN (Kapoor et al., 2020)	I/W/M	20.20	25.86	21.53	27.32	20.24	26.36	5.46	6.95	7.63	9.97	10.54	13.96
MPNN+LSTM (Panagopoulos et al., 2021)	I/W/M	12.40	17.31	16.30	21.73	19.78	25.59	3.34	4.49	5.27	7.97	4.80	6.56
GraphLSTM (Sesti et al., 2021)	I/W/M	10.22	12.82	13.27	16.91	15.56	20.07	3.44	4.61	4.38	5.88	5.32	7.13
SAB-GNN-wsa	I/W/M	10.75	13.43	12.72	16.33	15.61	20.10	3.32	4.46	4.63	6.18	5.03	6.77
SAB-GNN	I/W/M	8.03	10.43	11.23	14.78	13.76	18.24	3.25	4.24	4.28	5.57	5.25	6.82

Table 1. Performance evaluation using past

D_{1}

days’ features to predict next

D_{2}

days’ infection (

I

: infection data;

W

: web search data;

M

: inter-district mobility data; SAB-GNN-wsa: the SAB-GNN model without the social awareness recovery).

Besides, results from all models reveal a general tendency that the web search and mobility features result in prediction performance improvement, even if the past infection case is already a strong feature to connect with the future infection case. Finally, we observe from the LSTM and Seq2seq that simply concatenating the web search frequency embedding with the past infection cases is unable to consistently boost the prediction, which affirms the necessity of designing a suitable model architecture to utilize the web search data. For each model in Table 1, the running time for each scenario is within 20 minutes on the Ubuntu system with 32 GB RAM and 3.3 GHz Xeon w-2155 CPU.

6.3. Prediction Results for Different Urban Districts

We visualize the prediction errors and compare the predicted cases with the actual cases for each urban district in Figures 4, 5. As shown in Figures 4ab, the relative prediction errors for most districts are homogeneously lower than 0.50 except for the central district (i.e., Chiyoda). This reveals that SAB-GNN mines the relationships between features and future infection cases in a global manner thanks to the message passing mechanism among neighborhood nodes. The reason for the relatively large prediction error in Chiyoda (i.e., District 1) is because Chiyoda is where Imperial Palace locates and has quite a few real infection cases (Figure 5, the top-left panel). Figure 4c displays that the predicted daily case curve is able to capture the increasing tendency of actual cases from day 240 to day 280 and also maintains a small prediction error, which further demonstrates the power of our proposed SAB-GNN.

Figure 5 exhibits that for most urban districts, the model is able to anticipate the general increasing tendency of infection cases during the fourth wave. This information is especially valuable for local public agencies to make timely preparations at the beginning of a wave. Note that the prediction of District 4 (i.e., Shinjuku) is not as accurate as other districts. One potential interpretation is that Shinjuku is a commercial area with many entertainment industries and thus does not have similar infection patterns as other districts.

6.4. Parameter Sensitivity

The influences of model parameters in the SAB-GNN on the prediction performance are shown in Figure 6. Figure 6a displays the changes of RMSE for the fourth peak prediction with respect to $n_{w}$ (i.e., numbers of symptom-related web search features used in the model). Recall that there is a huge frequency discrepancy between different symptoms and we only feed the most frequent $n_{w}$ symptoms into our model. The result suggests that $n_{w}=8$ provides the best prediction performance. The underlying reason is that: when $n_{w}$ is small, more symptoms contribute stronger connections with future infection cases; when $n_{w}$ is large, many weakly-related symptoms bring noises to the training and thus harm the prediction performance.

Next, we tune the numbers of layers $L_{1},L_{2}$ in the GNN and LSTM modules (Figures 6bc) and conclude that $(L_{1},L_{2})=(1,2)$ yields the best prediction. It is reasonable that the simple model with a small number of parameters is preferred in this infection case prediction task whose training data is at the hundreds level (Qiu et al., 2021) (each day is associated with one training sample). Finally, we test the model in Figure 6d with varying $D_{1}$ and $D_{2}$ and observe the overall tendency that using more days’ (i.e., $D_{1}$ is large) information to predict later fewer days’ (i.e., $D_{2}$ is small) results in higher accuracy, which is consistent with the finding from the existing study (Panagopoulos et al., 2021).

6.5. Ablation Study

To confirm the effect of each module in the SAB-GNN on prediction results, we perform the ablation experiments for both the third wave and fourth wave (Figure 7). We remove one of the three modules in the SAB-GNN, and perform the prediction for the same temporal horizons as the full SAB-GNN model. We show quantitatively that the SAB-GNN model obtains lower prediction RMSE than other incomplete models, especially for the third wave, which reveals that both the spatial module, temporal module, and social awareness module contribute to the final prediction results in a positive manner. While existing studies have paid much attention to the spatial and temporal relationships between features and infection cases, we recommend introducing suitable social knowledge into the prediction, given the positive effect of the social awareness recovery mechanism.

7. DISCUSSION

Model performance analysis: Underlying reasons for the superior prediction of SAB-GNN are threefold: (1) SAB-GNN leverages the power of GNN and LSTM to capture the spatio-temporal dynamic of disease spreading; (2) Web search features provide extra information as unconfirmed symptoms; (3) Social memory decay module properly reflects the social awareness decay as people get accustomed to COVID-19 during the pandemic. Note that we do not quantitatively test epidemiological models (e.g., SIR, SEIR) for two reasons: (1) Vanilla epidemiological models provide a good fit to the single wave or the first wave, but are insufficient to describe the multiwave infection outbreaks; (2) Quantitative experiments have demonstrated the predictability power of SEIR model is weaker than neural network model greatly (Schwabe et al., 2021).
Policy implications: This study demonstrates that web search data serves as a qualified pandemic surveillance indicator. It is able to complement existing mobility-based prediction models in various prediction platforms such as COVID-19 Forecast Hub (Cramer et al., 2021), COVID-19 Mobility Impact Dashboard (Chang et al., 2021). Besides, policymakers can apply our tool on regional human text data collected from social media agencies (such as Twitter, Baidu, and Weibo) to forecast the next disease outbreak and make timely medical and social decisions, as part of the long-term battle against COVID-19.
Limitations: We acknowledge two limitations in this study. First, we cannot ascertain used mobility and web search data from Yahoo! Japan evenly cover the Tokyo population from young children to old people. Extra work about representative of mobility and web search data in the whole population is needed before real-world deployment. Second, we assume the exponential decay of social awareness which lacks rigorous empirical and theoretical evidence. Future work can build a more solid decay function starting from the classical Collective Memory Theory, which describes the relationship between information (e.g., symptom-related web search) and human identity (e.g., infected, uninfected) from the field of cognition science (Roediger III and Abel, 2015).

8. CONCLUSION

Motivated by the multiwave outbreak of COVID-19 across the globe, we establish the SAB-GNN to predict future infection cases. Except for the historical infection and mobility data, our approach utilizes the novel symptom-related web search data which provides alternative evidence of future waves. More importantly, we consider the social awareness decay effect and propose the social awareness recovery module to estimate the actual infection risks. Experiments on the third and fourth peaks of Tokyo affirm that the SAB-GNN outperforms other baseline models and captures the increasing trend of pandemic waves. Our method is applicable to many countries given the wide coverage of web search data.

ACKNOWLEDGEMENTS

We thank Yahoo Japan Corporation for the collaboration. Jiawei Xue and Satish Ukkusuri are partly supported by the National Science Foundation (Award Number: 1638311) grant for which the authors are grateful.

References

(1)
Adiga et al. (2021) Aniruddha Adiga, Lijing Wang, Benjamin Hurt, Akhil Peddireddy, Przemyslaw Porebski, Srinivasan Venkatramanan, Bryan Leroy Lewis, and Madhav Marathe. 2021. All models are useful: Bayesian ensembling for robust high resolution covid-19 forecasting. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 2505–2513.
Badr and Gardner (2021) Hamada S Badr and Lauren M Gardner. 2021. Limitations of using mobile phone data to model COVID-19 transmission in the USA. The Lancet Infectious Diseases 21, 5 (2021), e113.
Bastings et al. (2017) Joost Bastings, Ivan Titov, Wilker Aziz, Diego Marcheggiani, and Khalil Sima’an. 2017. Graph Convolutional Encoders for Syntax-aware Neural Machine Translation. In EMNLP.
Centers for Disease Control and Prevention (2021) Centers for Disease Control and Prevention. 2021. Symptoms of COVID-19. https://www.cdc.gov/coronavirus/2019-ncov/symptoms-testing/symptoms.html.
Chang et al. (2020) Serina Chang, Emma Pierson, Pang Wei Koh, Jaline Gerardin, Beth Redbird, David Grusky, and Jure Leskovec. 2020. Mobility network models of COVID-19 explain inequities and inform reopening. Nature (2020), 1–6.
Chang et al. (2021) Serina Chang, Mandy L Wilson, Bryan Lewis, Zakaria Mehrab, Komal K Dudakiya, Emma Pierson, Pang Wei Koh, Jaline Gerardin, Beth Redbird, David Grusky, et al. 2021. Supporting covid-19 policy response with large-scale mobility-based modeling. In SIGKDD. 2632–2642.
Chen et al. (2019) Cen Chen, Kenli Li, Sin G Teo, Xiaofeng Zou, Kang Wang, Jie Wang, and Zeng Zeng. 2019. Gated residual recurrent graph neural networks for traffic prediction. In Proceedings of the AAAI conference on artificial intelligence, Vol. 33. 485–492.
Cheng (1995) Yizong Cheng. 1995. Mean shift, mode seeking, and clustering. IEEE transactions on pattern analysis and machine intelligence 17, 8 (1995), 790–799.
Chimmula and Zhang (2020) Vinay Kumar Reddy Chimmula and Lei Zhang. 2020. Time series forecasting of COVID-19 transmission in Canada using LSTM networks. Chaos, Solitons & Fractals (2020), 109864.
Chinazzi (2021) Matteo Chinazzi. accessed August 2021. Mobility, commuting, and contact patterns across the United States during the COVID-19 outbreak. Available online at https://covid19.gleamproject.org/mobility.
Cho et al. (2014) Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014).
Cramer et al. (2021) Estee Y Cramer et al. 2021. The United States COVID-19 Forecast Hub dataset. medRxiv (2021). https://doi.org/10.1101/2021.11.04.21265886
Gao et al. (2021) Junyi Gao, Rakshith Sharma, Cheng Qian, Lucas M Glass, Jeffrey Spaeder, Justin Romberg, Jimeng Sun, and Cao Xiao. 2021. STAN: spatio-temporal attention network for pandemic prediction using real-world evidence. Journal of the American Medical Informatics Association 28, 4 (2021), 733–743.
Goel et al. (2010) Sharad Goel, Jake M Hofman, Sébastien Lahaie, David M Pennock, and Duncan J Watts. 2010. Predicting consumer behavior with Web search. Proceedings of the National academy of sciences 107, 41 (2010), 17486–17490.
Google Inc. (2021) Google Inc. 2021. Covid-19 infection cases. https://news.google.com/covid19/map?hl=en-US&gl=US&ceid=US%3Aen.
Hamilton et al. (2017) Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive representation learning on large graphs. In Advances in neural information processing systems. 1024–1034.
Hisada et al. (2020) Shohei Hisada, Taichi Murayama, Kota Tsubouchi, Sumio Fujita, Shuntaro Yada, Shoko Wakamiya, and Eiji Aramaki. 2020. Surveillance of early stage COVID-19 clusters using search query logs and mobile device-based location information. Scientific Reports 10, 1 (2020), 1–8.
Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735–1780.
Huang et al. (2020) Xiao Huang, Zhenlong Li, Yuqin Jiang, Xiaoming Li, and Dwayne Porter. 2020. Twitter reveals human mobility dynamics during the COVID-19 pandemic. PloS one 15, 11 (2020), e0241957.
Jo et al. (2020) HyeongChan Jo, Juhyun Kim, Tzu-Chen Huang, and Yu-Li Ni. 2020. condLSTM-Q: A novel deep learning model for predicting Covid-19 mortality in fine geographical Scale. arXiv preprint arXiv:2011.11507 (2020).
Ju et al. (2021) Mingxuan Ju, Wei Song, Shiyu Sun, Yanfang Ye, Yujie Fan, Shifu Hou, Kenneth Loparo, and Liang Zhao. 2021. Dr. Emotion: Disentangled Representation Learning for Emotion Analysis on Social Media to Improve Community Resilience in the COVID-19 Era and Beyond. In Proceedings of the Web Conference 2021. 518–528.
Kang et al. (2020) Yuhao Kang, Song Gao, Yunlei Liang, Mingxiao Li, Jinmeng Rao, and Jake Kruse. 2020. Multiscale dynamic human mobility flow dataset in the US during the COVID-19 epidemic. Scientific data 7, 1 (2020), 1–13.
Kapoor et al. (2020) Amol Kapoor, Xue Ben, Luyang Liu, Bryan Perozzi, Matt Barnes, Martin Blais, and Shawn O’Banion. 2020. Examining covid-19 forecasting using spatio-temporal graph neural networks. arXiv preprint arXiv:2007.03113 (2020).
Kargas et al. (2021) Nikos Kargas, Cheng Qian, Nicholas D Sidiropoulos, Cao Xiao, Lucas M Glass, Jimeng Sun, et al. 2021. STELAR: Spatio-temporal tensor factorization with latent epidemiological regularization. In 35th AAAI Conference on Artificial Intelligence (AAAI).
Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
Kipf and Welling (2016) Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).
Kufel et al. (2020) Tadeusz Kufel et al. 2020. ARIMA-based forecasting of the dynamics of confirmed Covid-19 cases for selected European countries. Equilibrium. Quarterly Journal of Economics and Economic Policy 15, 2 (2020), 181–204.
Leung et al. (2020) Kathy Leung, Joseph T Wu, Di Liu, and Gabriel M Leung. 2020. First-wave COVID-19 transmissibility and severity in China outside Hubei after control measures, and second-wave scenario planning: a modelling impact assessment. The Lancet 395, 10233 (2020), 1382–1393.
Levin et al. (2021) Roman Levin, Dennis L Chao, Edward A Wenger, and Joshua L Proctor. 2021. Insights into population behavior during the COVID-19 pandemic from cell phone mobility data and manifold learning. Nature Computational Science 1, 9 (2021), 588–597.
Moghadas et al. (2020) Seyed M Moghadas, Affan Shoukat, Meagan C Fitzpatrick, Chad R Wells, Pratha Sah, Abhishek Pandey, Jeffrey D Sachs, Zheng Wang, Lauren A Meyers, Burton H Singer, et al. 2020. Projecting hospital utilization during the COVID-19 outbreaks in the United States. Proceedings of the National Academy of Sciences 117, 16 (2020), 9122–9126.
Narasimhan et al. (2018) Medhini Narasimhan, Svetlana Lazebnik, and Alexander G Schwing. 2018. Out of the box: Reasoning with graph convolution nets for factual visual question answering. Advances in Neural Information Processing Systems 2018 (2018), 2654–2665.
Organization (2022) World Health Organization. accessed June 2022. Coronavirus disease (COVID-19) pandemic. Available online at https://covid19.who.int/table.
Panagopoulos et al. (2021) George Panagopoulos, Giannis Nikolentzos, and Michalis Vazirgiannis. 2021. Transfer Graph Neural Networks for Pandemic Forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 4838–4845.
Peng et al. (2020) Hao Peng, Hongfei Wang, Bowen Du, Md Zakirul Alam Bhuiyan, Hongyuan Ma, Jianwei Liu, Lihong Wang, Zeyu Yang, Linfeng Du, Senzhang Wang, et al. 2020. Spatial temporal incidence dynamic graph neural networks for traffic flow forecasting. Information Sciences 521 (2020), 277–290.
Qian et al. (2021) Xinwu Qian, Lijun Sun, and Satish V Ukkusuri. 2021. Scaling of contact networks for epidemic spreading in urban transit systems. Scientific reports 11, 1 (2021), 1–12.
Qian et al. (2020) Xinwu Qian, Jiawei Xue, and Satish V Ukkusuri. 2020. Modeling disease spreading with adaptive behavior considering local and global information dissemination. arXiv preprint arXiv:2008.10853 (2020).
Qiu et al. (2021) Yu Qiu, Yun Liu, Shijie Li, and Jing Xu. 2021. MiniSeg: An Extremely Minimum Network for Efficient COVID-19 Segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 4846–4854.
Rodriguez et al. (2021) Alexander Rodriguez, Nikhil Muralidhar, Bijaya Adhikari, Anika Tabassum, Naren Ramakrishnan, and B Aditya Prakash. 2021. Steering a Historical Disease Forecasting Model Under a Pandemic: Case of Flu and COVID-19. In Proceedings of AAAI.
Roediger III and Abel (2015) Henry L Roediger III and Magdalena Abel. 2015. Collective memory: a new arena of cognitive study. Trends in cognitive sciences 19, 7 (2015), 359–361.
Sarker et al. (2020) Abeed Sarker, Sahithi Lakamana, Whitney Hogg-Bremer, Angel Xie, Mohammed Ali Al-Garadi, and Yuan-Chi Yang. 2020. Self-reported COVID-19 symptoms on Twitter: an analysis and a research resource. Journal of the American Medical Informatics Association 27, 8 (2020), 1310–1315.
Schwabe et al. (2021) Amray Schwabe, Joel Persson, and Stefan Feuerriegel. 2021. Predicting COVID-19 Spread from Large-Scale Mobility Data. In KDD 2021.
Sesti et al. (2021) Nathan Sesti, Juan Jose Garau-Luis, Edward Crawley, and Bruce Cameron. 2021. Integrating LSTMs and GNNs for COVID-19 Forecasting. arXiv preprint arXiv:2108.10052 (2021).
Shahid et al. (2020) Farah Shahid, Aneela Zameer, and Muhammad Muneeb. 2020. Predictions for COVID-19 with deep learning models of LSTM, GRU and Bi-LSTM. Chaos, Solitons & Fractals 140 (2020), 110212.
Tokyo COVID-19 Task Force Website (2021) Tokyo COVID-19 Task Force Website. 2021. Covid-19 infection cases. https://github.com/codeforshinjuku/covid19/blob/master/dist/patient.json.
Veličković et al. (2017) Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. 2017. Graph attention networks. arXiv preprint arXiv:1710.10903 (2017).
Wang et al. (2021) Jingqi Wang, Noor Abu-el Rub, Josh Gray, Huy Anh Pham, Yujia Zhou, Frank J Manion, Mei Liu, Xing Song, Hua Xu, Masoud Rouhizadeh, et al. 2021. COVID-19 SignSym: a fast adaptation of a general clinical NLP tool to identify and normalize COVID-19 signs and symptoms to OMOP common data model. Journal of the American Medical Informatics Association 28, 6 (2021), 1275–1283.
Wang et al. (2020) Lijing Wang, Aniruddha Adiga, Srinivasan Venkatramanan, Jiangzhuo Chen, Bryan Lewis, and Madhav Marathe. 2020. Examining Deep Learning Models with Multiple Data Sources for COVID-19 Forecasting. In 2020 IEEE International Conference on Big Data (Big Data). IEEE, 3846–3855.
Xiao et al. (2021) Congxi Xiao, Jingbo Zhou, Jizhou Huang, An Zhuo, Ji Liu, Haoyi Xiong, and Dejing Dou. 2021. C-Watcher: A Framework for Early Detection of High-Risk Neighborhoods Ahead of COVID-19 Outbreak. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 4892–4900.
Xiong et al. (2020) Chenfeng Xiong, Songhua Hu, Mofeng Yang, Weiyu Luo, and Lei Zhang. 2020. Mobile device data reveal the dynamics in a positive relationship between human mobility and COVID-19 infections. Proceedings of the National Academy of Sciences 117, 44 (2020), 27087–27089.
Xue et al. (2022) Jiawei Xue, Nan Jiang, Senwei Liang, Qiyuan Pang, Takahiro Yabe, Satish V Ukkusuri, and Jianzhu Ma. 2022. Quantifying the spatial homogeneity of urban road networks via graph neural networks. Nature Machine Intelligence 4, 3 (2022), 246–257.
Yabe et al. (2020) Takahiro Yabe, Kota Tsubouchi, Naoya Fujiwara, Takayuki Wada, Yoshihide Sekimoto, and Satish V Ukkusuri. 2020. Non-compulsory measures sufficiently reduced human mobility in Tokyo during the COVID-19 epidemic. Scientific reports 10, 1 (2020), 1–9.
Yabe et al. (2021) Takahiro Yabe, Kota Tsubouchi, Yoshihide Sekimoto, and Satish V Ukkusuri. 2021. Early warning of COVID-19 hotspots using human mobility and web search query data. Computers, Environment and Urban Systems (2021), 101747.
Yabe et al. (2019) Takahiro Yabe, Kota Tsubouchi, Toru Shimizu, Yoshihide Sekimoto, and Satish V Ukkusuri. 2019. Predicting Evacuation Decisions using Representations of Individuals’ Pre-Disaster Web Search Behavior. In SIGKDD. 2707–2717.
Yan et al. (2021) Rui Yan, Weiheng Liao, Jianwei Cui, Hailei Zhang, Yichuan Hu, and Dongyan Zhao. 2021. Multilingual COVID-QA: Learning towards Global Information Sharing via Web Question Answering in Multiple Languages. In Proceedings of the Web Conference 2021. 2590–2600.
Yao et al. (2019) Huaxiu Yao, Xianfeng Tang, Hua Wei, Guanjie Zheng, and Zhenhui Li. 2019. Revisiting spatial-temporal similarity: A deep learning framework for traffic prediction. In Proceedings of the AAAI conference on artificial intelligence, Vol. 33. 5668–5675.
Yao et al. (2018) Huaxiu Yao, Fei Wu, Jintao Ke, Xianfeng Tang, Yitian Jia, Siyu Lu, Pinghua Gong, Jieping Ye, and Zhenhui Li. 2018. Deep multi-view spatial-temporal network for taxi demand prediction. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32.
Ye et al. (2020) Yanfang Ye, Shifu Hou, Yujie Fan, Yiming Zhang, Yiyue Qian, Shiyu Sun, Qian Peng, Mingxuan Ju, Wei Song, and Kenneth Loparo. 2020. $alpha$ -Satellite: An AI-Driven System and Benchmark Datasets for Dynamic COVID-19 Risk Assessment in the United States. IEEE Journal of Biomedical and Health Informatics 24, 10 (2020), 2755–2764.
Yom-Tov et al. (2022) Elad Yom-Tov, Vasileios Lampos, Thomas Inns, Ingemar J Cox, and Michael Edelstein. 2022. Providing early indication of regional anomalies in COVID-19 case counts in England using search engine queries. Scientific reports 12, 1 (2022), 1–10.
Zhang et al. (2021) Xiyue Zhang, Chao Huang, Yong Xu, Lianghao Xia, Peng Dai, Liefeng Bo, Junbo Zhang, and Yu Zheng. 2021. Traffic Flow Forecasting with Spatial-Temporal Graph Diffusion Network. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 15008–15015.
Zhou et al. (2020a) Feng Zhou, Tao Chen, and Baiying Lei. 2020a. Do not forget interaction: Predicting fatality of COVID-19 patients using logistic regression. arXiv preprint arXiv:2006.16942 (2020).
Zhou et al. (2020b) Jie Zhou, Ganqu Cui, Shengding Hu, Zhengyan Zhang, Cheng Yang, Zhiyuan Liu, Lifeng Wang, Changcheng Li, and Maosong Sun. 2020b. Graph neural networks: A review of methods and applications. AI Open 1 (2020), 57–81.

Appendix A Supplement: Reproducibility

A.1. Dataset

Location data		Web search data
ID1	thisisfakeid	ID1	thisisfakeid
Latitude	35.683	Time	2020-03-17 08:32:14
Longitude	139.763	Query	”Is loss taste covid symptom?”
Unix time	1584390800
Date	20200317

Table 2. Samples of mobility data and web search data.

	Symptom		Symptom
1	Abdominal pain	23	Hot flash
2	Ageusia	24	Hyperhidrosis
3	Anosmia	25	Insomnia
4	Anxiety	26	Lethargic
5	Arthralgia	27	Loss of appetite
6	Body ache	28	Myalgia
7	Chest pain	29	Nasal dryness
8	Chest tightness	30	Nausea
9	Chills	31	Oropharyngeal pain
10	Confusion	32	Pain
11	Cough	33	Palpitation
12	Dehydration	34	Pyrexia
13	Diarrhea	35	Rash
14	Disorientation	36	Rhinorrhea
15	Dizziness	37	Sinusitis
16	Dyspnea	38	Sleep disturbance
17	Ear infection	39	Sneezing
18	Ear pain	40	Stress
19	Eye infection	41	Sweating
20	Eye pain	42	Upper Respiratory Tract Infection
21	Fatigue	43	Vomiting
22	Headache	44	Wheezing

Table 3. Specified COVID-19 related symptoms.

This study uses four datasets: $[1]$ mobility data; $[2]$ web search data; $[3]$ infection data; $[4]$ symptom data.

•

We aggregate mobility data and web search data for individuals from the mobile phone data owned by Yahoo Japan Corporation²²2 https://about.yahoo.co.jp/en/info/company/, and feed aggregated mobility (i.e., $\mathbf{E}_{t}$ ) and web search data (i.e., $\mathbf{H}_{t}$ ) into our machine learning model. $[1]$ $\mathbf{E}_{t}$ and $[2]$ $\mathbf{H}_{t}$ are accessible at Yahoo! JAPAN R&D³³3https://randd.yahoo.co.jp/en/softwaredata. Please go to YJ Covid-19 Prediction Data.
•

$[3]$ infection data and $[4]$ symptom data originate from public resources and can be found under the $data$ - $collection$ directory in our Github repository⁴⁴4 https://github.com/JiaweiXue/MultiwaveCovidPrediction.
•

Machine learning codes and implementation results are under the $SAB$ - $GNN$ directory in our Github repository.

Using these data and codes, readers can fully reproduce the disease prediction results for Tokyo. Besides, readers may use our codes to conduct the disease infection prediction on other cities if their mobility and web search data are available.

A.2. Data Preprocessing

Recall that our prediction task requires the standard mobility matrix $\mathbf{E}_{t}$ as well as the web search matrix $\mathbf{H}_{t}$ on the day $t$ . We design a data preprocessing framework to convert the raw mobility and web search data to $\mathbf{E}_{t}$ and $\mathbf{H}_{t}$ (Figure 8). We notice from the raw mobility data that a proportion of user IDs only appear a few times, which indicates that they might be contemporary travelers in Tokyo and may not be strongly related to the inter-city disease spreading in several months, and thus exclude their mobility data in this study. We finally identify 551,745 permanent users in the 23 special wards of Tokyo, which take around 6% of the total population in these wards. Our mobility and web search data preprocessing procedure is shown as Figure 8:

•

Identify Tokyo residents. We apply the Mean Shift Algorithm (Cheng, 1995) to each permanent users’ location points during the night hours (from 6:00 PM to 9:00 AM) for 2 weeks starting from Jan. 6, 2020 to estimate the longitude and latitude of their homes using Java (step (1)).
•

Generate mobility feature. We extract the daily trajectories of these permanent residents using Java (step (2)) and then project the location points along the trajectories to the urban districts using the $spatial-join$ function in $\mathsf{geopandas}$ package in Python and obtain the mobility matrix $\mathbf{E}_{t}$ (step (3)). Note that we mine the daily location trajectory of each user ID and identify cross-district trips with the duration of at least 10 minutes (these trips are referred to as valid trips).
•

Generate web search feature. Based on the specified 44 COVID-19 symptoms, we count the number of symptom searches for each permanent resident using Java (step (4)), aggregate them by urban districts (step (5)), and arrive at the web search matrix $\mathbf{H}_{t}$ .

\{G_{t},T-D_{1}+1\leq t\leq T;I_{t},T+1\leq t\leq T+D_{2}\}_{T\in T_{train}}

, where

G_{t}=(V,\mathbf{E}_{t},\mathbf{H}_{t},\mathbf{I}_{t})

2:the SAB-GNN model.

3:for each epoch do

4: for each batch do

\mathcal{L}_{batch}\leftarrow 0

m=0

;

6: for

T

in this batch do

m

\leftarrow m+1

;

8: /* Spatial module */

9: Evaluate

\mathbf{X}_{t}^{(k+1)}

using Eq. (1),

T-D_{1}+1\leq t\leq T

;

10: Generate representations

11:

\mathbf{H}_{t}^{S}=\mathbf{X}_{t}^{(K)}

T-D_{1}+1\leq t\leq T

;

12: /* Social awareness recovery module */

13: Update

\mathbf{H}_{t}^{S}

with

\mathbf{M}_{t,t_{0}}

;

14: Compute

\mathbf{H}_{t}^{A}

using Eq. (7);

15: /* Temporal module */

16: Feed each row

i

\mathbf{H}_{t}^{A}

into an LSTM model and

17: obtain the hidden states for

t=T+1,T+2,...,T+D_{2}

;

18: Use a one-layer perceptron to transform

19: these hidden states to

D_{2}

predicted cases

\hat{I}_{t,i}

;

20: Compute the loss

\mathcal{L}_{T}

using Eq. (8);

21:

\mathcal{L}_{batch}\leftarrow\frac{m-1}{m}\mathcal{L}_{batch}+\frac{1}{m}\mathcal{L}_{T}

;

22: end for

23: Use Adam optimizer to update the model parameter.

24: end for

25:end for

Algorithm 1 : Training algorithm of SAB-GNN

Study	Period	Used features	Model	Publication
Chimmula et al. (Chimmula and Zhang, 2020)	Jan. 2020 - Mar. 2020	Infection	LSTM	Chaos, Solitons & Fractals
Xiao et al. (Xiao et al., 2021)	Jan. 2020 - Mar. 2020	Baidu mobility, etc.	TL	AAAI 2021
Schwabe et al. (Schwabe et al., 2021)	Feb. 2020 - Apr. 2020	Swisscom mobility, etc.	Hawkes model	KDD 2021
Panagopoulos et al. (Panagopoulos et al., 2021)	Feb. 2020 - May 2020	Facebook mobility	MPNN + TL	AAAI 2021
Kapoor et al. (Kapoor et al., 2020)	Mar. 2020 - May 2020	Google mobility	GCN + Skip-connection	ArXiv
Kargas et al. (Kargas et al., 2021)	Mar. 2020 - Jun. 2020	Medical data	Epidemiological model	AAAI 2021
Gao et al. (Gao et al., 2021)	Mar. 2020 - Jun. 2020	Medical data	GAT + GRU	JAMIA
Wang et al. (Wang et al., 2020)	Mar. 2020 - Aug. 2020	Medical data	Ensemble	IEEE Big Data 2020
Adiga et al. (Adiga et al., 2021)	Aug. 2020 - Jan. 2021	Infection	Ensemble	KDD 2021
Chang et al. (Chang et al., 2021)	Mar. 2020 - Feb. 2021	SafeGraph mobility, etc.	Epidemiological model	KDD 2021
Sesti et al. (Sesti et al., 2021)	Jan. 2020 - May 2021	Infection	GraphLSTM	ICML 2021 Workshop
This study	Apr. 2020 - May 2021	Yahoo mobility, web search	Social Awareness GNN	KDD 2022

Table 4. Comparison between existing studies and this study.

A.3. Training Algorithm

We present the training algorithm as Algorithm 1. We calculate the gradient and update model parameters for each batch of data.

A.4. Hyperparamter Selection

For SAB-GNN-wsa and SAB-GNN, we try following hyperparameters to identify optimal hyperparameters which are underlined.

•

Learning rate $\in$ {1e-1, 1e-2, 1e-3, 1e-4, 1e-5}.
•

Batch size $\in$ {2, 4, 8, 16}.
•

Dropout rate $\in$ {0.3, 0.5, 0.7}.
•

$L_{1}\in\{\underline{1},2,3\}$ . $L_{1}$ : the number of layers in GNN.
•

$L_{2}\in\{1,\underline{2},3\}$ . $L_{2}$ : the number of layers in LSTM.
•

$n_{w}$ $\in$ {1, 3, 8, 12, 44}. $n_{w}$ : the number of used symptom features (recall that $\mathbf{h}_{t}^{v_{i}}\in\mathbb{R}^{n_{w}}$ ).

A.5. Comparison with Existing Studies

We present existing COVID-19 infection studies in Table 4 based on prediction period, used feature data, and model architecture. We find that mobility data owned by different companies such as Baidu, Facebook, Google, and Swisscom served as the primary sources of COVID-19 infection prediction before Jun. 2020 when the first COVID-19 outbreaks occurred in many countries. As the pandemic propagated and evolved worldwide with the total infection cases exceeding 530 million as of June 7, 2022 (Organization, 2022), the infection dynamic involved more factors such as virus mutation, mask policy, vaccination, and travel policy changes.

Our study makes the first attempt to introduce web search data, which is a representative signal of unconfirmed positive cases within the population, into the multiwave infection prediction. Besides, we consider the temporal decay of social awareness of COVID-19 symptoms in our proposed social awareness GNN, to fully leverage the power of web search data. The reason is that people’s cognition changed during the two-year pandemic.

Our goal is not to beat all other models. COVID-19 spreading is a highly uncertain process affected by varying transmission discrepancy, virus mutation, human behavior, and vaccination penetrations in different periods and locations worldwide, which makes a universally best model impossible. Centers for Disease Control and Prevention announced that even the ensemble model was unable to provide reliable predictions all the time⁵⁵5 https://www.cdc.gov/coronavirus/2019-ncov/science/forecasting/forecasts-cases.html. Instead, the main takeaway is that we can leverage web search data to complement existing mobility-based infection prediction to confront the challenges of the multiwave infection uncertainty.

A.6. Application on Social Media Data Mining

One contribution of this study is the social awareness recovery module (Equations 4, 5, 6). Social awareness depicts the general phenomenon that human perception and attention on events / products / news becomes weaker as time proceeds. Beyond disease prediction, it is promising to apply the social awareness recovery module to social media data from other sources (e.g., Facebook, Twitter, Weibo) in other applications such as predicting the visits to a new location, and the shopping demand of a new product.

Multiwave COVID-19 Prediction from Social Awareness using Web Search and Mobility Data