STCGAT: A Spatio-temporal Causal Graph Attention Network for traffic flow prediction in Intelligent Transportation Systems

Wei Zhao, Shiqi Zhang, Bing Zhou, Bei Wang W. Zhao is with the School of Artificial Intelligence and Computer Science, School of Cyber Science and Engineering and Cooperative Innovation Center of Internet Healthcare, Zhengzhou University, Zhengzhou, 450001, China.S. Zhang and B. Wang are with the School of Cyber Science and Engineering, Zhengzhou University, Zhengzhou, 450001, China(e-mail:zhangshiqi

@

gs.zzu.edu.cn).B. Zhou is with the School of Artificial Intelligence and Computer Science and Cooperative Innovation Center of Internet Healthcare, Zhengzhou University, Zhengzhou, 450001, China.

Abstract

Air pollution and carbon emissions caused by modern transportation are closely related to global climate change. With the help of next-generation information technology such as Internet of Things (IoT) and Artificial Intelligence (AI), accurate traffic flow prediction can effectively solve problems such as traffic congestion and mitigate environmental pollution and climate change. It further promotes the development of Intelligent Transportation Systems (ITS) and smart cities. However, the strong spatial and temporal correlation of traffic data makes the task of accurate traffic forecasting a significant challenge. Existing methods are usually based on graph neural networks using predefined spatial adjacency graphs of traffic networks to model spatial dependencies, ignoring the dynamic correlation of relationships between road nodes. In addition, they usually use independent Spatio-temporal components to capture Spatio-temporal dependencies and do not effectively model global Spatio-temporal dependencies. This paper proposes a new Spatio-temporal Causal Graph Attention Network (STCGAT¹¹1The source code of STCGAT is available at https://github.com/zsqZZU/STCGAT.) for traffic prediction to address the above challenges. In STCGAT, we use a node embedding approach that can adaptively generate spatial adjacency subgraphs at each time step without a priori geographic knowledge and fine-grained modeling of the topology of dynamically generated graphs for different time steps. Meanwhile, we propose an efficient causal temporal correlation component that contains node adaptive learning, graph convolution, and local and global causal temporal convolution modules to learn local and global Spatio-temporal dependencies jointly. Extensive experiments on four real, large traffic datasets show that our model consistently outperforms all baseline models.

Index Terms:

traffic flow prediction, spatial dependencies, Spatio-temporal dependencies, Spatio-temporal Causal Graph Attention Network, Intelligent Transportation Systems (ITS).

I Introduction

With the rapid development of Industrial Internet of Things (IIoT) 4.0[1], Healthcare 2.0, 5G, and even future 6G-based high-traffic communications[2], as well as Artificial Intelligence (AI), the range of intelligent city sub-sectors are beginning to serve service providers and citizens fully. Among these services, Intelligent Transportation Systems (ITS)[3] is the critical platform for intelligent transportation, which provides outstanding services. Such as, it can provide real-time and accurate road traffic status information, location navigation services, personalized travel route planning, and other services. As an essential part of the intelligent transportation system, traffic flow prediction can effectively mine potential Spatio-temporal patterns from traffic data, which not only helps to relieve traffic congestion and control traffic flow scheduling but also reduces people’s travel time and cost and reduces environmental pollution to promote the development of smart cities[4].

The prediction model based on Spatio-temporal traffic data should consider the temporal dependencies of historical time series and the spatial dependencies between traffic highways. In the early days, traffic flow prediction methods were usually regarded as multivariate series analysis tasks in the time dimension, such as modeling traffic time series data using queuing theory[5] model, traffic behavior theory[6], and machine learning approaches[7]. However, these methodologies only evaluated the dependence on the temporal dimension and disregarded the spatial dimension dependence. Consequently, an increasing number of academics are focusing on Spatio-temporal prediction models based on Graphical Neural Network (GNN)[8] that produce impressive outcomes. However, these models still have some limitations.

Refer to caption — Figure 1: Adjacency Matrix of Traffic Road Network.

The first limitation is that the dynamic correlation information between nodes on the graph is ignored. The GNN-based spatial dependency modeling approach can be thought of as edge transformation and aggregation through the information of nodes in the traffic network[8]. As shown in Fig. 1, Most approaches[9][10][11] employ a predetermined static adjacency matrix to characterize the spatial interactions between traffic road nodes[12]. However, the relationships between road nodes in a traffic network are dynamic and interact, depending on various complex factors on the road, such as traffic flow, number of lanes, and population density. Therefore, the spatial relationship of roads cannot be modeled simply by the static spatial connections between them.

Second, traffic information on the transportation network has a high degree of nonlinear correlation and uncertainty. Examples include regular road maintenance, sudden accidents, etc. However, These models usually use Recurrent Neural Network (RNN)[13] (eg. Long-short Term Memory (LSTM)[14] or Gated Recurrent Unit (GRU)[15]) based models to capture the temporal dependence, the signals must be traversed along the long recurrent path of the network when dealing with long-range sequences, making it challenging to effectively model the global time dependence of long-time sequence data. In addition, the sequential execution process of RNNs prevents them from capturing causal correlation information about traffic events[16]. To address the above issues, some Convolutional Neural Network (CNN)-based research approaches stack convolutional layers into multiple layers to model global temporal dependencies[17][18]. However, the local time-dependent information may be lost as the expansion rate increases[19]. Meanwhile, as the neural network’s depth deepens, it makes optimizing the model more challenging, which inevitably leads to the problem of network degradation[20].

To address the aforementioned challenge, we offer a new Spatio-temporal prediction framework based on Graph Attention Network (GAT)[21] called Spatio-temporal Causal Graph Attention Network (STCGAT). We propose a data-driven graph structure learning method that can autonomously learn the relationship information between road nodes and model the spatial correlation of traffic networks without relying on the traffic network graph structure information. In addition, we propose a bidirectional Spatio-temporal component to simultaneously capture local and global spatial-temporal dependency information. Moreover, a residual module[20] is introduced in the component to use a deep network to capture more abstract Spatio-temporal dependence information. The following are the principal contributions of this work:

1.

We construct a data-driven Node Adaptive Learning Graph Attention Network (NAL-GAT) that dynamically captures spatial dependencies by generating a weighted adjacency matrix of the graph based on the road traffic state information at different moments.
2.

We propose a new spatiotemporal component that replaces the GRU gating unit with NAL-GAT and is further constructed as a recursive bidirectional network to capture the local causal spatiotemporal dependence.
3.

We parallel process the output sequence data by stacked Temporal Convolutional Network to capture global and long-range temporal dependencies.

The structure of this document is as follows: Section II briefly covers some relevant works in the field of spatial-temporal prediction. Section III explains the structure and methodology of STCGAT. Section IV evaluates the performance of STCGAT by comparing numerous experimental outcomes. Section V finally closes our job.

II Related Works

II-A Graph Convolutional Network

Graph Convolutional Networks (GCN)[22] has received much interest in recent years. They are now frequently utilized for various tasks involving graph-structured data, including graph classification, node classification, and link prediction. The clustering method divides GCN into two basic categories: Spectral-based and Spatial-based[8]. Spectral-based techniques implement filters to define the graph convolution from the perspective of graph signal processing and extend the graph convolution to the spectral domain by locating the corresponding Fourier bases. The primary approaches are Chebyshev Spectral CNN (ChebNet)[23] and Adaptive Graph Convolution Network (AGCN)[24]. The Spatial-based approach represents graph convolution as aggregating feature information from the neighborhood and defining the graph convolution by the spatial relationship of the nodes. The most prominent methods are GAT and Gated Attention Network (GAAN)[25].

II-B Time-dependent modeling

Earlier traffic forecasting projects were often multivariate time series analysis problems utilizing time series modeling techniques such as History Average (HA) Model[26], Vector Autoregression (VAR)[27] and Support Vector Regressor (SVR)[28]. However, these algorithms rely too heavily on the assumption of ideal smoothness, which is incompatible with the nonlinear correlation of traffic data. With its well-established data modeling and autonomous learning capabilities, deep learning has steadily taken over time series forecasting tasks in recent years. Most studies have used RNN-based models to capture temporal dependencies, but the complex long-loop structure is time-consuming. It may also be accompanied by gradient disappearance and explosion phenomena as the long-loop links grow. To address this problem, some work uses Temporal Convolutional Network (TCN)[29][30] to process time series in parallel, learning more helpful information through a larger field of perception. Recently, several researchers have concentrated on the robust modeling capabilities of Transform[31] for time series data and have presented many variant models[32][33][34] based on Transform with excellent performance on long time series prediction tasks.

II-C Spatio-temporal dependence modeling

To model traffic data’s spatial and temporal dependencies, some studies[16][17][18] modeled traffic networks as regular two-dimensional grids, then process the two-dimensional data using CNN to capture the spatial dependence, and finally further model the temporal dependence of traffic time series using RNN or CNN. However, CNN’s are not always applicable in traffic road networks of non-Euclidean nature. To solve the problem, more and more researchers are turning to GCN-based models for Spatio-temporal data prediction. For example, DCRNN[9] models spatial dependencies by wandering in both directions on the traffic road topology graph and then uses GRU to capture temporal dependencies. ASTGCN[10] models the spatial-temporal dependence of traffic data with spatial attention and temporal attention, respectively. STSGCN[35] captures the local Spatio-temporal correlation of traffic data by combining Spatio-temporal blocks through a local Spatio-temporal synchronized graph convolution module. STFGNN[19] learns both local and global Spatio-temporal dependencies by processing data-driven spatial and temporal graphs at different moments in parallel. Following STFGNN, STGODE[36] draws on the Dynamic Time Warping (DTW)[37] used by STFGNN to generate semantic adjacency matrices for traffic road topology maps to capture deeper Spatio-temporal dependencies. In recent works, Z-GCNETs[38] introduce the concept of zigzag persistence in the traffic network diagram structure and integrate it into GCNs to enhance the stability of the model. STG-NCDE[39] designs two neural control differential equations dealing with temporal and spatial dependencies, respectively, and integrates both to capture Spatio-temporal dependencies simultaneously.

III Methods

III-A Problem formulation

Traffic flow forecasting uses historical traffic information on the road to predict traffic condition information in the future period. In our model, traffic speed is chosen as the traffic condition information of the road.

Definition 1: We use the undirected graph $G=(V,E)$ to represent the topological information of the traffic roads. Where $V=\{v_{1},v_{2},\cdots,v_{N}\}$ is the set of all nodes of graph $G$ , $N$ denotes the number of all road nodes, and $E$ denotes the set of connected edges of all nodes of graph $G$ .

Definition 2: We represent the historical traffic speed information of time length $T$ by a feature matrix $X=\{X_{1},X_{2},\cdots,X_{N}\}\in R^{N\times T\times F}$ , where $F$ denotes the feature dimension, $X_{t}=\{\vec{x_{t:1}},\vec{x_{t:2}},\cdots,\vec{x_{t:N}}\}\in R^{N\times F}$ denotes the set of traffic information of all road nodes at any $t$ time, and $\vec{x_{t:i}}\in\mathbb{R}^{F}$ then denotes all the feature vectors of node $v_{i}$ .

Traffic prediction has a substantial spatial and temporal correlation, considering the spatial correlation and temporal dependence of individual road nodes in the road network. Therefore, we aim to use the road network topology $G$ and the traffic information feature matrix $X$ to predict the traffic speed information $Y^{\prime}=[X_{t+1}^{\prime},X_{t+2}^{\prime},\cdots,X_{t+T}^{\prime}]$ in the next $T$ moments by learning a function $f(\cdot)$ .

Y^{\prime}=f(G;(X_{t-T},X_{t-(T-1)},\cdots,X_{t}))

(1)

III-B Spatial Dependency Modeling

The traffic condition data of distinct road nodes are strongly and dynamically connected in the spatial dimension. However, traditional graph neural networks construct predefined adjacency matrices to perform graph convolution operations based on information such as connectivity or distance between graph nodes. Although the predefined adjacency matrix intuitively represents the position relationship between nodes, it cannot reflect the dynamic spatial dependence between road nodes at different moments dynamically. Therefore, as shown in (b) in Fig. 2, we use a node-adaptive learning mechanism to learn the dynamic correlation information among road nodes at different moments in a fine-grained manner to generate the traffic subgraph $G_{ad}$ for the corresponding moment. As shown in Eq. 2, this mechanism generates the adjacency matrix $\tilde{A^{t}}\in R^{N\times N}$ corresponding to that moment at any moment $t$ .

\tilde{A}^{t}=softmax(ReLU(E_{At}\cdot E_{At}^{T}))

(2)

Where $E_{At}\in R^{N\times d}$ is the embedding dictionary encoding each traffic road node, $d$ is the embedding dimension, and $E_{At}^{T}$ is the transpose matrix of $E_{At}$ . $softmax$ is the normalization function, and $ReLU$ is the nonlinear activation function.

Meanwhile, to capture spatial dependencies among nodes in the spatial dimension adaptively and dynamically. We propose a NAL-GAT to extract the spatial features of traffic roads by integrating the node-adaptive learning mechanism with GAT. Specifically, at any moment $t$ , GAT computes the attention coefficients of its neighboring nodes vertex by vertex for the node correlation information generated by the node adaptive learning mechanism. Finally, it aggregates the spatial dependencies among the nodes on the graph at moment $t$ . As shown in Eq. 3, the computation process of attention coefficient $e_{ij}^{t}$ between node $v_{i}$ and its arbitrary neighbor node $v_{j}$ is demonstrated.

e_{ij}^{t}=a(W\vec{x}_{t:i},W\vec{x}_{t:j})

(3)

Where $a$ is the computational function of the attention mechanism, and $W\in R^{F\times F^{\prime}}$ is the graph’s weight matrix of all nodes. The attention coefficient $\alpha_{ij}^{t}$ of the graph’s attention layer is then generated by normalizing the attention coefficients of node $v_{i}$ ’s neighbors.

\begin{split}\alpha_{ij}^{t}&=softmax_{j}(e_{ij}^{t})\\ &=\frac{exp(e_{ij}^{t})}{\sum_{k\in\tilde{A}_{i}^{t}}exp(e_{ik}^{t})}\end{split}

(4)

Where $\tilde{A}_{i}^{t}$ denotes all the neighbor nodes of $v_{i}$ .

In addition, we note that all nodes in the GAT share the same parameter space $W\in R^{F\times F^{\prime}}$ , which may lead to too large graph $W$ with more nodes making the model difficult to optimize. To solve this problem, we construct a shared weight pool $W_{p}\in R^{d\times F\times F^{\prime}}$ , which can get the weight matrix $W^{\prime}=E_{At}\cdot W_{p}\in R^{F\times F^{\prime}}$ of each node according to the node’s embedding dictionary $E_{At}$ . Then the complete attention coefficient calculation process is shown in Eq. 5.

\begin{split}e_{ij}^{t}&=LeakReLu(\vec{a}^{T}[E_{At}\cdot W_{p}\vec{x}_{t:i}\parallel E_{At}\cdot W_{p}\vec{x}_{t:i}])\\ e_{ik}^{t}&=LeakReLu(\vec{a}^{T}[E_{At}\cdot W_{p}\vec{x}_{t:i}\parallel E_{At}\cdot W_{p}\vec{x}_{t:k}])\\ \alpha_{ij}^{t}&=\frac{exp(e_{ij}^{t})}{\sum_{k\in\tilde{A}_{i}^{t}}exp(e_{ik}^{t})}\\ \end{split}

(5)

Where $\vec{a}\in\mathbb{R}^{2F^{\prime}}$ is the weight matrix, $\parallel$ denotes the connection operation, and $LeakReLu$ is the nonlinear activation function.

To capture deeper feature information, this paper further uses the multi-head attention mechanism to model spatial dependence. As shown in Eq. 6, $Q$ sets of mutually independent attention mechanisms are invoked.

\vec{x}_{t:i}^{\prime}=\parallel_{q=1}^{Q}LeakReLu(\sum_{k\in\tilde{A}_{i}^{t}}\alpha_{ik}^{t,q}(E_{At}^{q}\cdot W_{p}^{q})\vec{x}_{t:k})

(6)

Where $\alpha_{ik}^{t,q}$ is the weight coefficient computed by the attention mechanism of the qth group at time $t$ , $E_{At}^{q}\cdot W_{p}^{q}$ is the weight matrix of the corresponding group, and $\vec{x_{i}}^{\prime}\in\mathbb{R}^{QF^{\prime}}$ is the new feature representation obtained by passing node $v_{i}$ through the attention layer of the multi-headed graph.

When using multi-headed attention, the network’s last layer should not output too many features, so we use a separate self-attentive mechanism to limit the node’s output feature length. Specifically, we use a new weight pool $W_{p}^{\prime}\in R^{d\times QF^{\prime}\times F^{\prime\prime}}$ to map the output dimension of nodes from $\mathbb{R}^{QF^{\prime}}$ to $\mathbb{R}^{F^{\prime\prime}}$ to obtain the final output result $\vec{x}_{t:i}^{\prime\prime}\in\mathbb{R}^{F^{\prime\prime}}$ .

\vec{x}_{t:i}^{\prime\prime}=LeakReLu(\sum_{k\in\tilde{A}_{i}^{t}}\alpha_{ik}^{t\prime}(E_{At}^{\prime}\cdot W_{p}^{\prime})\vec{x}_{t:k}^{\prime})

(7)

When all nodes in the graph have completed the above graph attention layer operation, we can obtain the output features $X_{t}^{\prime\prime}=\{\vec{x}_{t:1}^{\prime\prime},\vec{x}_{t:2}^{\prime\prime},\cdots,\vec{x}_{t:N}^{\prime\prime}\}\in R^{N\times F^{\prime\prime}}$ . For the convenience of presentation, we express this process in Eq. 8.

X_{t}^{\prime\prime}=\sigma(\tilde{A}^{t}X_{t}(E_{At}\cdot W_{p}))

(8)

where $\sigma(\cdot)$ is the computation function for the graph attention layer.

III-C Local Causal Spatial-temporal Dependency Modeling

There is a correlation between traffic conditions in the time dimension at various times. As shown in (C) in Fig. 2, we replace the gating unit of GRU with NAL-GAT to further capture the Spatio-temporal dependence of the traffic time series data. Specifically, the spatially dependent time series data $X_{t}^{\prime\prime}$ at any moment $t$ is used as the input data of the GRU.

\begin{split}z_{t}&=\sigma(\tilde{A}^{t}[X_{t},\overrightarrow{h}_{t-1}](E_{At}^{z}\cdot W_{p}^{z}))\\ r_{t}&=\sigma(\tilde{A}^{t}[X_{t},\overrightarrow{h}_{t-1}](E_{At}^{r}\cdot W_{p}^{r}))\\ \widetilde{h_{t}}&=tanh(\tilde{A}^{t}[X_{t},r_{t}\odot\overrightarrow{h}_{t-1}](E_{At}^{\widetilde{h}t}\cdot W_{p}^{\widetilde{h}t})\\ \overrightarrow{h_{t}}&=z_{t}\odot h_{t-1}+(1-z_{t})\odot\widetilde{h_{t}}\end{split}

(9)

Where $\overrightarrow{h}_{t-1}$ is the output at the previous moment, $[\cdot]$ denotes the concat operation in the feature dimension, $\widetilde{h_{t}}$ is the candidate hidden layer state, $\odot$ denotes the multiplication by elements, and $\overrightarrow{h_{t}}\in R^{N\times F^{\prime\prime}}$ is the output at the current moment.

It is important to note that as the input time length rises, so does the network depth of the model. However, the deep network may lead to issues such as gradient disappearance and overfitting in the model. Therefore, we use the residual module to connect the network’s layers to improve the model’s capacity for long-term capture.

\begin{split}z_{t}&=\sigma(\tilde{A}^{t}[X_{t},\overrightarrow{h}_{t-1}^{\prime}](E_{At}^{z}\cdot W_{p}^{z}))\\ r_{t}&=\sigma(\tilde{A}^{t}[X_{t},\overrightarrow{h}_{t-1}^{\prime}](E_{At}^{r}\cdot W_{p}^{r}))\\ \widetilde{h_{t}}&=tanh(\tilde{A}^{t}[X_{t},r_{t}\odot\overrightarrow{h}_{t-1}^{\prime}](E_{At}^{\widetilde{h}t}\cdot W_{p}^{\widetilde{h}t})\\ \overrightarrow{h_{t}}&=z_{t}\odot h_{t-1}^{\prime}+(1-z_{t})\odot\widetilde{h_{t}}\\ \overrightarrow{h}_{t}^{\prime}&=\varepsilon(\omega_{1}\otimes X_{t}+\omega_{2}\otimes\overrightarrow{h_{t}})\end{split}

(10)

Where $\omega_{1}$ and $\omega_{2}$ are both one-dimensional convolution kernels, $\varepsilon$ is the nonlinear activation function, $\otimes$ denotes the convolution operation, and $\overrightarrow{h}_{t}^{\prime}\in R^{N\times F^{\prime\prime}}$ is the output of residual concatenation. Until the completion of the above operations at the Tth time step, we can obtain the sequence data containing the Spatio-temporal dependence $\overrightarrow{H}^{\prime}=(\overrightarrow{h}_{t-T}^{\prime},\overrightarrow{h}_{t-(T-1)}^{\prime},\cdots,\overrightarrow{h}_{t}^{\prime})$ .

In addition, traffic data are not always sequentially correlated, and there are complex causal correlations between traffic events. Therefore, we use bidirectional GRU to capture the local causal, temporal relationships. The reverse operation is similar to the above operation, and the output results are finally stitched to obtain the output $H\in R^{N\times T\times 2F^{\prime\prime}}$ .

H=\overrightarrow{H}^{\prime}\parallel\overleftarrow{H^{\prime}}

(11)

III-D Global Spatial-temporal Dependency Modeling

From the above procedure, it is evident that GRU is processed by progressively unfolding along the timeline, which causes the output at the present time to depend on the state at the previous time and so lacks the capacity to capture long-term temporal dependence. We deploy a parallel temporal convolutional network along the time axis to improve the performance of extracting long-term temporal dependencies. The output of Equation 11’s data deformation $H^{\prime}\in R^{N\times(T*2F^{\prime\prime}})$ is utilized as the TCN’s input data. In the time series convolution process, the time series data $H_{i:}^{\prime}\in\mathbb{R}^{(T*F^{\prime\prime})}$ of any node $v_{i}$ and a filter $f:\{0,\cdots,l-1\}\ \ \overrightarrow{}\ \ \mathbb{R}$ are first extended for the elements $s$ .

F(s)=(H_{i:}^{\prime}*_{d}f)(s)=\sum_{i=0}^{l-1}f(i)\cdot H_{i:}^{\prime}(_{s-d\cdot i})

(12)

Where $d$ is the dilation factor, $l$ is the filter size, and $s-d\cdot i$ denotes the direction of the timeline past. When $d=0$ , the dilation convolution becomes a regular convolution. In addition, as shown in (d) in Fig. 2, the TCN needs to perform a series of transformations such as Weight Norm and Dropout and use the residual join to obtain the output $o\in R^{N\times(T*2F^{\prime\prime})}$ as in Eq. 13.

o=Activation(x+F(x))

(13)

III-E The prediction layer

Finally, we perform a dimension-specific linear transformation of the output sequence by a two-layer fully connected neural network.

Y^{\prime}=W_{2}\cdot\varphi(W_{1}\cdot o+b_{1})+b_{2}

(14)

Where $W_{1}\in R^{F^{\prime\prime\prime}\times(T*2F^{\prime\prime})}$ and $W_{2}\in R^{(T\times F)\times F^{\prime\prime\prime}}$ are the weight matrices, $b_{1}$ and $b_{2}$ are the corresponding bias terms, and $Y^{\prime}\in R^{N\times T\times F}$ is the final prediction result.

During model training, we optimize the model using the $L1$ loss function and the Adam optimizer to make the error between the predicted $Y^{\prime}=[X_{t+1}^{\prime},X_{t+2}^{\prime},\cdots,X_{t+T}^{\prime}]$ and labeled values $Y=[X_{t+1},X_{t+2},\cdots,X_{t+T}]$ as small as possible.

loss=\frac{1}{T}\sum_{i=1}^{T}\lvert Y_{t+i}-Y_{t+i}^{\prime}\rvert

(15)

IV Experiment

IV-A Datasets

We have done extensive experiments on the following four real public transportation datasets, PeMS03, PeMS04, PeMS07, and PeMS08[40], to illustrate the effectiveness of the proposed Spatio-temporal prediction framework. These datasets were obtained on Caltrans’ Performance Measurement System (PeMS). The system has over 39,000 traffic detectors deployed on California freeways to collect real-time traffic flow data and geographic information on the roadways. The collected data are aggregated every five minutes. Table I summarizes the critical statistics for the four datasets.

TABLE I: Datasets information statistics.

Datasets	Sensors	Edges	Unit	Time Steps
PeMS03	358	547	5 min	26208
PeMS04	307	340	5 min	16992
PeMS07	883	866	5 min	28224
PeMS08	170	295	5 min	17856

IV-B Baseline Methods

STCGAT was compared to some of the most advanced baseline models. The following is a summary of these baselines.

•

HA[26]: The model uses the average of historical traffic data as the forecast value.
•

VAR[27]: The model captures the dependencies between temporal data.
•

FC-LSTM[14]: The model is based on the traditional RNN model to model the time dependence of historical traffic data.
•

TCN[29]: The model allows for processing long-time series data at a fraction of the cost.
•

DCRNN[9]: The model is based on GCN to capture spatial dependencies and uses an encoder-decoder architecture to capture temporal correlations.
•

ASTGCN[10]: The model uses spatial and temporal attention mechanisms to capture underlying Spatio-temporal patterns.
•

STSGCN[35]: The model captures local Spatio-temporal dependencies using spatial graph convolution and one-dimensional temporal convolution, respectively.
•

AGCRN[11]: The model proposes an adaptive graph convolution module that can autonomously learn the adjacency matrix of a traffic road network.
•

STFGNN[19]: The model effectively fuses multiple spatial and temporal graphs to learn the Spatio-temporal relationships hidden by traffic data.
•

STGODE[36]: The model captures deeper Spatio-temporal dependencies by expanding the perceptual field of the GCN.
•

Z-GCNETs[38]: The model is designed with a time-aware GCN to capture the complex Spatio-temporal dependencies in traffic data.
•

STG-NCDE[39]: The model is designed with two independent neural control differential equations for modeling spatial and temporal dependence.

IV-C Experimental Settings and Evaluation Metrics

We divide the used dataset into 60% training set, 20% validation set, and the remaining 20% test set with Z-score standardization. The partitioned dataset is then processed through a sliding window of length $2T$ , where the first $T$ time lengths of serial data are used as historical data, and the last $T$ time lengths of data are used as labeled values. Here, we set the size of $T$ to $12$ . That is, one hour’s historical traffic data is used to predict traffic data for the next hour.

Meanwhile, STCGAT is implemented using the PyTorch deep learning framework. In the hyperparameter settings of our model, the embedding feature dimension of the nodes is set to 10, the hidden layer size is 64, the number of multi-head attention mechanisms is set to 3, and the convolutional kernel size is set to 2. During the training process, the learning rate is set to 0.001, the batch size is set to 64, and the model is optimized using the Adam optimizer with a maximum number of iterations of $300$ . The training environment of the model is shown in Table II.

TABLE II: Experimental environments.

System	Ubantu 18.04.6
CPU	Intel Core i5-10500 @ 3.10GHz
GPU	NVIDIA GeForce 2080Ti

We use the following three performance metrics to measure the model’s predictive power.

•

Mean Absolute Error(MAE):

\text{MAE}=\frac{1}{L}{\overset{L}{\underset{i=1}{\sum}}}\lvert Y_{i}-Y^{\prime}_{i}\rvert

•

Root Mean Squared Error(RMSE):

\text{RMSE}=\sqrt{\frac{1}{L}\sum^{L}_{i=1}(Y_{i}-Y^{\prime}_{i})}

•

Mean Absolute Percentage Error(MAPE):

\text{MAPE}=\frac{100\%}{L}\overset{L}{\underset{i=1}{\sum}}\lvert\frac{Y_{i}-Y^{\prime}_{i}}{Y_{i}}\rvert

Where $L$ denotes the total number of samples. The lower the values of the three indicators above, the greater the model’s predictive accuracy. We conduct each experiment five times and then calculate the mean value as the test result.

TABLE III: STCGAT and baseline models’ performance on PeMS03, PeMS04, PeMS07, and PeMS08 datasets were compared.

Model	Dataset	PeMS03			PeMS04			PeMS07			PeMS08
Model	Metrics	MAE	RMSE	MAPE	MAE	RMSE	MAPE	MAE	RMSE	MAPE	MAE	RMSE	MAPE
HA		31.74	51.79	33.49%	39.87	59.04	27.59%	45.32	65.74	23.92%	35.16	59.74	28.35%
VAR		23.75	37.97	24.53%	24.61	38.61	17.54%	49.89	75.45	32.13%	19.21	29.84	13.13%
FC-LSTM		20.96	36.01	20.76%	25.01	41.42	16.18%	33.26	59.92	14.32%	23.49	38.89	14.55%
TCN		19.32	33.55	19.93%	23.22	37.26	15.59%	32.27	42.23	14.26%	22.72	35.79	14.03%
DCRNN		17.48	29.19	16.83%	21.22	33.44	14.17%	24.69	37.88	10.80%	16.82	26.32	10.92%
ASTGCN		17.65	29.63	16.94%	22.03	34.99	14.59%	24.01	37.87	10.73%	18.36	28.31	11.25%
STSGCN		17.48	29.21	16.78%	21.19	33.65	13.90%	24.26	39.03	10.21%	17.13	26.80	10.96%
AGCRN		15.97	28.11	15.23%	19.83	32.26	12.97%	21.13	35.20	8.96%	15.95	25.22	10.09%
STFGNN		16.77	28.34	16.30%	20.18	32.41	13.94%	22.07	35.80	9.21%	16.64	26.25	10.60%
STGODE		16.32	27.23	16.25%	20.95	32.66	14.95%	22.90	37.54	10.14%	16.81	25.97	10.62%
Z-GCNETs		16.64	28.15	16.39%	19.50	31.61	12.78%	21.77	35.17	9.25%	15.76	25.11	10.01%
STG-NCDE		15.58	27.08	15.32%	19.81	31.57	14.04%	21.78	34.75	9.42%	15.62	24.87	10.14%
$\mathbf{STCGAT}$		$\mathbf{15.31}$	$\mathbf{27.03}$	$\mathbf{14.90\%}$	$\mathbf{19.21}$	$\mathbf{31.12}$	$\mathbf{12.36\%}$	$\mathbf{20.89}$	$\mathbf{34.00}$	$\mathbf{8.79\%}$	$\mathbf{15.41}$	$\mathbf{24.61}$	$\mathbf{9.78\%}$

IV-D Experiment Results and Analysis

On PeMS03, PeMS04, PeMS07, and PeMS08, our model was compared with the twelve representative baseline approaches described previously. Table III shows the prediction performance results of STCGAT and other baseline models within one hour (12 prediction steps). We can observe that 1) The GCN-based solution outperforms HA, VAR, and LSTM traditional machine learning methods, demonstrating the significance of explicitly modeling spatial correlation and the efficacy of GCN in traffic flow prediction tasks; 2) The performance metrics of our improved GAT-based method on each dataset are significantly better than other advanced baseline models, achieving significant results; 3) Our proposed method has stable short- and long-term Spatio-temporal prediction capabilities. As shown in Fig. 3, which demonstrates the comparison of the prediction performance of STCGAT with other Spatio-temporal prediction models on different Horizon, we can observe that the oscillations of the performance curve of STCGAT on each dataset are relatively small, indicating that our proposed method is insensitive to the prediction Horizon and the prediction performance is very stable; 4) As shown in Fig. 4, we use STCGAT and a representative baseline model to predict the traffic conditions at any road node for 288 consecutive time steps (24 hours). We can observe that the prediction results of STCGAT are more closely matched to the actual values, proving that STCGAT can more accurately capture the temporal and spatial correlations in the traffic flow sequence and achieve the best prediction results.

TABLE IV: Results of STCGAT ablation experiments on PeMS04 and PeMS08 datasets.

	PeMS04			PeMS08
Models	MAE	RMSE	MAPE	MAE	RMSE	MAPE
w/o node embedding	22.03	34.65	$15.29\%$	17.68	27.65	$11.04\%$
w/o ResNet	19.72	32.45	$13.57\%$	16.01	25.37	$10.31\%$
w/o reverse GRU	19.80	31.64	$13.29\%$	15.98	25.05	$10.33\%$
w/o TCN	23.74	39.05	$16.31\%$	21.91	34.76	$13.84\%$
$\mathbf{STCGAT}$	$\mathbf{19.21}$	$\mathbf{31.12}$	$\mathbf{12.36\%}$	$\mathbf{15.41}$	$\mathbf{24.61}$	$\mathbf{9.78\%}$

IV-E Ablation Study on Model Architecture

We created four STCGAT-based model variants and contrasted STCGAT with these four variables on the PeMS04 and PeMS08 datasets to understand the influence of STCGAT’s many modules. The differences between these four variants of the model are described below.

1.

w/o node embedding: The model removes the node embedding operation and uses only the traditional predefined adjacency matrix.
2.

w/o ResNet: The model removes the residual connection structure in STCGAT.
3.

w/o reverse GRU: The model one-way GRU captures the Spatio-temporal correlation of the traffic network.
4.

w/o TCN: The model removes the temporal convolutional network.

Table IV shows that we conducted experiments with these four variants and STCGAT on PeMS04 and PeMS08. Combined with the histogram of evaluation metrics for each variant model at different time steps in Fig. 5, we can obtain the following observations: 1) Each evaluation metric with TCN is at its maximum value, and the performance metrics are similar in the short-term (15 min) and long term (60 min). This illustrates the necessity of using temporal convolutional networks to extract the global temporal correlation of traffic flow; 2) The performance metrics of w/o node embedding decrease significantly when the node self-learning module is removed. This proves that learning the dynamic spatial correlation among nodes from the state information of traffic flow at different moments better expresses how the traffic flow dynamics change in reality; 3) The performance metrics values of w/o ResNet compared to STCGAT are significantly larger, i.e., removing the residual module has a more significant impact on both datasets. This indicates that the application of the residual module helps STCGAT to mitigate the problem of overfitting or gradient disappearance caused by the superposition of network layers to a certain extent; 4) The performance metrics of with or without reverse GRU also increase on both the PeMS04 and PeMS08 datasets, indicating that effectively capturing the causality of traffic flow data helps to analyze the Spatio-temporal correlation of traffic flow more comprehensively; 5) Compared with the four variants, STCGAT has the best performance. On the one hand, it shows the importance of each module in STCGAT, and on the other hand, it shows that STCGAT can extract the Spatio-temporal correlations in traffic flow series more accurately.

V CONCLUSION

This paper proposes a new Spatio-temporal causal prediction model STCGAT, which encodes all road nodes by node embedding. Then, we adaptively learn the relationship between road nodes according to the road traffic conditions at different moments, thus free from the constraint of the predefined adjacency matrix. It is further integrated into GAT to form NAL-GAT to model the spatial dependence of traffic road networks dynamically. Second, STCGAT reconstructs NAL-GAT into GRU to capture local Spatio-temporal dependencies. Meanwhile, STCGAT uses bi-directional GRU to capture the Spatio-temporal causality of traffic data in a fine-grained manner. In addition, STCGAT also reduces the network degradation problem caused by the deep network by introducing the residual module. Finally, STCGAT simultaneously captures traffic data’s global Spatio-temporal dependence information by processing time series data in parallel with TCN. Through extensive comparative experiments on multiple datasets, it is demonstrated that STCGAT has excellent Spatio-temporal modeling capability for highly nonlinear traffic data and consistently achieves optimal prediction performance compared to advanced Spatio-temporal prediction models. In future work, we will investigate the proposed model in other spatiotemporal data mining problems, such as weather spatiotemporal data mining tasks.

Acknowledgement

This work was supported in part by the National Key Technologies R&D Program (2020YFB1712401, 2018YFB1701400), the 2020 Key Project of Public Benefit in Henan Province of China (201300210500), and the Nature Science Foundation of China (62006210).

References

[1] M. Majid, S. Habib, A. R. Javed, M. Rizwan, G. Srivastava, T. R. Gadekallu, and J. C.-W. Lin, “Applications of wireless sensor networks and internet of things frameworks in the industry revolution 4.0: A systematic literature review,” Sensors, vol. 22, no. 6, p. 2087, 2022.
[2] X. Zhao, H. Askari, and J. Chen, “Nanogenerators for smart cities in the era of 5g and internet of things,” Joule, vol. 5, no. 6, pp. 1391–1431, 2021.
[3] L. Guevara and F. Auat Cheein, “The role of 5g technologies: Challenges in smart cities and intelligent transportation systems,” Sustainability, vol. 12, no. 16, p. 6469, 2020.
[4] A. Richter, M.-O. Löwner, R. Ebendt, and M. Scholz, “Towards an integrated urban development considering novel intelligent transportation systems: Urban development considering novel transport,” Technological Forecasting and Social Change, vol. 155, p. 119970, 2020.
[5] X.-y. Xu, J. Liu, H.-y. Li, and J.-Q. Hu, “Analysis of subway station capacity with the use of queueing theory,” Transportation research part C: emerging technologies, vol. 38, pp. 28–43, 2014.
[6] E. Cascetta, Transportation systems engineering: theory and methods. Springer Science & Business Media, 2013, vol. 49.
[7] C. Li and P. Xu, “Application on traffic flow prediction of machine learning in intelligent transportation,” Neural Computing and Applications, vol. 33, no. 2, pp. 613–624, 2021.
[8] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and S. Y. Philip, “A comprehensive survey on graph neural networks,” IEEE transactions on neural networks and learning systems, vol. 32, no. 1, pp. 4–24, 2020.
[9] Y. Li, R. Yu, C. Shahabi, and Y. Liu, “Diffusion convolutional recurrent neural network: Data-driven traffic forecasting,” arXiv preprint arXiv:1707.01926, 2017.
[10] S. Guo, Y. Lin, N. Feng, C. Song, and H. Wan, “Attention based spatial-temporal graph convolutional networks for traffic flow forecasting,” in Proceedings of the AAAI conference on artificial intelligence, vol. 33, no. 01, 2019, pp. 922–929.
[11] L. Bai, L. Yao, C. Li, X. Wang, and C. Wang, “Adaptive graph convolutional recurrent network for traffic forecasting,” Advances in neural information processing systems, vol. 33, pp. 17 804–17 815, 2020.
[12] L. Waikhom and R. Patgiri, “Graph neural networks: Methods, applications, and opportunities,” arXiv preprint arXiv:2108.10733, 2021.
[13] S. Sha, J. Li, K. Zhang, Z. Yang, Z. Wei, X. Li, and X. Zhu, “Rnn-based subway passenger flow rolling prediction,” IEEE Access, vol. 8, pp. 15 232–15 240, 2020.
[14] M. Karimzadeh, R. Aebi, A. M. de Souza, Z. Zhao, T. Braun, S. Sargento, and L. Villas, “Reinforcement learning-designed lstm for trajectory and traffic flow prediction,” in 2021 IEEE Wireless Communications and Networking Conference (WCNC). IEEE, 2021, pp. 1–6.
[15] P. Sun, A. Boukerche, and Y. Tao, “Ssgru: A novel hybrid stacked gru-based traffic volume prediction approach in a road network,” Computer Communications, vol. 160, pp. 502–511, 2020.
[16] T. Li, A. Ni, C. Zhang, G. Xiao, and L. Gao, “Short-term traffic congestion prediction with conv–bilstm considering spatio-temporal features,” IET Intelligent Transport Systems, vol. 14, no. 14, pp. 1978–1986, 2020.
[17] H. Zheng, F. Lin, X. Feng, and Y. Chen, “A hybrid deep learning model with attention-based conv-lstm networks for short-term traffic flow prediction,” IEEE Transactions on Intelligent Transportation Systems, vol. 22, no. 11, pp. 6910–6920, 2020.
[18] D. Ma, X. Song, and P. Li, “Daily traffic flow forecasting through a contextual convolutional recurrent neural network modeling inter-and intra-day traffic patterns,” IEEE Transactions on Intelligent Transportation Systems, vol. 22, no. 5, pp. 2627–2636, 2020.
[19] M. Li and Z. Zhu, “Spatial-temporal fusion graph neural networks for traffic flow forecasting,” in Proceedings of the AAAI conference on artificial intelligence, vol. 35, no. 5, 2021, pp. 4189–4196.
[20] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
[21] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio, “Graph attention networks,” arXiv preprint arXiv:1710.10903, 2017.
[22] M. Welling and T. N. Kipf, “Semi-supervised classification with graph convolutional networks,” in J. International Conference on Learning Representations (ICLR 2017), 2016.
[23] F. Monti, D. Boscaini, J. Masci, E. Rodola, J. Svoboda, and M. M. Bronstein, “Geometric deep learning on graphs and manifolds using mixture model cnns,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 5115–5124.
[24] R. Li, S. Wang, F. Zhu, and J. Huang, “Adaptive graph convolutional neural networks,” in Proceedings of the AAAI conference on artificial intelligence, vol. 32, no. 1, 2018.
[25] J. Zhang, X. Shi, J. Xie, H. Ma, I. King, and D.-Y. Yeung, “Gaan: Gated attention networks for learning on large and spatiotemporal graphs,” arXiv preprint arXiv:1803.07294, 2018.
[26] Y. Duan, Y. Lv, Y.-L. Liu, and F.-Y. Wang, “An efficient realization of deep learning for traffic data imputation,” Transportation research part C: emerging technologies, vol. 72, pp. 168–181, 2016.
[27] B. Dissanayake, O. Hemachandra, N. Lakshitha, D. Haputhanthri, and A. Wijayasiri, “A comparison of arimax, var and lstm on multivariate short-term traffic volume forecasting,” in Conference of Open Innovations Association, FRUCT, no. 28. FRUCT Oy, 2021, pp. 564–570.
[28] L. Hao, L. Leixiao, and W. Hui, “Survey on research and application of support vector machines in intelligent transportation system,” Journal of Frontiers of Computer Science & Technology, vol. 14, no. 6, p. 901, 2020.
[29] S. Bai, J. Z. Kolter, and V. Koltun, “An empirical evaluation of generic convolutional and recurrent networks for sequence modeling,” arXiv preprint arXiv:1803.01271, 2018.
[30] K. Zhang, Z. Liu, and L. Zheng, “Short-term prediction of passenger demand in multi-zone level: Temporal convolutional neural network with multi-task learning,” IEEE transactions on intelligent transportation systems, vol. 21, no. 4, pp. 1480–1490, 2019.
[31] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
[32] M. Xu, W. Dai, C. Liu, X. Gao, W. Lin, G.-J. Qi, and H. Xiong, “Spatial-temporal transformer networks for traffic flow forecasting,” arXiv preprint arXiv:2001.02908, 2020.
[33] H. Zhou, S. Zhang, J. Peng, S. Zhang, J. Li, H. Xiong, and W. Zhang, “Informer: Beyond efficient transformer for long sequence time-series forecasting,” in Proceedings of AAAI, 2021.
[34] H. Wu, J. Xu, J. Wang, and M. Long, “Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting,” Advances in Neural Information Processing Systems, vol. 34, 2021.
[35] C. Song, Y. Lin, S. Guo, and H. Wan, “Spatial-temporal synchronous graph convolutional networks: A new framework for spatial-temporal network data forecasting,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 01, 2020, pp. 914–921.
[36] Z. Fang, Q. Long, G. Song, and K. Xie, “Spatial-temporal graph ode networks for traffic flow forecasting,” in Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, 2021, pp. 364–373.
[37] H.-L. Li, Y. Liang, and S.-C. Wang, “Review on dynamic time warping in time series data mining,” Control and Decision, vol. 33, no. 8, pp. 1345–1353, 2018.
[38] Y. Chen, I. Segovia, and Y. R. Gel, “Z-gcnets: time zigzags at graph convolutional networks for time series forecasting,” in International Conference on Machine Learning. PMLR, 2021, pp. 1684–1694.
[39] J. Choi, H. Choi, J. Hwang, and N. Park, “Graph neural controlled differential equations for traffic forecasting,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 6, 2022, pp. 6367–6374.
[40] S. Guo, Y. Lin, H. Wan, X. Li, and G. Cong, “Learning dynamics and heterogeneity of spatial-temporal graph data for traffic forecasting,” IEEE Transactions on Knowledge and Data Engineering, 2021.