This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

DQN Control Solution for KDD Cup 2021 City Brain Challenge

Yitian Chen yitianartsky@gmail.com BIGO Technology Kunlong Chen chenkunlong@bigo.sg BIGO Technology Kunjin Chen kunjin.ckj@alibaba-inc.com Alibaba Group  and  Lin Wang marvin.wl@alibaba-inc.com Alibaba Group
(2021)
Abstract.

We took part in the city brain challenge competition111http://www.yunqiacademy.org/poster and achieved the 8th place. In this competition, the players are provided with a real-world city-scale road network and its traffic demand derived from real traffic data. The players are asked to coordinate the traffic signals with a self-designed agent to maximize the number of vehicles served while maintaining an acceptable delay. In this abstract paper, we present an overall analysis and our detailed solution to this competition. Our approach is mainly based on the adaptation of the deep Q-network (DQN) for real-time traffic signal control. From our perspective, the major challenge of this competition is how to extend the classical DQN framework to traffic signals control in real-world complex road network and traffic flow situation. After trying and implementing several classical reward functions (Abdoos et al., 2011; Wei et al., 2019a), we finally chose to apply our newly-designed reward in our agent. By applying our newly-proposed reward function and carefully tuning the control scheme, an agent based on a single DQN model can rank among the top 15 teams. We hope this paper could serve, to some extent, as a baseline solution to traffic signal control of real-world road network and inspire further attempts and researches.

Reinforcement learning, DQN, Traffic signal control
copyright: nonecopyright: acmcopyrightjournalyear: 2021doi: 10.1145/1122445.1122456conference: City Brain Challenge @ KDD 2021 Workshop; August 14-18, 2021; Singaporeprice: 15.00isbn: 978-1-4503-XXXX-X/18/06

1. Introduction

Traffic signals coordinate the traffic movements at the intersections and a smart traffic signal coordination algorithm is the key to transportation efficiency. In this competition, the players are provided with a real-world city-scale road network and its traffic demand derived from real traffic data. The players are asked to coordinate the traffic signals with an self-design agent. For each period of time step in an intersection, the agent can select one of 8 types of signal phases, serving a pair of non-conflict traffic movements (e.g., phase-1 gives right-of-way for left-turn traffic from northern and southern approaches), to maximize the number of vehicles served while maintaining an acceptable delay.

Refer to caption
Figure 1. An example of a four-leg intersection and its phase types.

Figure 1 shows an example of four-leg intersections and 8 types of signal phases. The evaluation metrics used in this competition is the total number of vehicles served. In the final phase, the delay index is computed every 20 seconds to evaluate the players’ final score. The evaluation process is terminated once the delay index reaches the predefined threshold , which is set to 1.40.

As introduced by the competition organizing committee, lots of reinforcement learning based (RL-based) approaches have been proposed and achieved state-of-the-art result in simulation environment (Wei et al., 2019a; Abdoos et al., 2011; Wei et al., 2019b; Zheng et al., 2019). However, most of researches are based on the simulation environment and there is little work derived from real-world city-scale road network and traffic data. This competition provides a good opportunity to valid the performance of different RL-based approaches in real-world situation. In this paper, we present an overall analysis of the competition and our DQN-based solution framework to settle the city brain challenging.

2. Preliminaries

Table 1. Summary of notations
Notation Meaning
vv A vehicle in the traffic flow.
d(v)d(v) Delay index of vehicle vv, d(v)=speed(v)/speedLimit(v)d(v)=speed(v)/speedLimit(v).
q(v)q(v) Queue status of vehicle vv, q(v)=Bool(speed(v)<0.3)q(v)=Bool(speed(v)<0.3)
II Set of intersections.
JJ Lane index of an intersection, ranging from 0 to 23.
AikA_{i}^{k} Zone of influence of intersection ii with distance kk.
xjt(Aik)x^{t}_{j}(A_{i}^{k}) Vehicle number on lane index jj of zone of influence AikA_{i}^{k} of intersection ii at step tt.
djt(Aik)d^{t}_{j}(A_{i}^{k}) Delay index on lane index jj of zone of influence AikA_{i}^{k} of intersection ii. at step tt
qjt(Aik)q^{t}_{j}(A_{i}^{k}) Queue length on lane index jj of zone of influence AikA_{i}^{k} of intersection ii at step tt.

2.1. Environment and Data Description

Figure 2 shows the distribution of road lengths in this competition, which ranges from 37 meters to more than 4,000 meters. Meanwhile, an action of traffic signal phase to be selected will last for 10 seconds in the default setting. As a result, it may be ineffective to take all vehicles in the corresponding lanes of the intersection into consideration.

Refer to caption
Figure 2. The distribution of road length in the competition.

Our first work is to define the zone of influence. Specifically, for each intersection ii, we define the segment of each lane where the distance to its center is less than kk meters as the zone of influence and denote it as AikA_{i}^{k}. Meanwhile, for each lane, the lane length is set as the distance capping of kk, which can be expressed as:

k:=min(k,lane length)k:=\min(k,\text{lane length})

The design and calculation of the states and rewards in our solution framework hereinafter are based on this definition.

2.2. Notations and Definition

In Table 1, we summarize some important notations used in our framework. For a vehicle vv extracted from the objective “info” in simulation environment , we define the delay of the vehicle as 1 minus the speed vv dividing the speed limit of the current lane speedLimit(v)speedLimit(v)

(1) d(v)=1speed(v)speedLimit(v)d(v)=1-\frac{speed(v)}{speedLimit(v)}

Meanwhile, we define the queue status q(v)q(v) to indicate whether a vehicle has a speed smaller than 0.30.3 m/s. JJ is the index set of the lanes of an intersection, ranging from 0 to 23. The index of a nonexistent lane is marked as 1-1. We further define xjt(Aik)x^{t}_{j}(A_{i}^{k}),djt(Aik)d^{t}_{j}(A_{i}^{k}) ,qjt(Aik)q^{t}_{j}(A_{i}^{k}) as the number of vehicles, the delay index, and the queue length within the zone of influence AikA_{i}^{k} on lane jj at step tt, respectively. More specifically, the delay index of a lane is the average delay of all vehicles being considered. For a given intersection ii and distance kk, one can easily obtain the statistics for each lane based on the information of vehicles within the zone of influence at time step tt.

In addition, the definition of the actions are clearly described in the official document: at step tt, for each intersection ii, the agent chooses a phase pitp_{it}, indicating that the traffic signal should be set to phase pitp_{it}, which is an element in 𝒮={1,2,3,4,5,6,7,8}\mathcal{S}=\{1,2,3,4,5,6,7,8\}.

3. Method

In this section, we first present the state and reward design for the agent. Then we illustrate the control schemes, model design, and model training schemes of the proposed DQN framework. Finally, we describe our rule-based agent. As we will show in Section 4, an ensemble of DQN-based agent and rule-based agent can achieve out best evaluation score on the leader-board.

3.1. State Design

For each intersection, the state includes the current phase pp, the current time step, the time duration of current phase and the statistic features of vehicle number, delay index, queue length and pressure for each phase based on the definition of zone of the influence.

Table 2 shows the statistic features we use in the state design. To calculate the aggregate feature of each traffic signal phase, we extract the feature for the corresponding lanes given a specific value of kk and add them together.

For instance, for the pair of feature “vehicle number” and k=60k=60 of signal phase 1, we summarize the total vehicle number in the segments of of lane 0 and 66 within zone of influence with k=60k=60, and it can be expressed as:

x0t(Ai60)+x6t(Ai60)x^{t}_{0}(A_{i}^{60})+x^{t}_{6}(A_{i}^{60})

The same state for other phases can be obtained likewise. As the number of total signal phases is 8, correspondingly, the number of feature dimension is 8.

Further, for the pair of feature “pressure of vehicle number” and k=60k=60 of signal phase 1, in addition to upstream vehicle counts, we summarize the total downstream vehicle numbers on lane list [15,16,17,21,22,23][15,16,17,21,22,23] within zone of influence with k=60k=60 and divide it by 3. To calculate the pressure, we use the upstream vehicle number minus the downstream vehicle number divided by 3, which can be express as:

(2) x0t(Ai60)+x6t(Ai60)13[x15t(Ai60)+x16t(Ai60)+x17t(Ai60)+x21t(Ai60)+x22t(Ai60)+x23t(Ai60)]\begin{split}x^{t}_{0}(A_{i}^{60})+x^{t}_{6}(A_{i}^{60})-&\frac{1}{3}[x^{t}_{15}(A_{i}^{60})+x^{t}_{16}(A_{i}^{60})+x^{t}_{17}(A_{i}^{60})\\ &+x^{t}_{21}(A_{i}^{60})+x^{t}_{22}(A_{i}^{60})+x^{t}_{23}(A_{i}^{60})]\end{split}
Table 2. Statistic features used in the state design. For each pair , e.g., the feature vehicle number and zone of influence with distance k=60k=60 , the feature dimension is 8, corresponding to 8 signal phases.
Feature dimension
Feature group k=60k=60 k=100k=100 k=200k=200
Vehicle number 8 8 8
Delay index 8 8 8
Queue length 8 8 8
Pressure of vehicle number 8 8 8
Pressure of delay index 8 8 8
Pressure of queue length 8 8 8

3.2. Reward Design

Various reward designs have been proposed in the literature. Our first attempt is to choose the rewards of queue length and delay (Wei et al., 2019b). Specifically, given an intersection iIi\in I at time tt and its’ zone of influence AikA_{i}^{k}, the rewards of queue length or delay index at time tt is the sum of all queue length or delay index in the next timestap t+1t+1.

(3) Rdelay(t)\displaystyle R^{\rm{delay}}(t) =j=023(djt+1(Aik))\displaystyle=-\sum\limits_{j=0}^{23}(d_{j}^{t+1}(A_{i}^{k}))
Rqueue(t)\displaystyle R^{\rm{queue}}(t) =j=023(qjt+1(Aik))\displaystyle=-\sum\limits_{j=0}^{23}(q_{j}^{t+1}(A_{i}^{k}))

A combination of delay index and queue length (we denote it as “DQ”) can empirically achieve better results:

(4) RDQ(t)=j=023(djt+1(Aik)+qjt+1(Aik))R^{\rm{DQ}}(t)=-\sum\limits_{j=0}^{23}(d_{j}^{t+1}(A_{i}^{k})+q_{j}^{t+1}(A_{i}^{k}))

Another reward recommended by the official competition organizers is max pressure (MP) (Wei et al., 2019a). For an intersection ii of time tt, the MP reward is defined as the sum of the absolute pressures over all traffic movements:

(5) RMP(t)=|j=011xjt+1(Aik)j=1223xjt+1(Aik)|R^{\rm{MP}}(t)=\left|\sum\limits_{j=0}^{11}x_{j}^{t+1}(A_{i}^{k})-\sum\limits_{j=12}^{23}x_{j}^{t+1}(A_{i}^{k})\right|

However, the performance of the original reward function is relatively poor. In our implementation, combining the MP reward with the DQ reward can achieve better results:

(6) RMPDQ(t)=\displaystyle R^{\rm{MP-DQ}}(t)= j=011(djt+1(Aik)+qjt+1(Aik))\displaystyle-\sum\limits_{j=0}^{11}(d_{j}^{t+1}(A_{i}^{k})+q_{j}^{t+1}(A_{i}^{k}))
+1/2j=1223(djt+1(Aik)+qjt+1(Aik))\displaystyle+1/2\sum\limits_{j=12}^{23}(d_{j}^{t+1}(A_{i}^{k})+q_{j}^{t+1}(A_{i}^{k}))

Further, the reward function that achieves the best single-model result is a new one we proposed based on diagnosis analysis of our experiment results and we name it as “Twin-DQ”:

(7) RTwinDQ(t)=\displaystyle R^{\rm{Twin-DQ}}(t)= j=011(djt+1(Aik)+qjt+1(Aik))\displaystyle-\sum\limits_{j=0}^{11}(d_{j}^{t+1}(A_{i}^{k})+q_{j}^{t+1}(A_{i}^{k}))
j=1223(djt+1(Aik)+qjt+1(Aik))(djt(Aik)+qjt(Aik))\displaystyle-\sum\limits_{j=12}^{23}(d_{j}^{t+1}(A_{i}^{k})+q_{j}^{t+1}(A_{i}^{k}))-(d_{j}^{t}(A_{i}^{k})+q_{j}^{t}(A_{i}^{k}))

The aim of the first term of the equation is to minimize the delay and queue length of the upstream lanes, while the purpose of the second term is to minimize the difference of DQ between timesteps tt and t+1t+1. The basic idea of “Twin-DQ” is to consider the rewards of upstream lanes and downstream lanes separately. An action of the traffic signal can affect the queue length and delay in the upstream lanes directly. However, the influence for the downstream lanes is more complicated. Thus, the reward “Twin-DQ” tries to maintain the congestion level of the downstream lanes, which may facilitate the coordination between upstream and downstream intersections. In our experiments, choose kk equal to100100, achieve best results in the local evaluation dataset.

3.3. Control Scheme

By default, the decision making process by the agent for a given intersection is triggered if the green time exceeds 30 (or 20) seconds. A major reason for this setting is that switching to another phase would result in a 5-second “all red” period, during which all vehicles may not pass this intersection. Thus, a default setting of a relatively long green time (e.g., 20 or 30 seconds) can avoid such a situation and guarantee higher returns in the long run. However, it is observed that there are situations in which the current phase goes on with no vehicles in upstream lanes. A more sophisticated approach is to consider not only the green time of the current phase, but also alternative conditions that may trigger the agent to make better decisions. As a result, the suite of control schemes we use to trigger the DQN-agent to do the phase selection action is as follows:

Condition 1:

: the green time of the current signal phase exceeds 30 seconds.

Condition 2:

: the queue length of the current signal phase in the upstream zone of influence with k=60k=60 equals zero.

Condition 3:

: the queue length of the current signal phase in the downstream zone of influence with k=60k=60 equals 8.

Condition 4:

: the queue pressure of the current signal phase in the zone of influence with k=60k=60 is less than -5.

During the training process of the DQNs, if any of conditions above is met, the DQN-agent is triggered to begin the phase selection process.

3.4. DQN Framework and Training Scheme

Table 3. Training parameters of DQN
Parameter Value
greenSec 20 seconds
Gamma 0.8
Model update frequency 1
Target model update frequency 17
Learning rate 5e-5
epsilon 0.2
epsilonMin 0.01
Loss function Huber loss (smooth-L1)

Our training framework follows the classical paradigm of Double Q-Network (double DQN) (Van Hasselt et al., 2016; Mnih et al., 2015) which includes an online network and a target network. The output dimensions of both networks are set to 8, which correspond to 8 candidate signal phases to be selected. Meanwhile, We set the the parameter γ\gamma to 0.8, which aims at maximizing the total rewards of the next 5 rounds. The ϵ\epsilon-greedy policy is adopted to train the model. A detailed description of the training parameters are presented in Table 3.

To stabilize the training process of the DQN framework, we design a two-objective neural network. Specifically, one objective is used for predicting the Q-values, and the other one is used for predicting the rewards. The loss function consists of two terms, namely, the smooth-L1 loss of the predicted Q-values and the smooth-L1 loss of the predicted rewards.

As the reward signals are straightforward, it is observed that this approach can greatly accelerate and stabilize the convergence of the training process in our experiments.

3.5. Rule-based Agent

In this subsection, we describe a rule-based agent design in addition to the DQN agent. The core of the rule-based agent is the calculation of vehicle density near the center of the intersections. In order to alleviate traffic congestion in multiple directions, phases with high vehicle density on both its upstream lanes are given higher priority. For each upstream lane, the density is calculated as

(8) αjup=xj(Aikup)kup\alpha^{\rm{up}}_{j}=\frac{x_{j}(A^{k_{\rm{up}}}_{i})}{k_{\rm{up}}}

where kupk_{\rm{up}} is a global value for the zone of influence in the upstream lanes. The density of downstream lanes is calculated as

(9) αjdown=xj(Ailjkup)ljkup\alpha^{\rm{down}}_{j}=\frac{x_{j}(A^{l_{j}-k_{\rm{up}}}_{i})}{l_{j}-k_{\rm{up}}}

where ljl_{j} is the length of the downstream lane with index jj, such that the range of ljkupl_{j}-k_{\rm{up}} is complement to the upstream zone of influence. The order of priority for each upstream lane, oj1o_{j_{1}}, is decided by the relative density αj1=αj1upμαj2down\alpha_{j_{1}}=\alpha^{\rm{up}}_{j_{1}}-\mu\alpha^{\rm{down}}_{j_{2}}, where j2j_{2} is the index of the corresponding downstream lane for the upstream lane j1j_{1} and μ\mu is a coefficient to be tuned. Specifically, the values of order ranges from 1 to 12 (1 for the highest priority). The difficulty of the designing of the rules mainly comes from the fact that each phase generally correspond to two upstream-downstream lane pairs. An empirical observation is that a phase with two high-priority lane pairs is more favorable than a phase with one top-priority lane and a low-priority lane. Although designing a perfect chain of rules is unlikely, carefully tuning the requirement for both lanes of a phase to be balanced is crucial.

More specifically, the rule has 4 layers of logic. If none of the phases satisfy the condition of a layer, the logic flow moves on to the next layer. For a given phase being considered, we link the dominant upstream-downstream lane pair with index j𝐈j_{\mathbf{I}} (same as its upstream lane index), while the other lane pair is linked with index j𝐈𝐈j_{\mathbf{II}}. The layers of logic are given as follows:

  • Layer 1: the first layer aims at picking a phase that corresponds to a lane blocked for too long. The number of blocked round for each lane is calculated and ordered. At the first round, a lane pair j𝐈j_{\mathbf{I}} is selected if its upstream lane is blocked for more than CblockC_{\rm{block}} seconds and its blocked time is the largest. A phase will be selected if oj𝐈<=5o_{j_{\mathbf{I}}}<=5 and oj𝐈𝐈<=4o_{j_{\mathbf{II}}}<=4. If more than one phase satisfies the conditions, the phase serving two roads is picked. At the second round, a lane pair j𝐈j_{\mathbf{I}} is chosen if its upstream lane is blocked for more than Cblock+50C_{\rm{block}}+50 seconds and its blocked time is the second largest. A phase will be selected if oj𝐈<=2o_{j_{\mathbf{I}}}<=2 and oj𝐈𝐈=1o_{j_{\mathbf{II}}}=1. The value of CblockC_{\rm{block}} starts with 200 and increases over time until it reaches 300.

  • Layer 2: the second layer aims at picking a phase that corresponds to two high-density lanes. At this layer, the vehicle densities for both lane pairs of each phase are checked and a phase will be selected if it has two balanced, high-relative-density lane pairs. Given the balance coefficient CbalanceC_{\rm{balance}}, a lane pair is balanced if αj𝐈<αj𝐈𝐈/Cbalance\alpha_{j_{\mathbf{I}}}<\alpha_{j_{\mathbf{II}}}/C_{\rm{balance}} (suppose αj𝐈>=αj𝐈𝐈\alpha_{j_{\mathbf{I}}}>=\alpha_{j_{\mathbf{II}}}). The selection of the second layer has a total of 5 rounds. For the first 4 rounds, the lane pair with oj𝐈=1o_{j_{\mathbf{I}}}=1 is picked in advance, and the lane pair j𝐈𝐈j_{\mathbf{II}} can be selected if the two lane pairs are balanced and oj𝐈𝐈o_{j_{\mathbf{II}}} is less than or equal to 2, 3, 4, or 5, respectively. For the 5th round, a phase is selected if its two lane pairs satisfy oj𝐈<=2o_{j_{\mathbf{I}}}<=2 and oj𝐈𝐈<=3o_{j_{\mathbf{II}}}<=3 and the two pairs are balanced. The values of CbalanceC_{\rm{balance}} for the 5 rounds start with 0.15, 0.2, 0.2, 0.2, and 0.25 and they decrease over time until they reach 0.13, 0.18, 0.18, 0.18, and 0.23, respectively.

  • Layer 3: the third layer aims at picking a phase that corresponds to two low-speed lanes. Here, the threshold for low speed, CspeedC_{\rm{speed}}, is set to 1 m/s. The average speeds of the upstream lanes are calculated as sjs_{j}, and the speeds are sorted in ascending order (denoted as ojso^{\rm{s}}_{j}). The prerequisite for a phase to be selected is that the average speeds at the two upstream lanes corresponding to the phase are lower than CspeedC_{\rm{speed}}. In addition, at the first round, a phase is selected if oj𝐈s<=2o^{\rm{s}}_{j_{\mathbf{I}}}<=2, oj𝐈𝐈s<=2o^{\rm{s}}_{j_{\mathbf{II}}}<=2, oj𝐈<=4o_{j_{\mathbf{I}}}<=4, and oj𝐈𝐈<=4o_{j_{\mathbf{II}}}<=4. Further, at the second round, a phase will be selected if oj𝐈s<=3o^{\rm{s}}_{j_{\mathbf{I}}}<=3, oj𝐈𝐈s<=3o^{\rm{s}}_{j_{\mathbf{II}}}<=3, oj𝐈<=5o_{j_{\mathbf{I}}}<=5, and oj𝐈𝐈<=5o_{j_{\mathbf{II}}}<=5.

  • Layer 4: the final layer picks the phase corresponding to the lane with the highest density if none of the previous conditions can be satisfied. At the same time, a phase that serves two roads is given higher priority (unless one of the roads does not exist). A second round of phase selection is added if the lane with the highest density is a right-turning lane.

4. Experiments

In this section, we first show the performance of three different rewards, namely, “DQ”, “Pressure”, and ”Twin-DQ”, under difference control schemes on the default round3_flow0 traffic data. We then present our final experimental results in the leader-board.

The evaluation metrics we use here are the total number of vehicles served and the delay index. The performance of the agent is evaluated every 20 seconds, and the evaluation process is terminated when the average delay index reaches the predefined threshold 1.40.

To show the effectiveness of the control scheme, we test several different triggering policies (TP) during the evaluation stage:

  1. (1)

    TP1: TP1 is the same as condition 1 described in subsection 3.3: the green time of the current signal phase exceeds 30 seconds.

  2. (2)

    TP2: TP2 includes the condition 1 and condition 2 described in subsection 3.3. If one of these two conditions above is met, the DQN-agent is triggered to begin the phase selection process.

  3. (3)

    TP3: TP3 includes all conditions described in subsection 3.3, if any of these conditions is met, the DQN-agent is triggered to begin the phase selection process.

Table 4. Evaluation results of number of vehicles served and delay index of different reward function-control scheme pairs on the default round3_flow0 traffic data.
Rewards TP1 TP2 TP3
DQ 46,619/1.400 48,201/1.409 48,201/1.403
Pressure 47,747/1.403 48,201/1.402 48,201/1.400
Twin-DQ 47,747/1.400 48,201/1.400 48,590/1.418

The evaluation results of reward function-triggering policy pairs are illustrated in Table 4. Briefly speaking, The “Twin-DQ” reward together with TP3 achieves the best evaluation score. Note that TP3 corresponds to the entire set of triggering conditions used for training the DQNs.

Further, Table 5 illustrates our experiment results from the leader-board. Specifically, a single DQN model can serve about 313,035 vehicles. The rule-based agent described in subsection 3.5 also serves 313,035 vehicles with a delay of 1.402. A straightforward ensemble approach, which averages the predicted Q-values of multiple DQN models, can improve the number of vehicles served in the leader-board to 314,467 with an ensemble of 10 models. Our best score is based on the DQN with rule revising (adopting the decisions of the rule-based agent in a small set of cases), which can serve 317,954 cars with a delay of 1.401.

Table 5. Experiment results in the leader-board. The “Twin-DQ” reward function is used to train the model.
Strategy Number of vehicles served Delay index
Rule-based 313,035 1.402
DQN-single 313,035 1.403
DQN-ensemble 314,467 1.404
DQN+Rule-based 317,954 1.401

5. Conclusion

In this paper, we present a DQN-based solution for the “City Brain Challenge” competition. We described our overall analysis and details of the DQN-based framework for real-time traffic signal control. Our main improvements on the leader-board come from two points: a newly-proposed reward function, namely, “Twin-DQ”, and a suite of control schemes. Meanwhile, an ensemble of multiple DQN models can further improve the performance. In addition, a drawback of the DQN-based control framework in practice is that it may fail to make good decisions in certain cases due to the complexity of real-world road networks and traffic flow patterns. Therefore, applying heuristic rules to revise the DQN control actions in these cases can improve the performance of the DQN-based control framework. The codes of our solution is available from https://github.com/oneday88/kddcup2021CBCBingo. Our work could serve, to some extent, as a baseline solution to traffic signal control of real-world road network and inspire further attempts and researches.

References

  • (1)
  • Abdoos et al. (2011) Monireh Abdoos, Nasser Mozayani, and Ana LC Bazzan. 2011. Traffic light control in non-stationary environments based on multi agent Q-learning. In 2011 14th International IEEE conference on intelligent transportation systems (ITSC). IEEE, 1580–1585.
  • Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. 2015. Human-level control through deep reinforcement learning. nature 518, 7540 (2015), 529–533.
  • Van Hasselt et al. (2016) Hado Van Hasselt, Arthur Guez, and David Silver. 2016. Deep reinforcement learning with double q-learning. In Proceedings of the AAAI conference on artificial intelligence, Vol. 30.
  • Wei et al. (2019a) Hua Wei, Chacha Chen, Guanjie Zheng, Kan Wu, Vikash Gayah, Kai Xu, and Zhenhui Li. 2019a. Presslight: Learning max pressure control to coordinate traffic signals in arterial network. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1290–1298.
  • Wei et al. (2019b) Hua Wei, Guanjie Zheng, Vikash Gayah, and Zhenhui Li. 2019b. A survey on traffic signal control methods. arXiv preprint arXiv:1904.08117 (2019).
  • Zheng et al. (2019) Guanjie Zheng, Yuanhao Xiong, Xinshi Zang, Jie Feng, Hua Wei, Huichu Zhang, Yong Li, Kai Xu, and Zhenhui Li. 2019. Learning phase competition for traffic signal control. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management. 1963–1972.