Jointly Learning Agent and Lane Information for Multimodal Trajectory Prediction

Jie Wang, Caili Guo*, Minan Guo and Jiujiu Chen *Corresponding author: Caili GuoJ. Wang, C. Guo, M. Guo and J. Chen are with Beijing Laboratory of Advanced Information Networks, Beijing University of Posts and Telecommunications, Beijing, 100876, China. {wj1e, guocaili, guominan, chenjiujiu}@bupt.edu.cn.

Abstract

Predicting the plausible future trajectories of nearby agents is a core challenge for the safety of Autonomous Vehicles and it mainly depends on two external cues: the dynamic neighbor agents and static scene context. Recent approaches have made great progress in characterizing the two cues separately. However, they ignore the correlation between the two cues and most of them are difficult to achieve map-adaptive prediction. In this paper, we use lane as scene data and propose a staged network that Jointly learning Agent and Lane information for Multimodal Trajectory Prediction (JAL-MTP). JAL-MTP use a Social to Lane (S2L) module to jointly represent the static lane and the dynamic motion of the neighboring agents as instance-level lane, a Recurrent Lane Attention (RLA) mechanism for utilizing the instance-level lanes to predict the map-adaptive future trajectories and two selectors to identify the typical and reasonable trajectories. The experiments conducted on the public Argoverse dataset demonstrate that JAL-MTP significantly outperforms the existing models in both quantitative and qualitative.

Index Terms:

Autonomous driving, multimodal trajectory prediction, deep learning methods.

I INTRODUCTION

Self-Driving Vehicles (SDVs) are going to change the way we live by providing safe, reliable and effective transportation for everyone everywhere. Predicting the nearby agents’ trajectories is crucial for SDVs to understand the surrounding environment and make high-level decisions, which ensures the safety and comfort of SDVs [1]. The biggest challenge in trajectory prediction is that the future is inherently ambiguous. For predicting the target’s trajectory, we have to completely consider the target agent’s intention, the current static surroundings and the interactions with nearby agents, to generate all plausible future trajectories based on the current scenario, namely multimodal trajectory prediction.

Although the existing multimodal trajectory prediction methods have shown competitive performance on the accuracy metrics, there are still two major deficiencies: 1) They consider the dynamic social interaction and the static scene information separately. However, the distribution of surrounding dynamic agents will simultaneously affect the drivable state of lanes, such as occupancy and congestion. Ignoring this can lead to unreasonable trajectory and even collision. 2) The patterns among the generated trajectories are not explicitly decoupled from the map, it will lead to mediocre map generalization capability. Typically, the regression-based [3, 15] and fixed anchor-based methods [5, 6] either concentrate on the main mode or restricted for a certain scene. The existing lane-based methods [7, 8, 9, 10] either consider a single lane or fuse all lanes. They do not explicitly take lanes as goals and are hard to generate map-adaptive trajectories.

In this paper, we propose a novel network for multimodal trajectory prediction, namely JAL-MTP. Towards the two aforementioned issues, we use a proposed social to lane (S2L) module to jointly represent the dynamic agents and static lane context into the dynamic map consists of instance-level lanes. We intuitively take instance-level lanes as goals which ensures the goals is map-adaptive. Then we apply the proposed recurrent lane attention (RLA) mechanism to learn the features in instance-level lane and achieve lane-based predicting. Further, two selectors are used to score and identify the final trajectories. Finally, we evaluate the qualitative and quantitative performance of JAL-MTP on the public Argoverse dataset [11].

Our contributions are summarized as follows,

•

We propose a staged multimodal trajectory prediction network that first jointly represent the context cues as instance-level lanes, then model the evolution of the current motion on instance-level lanes to generate map-adaptive trajectories.
•

We propose a map representation method that fuse the information of dynamic nearby agents and static lane into a joint representation, which could reflect the map constraints well.
•

Our approach achieves SOTA quantitative performance for some categories of Argoverse benchmark and more reasonable multimodal qualitative results against the exist models.

II Related works

In this section, we review works on modeling the two important cues for trajectory prediction: dynamic agents interaction and static scene context, conclude the existing methods that aim to multimodal trajectory prediction.

II-A Interaction among Dynamic Agents

Related agents share the same scenario, co-operating with each other to perform safe actions. Accounting for the social interaction among neighboring agents is critical for safe and reasonable trajectory prediction. Grid-based methods [12, 20] divide the space around agents into 2D grids and apply pooling to aggregate the interaction feature.These methods are hard to represent the irregular real-world interaction. Recently, graph-based methods [22] generalize deep learning on regular grids to arbitrary graphs with irregular topologies. They apply graph convolution [13] or attention [9] to achieve message passing among the vertices composed of agents. These methods have achieved competitive performance in accuracy, however, they consider the social interactions independently, without utilizing the link between the agents and the scene.

II-B Static Scene Context Modeling

Modeling the static scene context is the key to understand the intention and executable actions of the target agent. convolutional neural networks (CNN) based works [5, 6] use CNN to process the rasterized map image to obtain the scene context. These methods neglect the road topology such as the neighbor but opposite lanes. Recent methods use connective lanes sampled from the map as scene data. graph nerual networks (GNN) based methods[3, 15, 2] take the sampled points of lane centerline from the semantic map as vertices on a graph and obtain the scene context through GNN. They achieve competitive distance-based metrics but poor multimodal. lane-based methods[7, 10, 9] respectively propose a lane attention mechanism to use lane as scene data. MTPLA[7] utilizes the most correlated lane. LaPred[10] integrates all of the lane feature with learnable weights. WIMP[9] focus on the effective segments of the most correlated lane. We share a similar idea to lane-based methods. The difference is that we jointly represent each lane with their associated agents and separately treat each lane by proposed RLA.

II-C Multimodal Trajectory Prediction

The future motion of agents is inherently multimodal, it requires to generate multiple trajectories with likelihoods. Some works are built upon stochastic models such as generative adversarial networks (GANs) [16] or variational autoencoders (VAEs) [17]. Despite their competitive performance, the drawback of requiring multiple sampling to latent variables during inference prevents them from being applied. To avoid mode collapse [4], recent frameworks decompose the task into classification over anchor [14] or goal [18], followed by conditional regression. TNT [14] take the lane centerline nodes as anchors which are accurate but too massive. GoalNet [18] propose multimodal generators to select goals among lanes but it mainly use rasterized map. Generally, lane information offers a strong prior on the semantic behavior of drivers. Driven by this, we take all possible lanes as goals and predict based on them.

III Approach

In this section, we first introduce our problem formulation and notations in III-A. Then we present the details of JAL-MTP in III-B, which consists of three main modules what we called S2L, RLA Enc-Dec and selectors. Finally, we show our multimodal training strategy in III-C.

III-A Problem Formulation

The purpose of trajectory prediction is to predict the future motion of all agents in a scene, given their past motion and current scene data. Since we focus on the multimodal problem, we only consider the one agent we called target in a scene. Suppose that the target agent has $N$ possible goals $G=\{{g^{i}},i\in[1,N]\}$ on its situation, considering that drivers usually choose a potential goal $g^{i}$ to drive in reality and different goals can be treated independently, the task can be formulated as finding the posterior distribution of the future trajectory $Y$ given the possible goals $G$ :

p({\rm{Y}}|G)=\sum\limits_{i=1}^{N}{p({\rm{Y}}|{g^{i}})},

(1)

with approximating the distribution of ${\rm{Y}}$ as $K$ representative future trajectories ${\rm{Y}}=\left\{{{Y^{k}},k\in[1,K]}\right\}$ , where ${Y^{k}}=\{y_{t}^{k},t\in[1,{T_{f}}]\}$ denotes the $k$ -th future trajectory over future time horizon ${T_{f}}$ and $y_{t}\in\mathbb{R}^{2}$ denotes the agent’s 2D coordinates in bird’s eye view(BEV) at time $t$ .

The possible goals $G$ depend on the history motion state of the target agent, the distribution of nearby dynamic agents and static scene which are essentially drivable paths. The motion state of target agent and $M$ surrounding agents we called the social agents over the past ${T_{P}}$ timestamp can be formulated as ${\rm{X}}=\left\{{{x_{t}},t\in[-{T_{p}}+1,0]}\right\}$ and ${\rm{S}}=\left\{{{S^{j}},j\in[1,M]}\right\}$ respectively, where ${S^{j}}=\left\{{s_{t}^{j},t\in[-{T_{p}}+1,0]}\right\}$ denotes the past trajectory of $j$ -th social agent and $x_{t},s_{t}\in\mathbb{R}^{2}$ denote the agents’ BEV coordinates at time $t$ . Similar to [9, 10], we describe the drivable paths as the concatenated centerline of lane segments, which called lane proposals corresponding to the $N$ goals and denoted as ${\rm{L}}=\left\{{{L^{i}},i\in[1,N]}\right\}$ , with ${L^{i}}=\left\{{l_{h}^{i},h\in[1,H]}\right\}$ and $l_{h}^{i}\in\mathbb{R}^{2}$ representing $h$ -th lane node of $i$ -th lane because the path are composed of discrete samples, $H$ is the length of a lane.

As mentioned above, $G$ boll down to the sum of ${\rm{X}}$ , ${\rm{S}}$ and ${\rm{L}}$ . In most cases the drivers tend to choose a lane proposal as their goal and drive along the centerline of the lane proposal. In addition, ${\rm{S}}$ and ${\rm{L}}$ are interrelated and mutually influenced, which can be specifically performed the distribution of surrounding dynamic agents will simultaneously affect the drivability of lane proposals. Therefore, we comprehensively aggregate the social agents’ situation ${\rm{S}}$ into lane proposals ${\rm{L}}$ to obtain instance-level lanes ${\rm{\tilde{L}}}=\left\{{{{\tilde{L}}^{i}},i\in[1,N]}\right\}$ , then we can regard the $N$ goals as the evolution results of the current motion state ${\rm{X}}$ on $N$ instance-level lanes, so (1) could be expressed as

p({\rm{Y}}|G)=\sum\limits_{i=1}^{N}{p({\rm{Y}}|{\rm{X}},{\rm{S}},{L^{i}})}=\sum\limits_{i=1}^{N}{p({\rm{Y}}|{\rm{X}},{{\tilde{L}}^{i}})},

(2)

where ${\tilde{L}^{i}}$ represent $i$ -th instance-level lane under ${g^{i}}$ , we can adequately characterize the distribution of future trajectories by aggregating the trajectory predicted under $N$ instance-level lanes. We consider the multimodal trajectory prediction problem and build our model driven by Eq. (2).

III-B Model Structure

Refer to caption — Figure 1: Overall architecture: The model consists of three major modules, RLA encoder-decoder(Enc-Dec), S2L and two selectors (respectively rendered as blue, green and yellow). The inputs to the network are set of target and social agents’ past trajectories (respectively marked as blue and green lines) and sampled lane nodes (marked as yellow circles). For the sampled lane nodes, we employ a heuristic lane proposal algorithm based on the map preprocess methods proposed in [11] to find the lane proposals.

The model structure of the proposed network is corresponding to the formulation expressed in III-A. Fig. 1 depicts the overall structure of our network. The lane proposals obtaind by preprocessing and the social agents’ past trajectories are respectively fed into the multi-scale one-dimensional Convolutional Neural Network (1D-CNN) and the 1D-CNN followed by Gated Recurrent Unit (GRU) to encode as the lane proposals feature ${\rm{L}}$ and social motion feature ${\rm{S}}$ . For effectively representing the road information as instance-level lane ${\tilde{\rm{L}}}$ , we apply a spatial attention in S2L to jointly fusing the ${\rm{L}}$ and ${\rm{S}}$ . The instance-level lane will serve as the extra input to RLA modules. For the target agent, by applying the 1DCNN followed by RLA Encoder, we can encode the past motion feature ${\rm{X}}$ and current lane condition ${\tilde{L}^{i}}$ to hidden vector $h^{i}$ . By feeding the vector to Lane Selector, we can get the lane score which marked as the transparency. Finally, based on the hidden vector and lane score, the RLA Decoder and Trajectory Selector will produce and select K typical trajectories with probabilities.

III-B1 S2L: Jointly representing the agent and lane information

The S2L devotes to fusing the ${\rm{S}}$ and ${\rm{L}}$ to extract the joint agents-lane representation ${\tilde{\rm{L}}}$ by the spatial attention mechanism we proposed. The process is shown in Fig. 2. Firstly, we apply the constant acceleration kinematic equation to simply calculate the social agents’ future trajectory rollout from their current motion states. Given the trajectory rollouts, we can find the related lane nodes to apply the spatial attention:

\hat{l}_{h}^{i}=\varphi_{res}\left(l_{h}^{i}+\sum_{j=1}^{M}\varphi_{agg}\left(l_{h}^{i}\|\operatorname{dist}\|S^{j}\right)\right),

(3)

where $h\in[1,H]$ , $l_{h}^{i}$ and $\hat{l}_{h}^{i}$ respectively denote $h$ -th lane nodes of the $i$ -th lane proposal and instance-level lane, ${S^{j}}$ denotes the feature of the related social agent whose trajectory rollout’s ${l_{2}}$ distance from the lane nodes is smaller than a threshold (e.g. 7.5m), ${\varphi_{res}}(\cdot)$ and ${\varphi_{agg}}(\cdot)$ are multi-layer perceptron(MLP) whose weights are shared over all nodes, $\parallel$ is concatenation, and $dist$ represent the spatial relationship between the lane nodes and related social agent, where we use $MLP({V_{l_{h}^{i}}}-{V_{{S^{j}}}})$ with $V$ representing the coordinates.

After aggregate the related social feature to lane nodes by Eq. (3), we pass the updated lane nodes into another 1D-CNN to perform message passing among the lane nodes like GNN and get the final updated instance-level lane $\tilde{l}_{h}^{i}$ . Then the produced $\tilde{l}_{h}^{i}$ will be used by RLA to help neural network understand the current scene context.

III-B2 RLA Enc-Dec: Learning from the instance-level lane representation

Taking the motion feature $\rm{X}$ from 1DCNN and instance-level lane $\tilde{l}_{h}^{i}$ from S2L as input, the RLA module aim to generate future trajectories based on taking the input lane as goal. Intuitively, drivers usually pay more attention to the area they currently close to and the area to be reached. As shown in Fig. 3, firstly, waypoint GRU takes the target agent’s motion history ${h_{t-1}}$ together with the current location ${a_{t-1}}$ (blue triangle) as input to update the hidden state ${h_{t}}$ and predict the waypoint ${b_{t}}$ (red triangle) according to

{b_{t}}={\varphi_{{\rm{waypoint}}}}({h_{t}}){\rm{}},{\rm{}}{h_{t}}=GR{U_{{\rm{waypoint}}}}({a_{t-1}},{h_{t-1}}),

(4)

where ${\varphi_{waypoint}}(\cdot)$ is MLP. Based on the ${a_{t-1}}$ and the ${b_{t}}$ , we can find the nearest Euclidean points A (marked as blue) and B (marked as red) on $i$ -th lane. Then we can perform goal-based prediction conditioned on the interested lane nodes from A to B via lane attention:

\begin{array}[]{l}{Q^{i}}=h_{t-1}^{i}{W^{Q}},\\ K_{A:B}^{i}=E_{A:B}^{i}{W^{K}},\\ V_{A:B}^{i}=E_{A:B}^{i}{W^{V}},\end{array}

(5)

where ${W^{Q}}$ , ${W^{K}}$ , ${W^{V}}$ are learned weights, ${Q^{i}}$ relates to the motion history and the $K_{A:B}^{i}$ and $V_{A:B}^{i}$ are calculated from the states of the lane nodes A to B together with their spatial information relative to the agent’s current motion, which can be illustrated as follows:

\begin{array}[]{l}R_{A:B}^{i}=MLP(concat(D_{A:B}^{i},\Delta\theta_{A:B}^{i})),\\ E_{A:B}^{i}=MLP(concat(\tilde{l}_{A:B}^{i},R_{A:B}^{i})),{\rm{}}\end{array}

(6)

where $D_{A:B}^{i}$ is the Euclidean distance between the ${a_{t-1}}$ and the related lane nodes $l_{A:B}^{i}$ , $\Delta\theta_{A:B}^{i}$ is the angle difference between the agent’s current orientation and the curvature of lane segment A to B, so $R_{A:B}^{i}$ represents the relative spatial information changes along A to B and $E_{A:B}^{i}$ covers the joint interaction features and spatial variation of A to B on lane $i$ , which corresponds to the goal-based intuition. Then the updated hidden states can be obtained by adding the attention results ${\alpha^{i}}$ :

\begin{array}[]{l}{\alpha^{i}}=\sum\limits_{j=A}^{B}{softmax(\frac{{{Q^{i}}K{{{}_{j}^{i}}^{T}}}}{{\sqrt{{d_{k}}}}})V_{j}^{i}},\\ \tilde{h}_{t-1}^{i}=h_{t-1}^{i}+{\alpha^{i}},\end{array}

(7)

Finally, the updated hidden states are entered into the GRU followed by the predictor to predict the next trajectory point $\hat{a}_{t}^{i}$ (purple triangle):

\hat{a}_{t}^{i}={\varphi_{predictor}}(h_{t}^{i}){\rm{}},{\rm{}}h_{t}^{i}=GR{U_{{\rm{next}}}}(a_{t-1}^{i},\tilde{h}_{t-1}^{i}),

(8)

where ${\varphi_{predictor}}(\cdot)$ is MLP. Note that for the RLA encoder, $t=[-{T_{p}}+1,0]$ to learn how to encode the motion information and scene states over the past time into the hidden vector, but for the RLA decoder, $t=[1,{T_{f}}]$ to learn how to adjust to the changes of the real time scene to make multimodal and map-adaptive prediction.

III-B3 Selectors: Identifying typical trajectories

As shown in Fig. 4, the lane selector takes the $N$ hidden vector from the RLA encoder as input to obtain $N$ scores of $N$ lane. The trajectory selector scores the $N\times K$ trajectories from the RLA decoder and multiply the trajectory scores by the corresponding lane scores to get the final $N\times K$ trajectory scores. Finally, we intercept the $K$ highest score trajectories as the final trajectories to calculate the metrics for evaluation, and normalize the $K$ trajectory scores as the corresponding probability.

Note that both the lane selector and the trajectory selector are two-layer MLP.

III-C Training

The regression and two classification tasks are independent and the whole model can be trained end-to-end with a total loss function:

{L_{total}}={L_{reg}}+{\lambda_{1}}{L_{lanecls}}+{\lambda_{2}}{L_{trajcls}},

(9)

where ${\lambda_{1}}$ and ${\lambda_{2}}$ are hyperparameters to balance the tasks.

For regression, we apply the ${l_{1}}$ loss on all predicted time steps that proved effective in previous works [9]:

{L_{reg}}=\frac{1}{{{T_{f}}}}\mathop{\min}\limits_{k\in\{1...K\}}\sum\limits_{t=1}^{{T_{f}}}{\left\|{{{\rm{y}}_{t}}-{\rm{\hat{y}}}_{t}^{k}}\right\|},

(10)

where $\hat{y}_{t}^{k}$ denotes the future position of the k-th trajectory at time t and ${y_{t}}$ is the corresponding ground truth.

For classification, we use the cross-entropy loss between the output scores and the labels:

\begin{array}[]{l}{L_{lanecls}}={L_{CE}}(scor{e_{lane}},labe{l_{lane}}),\\ {\rm{}}{L_{trajcls}}={L_{CE}}(scor{e_{traj}},labe{l_{traj}}),\end{array}

(11)

where the labels are generated during training with self-supervised learning task by

\begin{array}[]{l}labe{l_{lanecls}}=\frac{{\exp(-{D_{1}}(L,{{\rm{Y}}_{GT}}))}}{{\sum\limits_{i=1}^{N}{\exp(-{D_{1}}({L_{i}},{{\rm{Y}}_{GT}}))}}},\\ labe{l_{trajcls}}=\frac{{\exp(-{D_{2}}({\rm{Y}},{{\rm{Y}}_{GT}}))}}{{\sum\limits_{i=1}^{K}{\exp(-{D_{2}}({{\rm{Y}}_{i}},{{\rm{Y}}_{GT}}))}}},\end{array}

(12)

with $D_{1}\left(L,Y_{GT}\right)=\sum_{t=1}^{T_{f}}\beta(t)\min_{h\in\{1,\ldots,H\}}\left\|y_{t}-l_{h}\right\|$ and $D_{2}\left(\mathrm{Y},\mathrm{Y}_{GT}\right)=\sum_{t=1}^{T_{f}}\left\|y_{t}-\hat{y}_{t}\right\|$ denoting the distance between the ground truth trajectory and the lane, the prediction respectively, where $\beta(t)=t$ is the scaling weight to indicate the importance of different time steps. Note that there is only one ground truth trajectory, in order to make the model better learn to follow a certain lane, we only predict the trajectory under the most likely lane during training, when inference, we input all of the lane proposal and apply the selectors to get final trajectories.

IV Experiments

In this section, we evaluate the quantitative and qualitative performance of the proposed model on the Argoverse motion forecasting benchmark.

IV-A Experimental Setup

IV-A1 Dataset

Argoverse is a motion forecasting dataset with 325K 5 second trajectory sequences collected in Pittsburgh and Miami. The sequences are split into training, validation and test sets, which have 206K, 39K and 78K sequences respectively. The trajectories comprising of 2D coordinate sequences are sampled at 10Hz. Each sequence has one interesting object called “agent”, the task is using the agents’ past 2 seconds trajectory data to predict the future locations of agents in 3 seconds. In addition to trajectories, each sequence is associated with a high definition map composed of lane centerline sets and their connectivity. Importantly, Argoverse dataset is collected for interesting and diverse behaviors, which is suitable for the study of multimodal trajectory prediction. Because of the test set without labels, we conduct our experiments mainly on the validation set.

IV-A2 Metrics

To evaluate a multimodal set of predicted trajectories, we take the best trajectory among the $K$ generated trajectories to calculate the minimum average displacement error ( $\operatorname{minADE}_{K}$ ) and final displacement error ( $\operatorname{minFDE}_{K}$ ) :

\begin{array}[]{l}\min\mathrm{ADE}_{K}=\frac{1}{T_{f}}\mathop{\min_{k\in\{1,\ldots,K\}}}{\sum\nolimits_{t=1}^{{T_{f}}}{\left\|{{y_{t}}-\hat{y}_{t}^{k}}\right\|}_{2}},\\ \min{\rm{FD}}{{\rm{E}}_{K}}=\mathop{\min_{k\in\{1,\ldots,K\}}}{\left\|{{y_{{T_{f}}}}-\hat{y}_{{T_{f}}}^{k}}\right\|_{2}},\end{array}

(13)

where $\hat{y}_{t}^{k}$ denotes the future position of the k-th trajectory at time t and ${y_{t}}$ is the corresponding ground truth. We also report miss rate (MR) to measure the ratio of scenarios where none of the predictions are within 2 meters of the ground truth according to $\operatorname{minFDE}_{K}$ . These metrics indicate the accuracy of the predicted trajectories. To evaluate the performance of our selectors, we also calculate the probabilistic-based metrics (p-FDE, p-ADE, p-MR, brier-FDE, brier-ADE) [10], which measures the reasonability of predicted probabilities and not be elaborated because of only used in ablation study. Notably, for all metrics, the smaller is the better.

IV-B Quantitative Results

We compare JAL-MTP with the existing methods that have similar configuration and SOTA performance. Table. I presents the results on the validation set with $K=1,6$ . GRU ED is the baseline that simply GRU encoder-decoder structure. MTPLA [7], WIMP [9] and LaPred [10] are lane-based methods that also use instance-level lane similar to us, the difference is that MTPLA and LaPred map the entire lane to a vector, WIMP consider without the explicit changes in position and curvature of the lane segment. LaneGCN [3] and TPCN [15] are the GNN-based models that have better performance in accurate metrics through GNN but perform bad in multimodal results. The quantitative results show that our method out-performs all related lane-based works in almost all metrics and achieves similar performance to the GNN-based models. Notably, our ${\rm{minFD}}{{\rm{E}}_{6}}$ and MR metric are the best among all baselines, which indicates that our method has a competitive ability to capture the main mode while considering the multimodal of predicted trajectories.

TABLE I: Results on Argoverse motion forecasting validation set.

Model	${\rm{minFD}}{{\rm{E}}_{1}}$	${\rm{minAD}}{{\rm{E}}_{1}}$	${\rm{minFD}}{{\rm{E}}_{6}}$	${\rm{minAD}}{{\rm{E}}_{6}}$	$\operatorname{MR}$
GRU ED	3.75	1.64	1.71	0.97	0.23
MTPLA	3.27	1.46	2.06	1.05	-
LaPred	3.29	1.48	1.44	0.71	-
WIMP	3.19	1.45	1.14	0.75	0.12
TPCN	2.95	1.34	1.15	0.73*	0.11*
LaneGCN	2.97*	1.35*	1.08*	0.71	0.11*
Ours	3.00	1.39	1.07	0.73*	0.10

TABLE II: Ablation study results of modules

Modules				K=6
LA	Position	S2L	Selectors	$\operatorname{minFDE}_{K}$	$\operatorname{minADE}_{K}$	MR	p-FDE	p-ADE	p-MR	brier-FDE	brier-ADE
				1.71	0.97	0.233	3.50	2.76	0.872	2.40	1.66
$\surd$				1.27_↓0.44	0.80_↓0.17	0.138_↓0.095	3.06_↓0.44	2.59_↓0.17	0.856_↓0.016	1.96_↓0.44	1.49_↓0.17
$\surd$	$\surd$			1.20_↓0.07	0.77_↓0.03	0.122_↓0.016	2.99_↓0.07	2.56_↓0.03	0.853_↓0.003	1.89_↓0.07	1.47_↓0.02
$\surd$	$\surd$	$\surd$		1.07_↓0.13	0.73_↓0.04	0.098_↓0.024	2.87_↓0.12	2.53_↓0.03	0.849_↓0.004	1.77_↓0.12	1.43_↓0.04
$\surd$	$\surd$	$\surd$	$\surd$	1.07_↓0.00	0.73_↓0.00	0.098_↓0.000	2.73_↓0.14	2.38_↓0.15	0.817_↓0.032	1.71_↓0.06	1.37_↓0.06

IV-C Ablation study

IV-C1 Importance of each module

In Table. II, we show the results of using GRU encoder-decoder as baseline (first line) and progressively adding the rest of modules to the network.

The RLA module is divided into LA and Position. The Selectors column denotes using the two selectors to mark and identify typical trajectories rather than treat them equally. From results, all modules have contribution to the performance improvement. It illustrates the efficiency of our modules to predict accurate trajectories. Moreover, the inclusion of LA is more effective for boosting the performance. It means the lane information is more critical for prediction. Notably, the improvement in probabilistic-based metrics by Selectors proves its ability to select reasonable trajectories.

IV-C2 Superiority of our methods

For fair contrast, We only change the different lane attention or social interactive extraction methods based on our whole model in Table. III Lane and Social section. In Lane section, LA-MTPLA and LA-LaPred respectively map the whole lane nodes sequence into a hidden vector by pooling and LSTM. LA-overall apply attention to all lane nodes. LA-WIMP execute attention on the local lane nodes similar to us, but without the Position information.

TABLE III: Comparative study on lane and social
modeling methods

	Methods	${\rm{minFD}}{{\rm{E}}_{6}}$	${\rm{minAD}}{{\rm{E}}_{6}}$	MR
Lane	LA-MTPLA	1.24	0.81	0.121
	LA-LaPred	1.21	0.78	0.122
	LA-overall	1.15	0.75	0.112
	LA-WIMP	1.12	0.75	0.104
	RLA	1.07	0.73	0.098
Social	Separately by GAT	1.20	0.78	0.123
Social	Jointly by S2L	1.07	0.73	0.098

By comparing LA-overall, LA-WIMP, RLA with LA-MTPLA, LA-LaPred, it can be concluded that regarding the lane as a set of lane nodes is more effective than simply mapping it to a vector. The best performance achieved by RLA reveals the superiority of our lane attention method. In Social section, Separately by GAT means treating the agents as vertices on a fully-connected graph and performing attention to independently model the interaction. The significant improvement on all metrics by S2L proves the effective of our S2L in jointly representing the social and lane information.

IV-C3 Impact of forecast horizon

In Fig. 5, our model outperforms all baselines across all horizons. Furthermore, the gap between our model and baselines gradually widens as the horizon increasing, which underscores the ability to longer-term prediction of our model. GRU baseline achieves a very low minADE in time step 5, but quickly grow beyond the lane-based methods including us in longer horizon. It reveals the short-term behavior of an agent is easier to model with fewer constrains and simpler models, but the long-term behavior needs deeper attributes such as its goal to achieve.

IV-D Qualitative analysis

Fig. 6 shows the qualitative results of our model to other competitive methods on particular hard scenarios selected from Argoverse dataset. ${\rm{Ours\_all}}$ visualizes all trajectories generated by our model for all reachable lanes and ${\rm{Ours\_6}}$ shows the remaining trajectories filtered by two selectors. Our model could generate trajectories covering all plausible modes in each challenging scenario. The trajectories with reasonable probabilities produced by ${\rm{Ours\_6}}$ show the importance of two selectors to identify typical trajectories.

Specifically from Fig. 6, in turning and forking cases, it is hard to distinguish whether the agent will go on or turn to another path, the baselines only maintain current motion but our model can suggest both paths even there is only a small possibility. In intersection scenario, although there are more possible paths agent can choose, our model can still explicitly predict map-adaptive trajectories compared with others. Moreover, when there is congestion at the front, only our model correctly predicts the deceleration and tend to bypass. It reveals the effectiveness of jointly learning the social agents and lanes information to help prediction more reasonable and safe.

V Conclusion

In this paper, we introduced JAL-MTP, a novel mothod for multimodal trajectory prediction in complex driving scenarios. JAL-MTP captures the relation between the lanes and the motion of agents through the joint representation obtained for each instance-level lanes. Then JAL-MTP model the evolution of the current motion on these instance-level lanes to perform lane-based prediction, which help trajectories to be map-adaptive. In addition, we propose two self-supervised classification subtasks to guide our model to select the reasonable trajectories. The experiments conducted on Argoverse datasets demonstrate that the JAL-MTP could produce reasonable and accurate prediction in challenging scenarios, which is crucial for the safety and comfort of SDVs. In future work, the process to find approachable lanes by heuristic methods should be extended in neural network to achieve end-to-end solution without preprocessing.

References

[1] B. Paden, M. Čáp, S. Z. Yong, et al., “A survey of motion planning and control techniques for self-driving urban vehicles,” IEEE Transactions on intelligent vehicles, vol. 1, no. 1, pp. 33–55, 2016.
[2] J. Gao, C. Sun, H. Zhao, et al., “Vectornet: Encoding hd maps and agent dynamics from vectorized representation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 11 525–11 533.
[3] M. Liang, B. Yang, R. Hu, et al., “Learning lane graph representations for motion forecasting,” in European Conference on Computer Vision. Springer, 2020, pp. 541–556.
[4] O. Makansi, E. Ilg, O. Cicek, et al., “Overcoming limitations of mixture density networks: A sampling and fitting framework for multimodal future prediction,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 7144–7153.
[5] T. Phan-Minh, E. C. Grigore, F. A. Boulton, et al., “Covernet: Multimodal behavior prediction using trajectory sets,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 14 074–14 083.
[6] Y. Chai, B. Sapp, M. Bansal, et al., “Multipath: Multiple probabilistic anchor trajectory hypotheses for behavior prediction,” arXiv preprint arXiv:1910.05449, 2019.
[7] C. Luo, et al., “Probabilistic multi-modal trajectory prediction with lane attention for autonomous vehicles,” in 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2020, pp. 2370–2376.
[8] J. Pan, et al., “Lane-attention: Predicting vehicles’ moving trajectories by learning their attention over lanes,” in 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2020, pp. 7949–7956.
[9] S. Khandelwal, W. Qi, J. Singh, et al., “What-if motion prediction for autonomous driving,” arXiv preprint arXiv:2008.10587, 2020.
[10] B. Kim, et al., “Lapred: Lane-aware prediction of multi-modal future trajectories of dynamic agents,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 14 636–14 645.
[11] M.-F. Chang, J. Lambert, P. Sangkloy, et al., “Argoverse: 3d tracking and forecasting with rich maps,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 8748–8757.
[12] X. Li, X. Ying, and M. C. Chuah, “Grip: Graph-based interaction-aware trajectory prediction,” in 2019 IEEE Intelligent Transportation Systems Conference (ITSC). IEEE, 2019, pp. 3960–3966.
[13] H. Jeon, J. Choi, and D. Kum, “Scale-net: Scalable vehicle trajectory prediction network under random number of interacting vehicles via edge-enhanced graph convolutional neural network,” in 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2020, pp. 2095–2102.
[14] H. Zhao, J. Gao, T. Lan, et al., “Tnt: Target-driven trajectory prediction,” arXiv preprint arXiv:2008.08294, 2020.
[15] M. Ye, T. Cao, and Q. Chen, “Tpcn: Temporal point cloud networks for motion forecasting,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 11 318–11 327.
[16] A. Sadeghian, V. Kosaraju, A. Sadeghian, et al., “Sophie: An attentive gan for predicting paths compliant to social and physical constraints,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 1349–1358.
[17] S. Casas, C. Gulino, S. Suo, et al., “Implicit latent variable model for scene-consistent motion forecasting,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIII 16. Springer, 2020, pp. 624–641.
[18] L. Zhang, P.-H. Su, J. Hoang, et al., “Map-adaptive goal-based trajectory prediction,” arXiv preprint arXiv:2009.04450, 2020.
[19] A. Vaswani, N. Shazeer, N. Parmar, et al., “Attention is all you need,” in Advances in neural information processing systems, 2017, pp. 5998–6008.
[20] N. Deo and M. M. Trivedi, “Trajectory forecasts in unknown environments conditioned on grid-based plans,” arXiv preprint arXiv:2001.00735, 2020.
[21] S. Carrasco, D. F. Llorca, and M. Á. Sotelo, “Scout: Socially-consistent and understandable graph attention network for trajectory prediction of vehicles and vrus,” arXiv preprint arXiv:2102.06361, 2021.
[22] X. Mo, Y. Xing, and C. Lv, “Recog: A deep learning framework with heterogeneous graph for interaction-aware trajectory prediction,” arXiv preprint arXiv:2012.05032, 2020.