The Importance of Prior Knowledge in Precise Multimodal Prediction

Sergio Casas^∗1,2, Cole Gulino^∗1, Simon Suo^∗1,2 and Raquel Urtasun^1,2
Uber Advanced Technologies Group¹, University of Toronto²
{sergio.casas, cgulino, suo, urtasun}@uber.com *Denotes equal contribution.

Abstract

Roads have well defined geometries, topologies, and traffic rules. While this has been widely exploited in motion planning methods to produce maneuvers that obey the law, little work has been devoted to utilize these priors in perception and motion forecasting methods. In this paper we propose to incorporate these structured priors as a loss function. In contrast to imposing hard constraints, this approach allows the model to handle non-compliant maneuvers when those happen in the real world. Safe motion planning is the end goal, and thus a probabilistic characterization of the possible future developments of the scene is key to choose the plan with the lowest expected cost. Towards this goal, we design a framework that leverages REINFORCE to incorporate non-differentiable priors over sample trajectories from a probabilistic model, thus optimizing the whole distribution. We demonstrate the effectiveness of our approach on real-world self-driving datasets containing complex road topologies and multi-agent interactions. Our motion forecasts not only exhibit better precision and map understanding, but most importantly result in safer motion plans taken by our self-driving vehicle. We emphasize that despite the importance of this evaluation, it has been often overlooked by previous perception and motion forecasting works.

I Introduction

Self-driving vehicles (SDV) have the potential to make a large impact on our society, making transportation safer, cheaper and more efficient. A core component of every self-driving vehicle is its ability to perceive the world (including dynamic objects), and forecast how the future might unroll. In recent years there has been incredible progress in perception systems [1, 2, 3]. Many challenges still remain, however, in providing motion forecasts that are simultaneously diverse and precise [4]. That is, having the ability to cover all the modes of the data distribution while generating few highly unrealistic trajectories.

Roads in modern cities have well defined geometries, topologies, as well as traffic rules. The vast majority of actors in the scene will adhere to this structure, for example driving close to the middle of their lane, respecting stop signs, or obeying yielding laws. These agents will also most likely act in a socially acceptable manner, avoiding collisions with other traffic participants. Despite this fact, most perception and motion forecasting systems are trained to be as close as possible to the ground truth employing symmetric loss functions that do not take this structure into account. For example, euclidean distance at the waypoint level between the predicted and ground-truth future trajectories is a common choice for motion forecasting. This can cause uncomfortable rides for the self-driving vehicle with plenty of sudden brakes and steering changes due to false positive motion forecasts intruding into the ego-car’s lane, such as the red trajectory illustrated in Fig. 1-Left. Even worse, these sudden reactions to avoid an imminent collision with a motion forecast can cause an actual collision (with the ground-truth) as a by-product. It is also critical to recall all actors (and their future motion) in the SDV lane, since otherwise it would look like free space ahead and the SDV could dangerously accelerate, as depicted by the red motion forecast in Fig. 1-Right.

One possible solution is to add the aforementioned intuitions into the motion forecasting model as hard constraints. Unfortunately, this may not be resilient to non-compliant behavior from other actors as well as map failures, predicting possibly unrealistic and dangerous situations. In this paper we take an alternative approach and design loss functions that encourage our perception and prediction system to only violate these constraints when they happen in reality.

Refer to caption — Figure 1: Importance of incorporating prior knowledge depicted in two scenarios. We show the ground-truth trajectory, and two predictions with the same L2 error, one that is dangerous from the SDV perspective, and another one that is harmless.

Incorporating prior knowledge via loss functions is easy when the perception and prediction modules are deterministic. However, deterministic systems fail to capture the inherent uncertainty of the future. This can be catastrophic when it fails to capture the true actor intention (e.g., crossing the street vs waiting, yielding vs not). In order to plan a safe maneuver, coverage of the possible future scenarios is required, along with information about the likelihood of each possible future such that the motion planner can choose the plan with the lowest expected cost. The Gaussian distribution and mixtures thereof have been widely used to represent uncertainty over spatial locations [5, 6, 7]. However, as shown in [4], maximizing the log likelihood of data encourages the model to produce distributions with high recall in order to avoid the big penalty associated with low-density areas. As a consequence many unrealistic samples are generated, sacrificing the precision of the model.

In this paper, we show that making explicit use of our prior knowledge about the geometry and topology of the roads as well as the traffic rules can provide more precise distributions over future outcomes while preserving their recall. However, this is challenging as these priors are typically non-differentiable and thus not directly amenable to gradient based optimization. For instance, the fact that humans tend to follow traffic rules can be better described as a discrete (follow / not follow) action. To this end, we propose a flexible framework to incorporate non-differentiable prior knowledge as a loss and exploit the popular REINFORCE [8] gradient estimator. Our formulation allows us to optimize for any prior knowledge on future trajectories, as long as drawing samples from the perception and prediction model, and obtaining their likelihood can be done efficiently. In particular, we apply our formulation to model how the vehicles interact with the map, encouraging the predictions to respect lane dividers and traffic lights. We also exploit our framework to make the motion forecasting module more planning aware by emphasizing the importance of high recall and high precision near the SDV route.

Our experiments show that our proposed framework can improve the map understanding of state-of-the-art motion forecasting methods in very complex, partially observable urban environments in two challenging real-world datasets: ATG4D [2] and nuScenes [9]. Importantly, our approach achieves significant improvements in the precision of the trajectory distribution, while maintaining the recall. Unlike previous works, in this paper we advocate for measuring the system-level impact of the motion forecasts via motion planning metrics. We demonstrate that including prior knowledge not only results in more comfortable rides, but also in major safety improvements over the decisions taken by a state-of-the-art motion planner [10]. Moreover, we show that achieving a lower minADE alone, the most frequently used metric to benchmark multi-modal motion forecasts, may not translate into safer motion plans.

II Related Work

In order to plan a safe maneuver, we must effectively deal with noisy and partial observations from sensor data, and provide an accurate characterization of the distribution over future trajectories of all actors in the scene. We thus want to model $p(Y|X)$ , with $X=\begin{Bmatrix}x_{1},x_{2},\cdots,x_{N}\end{Bmatrix}$ the observations (i.e. a local context for each actor) and $Y=\begin{Bmatrix}y_{1},y_{2},\cdots,y_{N}\end{Bmatrix}$ the future trajectories of all $N$ actors. This is particularly challenging since the future is inherently uncertain, and actors’ discrete decisions induce highly multimodal distributions (e.g., turning right vs. going straight at an intersection, going vs. stopping at a yield sign).

In traditional self-driving stacks, there is an object detection module responsible for recognizing traffic participants in the scene, followed by a motion forecasting module that predicts how the scene might unroll, given the current state of each actor. However, the actor state is typically a very compact representation including pose, velocity, and acceleration. As a consequence, it is hard to incorporate uncertainty coming from sources such as sensor noise and occlusion.

FAF [11] unified these two tasks by having a single, fully convolutional backbone network predict both the current and future states for each pixel in a bird’s eye view grid, directly from a voxelized LiDAR pointcloud. This naturally propagates uncertainty between the two tasks in the feature space, without any need for handcrafted, intermediate representations. [12] extended this framework to include the map as an input by adding a parallel fully convolutional backbone network to process a semantic raster map of the scene, thus making fusion trivial by concatenation. Recently, this framework was further extended to model agent-agent interactions via graph neural networks [5], learn a cost map for motion planning [13], and add differentiable tracking in-the-loop [14]. While these works are great at dealing with uncertainties at the sensor level and mitigating failures downstream from object detection uncertainty by learning joint features, they do not focus on the output parameterization, and all of them produce uni-modal predictions. Uni-modal predictions can result in unsafe behaviors if, for example, the predicted intention is not accurate (e.g., a pedestrian crossing vs waiting) or the predictions lie inbetween two modes (e.g. at branching roads). In this work, we extend this framework to predict multi-modal behaviors that are aligned with human prior knowledge.

The motion planning algorithms in an SDV need to take into account all possibilities to make safe decisions. Thus, motion forecasting models that can characterize complex multi-modal distributions are a must, and efficient sampling is desired such that motion planning can timely find the plan with the lowest expected cost. Approaches that directly output the parameters of the marginal distribution at each timestep with a closed-form likelihood such as a sequence of Gaussians over time [15, 5] meet the efficient sampling requirement, but can suffer from low expressivity. [6] proposed a more stable way to train mixtures of future trajectories than directly optimizing the likelihood, by only training the Gaussian likelihood of the closest mode to the ground-truth trajectory and applying cross-entropy to the mode probabilities. [7] followed up on this idea by anchoring the predictions to clusters produced offline by running k-means on the training set. These two models have a good tradeoff between sampling efficiency and expressivity. Predicting discrete occupancy maps into the future [16] has very good expressivity and naturally captures multi-modality, but is very memory intensive and requires adaptations in order to use traditional motion planners designed for trajectory representations.

Autoregressive models [4, 17, 18] require sequential sampling and thus are not amenable to real-time inference, which is necessary in safety critical applications such as self-driving. Furthermore, there is a mismatch between training and inference: the distributions at training are conditioned on ground truth, but during inference they are conditioned on model samples. This makes the model less robust to noise, particularly in the joint detection and prediction setting where perfect detection is not possible. Finally, modeling interactions between the traffic participants [15, 19, 5, 17, 18] has been shown to help reduce the prediction uncertainty, and generate more socially consistent predictions.

There has been incredible progress in learning multimodal futures with a likelihood objective without leveraging prior knowledge or known structure, thanks to the availability of large and diverse driving datasets. However, datasets collected from real-world driving can only cover one possible mode out of many possibilities of how the future might have unrolled. As shown in [4], this partial description of the underlying distribution along with a Maximum Likelihood Estimation (MLE) objective results in high entropy distributions that favors mode-covering over mode-seeking behaviors [20]. Empirically, although the learned distribution recovers the ground truth future, it also generates highly implausible samples (e.g., out-of-map predictions). In this paper we show that coverage at the expense of precision severely harms motion planning.

A solution to avoid mode-covering predictions is to directly leverage prior knowledge to characterize the true distribution. In particular, the actor’s motion history, road geometry, traffic rules, and nearby traffic participants all constrain the space of plausible futures. Past approaches have (i) devised neural networks with inductive priors at the model level [21], and (ii) use a map-based reconstruction loss as an auxiliary task (e.g. recovering the road mask shape [22] to encourage prediction to fall on drivable areas, which they show helps with generalization to novel driving scenarios). However, their approach is limited since the policy is non probabilistic, and the loss is applied to only the maximum a posteriori (MAP) estimate. In this paper, we propose a more general and powerful approach by leveraging the REINFORCE gradient estimator to incorporate any prior knowledge (including non-differentiable functions) over probabilistic motion forecasts, still allowing the model to recover non-compliant behavior.

III Exploiting Prior Knowledge in Driving

In this section, we describe a novel framework to incorporate prior knowledge explicitly into probabilistic motion forecasts. Importantly, our approach still permits the model to predict non-compliant behavior that does not follow traffic rules in the rare event that this occurs. Our method is general and can be applied to any model that can generate actor trajectory samples $y_{n}$ and evaluate their marginal likelihood $p(y_{n}|x_{n})$ efficiently. Since for the remainder of the paper we only refer to per-actor marginal likelihoods, we simplify the notation and refer to an actor’s trajectory as $y$ , and its local context as $x$ (i.e. the detected bounding box and local LiDAR/map features). We defer the explanation of the particular state-of-the-art perception and prediction model we use to Section IV, as this is not the main contribution of our work.

Given a traffic scene, humans have rich prior knowledge over how the traffic participants might behave. In this paper, we propose to directly use this prior knowledge as supervision when learning an actor’s distribution over future trajectories. Towards this goal, we encode the prior knowledge as a deterministic reward function $r(y,x)$ . We then define the prior knowledge objective as the negative expected reward over samples from the future trajectory distribution. Note that applying the loss directly to the point estimate of the means is not sufficient, since our goal is to learn an accurate characterization of the full distribution for safe motion planning. The goal is then to learn a stochastic policy (parameterized by $\theta$ ) that maximizes the expected reward:

	$\displaystyle\mathcal{L}_{\text{prior}}(x;\theta)$	$\displaystyle=\mathop{\mathbb{E}_{y\sim p_{\theta}(y\|x)}}\left[-r\left(y,x)\right)\right]$
		$\displaystyle=\int-p_{\theta}\left(y\|x\right)r\left(y,x\right)dy$

Most priors are non-differentiable and cannot be easily relaxed (e.g., motion forecast following the traffic rules or not) Thus, we leverage policy gradient algorithms, which do not assume differentiability of the reward function $r$ and allow direct optimization without making any approximations. In particular, we use the popular REINFORCE algorithm [8], which only requires the policy to be differentiable, provide efficient sampling, and allow likelihood evaluation. In this case, the gradient of the expected loss can be computed as:

\displaystyle\nabla\mathcal{L}_{\text{prior}}(x;\theta)=\mathop{\mathbb{E}_{y\sim p_{\theta}\left(y|x\right)}}\left[-\nabla\log p_{\theta}\left(y|x\right)r\left(y,x\right)\right]

The expectation can then be approximated by drawing samples from the predicted distribution as follows:

\nabla\mathcal{L}_{\text{prior}}(x;\theta)\approx\frac{1}{S}\sum_{i}^{S}-\nabla\log p_{\theta}\left(y^{i}|x\right)r(y^{i},x)

where $y^{i}$ is the $i$ th trajectory sample, over $S$ samples. Although this Monte Carlo estimation is unbiased, it typically has high variance. Our experiments show that this does not pose a problem when using a policy that has an efficient sampling mechanism, since we can draw a large number of samples.

In practice, our reward explicitly incorporates the prior knowledge that drivers generally follow the reachable lanes defined by lane markers, traffic signs and traffic lights. Furthermore, we leverage our knowledge about the self driving vehicle’s planned route to place emphasis on the most relevant actors. Intuitively, missing the prediction of an actor coming in conflict with the SDV route (false negatives) or predicting that an actor will cross in front of the SDV when in reality it stops (false positives) are of greater importance than accurately characterizing the behavior of an actor 50 meters behind the SDV. We express the final reward as a simple linear combination of rewards

r(y,x)=\sum_{t}^{T}r_{\text{reach}}(y_{t},x)+r_{\text{route}}(y_{t},x)

Next, we explain both terms in detail.

III-A Reachable Lanes

Human driving behavior is highly structured: in the majority of scenarios, drivers will follow the road topology and traffic rules. To leverage this informative prior, but not overly penalize non-compliant behavior, we define a flexible traffic-rule informed loss that is conditioned on ground-truth behavior. To this end, we leverage a lane-graph representation where the nodes encode lane segments and the edges represent relationships between lane segments such as adjacency, predecessor, and successor (taking into account direction of traffic flow). This allow us to define the reachable lanes loss per timestep as:

r_{\text{reach}}\left(y_{t},x\right)=\begin{cases}r_{d},&\text{if }y_{t}\in\text{reach}(x)\land y_{t}^{gt}\in\text{reach}(x)\\ -r_{d}&\text{if }y_{t}\notin\text{reach}(x)\land y_{t}^{gt}\in\text{reach}(x)\\ 0,&\text{otherwise}\end{cases}\\

(1)

where $\text{reach}(x)$ represent the lanes that are reachable from the detected vehicle bounding box by obeying the traffic rules. Note that to be robust to noise in the lane graph and avoid penalizing non-compliant behaviors, we only apply the loss if the ground truth waypoint $y_{t}^{gt}$ falls within the set of reachable lanes as well. This loss is summarized in Fig. 2. To define this set of reachable lanes, we capture lane divider infractions as well as traffic light violations on the lane-graph.

Model	Final Lane Error (%)			meanADE (m)			minADE (m)
	Straight	Left	Right	Straight	Left	Right	Straight	Left	Right
SpAGNN [5]	18.79	51.27	51.52	3.36	5.70	6.22	0.68	1.63	1.68
MultiPath [7]	12.76	49.27	39.72	2.40	5.09	5.08	0.58	1.59	1.30
R2P2-MA [17]	6.36	42.26	28.58	2.57	4.74	5.09	0.75	1.85	1.63
SpAGNN+	10.07	47.92	39.60	2.35	4.53	4.89	0.53	1.39	1.26
Ours	6.28	39.13	28.07	2.17	4.16	4.57	0.54	1.60	1.51

TABLE I: [ATG4D] Motion forecasting

Model	Final Lane Error (%)			meanADE	minADE
	Straight	Left	Right	(m)	(m)
SpAGNN+	15.24	24.24	28.48	1.72	0.48
Ours	10.69	17.44	20.31	1.65	0.50

TABLE II: [nuScenes] Motion Forecasting

III-A1 Lane infraction

Lane dividers limit the set of legal high-level actions a vehicle can take in the road. For instance, lane changing over a solid line or taking over another vehicle by crossing a yellow double solid line into opposite traffic are not allowed. We incorporate this prior by removing the edges corresponding to illegal maneuvers from the lane-graph. Encoding this prior helps the model predict less entropic distributions.

III-A2 Traffic light violation

Many interactions occur at intersections, some of which are safety critical. Thus it is important to have accurate actor predictions at intersections, particularly differentiating stopping and going behaviors. To this end, we leverage the traffic control states (i.e., green, red, yellow) to remove edges connecting lane segments that are currently governed by a red traffic light.

Once we have processed the lane-graph by applying the aforementioned rules, we perform lane association to match each vehicle bounding box to a lane (or set of lanes when the vehicle overlaps with multiple ones, for example during a lane change). Subsequently, we run a depth first search starting from the current lane segment, obtaining a set of reachable lanes.

III-B SDV Route

It is more important to precisely characterize the motion of vehicles that might interact with the SDV [23], rather than other traffic participants that do not influence the SDV behavior. We can approximate the area of interest with the SDV’s route (i.e. high-level command), which is defined as the union of all lane segments that the SDV can travel on to reach a preset goal, given the lane-graph. More concretely, the horizon is set to be equal to the prediction horizon (5s), and the target lane is generated by a given high level route planner (out of the scope of this paper). This gives a safe approximation over its future possible locations.

We now define positive trajectories as those with at least one waypoint falling within the SDV route, and negative otherwise. We would like our trajectory predictions to achieve high precision and high recall under this definition, taking into account if the ground-truth trajectory intersects the route (positive) or not (negative).

More concretely, we define the route loss as:

r_{\text{route}}\left(y_{t},x\right)=\begin{cases}r_{tp},&\text{if }y_{t}\in\text{route}\land y_{t}^{gt}\in\text{route}\\ r_{fp},&\text{if }y_{t}\in\text{route}\land y_{t}^{gt}\notin\text{route}\\ r_{tn},&\text{if }y_{t}\notin\text{route}\land y_{t}^{gt}\notin\text{route}\\ r_{fn},&\text{if }y_{t}\notin\text{route}\land y_{t}^{gt}\in\text{route}\\ \end{cases}

(2)

where we have different rewards for true positive, false positive, true negative and false negative waypoint predictions, since there is high imbalance in the data and they have different impact on the safety of our motion planner. This loss is illustrated in Fig. 3.

IV Perception and Prediction Model

So far we have presented our general framework, but we have not specified the particular perception and motion forecasting model we apply our prior knowledge loss to. We exploit a combination of the backbone feature extraction, object detection network, and graph propagation from SpAGNN [5], together with the mixture of Gaussians output parameterization of MTP [6]. This perception and prediction model takes a voxelized LiDAR point cloud and a raster map as input, extracts scene features using a backbone CNN, and applies Rotated Region of Interest Align (RRoI) [24] to extract per-actor features. After that, we define a fully-connected graph where the nodes correspond to traffic participants, and perform a series of graph-propagations to refine each actor’s representation by aggregating features from their neighbors, as proposed by [5]. Finally, an MLP header predicts the parameters of a multimodal distribution over future trajectories by using a mixture of Gaussians with full covariance for each actor:

\displaystyle p(y|x)=\sum_{k}p(a^{k}|x)\prod_{t=1}^{T}p(y_{t}|\mu_{t}^{k}(x),\Sigma_{t}^{k}(x))

Our method can be trained end-to-end using back-propagation and stochastic gradient descent. In particular, we minimize a multi-objective loss containing a classification and regression terms for object detection, a symmetric motion forecasting loss agnostic to the map or SDV, as well as the prior informed non-differentiable loss we described in Section III. The loss of each actor is a weighted sum of the multiple objectives.

\displaystyle\mathcal{L}=\alpha\cdot\mathcal{L}_{\text{det}}+\beta\cdot\mathcal{L}_{\text{symmetric}}+\gamma\cdot\mathcal{L}_{\text{prior}}

For the classification branch ( $\mathcal{L}_{\text{cla}}$ ) of the detection header (background vs vehicle), we employ a binary cross entropy loss with hard negative mining. In particular, we select all positive examples from the ground-truth and 3 times as many negative examples from the rest of spatial locations. For box fitting ( $\mathcal{L}_{\text{reg}}$ ), we apply a smooth L1 loss to each of the 5 parameters $(x,y,w,h,\phi)$ of the bounding boxes anchored to a positive example .

\displaystyle\mathcal{L}_{\text{det}}=\mathcal{L}_{\text{cla}}+\lambda\cdot\mathcal{L}_{\text{reg}}

Instead of directly optimizing the likelihood of the mixture model, we follow [6] in heuristically matching the closest mode to the ground-truth and only taking the negative log likelihood of that mode, while training the mode scores with a cross-entropy loss. This has been shown empirically [7] to be a more stable training objective than optimizing the mixture likelihood directly as in [25]. Thus we define

\displaystyle\mathcal{L}_{\text{symmetric}}=-\sum_{k}\text{1}(k=\hat{k})[\log p(a^{k}|x)+\sum^{T}_{t}\log p(y^{k}_{t}|x)]

where $\hat{k}=\operatorname*{arg\,min}_{k}\text{dist}(y^{k},y^{gt})$ is the mode whose mean is closest to the ground truth trajectory in euclidean distance.

Since the gaussian waypoints in this model are independent across time, we use a heuristic trajectory sampler at inference to draw smooth samples from this model (see Appendix -A for details). Note that during training we do not need this sampler since all the losses are formulated at the waypoint level. However, it is important to have temporally consistent samples for measuring the impact of our predictions in the downstream task of motion planning, since we approximate the heading into the future by finite differences between waypoints of the sample trajectories.

Model	Collision	L2 human	Lat. acc.	Jerk	Progress
	(% up to 5s)	(m @ 5s)	(m/ $s^{2}$ )	(m/ $s^{3}$ )	(m @ 5s)
SpAGNN [5]	4.19	5.98	2.94	2.90	32.37
R2P2-MA [17]	3.71	5.65	2.84	2.53	33.90
MultiPath [7]	3.30	5.58	2.73	2.57	32.99
SpAGNN+	3.33	5.52	2.77	2.56	33.11
Ours	2.75	5.43	2.67	2.47	33.09

TABLE III: [ATG4D] System level performance

V Experimental Evaluation

In this section, we first describe how we measure the motion forecasting ability of our method, namely how well it predicts the future behavior of all the traffic participants. We then introduce the comprehensive set of metrics that we use for evaluation in the downstream task of ego-motion planning. Despite being the final goal of the system, this task has generally been ignored in previous motion forecasting works. Next, we discuss the state-of-the-art baselines we compare against, report extensive quantitative results on two challenging, real-world datasets: ATG4D [2] and nuScenes [9], and showcase the differences of adding prior knowledge from a qualitative standpoint. Finally, we perform a thorough ablation study to justify our choices on how to incorporate prior knowledge. We defer our implementation details to Appendix -A.

V-A Metrics

Motion forecasting results neglect the fact that actors are of different importance in the overall system, and that motion forecasting metrics are not necessarily aligned with system level metrics. A model with the best aggregate metrics in perception and motion forecasting may excel at the unimportant cases, and miss safety-critical cases resulting in unsafe driving. Since our method is a perception and motion forecasting system, we lead the discussion with motion forecasting metrics for clarity, as well as to gain intuitions on what are the differences with the baselines. However, what we really care about is how well we drive, and thus provide an extensive analysis of system level metrics to conclude this section.

V-A1 Motion Forecasting

We follow previous works [12, 2, 5] in joint perception and prediction and perform IoU-based matching between object detections and ground-truth bounding boxes, ignoring fully occluded vehicles without any LiDAR point.

In order to measure the performance of the motion forecasting system we use sample quality measures, following [17]. In particular, we use (i) the final lane error (trajectory waypoint inside vs. outside the reachable lanes at 5 seconds into the future) to measure map understanding, (ii) minimum average displacement error (minADE) to show the recall of our motion forecasts at different time horizons, and (iii) mean average displacement error (meanADE), which gives us an idea of the precision of our predictions since unrealistic samples severely harm this metric. Furthermore, we benchmark the performance in a diverse set of driving behaviors, by breaking down all metrics by the ground-truth high-level action of the vehicle: going straight, left turning and right turning. We omit stationary and near stationary vehicles, since all models do well in detecting and predicting the future states of those. We use 50 samples for all evaluations.

Note that, in contrast to [17], we emphasize the need of metrics that show the precision of the predictions, as opposed to using only recall-oriented metrics such as minADE, for 2 reasons. First, recall is easily achievable at the expense of precision by simply predicting very fan-out distributions. Second, precision in the motion forecasts is critical for safe motion planning, as we show in Sec. V-B.

V-A2 Ego-Motion Planning

To evaluate how our approach impacts the full system, we use the learnable motion planner proposed in [10]. We feed the planner with 50 trajectory samples for each vehicle, as a Monte Carlo approximation of the marginal distribution. Thus, we assign equal weight to each sample to avoid overweighing the high likelihood region of the distribution, since the samples already come from our model. This way we can keep the motion planner as proposed. Because in this paper we do not consider perception and prediction of pedestrian and bicyclists, we feed the ground-truth trajectories of these traffic participants to the motion planner, for all experiments.

We focus on the safety-related metric of collision rate ( $\%$ of time the SDV plan collides with any other traffic participant in the ground-truth, for a future horizon of 5 seconds). We also provide results on comfort-related metrics such as lateral acceleration and jerk, to reveal any potential tradeoff. Finally, we also include the progress of the SDV to show that the methods are not just trivially avoiding collisions and uncomfortable situations by staying still. Note that these metrics are computed in open-loop, by unrolling the motion plan for the duration of the prediction horizon.

[Uncaptioned image] — TABLE IV: Motion forecasting visualizations of 50 smamples. Time in the motion forecasts is encoded in the rainbow colormap ranging from red (0s) to pink (5s). Ground-truth is shown by the white trajectory waypoints.

V-B Comparison Against State of the Art

We compare our approach to three previously proposed state-of-the-art motion forecasting approaches: SpAGNN [5], R2P2-MA [17], and MultiPath [7]. We also consider the model outlined in Section IV but without applying our proposed prior knowledge loss ( $\mathcal{L}_{\text{prior}}$ ) as a baseline, which we call SpAGNN+. We implemented all models in the context of joint perception and prediction, and employ the same detection backbone in the baselines as in our approach for a fair comparison. We provide more details about the adaptations of R2P2-MA [17], and MultiPath [7] to the joint perception and prediction setting in Appendix -A.

Motion Forecasting: Table I shows motion forecasting results in ATG4D, operating at a 90 $\%$ common recall point across all models for fair comparison. We first show that adding the mixture of gaussians as output parameterization to SpAGNN improves all sample quality metrics (SpAGNN vs. SpAGNN+). This makes sense since unimodal distributions cannot capture multi-modal behaviors such as breaking vs. accelerating, or turning right vs. going straight. Incorporating prior knowledge via our proposed method delivers much better map understanding and precision, as shown by the final lane error and meanADE. While the distance between the ground-truth and the closest sample (minADE) suffers a minor regression when incorporating prior knowledge, we have shown that this does not impact the downstream task of motion planning in Section V-B. Indeed, it is a metric that does not provide a description of the full distribution and overly favors those with higher entropy.

We show qualitative results for 4 scenarios in ATG4D with diverse road topologies in Fig. IV. We can clearly see how adding prior knowledge makes the distributions less entropic while preserving multi-modality, and substantially improves the map understanding of our predictions. We highlight that despite incorporating prior knowledge about the fact that vehicles tend to follow the map, we can see in Example 2 how our model can predict vehicles going out of the map (in this case an unmapped driveway).

In Table II, we validate our method on nuScenes to show that it is robust to variations in the high definition map specification. We use the same evaluation setup, with the exception that we operate our object detector at a 60 $\%$ common recall point. As in our dataset, our approach shows improvements in precision and map understanding metrics.

Ego-Motion Planning: In Table III, we show that our method yields much safer motion plans, indicated by a 17% collision reduction over the strongest baseline in this metric. Remarkably, the increase in safety is not at the expense of comfort, where our method achieves marginally less jerk and lateral acceleration than other approaches. Note that while progress in general is desired, it cannot be at the expense of safety and comfort. We notice that the ego-motion plans make similar progress across models, but our approach produces the closest trajectories to the ground truth executed by an expert human driver (lowest L2 distance at 5 seconds into the future), while yielding much fewer collisions. We observe that despite the popularity of the minADE metric across previous works in motion forecasting, the model that achieves the lowest minADE in Table I does not yield the safest or most comfortable ego-motion plans.

Fig. V showcases an example that illustrates a repeated behavior throughout the dataset: the baseline method produces much more fan-out distributions that cause the ego-vehicle to verge into dangerous situations. In this particular example, there is a vehicle in the oncoming lane for which the baseline predicts with significant probability that will drive into the SDV’s lane, making the motion planner to drive into opposite traffic and finally causing a collision. In contrast, our model predicts a more precise and plausible distribution over possible futures where the relevant vehicle is predicted to follow its original lane.

V-C Ablations

To show that our design choices to incorporate prior knowledge are sound, we explore different approaches:

Prior Repr.	Approach	Final Lane Error (%)			meanADE (m)			minADE (m)
		Straight	Left	Right	Straight	Left	Right	Straight	Left	Right
-	-	10.07	47.92	39.60	2.35	4.53	4.89	0.53	1.39	1.26
Reach.	Reconstruction	10.08	48.02	38.21	2.37	4.48	4.85	0.53	1.32	1.24
Centerline Dist	Mean	11.95	53.78	46.11	2.80	11.07	9.03	0.63	2.07	1.95
Reach. Dist	Mean	9.68	47.65	39.73	2.24	4.98	5.12	0.55	1.58	1.53
Centerline Dist	Reparam	7.22	40.21	30.66	2.36	4.81	5.24	0.57	1.80	1.90
Reach. Dist	Reparam	6.84	37.75	27.55	2.29	4.46	4.91	0.56	1.68	1.86
Reach.	REINFORCE	6.74	41.57	29.80	2.25	4.47	4.82	0.55	1.68	1.65
Reach. & Route	REINFORCE	6.28	39.13	28.07	2.17	4.16	4.57	0.54	1.60	1.51

TABLE VI: [ATG4D] Motion forecasting ablations

Prior Repr.	Approach	Collision	L2 human	Lat. acc.	Jerk	Progress
		(% up to 5s)	(m @ 5s)	(m/ $s^{2}$ )	(m/ $s^{3}$ )	(m @ 5s)
-	-	3.33	5.52	2.77	2.56	33.11
Reach.	Reconstruction	3.58	5.54	2.80	2.56	33.10
Centerline Dist	Mean	3.94	5.72	2.81	2.75	32.40
Reach. Dist	Mean	3.30	5.46	2.75	2.55	32.99
Centerline Dist	Reparam	3.61	5.52	2.70	2.56	33.00
Reach. Dist	Reparam	3.25	5.49	2.70	2.52	33.04
Reach.	REINFORCE	3.00	5.45	2.70	2.51	33.02
Reach. & Route	REINFORCE	2.75	5.43	2.67	2.47	33.09

TABLE VII: [ATG4D] System level performance ablations

Reachable Lanes Reconstruction

Similar to the road loss proposed in [22], this approach uses an auxiliary convolutional head at the RRoI level that predicts the reachable lanes for each vehicle, represented as a spatial, binary mask. For this ablation, we replace our prior loss by a cross-entropy per pixel reconstruction loss, hoping to make the backbone features more map-aware.

Differentiable relaxations

If the prior knowledge is a differentiable reward function $r$ over the support of the distribution, we can directly apply the loss to the model samples and backpropagate the gradients through the gaussian output distribution using the reparameterization trick [26].

More concretely, we consider two differentiable relaxations of our non-differentiable reachable lanes loss as baselines, shown in Fig. 4:

1.

Distance to centerline: we define the loss at each spatial location as the closest distance to a centerline of any lane in the set of reachable lanes.
2.

Distance to boundary: we define the loss at each spatial location as the closest distance to the boundary of one of the reachable lanes when outside the reachable lanes surface, and zero when inside.

We test the aforementioned relaxed losses by applying them both to the mean of the distribution only, as well as to samples drawn from $p(y|x)$ .

Next, we discuss the motion forecasting results and finally the impact on motion planning.

Motion Forecasting As shown in Table VI, the implicit feature learning approach inspired by [22] barely changes from the baseline. Applying the relaxed loss only to the mean of the distribution can have conflicting effects with the data likelihood term in the loss, since the latter optimizes the full distribution but the former only the mean. This yields to results that are worse than the base model over all metrics. Finally, applying the relaxed loss to the samples via reparameterization trick achieves the best lane metrics, but harms the meanADE. This is caused by the approximation of the reachable lanes loss to a continuous function when applying the distance transform. For instance, in curvy lanes or turns, the relaxed loss will push the predictions to the closest point in the reachable lanes, causing the trajectory to shorten (i.e. reduce the speed). Another drawback is that in branch topologies, the prediction will get pushed to the closest branch, which could differ from the ground-truth one.

Ego-Motion Planning In Table VII, we show that not all approaches to incorporating prior knowledge improve the safety and comfort of ego-motion planning. In particular, the continuous relaxations of the reachable lanes loss do not reduce the number of collisions despite improving the final lane error as shown in the previous paragraph. We conjecture that there is a fine balance between map understanding, precision, and recall that is adequate for motion planning. The baselines sacrifice too much precision and recall in exchange of map understanding, most likely due to the approximations in the relaxation. Utilizing the REINFORCE gradient estimator to optimize the exact prior translates into much safer plans, particularly when incorporating the SDV route loss, showing that it is important to focus on the actors that can interact with the SDV.

VI Conclusion and Future Work

In this paper we have proposed a novel framework to explicitly incorporate prior knowledge into probabilistic motion forecasts, while allowing to predict non-compliant behavior when there is evidence. Our method is general, and can be applied to any model that can generate trajectory samples and evaluate their marginal likelihood per actor. We have demonstrated the effectiveness of our approach in two challenging real-world datasets, significantly outperforming other state-of-the-art methods in both motion forecasting as well as in the downstream task of motion planning. Though we have chosen SpAGNN [5] as the base model here, our method is general. We plan on integrating our approach with more base models, with a particular interest in joint distributions over actors where we can also apply our prior knowledge about interactions, such as the fact that vehicles do not generally collide.

References

[1] B. Li, T. Zhang, and T. Xia, “Vehicle detection from 3d lidar using fully convolutional network,” arXiv preprint arXiv:1608.07916, 2016.
[2] B. Yang, W. Luo, and R. Urtasun, “Pixor: Real-time 3d object detection from point clouds,” in Proceedings of the IEEE CVPR, 2018.
[3] S. Shi, Z. Wang, X. Wang, and H. Li, “Part-a^ 2 net: 3d part-aware and aggregation neural network for object detection from point cloud,” arXiv preprint arXiv:1907.03670, 2019.
[4] N. Rhinehart, K. M. Kitani, and P. Vernaza, “R2p2: A reparameterized pushforward policy for diverse, precise generative path forecasting,” in ECCV, 2018.
[5] S. Casas, C. Gulino, R. Liao, and R. Urtasun, “Spatially-aware graph neural networks for relational behavior forecasting from sensor data,” arXiv preprint arXiv:1910.08233, 2019.
[6] H. Cui, V. Radosavljevic, F.-C. Chou, T.-H. Lin, T. Nguyen, T.-K. Huang, J. Schneider, and N. Djuric, “Multimodal trajectory predictions for autonomous driving using deep convolutional networks,” in 2019 International Conference on Robotics and Automation (ICRA). IEEE, 2019, pp. 2090–2096.
[7] Y. Chai, B. Sapp, M. Bansal, and D. Anguelov, “Multipath: Multiple probabilistic anchor trajectory hypotheses for behavior prediction,” arXiv preprint arXiv:1910.05449, 2019.
[8] R. J. Williams, “Simple statistical gradient-following algorithms for connectionist reinforcement learning,” Machine learning, vol. 8, no. 3-4, pp. 229–256, 1992.
[9] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom, “nuscenes: A multimodal dataset for autonomous driving,” arXiv preprint arXiv:1903.11027, 2019.
[10] A. Sadat, M. Ren, A. Pokrovsky, Y.-C. Lin, E. Yumer, and R. Urtasun, “Jointly learnable behavior and trajectory planning for self-driving vehicles,” arXiv preprint arXiv:1910.04586, 2019.
[11] W. Luo, B. Yang, and R. Urtasun, “Fast and furious: Real time end-to-end 3d detection, tracking and motion forecasting with a single convolutional net,” in Proceedings of the IEEE CVPR, 2018.
[12] S. Casas, W. Luo, and R. Urtasun, “Intentnet: Learning to predict intention from raw sensor data,” in Conference on Robot Learning, 2018.
[13] W. Zeng, W. Luo, S. Suo, A. Sadat, B. Yang, S. Casas, and R. Urtasun, “End-to-end interpretable neural motion planner,” in Proceedings of the IEEE CVPR, 2019.
[14] M. Liang, B. Yang, Y. Chen, R. Hu, and R. Urtasun, “Multi-task multi-sensor fusion for 3d object detection,” in Proceedings of the IEEE CVPR, 2019.
[15] A. Alahi, K. Goel, V. Ramanathan, A. Robicquet, L. Fei-Fei, and S. Savarese, “Social lstm: Human trajectory prediction in crowded spaces,” in Proceedings of the IEEE CVPR, 2016.
[16] A. Jain, S. Casas, R. Liao, Y. Xiong, S. Feng, S. Segal, and R. Urtasun, “Discrete residual flow for probabilistic pedestrian behavior prediction,” 2019.
[17] N. Rhinehart, R. McAllister, K. Kitani, and S. Levine, “PRECOG: PREdiction Conditioned On Goals in Visual Multi-Agent Settings,” arXiv e-prints, p. arXiv:1905.01296, May 2019.
[18] C. Tang and R. R. Salakhutdinov, “Multiple futures prediction,” in Advances in Neural Information Processing Systems, 2019.
[19] N. Deo and M. M. Trivedi, “Convolutional social pooling for vehicle trajectory prediction,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2018, pp. 1468–1476.
[20] C. M. Bishop, Pattern recognition and machine learning, 2006.
[21] M.-F. Chang, J. Lambert, P. Sangkloy, J. Singh, S. Bak, A. Hartnett, D. Wang, P. Carr, S. Lucey, D. Ramanan, et al., “Argoverse: 3d tracking and forecasting with rich maps,” 2019.
[22] M. Bansal, A. Krizhevsky, and A. Ogale, “Chauffeurnet: Learning to drive by imitating the best and synthesizing the worst,” arXiv preprint arXiv:1812.03079, 2018.
[23] K. S. Refaat, K. Ding, N. Ponomareva, and S. Ross, “Agent prioritization for autonomous navigation,” arXiv preprint arXiv:1909.08792, 2019.
[24] J. Ma, W. Shao, H. Ye, L. Wang, H. Wang, Y. Zheng, and X. Xue, “Arbitrary-oriented scene text detection via rotation proposals,” IEEE Transactions on Multimedia, vol. 20, no. 11, 2018.
[25] C. M. Bishop, “Mixture density networks,” 1994.
[26] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” 2013.

-A Implementation Details

Reachable lanes and Route

We represent the reachable lanes and the route as a spatial binary mask (25 meters by 70 meters at 4 meter/pixel resolution) in each vehicle’s coordinate frame defined by its heading. This allows us to calculate the loss over samples efficiently. To evaluate our prior loss over a trajectory, we simply index into the raster at different waypoints.

Training

The hyperparameters we use for the loss function are: $\alpha=1.0$ , $\beta=0.1$ , $\gamma=0.1$ , and $\lambda=0.5$ . The rewards that we used are $r_{d}=1.0$ , $r_{tp}=r_{fp}=r_{fn}=1.0$ , and $r_{tn}=0.1$ . We use Adam optimizer with a base learning rate $5e-6$ , which we increase linearly with the batch size. All models are trained from scratch.

Network Architecture

We use the architecture proposed in [5], where we modify the final regression layer to predict 16 modes. Each mode is composed by a mode score, and 11 Gaussian waypoints into the future (one every 0.5 seconds). Each waypoint is composed of 5 predictions: its centroid ( $\mu_{x},\mu_{y}$ ), its standard deviation ( $\sigma_{x},\sigma_{y}$ ), and its correlation $\rho$ . The prediction coordinate frame is defined by the detected bounding box, where the x axis corresponds to the heading of the vehicle.

Heuristic Trajectory Sampler

We can heuristically sample smooth trajectories over time from a model that predicts independent gaussian waypoints over time, such as the one described in Section IV. More precisely, we extract temporally smooth trajectories by applying the re-parameterization trick for a bi-variate normal $y_{t}^{s}=\mu_{t}+A_{t}\cdot\varepsilon^{s}$ , where the model predicts a normal distribution $\mathcal{N}\left(y^{s}_{t}|\mu_{t},\Sigma_{t}\right)$ per waypoint $t$ , $(A_{t})^{T}\cdot A_{t}=\Sigma_{t}$ is the cholesky decomposition of the covariance matrix, and $\varepsilon_{s}\sim\mathcal{N}(0,I)$ is the noise sampled from a standard bi-variate normal distribution. Note that the noise $\varepsilon^{s}$ is constant across time $t$ for a given sample $s$ . Intuitively, having a constant noise across time steps allows us to sample waypoints whose relative location with respect to its predicted mean and covariance is constant across time (i.e. translated by the predicted mean and scaled by the predicted covariance per time).

Reachable lanes in nuScenes

Recently, high definition maps were provided as an extension to the original nuScenes dataset. Unfortunately, some of the road topology information is not provided. In order to make use of the road topology to obtain the reachable lanes, we programmatically inferred such information by defining neighbors to be lane polygons that share vertices, share a direction of travel and are not separated by a road marker barring crossing. Successors are lanes or intersections that share terminating vertices and share traffic directions.

Baseline adaptations

SpAGNN [5] did not require any adaptation because it was already proposed to work in the joint perception and prediction setting we operate. However, the formulation in MultiPath [7] and R2P2-MA [17] assumes object detections and past tracked trajectories for every actor are given. Because our approach is tracking-free, and SpAGNN [5] showed that the joint perception and prediction setting yields better results than using an off-the-shelf tracker to obtain past trajectories, we perform adaptations to these baselines. For MultiPath [7], the single difference is that our RRoI pooled features come from sensor data and raster map, while in their case the actor’s past tracked trajectories are rasterized. For R2P2-MA [17], we replace the RNN-based past trajectory encoder with the same per-actor feature extractor that is used in our method.

	t = 0 s	t = 1.5 s	t = 3.5 s	t = 5 s
SpAGNN+
Ours