This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

11institutetext: NEC-Labs1, Computer Science University of Kentucky2, UC San Diego3

Supplemental Material:
Learning to Look around Objects for Top-View Representations of Outdoor Scenes

Samuel Schulter1,†   Menghua Zhai2,†   Nathan Jacobs2
Manmohan Chandraker1,3

The supplemental material contains the following items:

  • Section 1: Additional experiments and details of the proposed hallucination models for semantic segmentation and depth prediction.

  • Section 2: Extended ablation study of the proposed refinement module.

  • Section 3: Additional qualitative results of the semantic bird’s eye view representation.

  • Section 4: Evaluation of semantic segmentation (and depth prediction) of foreground pixels.

1 Semantic and depth hallucination

Hallucinating the semantics and depth behind foreground objects is an important part of the proposed BEV mapping. The main paper shows two experiments for hallucination on the KITTI-Anon data set. Here, we provide an ablation study of several other aspects of the hallucination-CNN on two data sets.

First, we investigate the random box sampling strategy used for training the hallucination networks, see Section 3.1 and Figure 2(b) in the main paper. We name each sampling strategy based on four properties as “geometry - background class - object size - object count”, where each property can take the following values:

  • “geometry” is either “none” or “perspective (persp.)”. Perspective means that we apply a transformation to bounding boxes in order to mimic depth, \eg, objects further away become smaller. “None” means that we do not change box sizes based on any prior.

  • “background class” is also a prior about where bounding boxes are placed. For “road”, we only put boxes at positions where they significantly overlap with road pixels. For “bg”, significant overlap with any background class is required.

  • “object size” is the typical object height closest to the camera, \ie, bottom of the image in the 2D image. Note that object size is changed based on the y-axis location if the “perspective” option is used.

  • “object count” is the number of bounding boxes sampled per image.

In general, we can see from Table 1 that placing more or bigger bounding boxes on images during the training process is beneficial, particularly for the hidden pixels which we are most interested in. For instance, putting artificial boxes only on road pixels “persp-road-150-3”, making the boxes small “persp-bg-50-5” or only sampling a single box “persp-bg-100-1” clearly deteriorates the performance of both semantic segmentation and depth prediction for hidden pixels.

Since the hallucination CNN is jointly trained for semantic segmentation and depth prediction, we also investigate the impact of joint training of the two related tasks. We balance the loss functions for segmentation and depth prediction via \lossSym=λ\lossSymDepth+(1λ)\lossSymSeg\lossSym=\lambda\cdot\lossSym_{\textrm{Depth}}+(1-\lambda)\cdot\lossSym_{\textrm{Seg}}, where λ[0,1]\lambda\in[0,1]. We can see from Table 2 that the two tasks typically help each other, \ie, a value of λ\lambda not at the two ends of the spectrum gives the best results. The only exception is semantic segmentation of visible pixels on the Cityscapes data set. For hidden pixels, the benefits of jointly training for both tasks can be seen on all data sets and is more pronounced than for visible pixels.

\topruleDataset Method random-boxes human-gt
visible hidden visible hidden
iou RMSE acc ard iou RMSE acc ard iou iou
\topruleKITTI- none-bg-150-3 76.68 3.846 89.17 .0923 64.63 5.360 74.43 .1413 81.12 60.06
Anon persp-bg-150-3 75.37 4.510 88.39 .0938 61.06 7.544 63.54 .1748 80.21 60.19
persp-road-150-3 75.03 4.192 87.89 .0964 49.18 7.848 61.09 .1890 80.19 53.18
persp-bg-50-5 75.94 4.081 88.34 .0943 53.01 8.375 57.83 .1979 80.20 57.50
persp-bg-100-1 76.09 4.066 88.19 .0943 57.27 8.658 58.17 .1959 80.41 58.11
persp-bg-100-10 75.80 4.127 87.63 .0963 59.22 8.177 60.72 .1864 79.91 61.92
\midruleKITTI- none-bg-150-3 70.95 2.411 91.65 .0897 59.38 1.886 96.76 0.063 88.70 65.36
Ros persp-bg-150-3 70.82 2.298 92.44 .0843 61.90 1.715 97.15 0.060 88.22 61.68
persp-road-150-3 69.14 2.394 91.65 .0866 47.13 2.138 93.61 0.079 86.39 37.55
persp-bg-50-5 70.68 2.356 91.95 .0858 51.92 2.389 92.23 0.079 88.08 53.66
persp-bg-100-1 70.78 2.295 92.32 .0841 54.80 1.999 95.67 0.065 88.00 53.38
persp-bg-100-10 70.39 2.283 92.41 .0837 61.64 1.698 97.26 0.056 87.84 57.69
\bottomrule
Table 1: Impact of different masking strategies for training the proposed hallucination CNN for both semantic segmentation and depth prediction.
\topruleDataset Method random-boxes human-gt
visible hidden visible hidden
iou RMSE acc ard iou RMSE acc ard iou iou
\topruleKITTI- 0.00 72.92 - - - 62.48 - - - 79.45 59.78
Anon 0.25 75.17 4.105 87.64 .1000 63.70 5.905 70.79 .1577 80.24 60.26
0.50 76.20 3.832 89.44 .0909 64.41 5.503 74.31 .1446 80.85 59.39
0.75 75.29 3.778 90.17 .0873 63.33 5.334 76.02 .1380 80.72 59.94
1.00 - 3.921 88.79 .0954 - 5.428 74.73 .1452 - -
\midruleKITTI- 0.00 69.29 - - - 57.25 - - - 87.93 53.29
Ros 0.25 70.05 2.496 91.05 .0918 60.95 1.940 95.78 .0670 87.59 56.92
0.50 71.26 2.278 92.84 .0816 59.54 1.906 96.58 .0606 88.05 54.90
0.75 71.39 2.179 93.58 .0769 59.89 1.729 97.24 .0570 88.71 53.77
1.00 - 2.649 91.16 .0896 - 2.166 94.65 .0732 - -
\midruleCity- 0.00 71.14 - - - 59.80 - - - 74.00 60.27
scapes 0.25 70.62 12.810 84.57 .1325 60.26 8.290 85.39 .123 73.72 60.57
0.50 69.75 12.813 86.27 .1234 58.80 8.226 86.83 .116 73.49 61.24
0.75 68.19 12.787 87.03 .1206 55.69 8.187 87.24 .112 72.85 60.89
1.00 - 12.796 84.63 .1333 - 7.848 85.18 .127 - -
\bottomrule
Table 2: The impact of the trade-off between semantics and depth when training the hallucination CNN jointly for both tasks.

In Figure 1 we present additional qualitative results of hallucinating semantics and depth, which are contrasted with the corresponding standard foreground semantic segmentation and depth prediction. One can clearly see the prior knowledge learned by the hallucination CNN.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 1: Ten qualitative results of the hallucination CNN (two per row). The first column in each example shows the input RGB image, the semantic segmentation and depth prediction of all visible pixels, \ie, including foreground objects. The second column shows the masked RGB image and the hallucinated semantics and depth. Note how foreground objects like cars or pedestrians are replaced with learned priors about the environment. The first two rows are examples from the KITTI-Ros data set. Rows 3 and 4 are from the KITTI-Anon data set and the last row is from Cityscapes.

2 Additional results of the refinement module

In this section, we show additional results of the impact of the trade-off between the adversarial loss (on simulated data) and the reconstruction loss (with the initial BEV map or OSM data if available). The main paper contains one example (in Figure 7). Here, we provide more examples in Figure 2. It is clearly evident from the figure that the reconstruction loss needs to be properly balanced with the adversarial loss. No reconstruction loss obviously leads to generating scene layouts that don’t match the actual image evidence. On the other hand, putting too much weight on the reconstruction loss leads the refinement module to learning the identity function without improving upon its input.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 2: Five examples of the influence of the trade-off between adversarial and reconstruction loss in the proposed refinement module. Each example shows the input RGB image and the initial BEV map (\ie, after hallucinating depth and semantics and mapping into the top-view) on the left. The seven BEV-maps on the right show the output of the refinement module trained with varying λ={0,1,5,100,500,1000,166}\lambda=\{0,1,5,100,500,1000,16^{6}\} (from left to right). One can clearly observe that setting λ\lambda too low or too high results in unfavorable representations.

3 Additional qualitative results

The main paper already contains qualitative results of the final semantic BEV representation for the KITTI-Ros data set in Figure 8. Here, we provide additional examples for the KITTI-Anon and the Cityscapes data sets in Figures 3 and 4, respectively.

Moreover, we show a few additional examples of our representation including dynamic foreground objects like cars in Figure 5.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 3: Qualitative results of the final semantic BEV representation for the KITTI-Anon data set (four examples per row). Each example shows the masked RGB input, the hallucinated semantics and depth, as well as three BEV maps, which are (from left to right): The BEV map without hallucination, with hallucination, and after refinement.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 4: Qualitative results of the final semantic BEV representation for the Cityscapes data set (four examples per row). Each example shows the masked RGB input, the hallucinated semantics and depth, as well as three BEV maps, which are (from left to right): The BEV map without hallucination, with hallucination, and after refinement.
Refer to caption
Figure 5: Qualitative examples including foreground objects. Each example shows the RGB input image, the semantic segmentation, the depth map, and the estimated BEV representation.

4 Semantic segmentation and depth prediction of visible pixels

The proposed semantic bird’s eye view representation assumes a semantic segmentation of the visible pixels as input in order to identify foreground objects which define occlusions of the scene. Any semantic segmentation module can be used and we picked a CNN architecture inspired by the PSP module [sam:Zhao17a]. Besides standard semantic segmentation, this CNN also predicts depth of all visible pixels with a second decoder similar in structure to [sam:Laina16a], which is required to estimate depth for 3D localization of dynamic foreground objects (or traffic participants) like cars and pedestrians.

For completeness, this section provides a quantitative evaluation of this CNN. Table 3 shows our results for semantic segmentation and depth prediction and Figure 6 provides some qualitative examples. For evaluating semantic segmentation we use mean IoU as in the main paper. For evaluating depth prediction, we present additional metrics that are typically used and defined in [sam:Eigen14a]: RMSE, RMSE-log, accuracy (with threshold of δ=1.25\delta=1.25), and absolute relative difference (ARD). Note that we use a down-scaled version of Cityscapes (by a factor of 0.6250.625) for this experiment because it significantly decreases runtime and memory consumption during training and evaluation.

\topruleDataset mIoU RMSE RMSE-log ACC ARD
\midruleKITTI-Anon 69.63 4.129 0.158 89.56 .0928
KITTI-Ros [sam:Ros15a] 59.02 3.976 0.172 86.73 .1076
Cityscapes [sam:Cordts16a] 63.66 8.352 0.220 89.65 .1113
\bottomrule
Table 3: Quantitative evaluation of standard semantic segmentation (mIoU) and depth prediction (RMSE, RMSE-log, ACC, ARD) for all visible pixels, \ie, the first step in the proposed pipeline for BEV-mapping. We show results for all three data sets used in the main paper.
Refer to caption
Figure 6: Qualitative examples for semantic segmentation and depth prediction. First two rows: KITTI; Last row: Cityscapes.