Co-SemDepth: Fast Joint Semantic Segmentation and Depth Estimation on Aerial Images

Yara AlaaEldin University of Genova
Genova, Italy
yara.ala96@gmail.com Francesca Odone University of Genova
Genova, Italy
francesca.odone@unige.it

Abstract

Understanding the geometric and semantic properties of the scene is crucial in autonomous navigation and particularly challenging in the case of Unmanned Aerial Vehicle (UAV) navigation. Such information may be by obtained by estimating depth and semantic segmentation maps of the surrounding environment and for their practical use in autonomous navigation, the procedure must be performed as close to real-time as possible. In this paper, we leverage monocular cameras on aerial robots to predict depth and semantic maps in low-altitude unstructured environments. We propose a joint deep-learning architecture that can perform the two tasks accurately and rapidly, and validate its effectiveness on MidAir and Aeroscapes benchmark datasets. Our joint-architecture proves to be competitive or superior to the other single and joint architecture methods while performing its task fast predicting 20.2 FPS on a single NVIDIA quadro p5000 GPU and it has a low memory footprint. All codes for training and prediction can be found on this link: https://github.com/Malga-Vision/Co-SemDepth.

Index Terms:

UAV, Joint Learning, Semantic Segmentation, Depth Estimation, Real-time, Low-Altitude

I Introduction

The applications of aerial robotics, also known as Unmanned Aerial Vehicles (UAVs), are rapidly expanding across various fields, including environmental exploration, national security, package delivery, firefighting, and more. As it commonly happens in autonomous navigation, sensors are adopted to estimate scene depth and semantic information. Unlike ground autonomous vehicles, many types of UAVs, including drones, have limited computational capability and allowed carried weight. Thus, not all types of sensors can be mounted on the drone, for instance LiDAR and RADAR cannot be adopted for depth estimation as they are heavy and power consuming. Also, while LiDAR point clouds contain accurate depth information, they lack semantic meaning. To associate such depth points to their semantic meaning, an additional step of calibration between LiDAR and RGB camera has to be done to estimate their relative transformation [1] and associate the points in the point cloud to their corresponding pixels in the image frame. However, this calibration is never fully accurate and this leads to errors in the semantic association. Other depth sensors like stereo cameras are not common and may be not appropriate for UAVs since the small baseline distance between the two internal cameras, compared to the large distance between the stereo camera and the scene, produces inaccurate depth estimates [2]. Therefore, UAV applications often rely on monocular cameras, as they are cheap, light, small in size, and can produce implicitly image pairs with large baselines, by considering non adjacent frames [3, 4, 5]. Video cameras also have the added value of associating to each point a semantic meaning, thanks to the availability of semantic segmentation approaches [6, 7, 8, 9, 10].

Refer to caption — Figure 1: Our proposed Co-SemDepth Architecture. It is composed of a shared encoder and two decoders. The encoder and the depth decoder are the same presented in [11]. The semantic decoder makes use of the encoded feature maps to give an estimate of the semantic segmentation map. The depth and semantic maps get scaled up as they go forward through the successive levels of the decoders. The number of shown levels here are 3 while in our experiments we use 5 levels.

In this paper, we leverage monocular cameras on aerial vehicles for obtaining two types of necessary information for scene understanding, depth estimation and semantic segmentation:

•

In monocular depth estimation (MDE), the goal is to predict the depth of each pixel in each RGB frame captured by a video camera. Such depth expresses the distance (in meters) of the points in the world appearing in the pixels with respect to the camera frame.
•

In semantic segmentation, the goal is instead to predict the semantic class of each pixel in the input RGB frames. This semantic class, expressed with a unique integer, belongs to a set of predefined semantic classes of interest that the neural network was trained on.

The two modules are complementary since depth estimation expresses the geometric properties of the scene while semantic segmentation expresses the semantic properties. We propose a joint deep architecture for achieving the two tasks accurately and in real-time, see Figure 1 for an overview of our architecture. Using a joint architecture helps in saving computational time compared to performing each task separately as well as saving GPU memory by having fewer model parameters. Also, it can help in sharing learnt features between the two modules and this can in turn benefit both of them.

Our main contributions can be summarized as the following:

•

We first develop a fast single architecture for semantic segmentation inspired by the MDE architecture developed in [11]. We call our architecture M4Semantic and validate its effectiveness on aerial data.
•

We merge the two architectures (M4Depth and M4Semantic) into a joint one by sharing the feature extraction part and separating the decoders. We call our joint architecture Co-SemDepth and validate its accuracy and real-time performance on aerial data.
•

We provide a benchmark on MidAir dataset [12] for our method and other state-of-the-art methods in semantic segmentation and depth estimation and highlight the advantages of using our joint architecture compared to the others.

II Related Works

Much of the developed work in outdoor scene analysis has been driven by advancements in the automotive field, leaving aerial scene understanding comparatively under-investigated. In this section we discuss relevant state of the art, mostly referring to methods related to the automotive field.

II-A Monocular Depth Estimation

MDE can be addressed by means of classical methods like Structure from Motion (SfM) [13] or through Deep Learning methods. SfM methods generally suffer from limited accuracy and scarce availability of feature correspondences [14]. Deep learning methods have been a way out of these low-vision challences, where the input to the DL architecture can be the whole (single) image or image sequences. Estimating depth from single images causes the scale ambiguity problem [15]. Some of the state of the art methods in this family are MonoDepth [5], MonoDepth2 [16], Adabins [4], and DPT [17]. Estimating depth from image sequences improves the depth estimates compared to single image methods by making use of the temporal relation between input video frames [18, 3, 11, 19] to better understand the scale of objects.

The training of MDE networks can be realized in a supervised or self-supervised fashion. In supervised methods [20, 21, 22, 23, 11], depth maps ground truth should be provided to the network. The main challenge for these methods is the scarcity of available annotated datasets, especially in the aerial domain [12, 24, 25], that should cover different scenarios and environments. Some of the most used datasets in the literature are KITTI [26] for autonomous driving applications, and NYU [27] for indoor scenes. Self-supervised methods can be trained to predict the depth without requiring ground truth depth maps. They use as a self-supervision the nearby video frames [28, 3, 29] or synchronized stereo pairs [30, 5].

Open problems: Few MDE methods mentioned their inference time and no benchmarking was found in the literature comparing the inference time of different MDE methods. Very few methods were benchmarked on low-altitude aerial datasets [11, 31]. We address such challenges by benchmarking our method on the aerial dataset MidAir [12]. In this benchmark we use both depth evaluation metrics and inference time to compare our performance with other state-of-the-art.

II-B Semantic Segmentation

Semantic segmentation is widely addressed in the literature, in particular in the autonomous driving field [6, 7, 8, 9, 10] using CityScapes [32] dataset for benchmarking.

Unlike depth estimation, semantic segmentation networks are normally trained in a supervised fashion. This can be realized using image-based or video-based methods. Some state-of-the-art examples of image-based methods are PSPNet [33], ICNet [6], ERFNet [7], Bisenet [9], and SegFormer [8]. Many video-based methods [34, 35, 36, 37, 38, 39, 40, 41] make benefit from the fact that there is a big overlap between successive frames by avoiding to repeat the semantic segmentation computation on each frame. They extract features of key frames and use feature propagation using light optical flow networks or interpolation techniques to pass the information to sequential video frames. This is done to speed-up inference time. However, the mIoU achieved using these methods is usually lower than image-based methods.

Open problems: few semantic segmentation methods [42, 43, 44] were benchmarked on low-altitude aerial datasets We benchmark our method on MidAir dataset for semantic segmentation. We also adapt the codes of other state-of-the-art methods and benchmark them on MidAir. In addition, we benchmark our method on Aeroscapes [43] dataset.

II-C Joint Architectures

The idea of joint or multi-task deep learning architectures has been explored in the literature for various vision tasks. In [45], a joint architecture is implemented for image segmentation and classification, while in [46] they developed a joint network for motion estimation and segmentation. There are some works [47, 48, 49, 50, 51, 52] that addressed joint depth estimation and semantic segmentation. Most of the architectures used in these works have a common feature extraction part (in the form of deep convolutional network) followed by two separate branches dedicated for the prediction of depth and semantic maps. In [47], it was found that the shared feature extraction part helps in making the semantic segmentation branch benefit from the learnt depth features in the shared encoder and leads to slightly better results in semantic segmentation.

Open problems: None of the mentioned methods was benchmarked on aerial data, which incorporate very specif challenges in terms of variability of depth and semantic maps. We benchmark our method and the joint RefineNet [49] on MidAir dataset and we make the codes publicly available.

III Problem Statement

We present the specifications of the problem we address. A monocular camera is rigidly attached to a UAV. The camera intrinsic parameters are assumed to be known and constant. The camera frame rate is relatively high such that there are overlapping regions between each two consecutive frames. The UAV moves freely (6 DoF) in an outdoor unstructured environment and records the video frames as well as the camera position (using a IMU sensor) at each time step. Using the position at each time step we can compute the motion transformation matrix $T$ from one frame to the next. Our objective is to design a network, denoted by a function $F$ , that takes at each time step the current frame $I_{t}$ , previous $n$ frames $I_{seq}=[I_{t-1},I_{t-2},...,I_{t-n}]$ and camera motion transformations $T_{seq}=[T_{t-1},T_{t-2},...,T_{t-n}]$ and outputs an estimated depth map $\hat{D_{t}}$ and semantic segmentation map $\hat{S_{t}}$ corresponding to the current frame:

(\hat{D_{t}},\hat{S_{t}})=F(I_{t},I_{seq},T_{seq}).

(1)

The starting point of our work is the M4Depth network [11]. To the best of our knowledge, they produce the current top results in monocular depth estimation on the synthetic MidAir [12] aerial dataset and their model weights and code are publicly available. In addition, the encoder-decoder modularity of their architecture makes it a convenient candidate to be transformed into a joint architecture.

To this purpose, we first re-design the architecture to perform semantic segmentation instead of depth estimation. Then, we merge the two architectures into a single joint one, we call it Co-SemDepth, performing both depth estimation and semantic segmentation in real-time.

IV Methodology

In this section, we first give a brief overview of M4Depth for depth estimation, then describe our M4Semantic semantic segmentation network. Finally, we merge the two networks and flash light on our proposed joint Co-SemDepth architecture.

IV-A Backbone Depth Network

The architecture of M4Depth [11] is an adaptation of the standard U-Net encoder-decoder network trained to predict parallax maps to be then transformed into depth maps. The authors define parallax as a function of perceived motion, thus it can be seen as a general form of stereo disparity for an unconstrained camera baseline.

The network takes as input a sequence of $n$ video frames (we choose $n=3$ ) and the camera transformation $T$ between each two consecutive frames. At each time step $t$ the encoder, same as in Figure 1, takes a $t$ -frame $I_{t}$ and extracts image features at different scales using its pyramidal structure. Each encoder level is composed of two convolutional layers and a domain-invariant normalization layer (DINL) [53] to increase the network robustness to varied colors and luminosity conditions.

Then, the decoder, see top part of Figure 1, takes the feature maps at different resolutions obtained by the encoder at time $t$ , the features extracted from the previous frame ( $t-1$ ), the parallax map predicted at time $t-1$ , and the camera motion transformation $T_{t}$ to predict the parallax map of the current frame $I_{t}$ . This parallax $p_{t}$ is then transformed into a depth map $d_{t}$ following the transformation proposed in [11]. Each level of the decoder is composed of a preprocessing unit and a parallax refiner. The preprocessing unit is responsible for preparing the input to the parallax refiner at this level and the parallax refiner is a stack of convolutional layers responsible for giving an estimate of the parallax map at each level.

Depth loss: The network is trained in an end-to-end fashion and a scale-invariant $L_{1}$ loss is used to compute the loss between the predicted depth map $\hat{d_{i}}$ and the ground truth one $d_{i}$ . The loss is computed at each decoder level and then accumulated across all the levels:

L_{depth}=\sum_{l=1}^{M}\frac{1}{N_{p}^{l}}\sum_{d_{i}^{l}}2^{l+1}|log(d_{i})-log(\hat{d_{i}})|

(2)

where $M$ is the number of decoder levels and $N_{p}^{l}$ is the total number of pixels in the image at level $l$ .

IV-B M4Semantic Network

Inspired by M4Depth, we propose a similar architecture for semantic segmentation depicted in Figure 2. The encoder is a stack of multiple levels and has the pyramidal structure where the resolution of the feature map is decreased while proceeding forward through the encoder levels. The feature map predicted at each encoder level is passed to its corresponding decoder level. Each decoder level is composed of a preprocessing unit and a semantic refiner. The preprocessing unit prepares the input to the semantic refiner and the semantic refiner at each level gives an estimate of the semantic segmentation map at a specific resolution. The resolution of the semantic map is scaled-up proceeding forward on the decoder levels.

In Figure 3 we show the modules used in our architecture. The encoder at each level is composed of 2 convolutional layers. In the first level, DINL [53] is added after the first convolution to increase the network robustness to varied colors and luminosity conditions. ReLU activation is applied after each convolutional layer and the resolution is decreased by a factor of 2 after each level. The pyramidal structure of the encoder helps in extracting both coarse and fine (global and local) features from the input image.

The preprocessing unit at each decoder level is a pure computational unit with no parameters to be trained. It performs two operations:

•

It upscales the semantic map $S^{L-1}$ and the semantic features $f_{S}^{L-1}$ estimated from the semantic refiner of the previous level by a multiple of 2 to match the resolution of the current level.
•

It normalizes the feature map $f_{enc}^{L}$ received from the encoder.

Similar to the parallax refiner, the semantic refiner at each level is composed of a stack of convolutional layers. The last convolutional layer has a depth equals 4 (depth of the semantic features map) + N (the number of semantic classes). The output of the semantic refiner is a predicted semantic features map and an estimated semantic segmentation map. We apply Softmax activation on the predicted semantic segmentation map to obtain a probability score for each class on every pixel.

Different from M4Depth, our M4Semantic architecture works on single images. We removed the time dependency in the semantic decoder because this produced 2 times faster results than the one with time dependency with a slight drop in accuracy (this architectural choice will be discussed in Section V.E).

Semantic loss: the standard categorical cross-entropy loss is used on the predicted semantic maps at each level. The ground truths are resized using Nearest Neighbour interpolation to match the resolution of the predicted semantic maps at intermediate levels. Then, these losses are aggregated through a weighted sum:

L_{semantic}=\sum_{l=1}^{M}\frac{1}{N_{p}^{l}}\sum_{p_{t}^{l}}-log(\frac{p_{t}}{\sum_{j}^{N_{c}}p_{j}})

(3)

where $N_{c}$ is the number of semantic classes, $p_{t}$ is the output softmax probability score for the target class and $p_{j}$ is the output softmax probability score for class j.

IV-C Joint Co-SemDepth Network

To merge the two previously described networks, we adopt a multi-tasking shared encoder architecture [47, 48, 49, 51, 52]. The depth estimation and semantic segmentation networks share the encoder part for feature extraction, but each of them has its own decoder for their corresponding map prediction. An overview of our joint architecture is in Figure 1.

Loss Function: Our joint network is trained in an end-to-end fashion. The loss function for our architecture is defined as:

L_{total}=L_{depth}+w*L_{semantic}.

(4)

We incorporated a weighting factor $w$ forcing the loss values for semantic to lie within the same range of the losses for depth, and thus ensuring a comparable contribution for the two losses during training.

V Experiments

In this section, we discuss the experiments we conducted to validate the effectiveness of our joint architecture. We first give a brief description of the datasets we used for training and evaluation; MidAir and Aeroscapes. Next, we report the implementation details. In the experimental analysis, we first evaluate the effectiveness of using our joint architecture compared to using the two single architectures. Then, we benchmark our model against other state-of-the-art methods. Finally, we make an architecture study.

V-A Datasets

MidAir [12] is a synthetic dataset collected using AirSim simulator [54] consisting of 420K forward-view RGB video frames captured at low altitude in outdoor unstructured environments with various weather conditions. It contains annotations of depth maps, semantic segmentation, surface normals, stereo disparity, and camera locations. Hence, this dataset is suitable for training and testing our joint Co-SemDepth architecture.

We adopt the train-test split used in [11], but we randomly select 8 trajectories to create validation data. In the evaluation, the depth values are capped at 80.0 meters. We resized images to a resolution of 384x384 instead of the original 1024x1024. In the original semantic annotation of MidAir, there are 14 semantic classes: Sky, Animals, Trees, Dirt Ground, Ground Vegetation, Rocky Ground, Boulders, Empty, Water, Man-Made Construction, Road, Train Track, Road Sign, and Others. Since several classes are visually indistinguishable and some of them very small, we mapped them to a smaller set of 7 semantic classes : Sky, Water, Land, Trees, Boulders, Road, and Others. Specifically, we considered Ground Vegetation, Rocky Ground and Dirt Ground as Land, and Animals, Empty, Train Track and Road Sign in Others.

Aeroscapes [43] is a real dataset collected using drones at low-mid altitude in various outdoor environments. It consists of 3,269 images with 80%-20% train-test split and resolution of 1280x720. Some of these images are captured with a forward-view while others are captured with nadir (top) view. This dataset contains only semantic segmentation annotation. For this reason, we could not use this dataset for the training of our joint architecture; because it requires annotation of depth, semantic, and camera locations. However, we used Aeroscapes for the training and testing of our single M4Semantic network. There are 12 semantic classes in Aeroscapes, namely: Background, Person, Bike, Car, Drone, Boat, Animal, Obstacle, Construction, Vegetation, Road, and Sky.

V-B Implementation Details

We adopt Adam optimizer with the default momentum parameters $(\beta_{1}=0.9,\beta_{2}=0.999)$ and a fixed learning rate of $10^{-4}$ . We apply image augmentation of random rotation, flipping, and changing color (contrast, brightness, hue, and saturation) during training and we train with batch size of 3. In the computation of the joint architecture loss $L_{total}$ , after inspection of the loss ranges for depth (range from 0.0 to $\sim$ 1.0) and semantic (range from 0.0 to $\sim$ 7.0), we selected a weighting factor $w=0.1$ . After training, we choose the checkpoint that produced the best validation results for evaluation on the test set.

Our workstation has 16GB RAM, Intel core i7 processor and a single NVIDIA Quadro P5000 GPU card running CUDA11.4 with CuDNN 7.6.5 and Ubuntu operating system. Due to the memory limits of our workstation, we predict the depth and semantic maps at a resolution equal to half the input resolution and then apply Nearest Neighbour interpolation on the output maps to scale up its resolution to the original size. As reported in [55], decreasing the image resolution can slightly decrease the accuracy, however, it gives the advantage of reducing the computational runtime and memory footprint and these are critical factors for aerial robotics.

To quantitatively evaluate the depth prediction results, we consider the commonly used evaluation metrics in prior works [11, 3, 16]. These include the linear root mean square error (RMSE), the absolute relative error, and accuracy under a threshold. For semantic segmentation, we use the commonly used mean Intersection over Union $mIoU$ metric. The Inference Time (Inf. Time) is computed in milliseconds per frame (ms/f).

V-C Joint vs Single Architecture

We conduct experiments to compare the performance of our joint architecture Co-SemDepth to the two single architectures: M4Depth and M4Semantic, see Table I. Each architecture was trained equally for 60 epochs.

We can notice that the accuracy values (in terms of depth and semantic metrics) of the joint architecture are close to the single ones. The inference time of the joint architecture (49.6 ms/f) is lower than the sum of the two single ones ( $44.9+9.8$ ). Moreover, the number of parameters of the joint architecture (5.2 Million) is less than the sum of the two single ones (3.06 + 2.61 Million) by around 500K parameters. During inference, Co-SemDepth required only 6.2GB of GPU memory while running M4Depth and M4Semantic together required 14.6GB of GPU memory. This makes Co-SemDepth compatible to run on the microcontroller Nvidia Jetson TX2 that has only 8GB RAM and that is widely used in robotics hardware.

The above signifies that using our joint architecture Co-SemDepth is more effective in terms of computational time and memory footprint than using the two single architectures while achieving very close accuracies. The trade-off between accuracy and computational cost in the multi-tasking architectures was previously discussed in [56] and it can differ depending on the environment.

TABLE I: Evaluation of our joint vs single architectures for depth estimation and semantic segmentation on MidAir dataset.

Architecture	Output	Params(M)	Inf. Time (ms/f)	Semantic Metrics	Depth Metrics
Architecture	Output	Params(M)	Inf. Time (ms/f)	mIoU $\uparrow$	RMSE $\downarrow$	AbsRelErr $\downarrow$	$\delta<1.25$ $\uparrow$	$\delta<1.25^{2}$ $\uparrow$	$\delta<1.25^{3}$ $\uparrow$
M4Depth	D	3.06	44.9	-	6.821	0.0973	92.26%	95.62%	97.18%
M4Semantic	S	2.61	9.8	75.64%	-	-	-	-	-
Co-SemDepth	D+S	5.2	49.6	74.24%	7.15	0.103	92.1%	95.4%	96.94%

V-D Benchmarking

MidAir: We compare the performance of Co-SemDepth with other open-source state of the art methods (both single and joint architectures) in depth estimation and semantic segmentation, see Table II. For each method, we fix the input size to 384x384 and the number of training epochs to 60. Other parameters for every method were kept as the default. More details about the parameters used for each baseline method can be found on our Github page.

TABLE II: Benchmarking our joint architecture on MidAir dataset against other state-of-the-art methods in both depth estimation and semantic segmentation. The top part reports single depth estimation methods, the middle part for single semantic segmentation methods, and the bottom part for joint methods. Methods marked with ^∗ means the depth metrics values were reported in [11].

Method	Output	Params(M)	Inf. Time (ms/f) $\downarrow$	Semantic Metrics	Depth Metrics
Method	Output	Params(M)	Inf. Time (ms/f) $\downarrow$	mIoU $\uparrow$	RMSE $\downarrow$	AbsRelErr $\downarrow$	$\delta<1.25$ $\uparrow$	$\delta<1.25^{2}$ $\uparrow$	$\delta<1.25^{3}$ $\uparrow$
MonoDepth2^∗ [16]	D	14.8	23.9	-	12.351	0.394	61.0%	75.1%	83.3%
ST-CLSTM^∗ [57]	D	15.04	35.3	-	13.685	0.404	75.1%	86.5%	91.1%
ManyDepth^∗ [3]	D	46.3	82.9	-	10.919	0.203	72.3%	87.6%	93.3%
PWCDC-Net^∗ [58]	D	9.4	25.8	-	8.351	0.095	88.7%	93.8%	96.2%
FCN(VGG16) [59]	S	14.7	58.3	72.93%	-	-	-	-	-
FCN(MobileNetv2) [59]	S	2.2	60.5	69.82%	-	-	-	-	-
ERFNet [7]	S	2.07	19.1	77.4%	-	-	-	-	-
SegFormer-B0 [8]	S	3.8	49.1	75.1%	-	-	-	-	-
RefineNet [49]	D+S	3.0	74.2	72.7%	9.74	0.2	74.9%	89%	94.5%
Co-SemDepth (Ours)	D+S	5.2	49.6	74.24%	7.15	0.103	92.1%	95.4%	96.94%

We can clearly notice that our method outperforms the other joint network, RefineNet, in both semantic and depth accuracies and inference time.

Compared to the single depth estimation networks, Co-SemDepth could maintain its superior accuracy in depth estimation. This indicates that transforming M4Depth to the joint Co-SemDepth did not have a negative effect on its depth estimation performance compared to state-of-the-art. Co-SemDepth also has a notable fewer number of parameters compared to the other methods.

For the single semantic segmentation networks, Co-SemDepth has a competitive mIoU with the others, only slightly inferior to ERFNet, which is in any case a dedicated architecture.

The per-class IoU evaluation can be found in Table III and a qualitative visualization of the semantic map predictions of three different methods (ERFNet, MobileNet, and Co-SemDepth) can be found in Figure 4. We can notice that the three methods could capture the overall semantic layout of the input images. However, Co-SemDepth and ERFNet are remarkably better in capturing the details (notice the trees and the train track) than MobileNet. Compared to Co-SemDepth, ERFNet is a bit better in capturing some of the faraway details (the distant boulders and bushes in the third row).

TABLE III: Per-Class IoU Evaluation of Co-SemDepth architecture and other baseline methods on MidAir.

Method	Sky	Water	Trees	Land	Boulders	Road	Others	mIoU
FCN(VGG16) [59]	88.56%	83.12%	75.5%	82.3%	29.97%	88.04%	54.6%	72.93%
FCN(MobileNetv2) [59]	87.82%	82.42%	73.42%	81.28%	26.57%	84.64%	46.09%	69.82%
ERFNet [7]	91.5%	87.64%	82.48%	85.1%	40.63%	90.9%	63.7%	77.42%
SegFormer-B0 [8]	90.54%	88.58%	79.7%	83.33%	30.13%	92.19%	61.2%	75.1%
RefineNet [49]	89.7%	81.6%	79.7%	82.2%	30.15%	91.36%	54.1%	72.7%
Co-SemDepth	88.5%	83.7%	79.4%	81.7%	34.4%	94.7%	60.12%	74.24%

Aeroscapes: We report the M4Semantic results in Table IV. Our network was trained for 200 epochs with a batch size of 3 and a learning rate of $10^{-4}$ for the first 70 epochs and then decreased to $10^{-5}$ .

We implemented M4Semantic on TensorFlow as one whole model that can be trained in an end-to-end fashion without separation between the weight files of the encoder and the decoder. This led to a more compact code but limited us from pretraining the encoder separately on Imagenet as done in the other methods. Nevertheless, we could produce a competitive mIoU compared to the others. An output visulization can be found in Figure 5.

TABLE IV: Comparison of our M4Semantic architecture with other semantic segmentation methods benchmarked on Aeroscapes. P means pretrained on other datasets and S means trained from scratch

Method	P/S	Open-Source	mIoU $\uparrow$
FCN-8S [44]	P	No	43.12%
FCN-16S [44]	P	No	44.52%
FCN-32S [44]	P	No	45.51%
FCN-ImageNet-4S [43]	P	No	48.96%
FCN-Cityscapes [43]	P	No	49.55%
FCN-ADE20K [43]	P	No	51.62%
FCN-PASCAL [43]	P	No	52.02%
M4Semantic (Ours)	S	Yes	50.41%

V-E Architecture Study

We conduct architecture study experiments on M4Semantic to highlight the importance of the addition or ablation of different modules, see Table V. From the top part of the table, we can notice that using 5 levels produced the best mIoU.

In the bottom part, Original means M4Semantic(5 level). In Original+{SNCV}, we used the Spatial Neighbourhood Cost Volume module used in the decoder in M4Depth [11] on the encoded feature maps instead of adding the normalized feature maps directly in the preprocessing unit, Figure 3. Such a module measures the two-dimensional spatial autocorrelation of the scene and improved the performance in depth estimation. However, as can be seen in the table, using such module in M4Semantic didn’t improve the performance in semantic segmentation and, moreover, it increased the inference time.

In Original+{ $S^{t-1}$ }, we test the addition of time dependency in M4Semantic. At each decoder level, the semantic segmentation map predicted of the previous frame $S^{t-1}$ is used along with the camera motion information and the ground truth depth map to warp it and give an initial prediction of the semantic map of the current frame. We concatenate such warped map with the output of the preprocessing module, Figure 3, at each decoder level. While such technique achieved higher mIoU, the inference time increased due to the added warping computation. Also, the warping of the semantic segmentation map requires the ground truth depth map and this is not available in most of the times in reality.

Given this architecture study and the one done on the single depth estimation architecture [11], we choose the number of levels of our joint architecture Co-SemDepth to 5 levels.

TABLE V: Evaluation of our M4Semantic architecture on MidAir with the addition (+) or ablation (-) of different modules. The top part evaluates choosing different number of levels. The bottom part was performed on M4Semantic(5level).

Architecture	Inf. Time (ms/f) $\downarrow$	mIoU $\uparrow$
M4Semantic(4level)	9	73.93%
M4Semantic(5level)	9.8	75.64%
M4Semantic(6level)	10.9	73.7%
Original-{DINL}	9.5	74.42%
Original-{Normalize}	9.7	72.94%
Original+{SNCV}	17.7	70.65%
Original+{ $S^{t-1}$ }	16.7	77.2%

VI Conclusion

In this work, we presented Co-SemDepth, a fast joint architecture for depth estimation and semantic segmentation for aerial robots. Our architecture proved to be light and fast compared to the other methods while achieving better or on par accuracies. This makes it very suitable to be deployed on UAV hardware for real-time scene analysis in outdoor environments. Using our light architecture makes it possible for the drone to conduct scene analysis autonomously and independently using its onboard microcontroller without the need for sending data to a remote server to carry out the analysis. The output depth and semantic maps can provide geometric and semantic understanding of the scene that is necessary to carry out a variety of UAV missions.

References

[1] M. Vel’as, M. Španěl, Z. Materna, and A. Herout, “Calibration of rgb camera with velodyne lidar,” 2014.
[2] C. F. Olson and H. Abi-Rached, “Wide-baseline stereo vision for terrain mapping,” Machine Vision and Applications, vol. 21, pp. 713–725, 2010.
[3] J. Watson, O. Mac Aodha, V. Prisacariu, G. Brostow, and M. Firman, “The temporal opportunist: Self-supervised multi-frame monocular depth,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1164–1174, 2021.
[4] S. F. Bhat, I. Alhashim, and P. Wonka, “Adabins: Depth estimation using adaptive bins,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4009–4018, 2021.
[5] C. Godard, O. Mac Aodha, and G. J. Brostow, “Unsupervised monocular depth estimation with left-right consistency,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 270–279, 2017.
[6] H. Zhao, X. Qi, X. Shen, J. Shi, and J. Jia, “Icnet for real-time semantic segmentation on high-resolution images,” in Proceedings of the European conference on computer vision (ECCV), pp. 405–420, 2018.
[7] E. Romera, J. M. Alvarez, L. M. Bergasa, and R. Arroyo, “Erfnet: Efficient residual factorized convnet for real-time semantic segmentation,” IEEE Transactions on Intelligent Transportation Systems, vol. 19, no. 1, pp. 263–272, 2017.
[8] E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo, “Segformer: Simple and efficient design for semantic segmentation with transformers,” Advances in neural information processing systems, vol. 34, pp. 12077–12090, 2021.
[9] C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, and N. Sang, “Bisenet: Bilateral segmentation network for real-time semantic segmentation,” in Proceedings of the European conference on computer vision (ECCV), pp. 325–341, 2018.
[10] P. Wang, P. Chen, Y. Yuan, D. Liu, Z. Huang, X. Hou, and G. Cottrell, “Understanding convolution for semantic segmentation,” in 2018 IEEE winter conference on applications of computer vision (WACV), pp. 1451–1460, Ieee, 2018.
[11] M. Fonder, D. Ernst, and M. Van Droogenbroeck, “Parallax inference for robust temporal monocular depth estimation in unstructured environments,” Sensors, vol. 22, pp. 1–22, November 2022.
[12] M. Fonder and M. Van Droogenbroeck, “Mid-air: A multi-modal dataset for extremely low altitude drone flights,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pp. 0–0, 2019.
[13] O. Özyeşil, V. Voroninski, R. Basri, and A. Singer, “A survey of structure from motion*.,” Acta Numerica, vol. 26, pp. 305–364, 2017.
[14] J. Kopf, X. Rong, and J.-B. Huang, “Robust consistent video depth estimation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1611–1621, 2021.
[15] Z. Zeng, Y. Wu, H. Park, D. Wang, F. Yang, S. Soatto, D. Lao, B.-W. Hong, and A. Wong, “Rsa: Resolving scale ambiguities in monocular depth estimators through language descriptions,” arXiv preprint arXiv:2410.02924, 2024.
[16] C. Godard, O. Mac Aodha, M. Firman, and G. J. Brostow, “Digging into self-supervised monocular depth estimation,” in Proceedings of the IEEE/CVF international conference on computer vision, pp. 3828–3838, 2019.
[17] R. Ranftl, A. Bochkovskiy, and V. Koltun, “Vision transformers for dense prediction,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 12179–12188, 2021.
[18] V. Patil, W. Van Gansbeke, D. Dai, and L. Van Gool, “Don’t forget the past: Recurrent depth estimation from monocular video,” IEEE Robotics and Automation Letters, vol. 5, no. 4, pp. 6813–6820, 2020.
[19] J. Kopf, X. Rong, and J.-B. Huang, “Robust consistent video depth estimation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1611–1621, 2021.
[20] J. Li, R. Klein, and A. Yao, “A two-streamed network for estimating fine-scaled depth maps from single rgb images,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 3372–3380, 2017.
[21] I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab, “Deeper depth prediction with fully convolutional residual networks,” in 2016 Fourth international conference on 3D vision (3DV), pp. 239–248, IEEE, 2016.
[22] D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from a single image using a multi-scale deep network,” Advances in neural information processing systems, vol. 27, 2014.
[23] A. CS Kumar, S. M. Bhandarkar, and M. Prasad, “Depthnet: A recurrent neural network architecture for monocular depth prediction,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 283–291, 2018.
[24] W. Wang, D. Zhu, X. Wang, Y. Hu, Y. Qiu, C. Wang, Y. Hu, A. Kapoor, and S. Scherer, “Tartanair: A dataset to push the limits of visual slam,” in 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4909–4916, IEEE, 2020.
[25] H. Florea, V.-C. Miclea, and S. Nedevschi, “Wilduav: Monocular uav dataset for depth estimation tasks,” in 2021 IEEE 17th International Conference on Intelligent Computer Communication and Processing (ICCP), pp. 291–298, IEEE, 2021.
[26] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark suite,” in 2012 IEEE conference on computer vision and pattern recognition, pp. 3354–3361, IEEE, 2012.
[27] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor segmentation and support inference from rgbd images.,” ECCV (5), vol. 7576, pp. 746–760, 2012.
[28] V. Casser, S. Pirk, R. Mahjourian, and A. Angelova, “Depth prediction without the sensors: Leveraging structure for unsupervised learning from monocular videos,” in Proceedings of the AAAI conference on artificial intelligence, vol. 33, pp. 8001–8008, 2019.
[29] A. Masoumian, H. A. Rashwan, S. Abdulwahab, J. Cristiano, M. S. Asif, and D. Puig, “Gcndepth: Self-supervised monocular depth estimation based on graph convolutional network,” Neurocomputing, vol. 517, pp. 81–92, 2023.
[30] T. Huang, S. Zhao, L. Geng, and Q. Xu, “Unsupervised monocular depth estimation based on residual neural network of coarse–refined feature extractions for drone,” Electronics, vol. 8, no. 10, p. 1179, 2019.
[31] V.-C. Miclea and S. Nedevschi, “Monocular depth estimation with improved long-range accuracy for uav environment perception,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–15, 2021.
[32] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3213–3223, 2016.
[33] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing network,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2881–2890, 2017.
[34] S. Jain, X. Wang, and J. E. Gonzalez, “Accel: A corrective fusion network for efficient semantic segmentation on video,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8866–8875, 2019.
[35] Y.-S. Xu, T.-J. Fu, H.-K. Yang, and C.-Y. Lee, “Dynamic video segmentation network,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6556–6565, 2018.
[36] M. Paul, C. Mayer, L. V. Gool, and R. Timofte, “Efficient video semantic segmentation with labels propagation and refinement,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2873–2882, 2020.
[37] B. Mahasseni, S. Todorovic, and A. Fern, “Budget-aware deep semantic video segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1029–1038, 2017.
[38] J. Li, W. Wang, J. Chen, L. Niu, J. Si, C. Qian, and L. Zhang, “Video semantic segmentation via sparse temporal transformer,” in Proceedings of the 29th ACM International Conference on Multimedia, pp. 59–68, 2021.
[39] P. Hu, F. Caba, O. Wang, Z. Lin, S. Sclaroff, and F. Perazzi, “Temporally distributed networks for fast video semantic segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8818–8827, 2020.
[40] Y. Liu, C. Shen, C. Yu, and J. Wang, “Efficient semantic video segmentation with per-frame inference,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part X 16, pp. 352–368, Springer, 2020.
[41] G. Sun, Y. Liu, H. Ding, T. Probst, and L. Van Gool, “Coarse-to-fine feature mining for video semantic segmentation,” in proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 3126–3137, 2022.
[42] S. Nedevschi et al., “Weakly supervised semantic segmentation learning on uav video sequences,” in 2021 29th European Signal Processing Conference (EUSIPCO), pp. 731–735, IEEE, 2021.
[43] I. Nigam, C. Huang, and D. Ramanan, “Ensemble knowledge transfer for semantic segmentation,” in 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1499–1508, IEEE, 2018.
[44] G. Zheng, Z. Jiang, H. Zhang, and X. Yao, “Deep semantic segmentation of unmanned aerial vehicle remote sensing images based on fully convolutional neural network,” Frontiers in Earth Science, vol. 11, p. 1115805, 2023.
[45] Q. Xu, Y. Zeng, W. Tang, W. Peng, T. Xia, Z. Li, F. Teng, W. Li, and J. Guo, “Multi-task joint learning model for segmenting and classifying tongue images using a deep neural network,” IEEE journal of biomedical and health informatics, vol. 24, no. 9, pp. 2481–2489, 2020.
[46] C. Qin, W. Bai, J. Schlemper, S. E. Petersen, S. K. Piechnik, S. Neubauer, and D. Rueckert, “Joint learning of motion estimation and segmentation for cardiac mr image sequences,” in Medical Image Computing and Computer Assisted Intervention–MICCAI 2018: 21st International Conference, Granada, Spain, September 16-20, 2018, Proceedings, Part II 11, pp. 472–480, Springer, 2018.
[47] Y. Cao, C. Shen, and H. T. Shen, “Exploiting depth from single monocular images for object detection and semantic segmentation,” IEEE Transactions on Image Processing, vol. 26, no. 2, pp. 836–846, 2016.
[48] A. Mousavian, H. Pirsiavash, and J. Košecká, “Joint semantic segmentation and depth estimation with deep convolutional networks,” in 2016 Fourth International Conference on 3D Vision (3DV), pp. 611–619, IEEE, 2016.
[49] V. Nekrasov, T. Dharmasiri, A. Spek, T. Drummond, C. Shen, and I. Reid, “Real-time joint semantic segmentation and depth estimation using asymmetric annotations,” in 2019 International Conference on Robotics and Automation (ICRA), pp. 7101–7107, IEEE, 2019.
[50] J. Liu, Y. Wang, Y. Li, J. Fu, J. Li, and H. Lu, “Collaborative deconvolutional neural networks for joint depth estimation and semantic segmentation,” IEEE transactions on neural networks and learning systems, vol. 29, no. 11, pp. 5655–5666, 2018.
[51] Z. Zhang, Z. Cui, C. Xu, Z. Jie, X. Li, and J. Yang, “Joint task-recursive learning for semantic segmentation and depth estimation,” in Proceedings of the European conference on computer vision (ECCV), pp. 235–251, 2018.
[52] L. He, J. Lu, G. Wang, S. Song, and J. Zhou, “Sosd-net: Joint semantic object segmentation and depth estimation from monocular images,” Neurocomputing, vol. 440, pp. 251–263, 2021.
[53] F. Zhang, X. Qi, R. Yang, V. Prisacariu, B. Wah, and P. Torr, “Domain-invariant stereo matching networks,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, pp. 420–439, Springer, 2020.
[54] S. Shah, D. Dey, C. Lovett, and A. Kapoor, “Airsim: High-fidelity visual and physical simulation for autonomous vehicles,” in Field and Service Robotics: Results of the 11th International Conference, pp. 621–635, Springer, 2018.
[55] L. Chen, Z. Yang, J. Ma, and Z. Luo, “Driving scene perception network: Real-time joint detection, depth estimation and semantic segmentation,” in 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1283–1291, IEEE, 2018.
[56] S. Vandenhende, S. Georgoulis, and L. Van Gool, “Mti-net: Multi-scale task interaction networks for multi-task learning,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IV 16, pp. 527–543, Springer, 2020.
[57] H. Zhang, C. Shen, Y. Li, Y. Cao, Y. Liu, and Y. Yan, “Exploiting temporal consistency for real-time video depth estimation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1725–1734, 2019.
[58] D. Sun, X. Yang, M.-Y. Liu, and J. Kautz, “Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8934–8943, 2018.
[59] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431–3440, 2015.
[60] H. Xu, F. Li, and Z. Feng, “Mlffnet: multilevel feature fusion network for monocular depth estimation from aerial images,” Journal of Applied Remote Sensing, vol. 16, no. 2, pp. 026506–026506, 2022.
[61] V.-C. Miclea and S. Nedevschi, “Dynamic semantically guided monocular depth estimation for uav environment perception,” IEEE Transactions on Geoscience and Remote Sensing, 2023.