¹¹institutetext: CINVESTAV Unidad Guadalajara, Mexico ²²institutetext: Universidad Panamericana, Facultad de Ingenieria, Aguascalientes, Mexico ³³institutetext: Esc. de Ingenieria y Ciencias, Tecnologico de Monterrey, Monterrey, N.L., Mexico ⁴⁴institutetext: CRAN (UMR 7039), Université de Lorraine and CNRS, Nancy, France

EndoDepth: A Benchmark for Assessing Robustness in Endoscopic Depth Prediction

Ivan Reyes-Amezcua 11 Ricardo Espinosa 2244 Christian Daul 44 Gilberto Ochoa-Ruiz 33 Andres Mendez-Vazquez 11

Abstract

Accurate depth estimation in endoscopy is vital for successfully implementing computer vision pipelines for various medical procedures and CAD tools. In this paper, we present the EndoDepth benchmark, an evaluation framework designed to assess the robustness of monocular depth prediction models in endoscopic scenarios. Unlike traditional datasets, the EndoDepth benchmark incorporates common challenges encountered during endoscopic procedures. We present an evaluation approach that is consistent and specifically designed to evaluate the robustness performance of the model in endoscopic scenarios. Among these is a novel composite metric called the mean Depth Estimation Robustness Score (mDERS), which offers an in-depth evaluation of a model’s accuracy against errors brought on by endoscopic image corruptions. Moreover, we present SCARED-C, a new dataset designed specifically to assess endoscopy robustness. Through extensive experimentation, we evaluate state-of-the-art depth prediction architectures on the EndoDepth benchmark, revealing their strengths and weaknesses in handling endoscopic challenging imaging artifacts. Our results demonstrate the importance of specialized techniques for accurate depth estimation in endoscopy and provide valuable insights for future research directions.

https://github.com/Ivanrs297/endoscopycorruptions

Keywords:

Depth Estimation Robustness Medical Imaging

1 Introduction

Endoscopic procedures offer crucial insights into hollow organ interiors, providing essential tissue information. However, challenges in controlling the endoscope’s trajectory and viewpoint complicate examinations, leading to suboptimal views and incomplete scans, hampering lesion diagnosis. Traditional multi-view stereo methods like structure-from-motion (SfM) [10], shape-from-shading (SfS) [15], SLAM [13] excel in natural light settings but face limitations in endoscopic scenarios due to sparse textures and sudden illumination changes.

On the other hand, recent strides in depth estimation have relied on high-quality training data, either annotated or not. For instance, supervised learning-based monocular depth estimation methods have gained a lot of traction on conventional imaging, leveraging recent strides in deep learning architectures and large amounts of data acquired using other modalities to train models to predict depth from single images effectively. However, obtaining ground truth data for endoscopic imaging poses challenges, hindering the application of supervised methods [5, 4, 12].

Refer to caption — Figure 1: Example of the EndoDepth Corruptions on an image from the SCARED-C dataset. In the last row are image corruptions that significantly affect the performance of monocular depth estimation in endoscopic imaging, including lens distortion, resolution alterations, specular reflection, and color changes.

To address the lack of ground truth for depth estimation, several studies have explored unsupervised techniques using auxiliary supervisory signals from data. Many leverage view synthesis based on structure-from-motion principles for self-supervision, using predicted depth and pose to reconstruct target frames from source frames. The discrepancy between re-projected and source frames serves as the learning objective. However, these methods face challenges in endoscopic imaging due to extreme acquisition conditions and brightness constancy violations[3, 7].

Nevertheless, when it comes to extreme acquisition conditions like those encountered in endoscopy, the aforementioned methods face unique challenges. In endoscopy, scene illumination is heavily influenced by the orientation of the endoscope relative to the tissue surface. The captured images frequently suffer from under or over-exposure, depending on the surface shape, and are susceptible to specular reflections. Additionally, they may depict textureless scenes and become foggy due to rapid increases in endoscope temperature within human cavities, etc.

Despite advancements in supervised and self-supervised depth prediction models in endoscopy, there remains a significant gap in understanding their robustness to out-of-distribution challenges, particularly in dealing with endoscopy-specific corruptions like adverse exposure and sensor malfunctions [13]. Conventional machine learning-based visual perception models are often sensitive to subtle changes in lighting, noise, and texture variations, leading to compromised depth prediction accuracy [1, 17]. While progress has been made with datasets like EndoSfMLearner, SCARED, and Hamlyn [1, 2, 14], a crucial gap remains: the absence of a robustness benchmark tailored specifically for developing resilient and scalable endoscopy depth prediction systems.

To address this gap, our work introduces the first steps toward creating robust and dependable endoscopy depth prediction systems by introducing a base benchmark on the SCARED dataset. Our benchmark meticulously simulates common corruptions inherent to endoscopy environments. As illustrated in Figure 1, we categorize sixteen types of corruptions into four main groups: i) lighting conditions, ii) data processing complications, iii) sensor malfunctions and movements, and iV) endoscopy common corruptions. These corruptions, further stratified by severity, encompass a wide range of scenarios that induce image distortions, texture alterations, or degraded visual quality. The code for this paper is available on GitHub repository at https://github.com/Ivanrs297/endoscopycorruptions

The main contributions of this paper, based on the new dataset and robustness evaluation frameworks as summarized as follows:

1) We introduce EndoDepth Benchmark, the first systematically designed robustness evaluation for Supervised and Self-Supervised depth estimation models, capable of handling data corruptions, sensor failures, and style shifts.

2) We propose a new composite metric called mean Depth Estimation Robustness Score (mDERS) by capturing both accuracy and error resistance in the context of endoscopic image corruptions.

3) We evaluate the robustness of 4 state-of-the-art self-supervised depth prediction models in endoscopy using our novel dataset called SCARED-C

4) Drawing from our findings, we provide an in-depth discussion and analysis of the key design considerations for developing more resilient Self-supervised depth prediction models suitable for reliable, scalable, and practical applications.

2 Related Work

Several studies have demonstrated that Deep Neural Networks (DNNs) are susceptible to common corruptions, such as blur, Gaussian noise, and translations [8]. In the field of computer vision for endoscopy, several works [6, 1, 17] have shown that the performance of models for depth estimation can significantly decline when faced with artifacts and other challenging imaging conditions unique to endoscopy. For instance, acquisition perturbations due to blurring, and specular reflections due to movement of the point of the endoscope are significantly stronger than in natural imagery scenarios. To systematically investigate the robustness of DNN against image corruption, different benchmarks have been introduced. These benchmarks were initially developed in the domain of image recognition [8] and have since been extended to various tasks including object detection and semantic segmentation [11]. To address robustness in depth estimation for autonomous driving and real-world applications, a recent work named RoboDepth [9] has been introduced. By systematically evaluating depth estimation models under varied conditions, RoboDepth provides insights into the models’ robustness and performance in challenging real-world scenarios.

However, the endoscopic domain presents unique challenges that require specific attention and evaluation. Given the distinct characteristics and demands of endoscopic scenarios, a dedicated benchmark tailored to this domain becomes imperative. Existing benchmarks may not fully capture the intricacies and challenges inherent in endoscopic imaging, such as limited field of view, motion artifacts, specular reflections, and variations in tissue appearance.

In previous years, the most commonly used metric for assessing the robustness of depth prediction in general tasks has been the Mean Corruption Error (mCE) [8]. This metric requires a base model to evaluate the performance of other models across various corruptions [9]. However, using a reference model can bias the mCE results of other models, making them dependent on the chosen baseline model.

Our proposal incorporates the most common error metrics for depth prediction Abs Rel, Sq Rel, RMSE, RMSE log and accuracy measures a1, a2, a3, forming a metric that relies solely on evaluating models without needing base models. Additionally, our metric provides a single output that establishes a robust relationship between errors and accuracy, enabling a straightforward and fair evaluation of the robustness of depth prediction models for endoscopy.

3 The EndoDepth Benchmark Design

This section explores the robustness study we carried out as part of our quantitative analysis, emphasizing the clinical applications of these models’ implications and their resistance against common disturbances observed during endoscopic examinations. Our approach is inspired by the framework developed in [8] for classification tasks. In this seminal work, the mean Corruption Error (mCE) was proposed to function as a standardized aggregate performance metric that intended to serve as a proxy of how robust a model is to different kinds of corruption in image classification tasks.

To assess the robustness of endoscopic depth estimation methods, we propose a metric that encapsulates the model’s performance under a variety of realistic and challenging endoscopic procedure-specific conditions. Specifically, we introduce a metric for monocular depth estimation in endoscopy, which we’ll refer to as the mean Depth Estimation Robustness Score (mDERS).

3.1 Mean Depth Estimation Robustness Score (mDERS)

Accuracy and robustness are crucial in depth estimation, particularly where image degradations impact performance. Traditional metrics often fail to capture model robustness against real-world image corruptions. To address this, we introduce the mean Depth Estimation Robustness Score (mDERS), a composite metric designed to comprehensively assess the robustness of depth estimation models.

Given the error measures $e:\{AbsRel,SqRel,RMSE,RMSElog\}$ , these are metrics where lower values indicate better performance. Also, the accuracy measures $a:\{a1,a2,a3\}$ , these are metrics where higher values indicate better performance. They quantify how accurately a model can estimate depth in various scenarios. We determine the mean of each error $\bar{e}$ and accuracy $\bar{a}$ metric across all severity levels for each corruption to combine these multi-level results into a single metric. This procedure of averaging reduces the impact of outliers and yields a reliable indicator of the model’s performance in the face of various forms of corruption. Thus, the mean Depth Estimation Robustness Score (mDERS) is formulated as follows:

mDERS=\frac{\frac{1}{n}\sum_{i=1}^{n}\bar{a}_{i}}{1+\frac{1}{m}\sum_{i=1}^{m}\bar{e}_{i}}

(1)

Where, the numerator $\frac{1}{n}\sum_{i=1}^{n}\bar{a}_{i}$ calculates the mean accuracy across the $n_{a}$ accuracy metrics. This part of the formula aggregates the accuracy measures by taking their sum and dividing them by the number of accuracy metrics ( $n_{a}$ ), which is 3 in this context. Then, the denominator $1+\frac{1}{m}\sum_{i=1}^{m}\bar{e}_{i}$ accounts for the error metrics. It calculates the mean error across the $m$ error metrics, which is 4 in this case. The addition of 1 ensures that the denominator is always greater than 1, preventing the overall score from becoming disproportionately high due to very low error values and maintaining a balanced scale.

Hence, by combining accuracy and error into a single, balanced score, mDERS essentially offers a comprehensive assessment of a depth estimating model’s performance. A robust depth estimation is reflected in models with high mDERS values, which are both accurate and maintain low levels of estimate errors.

3.2 Mean Corruption Error for Endoscopy Depth Estimation

The notion of Mean Corruption Error (mCE) for endoscopy is presented in this subsection. In this context, mCE is a metric that measures a model’s error response across different types and severity levels of corruption that are encountered during endoscopic procedures. We present a methodical procedure for computing mCE, which yields a normalized score indicating how robust the model is in comparison to a baseline or average performance.

Let’s define $D_{f}^{s,p}$ as the depth estimation error of the model $f$ under perturbation type $p$ at severity level $s$ . Examples of appropriate error metrics to consider are the mean absolute error and the root mean squared error. The error metric selected can be customized to meet the needs of any given application. In our case, a wide variety of conditions found in endoscopic imaging negatively impact the quality of the acquired image and thus the depth estimate precision. Therefore, image degradations or artifacts found in endoscopy should be covered by use as possible perturbations to the depth estimate. The mCE for endoscopy depth estimation is formulated as follows:

\text{mCE}=\frac{1}{C}\sum_{c=1}^{C}\left(\frac{\sum_{s=1}^{S}D_{f}^{s,c}}{\sum_{s=1}^{S}D_{\text{baseline}}^{s,c}}\right)

(2)

where $D_{f}^{s,c}$ is the depth estimation error of the model $f$ for corruption type $c$ at severity level $s$ . On the other hand, $D_{\text{baseline}}^{s,c}$ is the depth estimation error of a baseline model (or the average error across models) for the same perturbation type and severity level. This normalization step ensures that the scores are comparable across different perturbations. $S$ is the number of severity levels considered for each perturbation type. $C$ is the total number of perturbation types relevant to endoscopic depth estimation. Thus, the model’s robustness to various endoscopic perturbations is summed up in a single score provided by the mCE. Better robustness is indicated by a lower mCE, which suggests that the model performs worse in demanding circumstances than the baseline.

Seven essential metrics are used to measure each model’s performance under these circumstances: accuracy under threshold (a1, a2, a3), squared relative difference (sq_rel), root mean square error (rmse), and root mean square error in logarithmic scale (rmse_log). These measures offer a thorough understanding of the models’ precision, accuracy, and general capacity to sustain performance integrity in challenging circumstances. For more details about these metrics please refer to [7].

4 Methods and Data

4.1 Data

To investigate how the corruptions affect the performance of depth estimation, we conducted N perturbation experiments utilizing the SCARED dataset. The SCARED dataset is partitioned into training (15351 frames), validation (1705 frames), and test sets (551 frames). The SCARED dataset, introduced by Allan et al. [2]. It consists of 35 endoscopic videos accompanied by point cloud and ego-motion ground truth annotations.

Herein, we present a novel dataset, SCARED-C [16], which has been carefully selected to test the robustness of depth estimation models in endoscopic imaging settings. With 16 different types of corruption, SCARED-C covers a broad spectrum of typical endoscopic problems, each with five levels of severity. It is expected that the SCARED-C dataset will prove to be an invaluable asset for the medical community, enabling substantial advancements in the robustness and dependability of depth estimate algorithms.

Metric	Monodepth2*	AF-SfMLearner	MonoViT	EndoSfMLearner
abs_rel	100.00	77.60	92.04	133.39
sq_rel	100.00	66.75	101.30	143.73
rmse	100.00	80.63	91.37	107.54
rmse_log	100.00	80.38	94.39	124.53
a1	100.00	106.60	103.91	91.76
a2	100.00	100.32	98.99	98.00
a3	100.00	100.14	99.50	99.86

Table 1: Mean Corruption Error (mCE) given in percentage (%) of four models: Monodepth2, AF-SfMLearner, MonoViT, and EndoSfMLearner across a range of metrics.

4.2 The EndoDepth Benchmark Corruptions

Image corruptions such as lens distortion, resolution changes, specular reflections, and color variations significantly affect monocular depth estimation in endoscopic imaging. These corruptions complicate spatial interpretation and depth perception. Addressing these issues is crucial for improving depth estimation models. We took some corruptions from [8] and provided a brief description of each relevant to endoscopic images. Robust modeling approaches are essential to mitigate these effects and ensure reliable depth measurements in endoscopic procedures.

•

Brightness: Variations in illumination intensity across the image. It can be mathematically represented as $I^{\prime}=\alpha I$ , where $I^{\prime}$ is the modified image, $I$ is the original image, and $\alpha>1$ indicates increased brightness.
•

Darkness: A reduction in illumination intensity, opposite to brightness. It follows the same formula as brightness but with $\alpha<1$ .
•

Contrast: The difference in luminance or color that makes an object distinguishable. Mathematically, contrast modification can be represented as $I^{\prime}=\alpha(I-\mu)+\mu$ , where $\mu$ is the mean luminance of the image, and $\alpha$ controls the level of contrast.
•

Fog: Simulates the scattering of light due to particles in the air, leading to reduced visibility. It can be modeled as $I^{\prime}=I(1-\omega)+A\omega$ , where $A$ is the atmospheric light, and $\omega$ represents the amount of fog.
•

Defocus Blur: Occurs when the image is out of the focal plane, resulting in a blur that is uniform across the image. It is often modeled by convolving the image with a disk-shaped kernel.
•

Glass Blur: Simulates the distortion caused by viewing through a rough glass surface. It is typically achieved by displacing pixels randomly in a manner that mimics the refraction through glass.
•

Motion Blur: Results from the rapid movement of either the camera or the subject, leading to streaking or blurring in the direction of motion. It can be modeled by convolving the image with a linear kernel aligned with the direction of motion.
•

Zoom Blur: Occurs during rapid zooming, causing radial streaks from the center of the image outward. It is modeled by radially blurring the image from a central point.
•

Gaussian Noise: Adds a statistical noise that follows a Gaussian distribution, represented as $I^{\prime}=I+N(0,\sigma^{2})$ , where $N$ is the Gaussian distribution with mean 0 and variance $\sigma^{2}$ .
•

Impulse Noise: Also known as salt-and-pepper noise, it randomly alters some of the image pixels to black or white, representing a sparse distribution of noise.
•

Shot Noise: Originates from the discrete nature of light itself, modeled as Poisson noise where the variance of the noise is signal-dependent.
•

ISO Noise: Graininess or noise introduced by high ISO settings in cameras, characterized by both luminance and color noise.
•

Lens Distortion: Causes straight lines to appear curved due to the optical design of lenses. It is often corrected by applying distortion correction algorithms.
•

Resolution Change: Involves altering the image’s resolution, which can affect the apparent depth cues in an image. This can involve either downsampling or upsampling techniques.
•

Specular Reflection: Bright spots that occur when light directly reflects off shiny surfaces into the camera. These can create high-intensity regions that may mislead depth estimation.
•

Color Changes: Variations in color balance and saturation, which can be due to different lighting conditions or camera settings. This can affect the perceived depth and texture in the image.

4.3 Evaluation of Robustness in Monocular Depth Prediction

In this section, we provide an extensive assessment of the resilience of several state-of-the-art monocular depth prediction models using our mDERS. Considering the "Monodepth2" model broad use in the depth prediction community, our assessment uses it as a baseline. We use a set of 16 different image corruptions (see Figure 1), each at 5 different levels of severity, throughout the testing phase to systematically evaluate the robustness of the models (Figure 2). For our experiments, we tested various recent state-of-the-art models for depth estimation and 3D reconstruction in endoscopy: AF-SfMLearner [17], MonoViT [18], and EndoSfMLearner [1].

	Monodepth2	AF-SfMLearner	MonoViT	EndoSfMLearner
mDERS	0.2608	0.3134	0.2759	0.2332

Table 2: mDERS comparison between the four models. The model with the highest score, bolded, is the most robust.

5 Experiments and Evaluation

Table 1 summarizes our experiments, in which we the tested models with the well-known Monodepth2 model, which served as a baseline for calculating the mCE. Similarly, Figure 2 shows a visual representation of a model’s robustness that shows the range and variability of the model’s errors (abs_rel).

A comparison of the Depth Estimation Robustness Score (mDERS) for the four state-of-the-art depth estimation models is shown quantitatively on the top of Figure 3, which also shows a qualitative comparison of the estimated depth maps produced by the different models under various perturbations.

The model’s overall performance in-depth estimation tasks is assessed using our score mDERS, which combines accuracy and error values. Our experiments reveal that AF-SfMLearner emerges as the most reliable model, boasting the highest mDERS value of 0.3134. This indicates its successful balance between depth prediction accuracy and low error rates. However, AF-SfMLearner shows susceptibility to Resolution Changes, a common perturbation in endoscopy procedures, as depicted in Fig. 2. Conversely, EndoSfMLearner receives the lowest score with 0.232, suggesting a need for improvement in its robustness under the studied conditions, particularly concerning strong illumination changes, specular reflection and color changes, as illustrated in the box plots of Fig. 2. The qualitative results in Fig. 3 are consistent with the findings of Fig. 2: EndoSFMLearner under-performs under all types of corruptions, while AF-SFMLearner consistently produces good depth maps despite different types of artifacts.

6 Conclusion

In this work, our aim was to identify the weaknesses of the SOTA depth prediction model against imaging artifacts. By applying varying levels of severity across a range of corruptions, we can test the models’ limits and pinpoint areas for improvement. The mDERS metric was introduced to evaluate depth prediction in endoscopy, showing a good performance in terms of modeling robustness. To foster further research in this area, the SCARED-C dataset is now publicly accessible on our GitHub repository [16].

Acknowledgments

The authors wish to acknowledge the Mexican Council for Science and Technology (CONACYT) for their support in terms of postgraduate scholarships in this project, and the Data Science Hub at Tecnologico de Monterrey for their support on this project. This work has been supported by Azure Sponsorship credits granted by Microsoft’s AI for Good Research Lab through the AI for Health program.

The project was also supported by the French-Mexican ANUIES CONAHCYT Ecos Nord grant 322537.

References

[1] Endoslam dataset and an unsupervised monocular visual odometry and depth estimation approach for endoscopic videos. Medical Image Analysis 71, 102058 (2021). https://doi.org/https://doi.org/10.1016/j.media.2021.102058
[2] Allan, M., Mcleod, J., Wang, C., Rosenthal, J.C., Hu, Z., Gard, N., Eisert, P., Fu, K.X., Zeffiro, T., Xia, W., Zhu, Z., Luo, H., Jia, F., Zhang, X., Li, X., Sharan, L., Kurmann, T., Schmid, S., Sznitman, R., Psychogyios, D., Azizian, M., Stoyanov, D., Maier-Hein, L., Speidel, S.: Stereo correspondence and reconstruction of endoscopic data challenge (2021)
[3] Bian, J., Li, Z., Wang, N., Zhan, H., Shen, C., Cheng, M.M., Reid, I.: Unsupervised scale-consistent depth and ego-motion learning from monocular video. Advances in neural information processing systems 32 (2019)
[4] Cao, Y., Wu, Z., Shen, C.: Estimating depth from monocular images as classification using deep fully convolutional residual networks. IEEE Transactions on Circuits and Systems for Video Technology 28(11), 3174–3182 (2018). https://doi.org/10.1109/TCSVT.2017.2740321
[5] Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. CoRR abs/1406.2283 (2014)
[6] Garcia-Vega, Axel; Ochoa, G., Espinosa, R.: Endoscopic real-synthetic over- and underexposed frames for image enhancement (2022)
[7] Godard, C., Mac Aodha, O., Firman, M., Brostow, G.J.: Digging into self-supervised monocular depth estimation. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 3828–3838 (2019)
[8] Hendrycks, D., Dietterich, T.: Benchmarking neural network robustness to common corruptions and perturbations. In: International Conference on Learning Representations (2019), https://openreview.net/forum?id=HJz6tiCqYm
[9] Kong, L., Xie, S., Hu, H., Ng, L.X., Cottereau, B.R., Ooi, W.T.: Robodepth: Robust out-of-distribution depth estimation under corruptions. In: Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track (2023), https://openreview.net/forum?id=SNznC08OOO
[10] Leonard, S., Sinha, A., Reiter, A., Ishii, M., Gallia, G.L., Taylor, R.H., Hager, G.D.: Evaluation and stability analysis of video-based navigation system for functional endoscopic sinus surgery on in vivo clinical data. IEEE transactions on medical imaging 37(10), 2185–2195 (2018)
[11] Liu, J., Jin, Y.: A comprehensive survey of robust deep learning in computer vision. Journal of Automation and Intelligence 2(4), 175–195 (2023). https://doi.org/https://doi.org/10.1016/j.jai.2023.10.002
[12] Liu, X., Sinha, A., Ishii, M., Hager, G.D., Reiter, A., Taylor, R.H., Unberath, M.: Dense depth estimation in monocular endoscopy with self-supervised learning methods. IEEE Transactions on Medical Imaging 39(5), 1438–1447 (2020)
[13] Ma, R., Wang, R., Pizer, S., Rosenman, J., McGill, S.K., Frahm, J.M.: Real-time 3D reconstruction of colonoscopic surfaces for determining missing regions. In: Medical Image Computing and Computer-Assisted Intervention. pp. 573–582. Springer (2019)
[14] Recasens, D., Lamarca, J., Fácil, J.M., Montiel, J., Civera, J.: Endo-depth-and-motion: Reconstruction and tracking in endoscopic videos using depth networks and photometric constraints. IEEE Robotics and Automation Letters 6(4), 7225–7232 (2021)
[15] Ren, Z., He, T., Peng, L., Liu, S., Zhu, S., Zeng, B.: Shape recovery of endoscopic videos by shape from shading using mesh regularization. In: Zhao, Y., Kong, X., Taubman, D. (eds.) Image and Graphics. pp. 204–213. Springer International Publishing, Cham (2017)
[16] Reyes-Amezcua, I., et al.: Endoscopycorruptions: A python package to simulate common image corruptions in endoscopic procedures. https://github.com/Ivanrs297/endoscopycorruptions (2024)
[17] Shao, S., Pei, Z., Chen, W., Zhu, W., Wu, X., Sun, D., Zhang, B.: Self-supervised monocular depth and ego-motion estimation in endoscopy: appearance flow to the rescue. Medical Image Analysis 77 (2022)
[18] Zhao, C., Zhang, Y., Poggi, M., Tosi, F., Guo, X., Zhu, Z., Huang, G., Tang, Y., Mattoccia, S.: Monovit: Self-supervised monocular depth estimation with a vision transformer. In: 2022 International Conference on 3D Vision (3DV). IEEE (Sep 2022). https://doi.org/10.1109/3dv57658.2022.00077, http://dx.doi.org/10.1109/3DV57658.2022.00077