This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

EndoSLAM Dataset and An Unsupervised Monocular Visual Odometry and Depth Estimation Approach for Endoscopic Videos: Endo-SfMLearner

Kutsev Bengisu Ozyoruk bengisu.ozyoruk@boun.edu.tr Guliz Irem Gokceler Gulfize Coskun Kagan Incetan Yasin Almalioglu Faisal Mahmood Eva Curto Luis Perdigoto Marina Oliveira Hasan Sahin Helder Araujo Henrique Alexandrino Nicholas J. Durr Hunter B. Gilbert Mehmet Turan mehmet.turan@boun.edu.tr Institute of Biomedical Engineering, Bogazici University, Turkey Computer Science Department, University of Oxford, Oxford, UK Brigham and Women’s Hospital, Harvard Medical School, Boston, MA, USA Cancer Data Science, Dana Farber Cancer Institute, Boston, MA, USA Cancer Program, Broad Institute of Harvard and MIT, Cambridge, MA, USA Institute for Systems and Robotics, University of Coimbra, Portugal Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD Department of Mechanical and Industrial Engineering, Louisiana State University, Baton Rouge, LA, USA Faculty of Medicine, Clinical Academic Center of Coimbra, University of Coimbra, Coimbra, Portugal
Abstract

Deep learning techniques hold promise to develop dense topography reconstruction and pose estimation methods for endoscopic videos. However, currently available datasets do not support effective quantitative benchmarking. In this paper, we introduce a comprehensive endoscopic SLAM dataset consisting of 3D point cloud data for six porcine organs, capsule and standard endoscopy recordings as well as synthetically generated data. A Panda robotic arm, two commercially available capsule endoscopes, two conventional endoscopes with different camera properties, and two high precision 3D scanners were employed to collect data from eight ex-vivo porcine gastrointestinal (GI)-tract organs. In total, 35 sub-datasets are provided with 6D pose ground truth for the ex-vivo part: 18 sub-datasets for colon, 12 sub-datasets for stomach and 5 sub-datasets for small intestine, while four of these contain polyp-mimicking elevations carried out by an expert gastroenterologist. Synthetic capsule endoscopy frames from stomach, colon and small intestine with both depth and pose annotations are included to facilitate the study of simulation-to-real transfer learning algorithms. Additionally, we propound Endo-SfMLearner, an unsupervised monocular depth and pose estimation method that combines residual networks with spatial attention module in order to dictate the network to focus on distinguishable and highly textured tissue regions. The proposed approach makes use of a brightness-aware photometric loss to improve the robustness under fast frame-to-frame illumination changes that is commonly seen in endoscopic videos. To exemplify the use-case of the EndoSLAM dataset, the performance of Endo-SfMLearner is extensively compared with the state-of-the-art: SC-SfMLearner, SfMLearner and Monodepth2. The codes and the link for the dataset are publicly available at https://github.com/CapsuleEndoscope/EndoSLAM . A video demonstrating the experimental setup and procedure is accessible through https://www.youtube.com/watch?v=G_LCe0aWWdQ.

keywords:
SLAM Dataset, Capsule Endoscopy, Standard Endoscopy, Monocular Depth Estimation, Visual Odometry, Spatial Attention Module.
journal: Medical Image Analysisfn2fn2footnotetext: This work was supported by the Scientific and Technological Research Council of Turkey (TUBITAK) with grant 2232 - The International Fellowship for Outstanding Researchers.

1 Introduction

Gastrointestinal(GI) cancers affect over 28 million patients annually, representing about 26% of the global cancer incidence and 35% of all cancer-related deaths [American Cancer Society, 2018]. Besides, GI cancer is the second deadliest cancer type with reported 3.4 million GI related deaths globally in 2018 [Arnold et al., 2020]. Direct visual inspection(DVI) of these cancers is the simplest and most effective technique for screening. Esophagogastroduodenoscopy (EGD) and colonoscopy are used to visualize gastrointestinal diseases specifically in colon and rectum while capsule endoscopy (CE) is preferred for small bowel exploration [Redondo-Cerezo et al., 2014].

An endoscopic gastro-intestinal procedure analysis hold by iData Research reveals that over 19 million colonoscopies are performed annually, as reported in 2017, a tremendous contribution to the 75 million endoscopies applied each year in the United States [iDa, 2018]. Specifically, the malignant tumors developed in the small intestine like Adenocarcinoma, Intestinal Lymphoma, Leiomyosarcoma, and metastatic malignancy from lung or breast are severe diseases, mostly resulting in death. Among these, the small bowel involving polyposis syndromes include Familial Adenomatous Polyposis, generalized Juvenile polyposis, Peutz-Jeghers and Cronkhite-Canada syndromes are the most mortal types. The diagnosis of these polyps and small-bowel tumors are challenging due to rarity of lesions, lack of common symptoms across patients, and variety of the symptoms [Yano and Yamamoto, 2009]. In these cases, differential diagnosis from blood tests and symptoms alone are not sufficient, and visual examination through capsule endoscopy can provide valuable information. After visual confirmation of any feature of diagnostic importance, “where is it?” arises as natural question. In the following subsection, we overview the related work from literature which are all motivated by this critical question.

1.1 Related work

The direction of arrival estimation based localization techniques such as radio frequency based signal triangulation [Dey et al., 2017], received signal strength [Shah et al., 2006], electromagnetic tracking [Son et al., 2015], x-ray [Kuth et al., 2007] and positron emission markers [Than et al., 2014] have been widely investigated in robotics. In capsule endoscopes, visual information has been provided which drives the attention to the development of vision-based odometry and simultaneous localization and mapping (SLAM) systems, either to remove the need for added hardware for pose sensing or to provide additional information for 3D tracking. While current capsules are propelled by the peristaltic motion of the GI tract, active capsule endoscopes hold promise to provide drug delivery and biopsy [Ciuti et al., 2016]. Vision-based SLAM is of utmost importance to enable these functions and other forms of complementary situational awareness in decision support and augmented reality systems [Simaan et al., 2015]. With the rise of deep learning techniques [Turan et al., 2017a], public datasets enabling a broader research community to work on the localization and mapping problems became crucial [Pogorelov et al., 2017, Bernal et al., 2017, Borgli et al., 2019] in medical image analysis. Several datasets are available to support research and development of a variety of advanced diagnostic features across a wide range of tasks, including segmentation, disease classification, tissue deformation and motion detection, and depth estimation. Some of them are available in the context of endoscopy which are overviewed in Table 1 and explained more in detail in Appendix B.

Refer to caption
Figure 1: Experimental Setup. The pre-harvested and cleaned porcine colons are cut and sewn to L and Z-shaped scaffolds by a health practitioner. A data belt, copper wiring and wireless transmitter are utilized for the data transfer between the receiver and wireless capsule endoscope(WCE). A specially designed capsule holder is attached to the robotic arm for high precision pose ground truth (see Fig. A.1). All experiments for the ex-vivo part of the dataset are conducted in dark room to mimic the real capsule endoscopy procedure.
Table 1: Dataset Survey. An overview of existing datasets for disease classification, polyp recognition, segmentation, pose tracking and depth estimation. The size of each dataset in terms of number of images, and corresponding organs are also listed. The datasets, collected via capsule endoscopy, standard endoscopy and laparoscopy, are denoted by , and , respectively.
Dataset Name Findings Organs Tasks Size
Kvasir-SEG Polyps Colon Segmentation 1,000
Kvasir [Pogorelov et al., 2017] Z-line, pylorus, cecum, esophagitis, polyps, ulcerative colitis, dyed Colon Disease detection 6,000
Lifted polyps and dyed resection margins Colon Segmentation 2,000
Hamlyn Centre Datasets†⋆ Polyp Colon Segmentation 7,894
- Kidney Disparity 40,000
Polyp Colon Polyp recognition Localisation 2,000
- Liver, ureter, kidney, abdomen Tissue deformation Tracking -
KID Dataset [Koulaouzidis et al., 2017] Angioectasias, apthae, chylous cysts and polypoid, vascular and inflammatory lesions Small Bowel and colon Classification 2,448
NBI-InFrames [Moccia et al., 2018] Angioectasias, apthae, chylous cysts and polypoid Larynx Classification 720
EndoAbs [Penza et al., 2018] - Liver, kidney, spleen Classification 120
ASU-MAYO Clinic Polyp Colons Segmentation 22,701
ROBUST-MIS Challenge Rectal cancer Abdomen Segmentation 10,040

1.1.1 Survey for Depth and Pose Estimation

Depth estimation from a camera scene and visual odometry are very challenging and active problems in computer vision. Various traditional multi-view stereo [Hartley and Zisserman, 2003] methods such as structure from motion [Wu et al., 2011, Leonard et al., 2018] and SLAM [Grasa et al., 2013] can be used to reconstruct 3D map based on the feature correspondence. However, their performances are still far from being perfect especially for endoscopic images suffering from lack of distinguishable feature. Despite the recent advances in image processing, colonoscopy remains as complicated procedure for depth estimation because of monocular camera with insufficient light source, limited working area and frequently changing environment due to the contractions of muscles. In that regard, deep-learning based methods have been applied for monocular depth estimation [Liu et al., 2020, Eigen et al., 2014, Liu et al., 2016]. CNN-based depth estimation methods have shown promising performance on a single image depth inference despite the scale inconsistency [Laina et al., 2016]. Nevertheless, using CNN in a fully supervised manner is challenging for endoscopy since dense depth map ground truth that correspond directly to the real endoscopic images are hard to obtain. Even if the labeled dataset is provided, patient-specific texture, shape and color make difficult to get generalizable results without a large amount of ground truth. These issues are mostly overcome by either synthetically generated data or the simultaneous depth and pose estimation methods where the output of pose network supervises the depth network instead of human expert annotations [Turan et al., 2018, Lu and Lu, 2019]. Mahmood et al. propose unsupervised reverse domain adaptation framework to avoid these annotation requirements which is accomplished by adversarial training removing patient specific details from real endoscopic images while protecting diagnostic details [Mahmood et al., 2018]. In [Mahmood and Durr, 2017], the monocular depth estimation is formulated as conditional random fields learning problem and CNN-CRF framework that consists of unary and pairwise parts are introduced as domain adaptable approach. Several self-supervised methods related with the single-frame depth estimation have been propounded in the generic field of computer vision [Garg et al., 2016, Zhang et al., 2020, Yin and Shi, 2018]. However, they are not generally applicable to endoscopy because of inter-frame photometric constancy assumptions of these works which is broken by the frequently appearing inconsistent illumination profile in endoscopic videos. The jointly moving camera and light source cause the appearance of the same anatomy differ substantially with varying camera poses, especially for tissue regions close to the camera surface. This might give rise to the network to get stuck in a local minima during training, specifically for textureless regions where extracting reliable information from photometric appearance is extremely difficult [Chen et al., 2019a]. There are also studies solely focusing on monocular localization problems utilized by CNN [Jiao et al., 2018, Meng et al., EasyChair, 2018]. Unlike traditional artificial neural networks, Turan et al. use RCNN which is able to process arbitrarily long sequences by its directed cycles between the hidden units and infer the correlative information across frames [Turan et al., 2017a]. However, estimating a global scale from monocular images is inherently ambiguous [Eigen et al., 2014]. Despite all efforts, visual odometry is insufficient in real-time localization and vSLAM methods come on the scene as a solution which can be tested only via a comprehensive vSLAM dataset with accurate ground truths. In the work of Mountney et al., a vSLAM method based on Extended Kalman Filter SLAM (EKF-SLAM) is used for localization and soft tissue mapping where sequential frames acquired by moving stereo endoscopes [Mountney et al., 2006]. In robotic surgical systems such as da Vinci™, real-time 3D reconstruction methods have been applied and validated on phantom models [Stoyanov et al., 2010, Lin et al., 2013]. Lin et al. adopt and extend Parallel Tracking and Mapping (PTAM) method to detect deformations on a non-rigid phantom to create 3D reconstruction of intestine model and to track endoscope position and orientation [Lin et al., 2013]. Some other works are focused on more commonly used monocular endoscopes. Mirota et al. generate a 3D reconstruction from endoscopic video during sinus surgeries by using feature detection and registered data from CT scan tracking endoscope location [Mirota et al., 2011]. In Grasa et al. [2013], another monocular vSLAM method is used to provide real-time 3D map of the abdominal cavity for hernia repair interventions. Apart from standard endoscopes, vSLAM techniques have also been used in capsule endoscopy [Chen et al., 2019b, Turan et al., 2017b]. A robust and reliable SLAM module is indispensable for next-generation capsule robots equipped with the functionalities including biopsy, drug delivery and automated polyp detection [Turan et al., 2017c], but several technical challenges such as low frame rate and low resolution due to space limitations make this need tough to meet. Specular reflections from extracellular fluids and rapidly changing environment due to peristaltic motions are further examples of inherent challenges. Those problems have motivated the exploration of deep learning based approaches that eschew complex physical models which ends up with the necessity of huge amount of dataset.

Refer to caption
Figure 2: Sample frames from EndoSLAM Dataset. Images are acquired by a MiroCam capsule endoscope, b Frontal camera of a PillCam, c HighCam, d LowCam, and e virtually generated UnityCam. The ex-vivo part of the dataset offers opportunity to test the robustness of pose estimation algorithms with images coming from various endoscope cameras. Since EndoSLAM dataset contains real and simulated frames, it is also a suitable platform to develop domain adaptation algorithms.

1.2 Contributions

In this work, we introduce the EndoSLAM dataset, a dedicated dataset designed for the development of 6-DoF pose estimation and dense 3D map reconstruction methods. The dataset is recorded using multiple endoscope cameras and ex-vivo porcine GI organs belonging to different animals and is designed to meet the following major requirements for scientific research and development of endoscopic SLAM methods:

  • 1.

    Time-synchronized, ground-truth 6 DoF pose data

  • 2.

    High precision, ground-truth 3D reconstructions

  • 3.

    Multiple organs from multiple individuals

  • 4.

    Images from cameras with varying intrinsic properties

  • 5.

    Image sequences with differing native frame rates

  • 6.

    Images acquired from different camera view angle such as perpendicular, vertical and tubular

  • 7.

    Images under a variety of lighting conditions

  • 8.

    Distinguishable features of diagnostic significance (e.g. presence/absence of polyps).

In addition to the experimentally collected data, synthetically generated data from a 3D simulation environment is included to facilitate study of the simulation to real-world problems such as domain adaptation and transfer learning. One of the biggest disadvantages of deep learning techniques is the fact that large networks need massive amounts of domain-specific data for training. Research in recent years has shown that large amounts of synthetic data can improve the performance of learning-based vision algorithms and can ameliorate the difficulty and expense of obtaining real data in a variety of contexts. However, due to the large gap between simulation data and real data, this path needs domain adaptation algorithms to be employed. With the synthetically generated data from Unity 3D environment, we aim to provide a test-bed to overcome the gap between simulation and real endoscopic data domain.

In addition to the EndoSLAM dataset, we propose an unsupervised depth and pose estimation approach for endoscopic videos based on spatial attention and brightness-aware hybrid loss. The main idea and details of the proposed architecture are depicted in Fig. 4. Our main contributions are as follows:

  • 1.

    Spatial Attention-based Visual Odometry and Depth-Estimation: We propose a spatial attention based ResNet architecture for pose estimation optimized for endoscopic images.

  • 2.

    Hybrid Loss: We propose a hybrid-loss function which is specifically designed to cope with depth of field related defocus issues and fast frame-to-frame illumination changes in endoscopic images. It collaboratively combines the power of brightness-aware photometric loss, geometry consistency loss, and smoothness loss.

Rest of the paper is organized as follows: Section 2 describes the experimental setup and gives details about the specifications of the devices used for dataset recording, introduces the overall datatree structure. In Section 3, Endo-SfMLearner is described. In Section 4, various use-cases of the EndoSLAM dataset are exemplified by benchmarking the Endo-SfMLearner and the state-of-the-art monocular depth and pose estimation methods SC-SfMLearner [Bian et al., 2019], SfMLearner [Zhou et al., 2017] and Monodepth2 [Godard et al., 2018]. Besides, fully dense 3D map reconstruction is exemplified using EndoSLAM dataset and Endo-SfMLearner. Finally, Section 5 discusses the future plans and offers some concluding remarks.

2 Dataset Shooting

In this section, we will introduce experimental setup, procedure and detailed structure of EndoSLAM dataset.

2.1 Experimental Setup

The experimental setup was specifically designed to support the collection of endoscopic videos, accurate 6-DoF ground truth pose, organ shape and topography data. The essential components are four endoscope video cameras (see Fig. A.1i, j, l, m), a robotic arm to track the trajectory and quantify the pose values (see Fig. A.1a), and high precision 3D scanners for ground truth organ shape measurement (see Fig. A.1g-h). All of the equipment are illustrated in Fig. A.1. As per camera devices, MiroCam® (see Fig. A.1m) and Pillcam® COLON2 (see Fig. A.1l) capsule endoscope cameras and two other cameras (HighCam and LowCam) representative of conventional endoscope cameras (see Fig. A.1i-j) were employed. Their specifications are as follows:

  • 1.

    MiroCam® Regular MC1000-W endoscopic video capsule: 320×\times320 image resolution, 3 fps frame rate, 170°\degree field of view, 7 - 20 mm depth of field, 6 white LED’s (Fig. A.1 m). [Hong et al., 2012]

  • 2.

    Pillcam® COLON2 double endoscope camera capsule: 256×\times256 each camera, 4 fps to 35 fps variable frame frate, 344°\degree field of view (172°\degree each camera), 4 LEDs (each camera), Fig. A.1l [Hong et al., 2012].

  • 3.

    High Resolution Endoscope Camera (YPC-HD720P): 1280×\times720 image resolution, 20 fps frame rate, 120°\degree field of view, 4-6 cm depth of field, 6 adjustable white LEDs, Fig. A.1i.

  • 4.

    Low Resolution Endoscope 3 in 1 Camera: 640×\times480 image resolution, 20 fps frame rate, 130°\degree field of view, 3-8 cm depth of field, 6 adjustable LEDs, Fig. A.1j.

Ground truth geometries of the organs were acquired via two commercially-available 3D scanners, the Artec 3D Eva and Shining 3D Einscan Pro 2x (see Fig. A.1g-h). 3D models of organs were reconstructed as in Fig. 3 and the depth distribution histograms for corresponding organs are given in Fig. F.3. Relevant performance specifications of the 3D scanners are as follows:

  • 1.

    Artec 3D Eva: ±\pm0.5 mm 3D resolution, ±\pm0.1 mm 3D point accuracy, ±\pm0.03% 3D accuracy over 100 cm distance [Artec3D, ].

  • 2.

    Shining 3D EinScan Pro 2x: 0.2-2mm point distance; ±\pm0.5 mm 3D resolution, ±\pm0.05 mm 3D point accuracy, ±\pm0.03% 3D accuracy over 100 cm distance [3D, ].

Refer to caption
Figure 3: 3D-Scanner images for EndoSLAM Dataset. 3D-scanner images for six organs which are fixed to scaffolds that were cut in O, Z and L shapes to mimic the GI-tract path through the ascending colon to the transverse colon. a RGB images of scanned organs. b Corresponding 3D reconstruction from .ply files for organs recorded via 3D Scanner. c Heatmap reconstructions for depth values by means of the Computer Vision Toolbox of Matlab. 3D point cloud data from two colons, one small intestine and three stomachs from different individuals make dataset appropriate not only development for 3D reconstruction algorithms but also for transfer learning problems.

In the rest of the paper, we will call MiroCam® Regular MC1000-W endoscope video capsule as MiroCam, Pillcam® COLON2 as PillCam, High Resolution Endoscope Camera (YPC-HD720P) as HighCam and Low Resolution Endoscope 3 in 1 Camera as LowCam for brevity. Franka Emika Panda robotic arm with 7 degree of freedom and 0.1mm pose repeatability precision is utilized to provide trajectory stabilization and ground-truth pose recording of the camera motions. The robot controls the motion of the cameras in hand-guided mode and supplies pose recording at high-frequency, 1kHz (see Fig. A.1 a).

2.2 Dataset Collection

The ex-vivo part of dataset consists of a total of 42,700 frames. Of these, 21,428 images are from the HighCam, 17,978 images are from the LowCam, 239 images are from the PillCam and 3,055 images are from the MiroCam. The reader is referred to Fig. 2 and Fig. F.1 for the illustration of recorded frames. Fig. E.1 shows the overall data tree structure. Frames folder contains recorded endoscopic images in a given trajectory and Table F.3 summarizes the trajectory classes based on tumor comprising for each organ. Trajectories for each organ can be found in the folder Poses in (.csv) and (.txt ) format with 6D pose coordinates: four orientation parameters (x; y; z; w) in quaternions and three absolute position parameters (x; y; z) in meters. Calibration folder contains intrinsic-extrinsic calibration data of cameras in .mat extensions and calibration sessions. 3D_Scanners folder consists of reconstructed 3D figures and point cloud data for six organs with a size of 23.2 GB in total, the reader is referred to the Table F.4 for the detailed point cloud distributions.

2.3 Experimental Procedure

’L’-shaped, ’Z’-shaped semi-cylindrical and ’O’ shaped semi-spherical scaffolds were cut in rectangular high-density solid foams with dimensions 30×30×1730\times 30\times 17 cm to be used as substrate for colons, small intestine and stomachs, respectively. The shape for colons mimics the GI-tract path through the ascending colon to the transverse colon. Cleaned porcine organs were cut and sewn to the foam by a practitioner, see Fig. 1.

All the capsules require a specific recorder to be worn by the patient. PillCam and MiroCam differ in multiple features and, in particular, PillCam transmits video wirelessly whereas MiroCam transmits video via a wired connection. For both PillCam and MiroCam, the capsules are placed into a 3D-printed, non-conducting holder attached to the robotic arm. HighCam and LowCam were also used in place of the capsule cameras to record endoscopic images in a similar set-up but with different organs, Colon-IV, Small Intestine, Stomach-I, Stomach-II, and Stomach-III, which were sewn on white scaffold.

Recordings were made in a dark room with green and white background solid foams to create luminance and color contrast between the GI tract and environment. The orientations of the cameras (capsule and conventional) throughout the procedure are mostly along the longitudinal axis of the semi-cylindrical and semi-spherical surfaces. In most of the trajectories the capsule endoscopes do not contact with the tissue, whereas all cameras never contact with the tissue for all trajectories. In all the experiments the robot end-effector was driven by hand, with speeds ranging mostly between 16.76 mm/s to 25.97 mm/s with a peak speed of 286.68 mm/s and with accelerations mostly ranging from 279.254 mm/s2s^{2} to 519.361 mm/s2s^{2} with a peak acceleration of 14,680.15 mm/s2s^{2}. Alternating speed and accelerations are quite important in terms of SLAM evaluations, since performance of SLAM methods in general are significantly dependent on the complexity of the trajectories. In that regard, we performed detailed quantitative analysis of robot motion, see Table F.1, F.2.

Experimental equipment calibration can be grouped into three parts as: camera calibration, the hand-eye transformation from robot to camera, and temporal synchronization between the camera frames and robot pose measurements. Each camera was calibrated against a pinhole camera model with non-linear radial lens distortion by Camera Calibration Toolbox MATLAB R2020a based on the theory of Zhang [Zhang, 1999] with the chessboard images illustrated in Fig. C.1. The hand-eye transformation between the robot end-effector coordinate frame and the camera frame was estimated with the procedure of Tsai and Lenz [Tsai and Lenz, 1989] and the resultant transformation matrices are given in Table C.1. Finally, temporal synchronization was performed by correlating camera motion computed by optical flow with velocity measurements from the robot. Table D.1 and Table D.2 shows the correspondence between the start frame of each sequence and the matching sampling instant of the robot pose data. Further details of the calibration procedures are given in Appendix C and D.

2.4 Dataset Augmentation

For the purpose of studying the robustness of SLAM algorithms against artefacts, the functions that are changing the property and quality of images were designed and exemplified in Fig. F.2. The transformations include resizing, Gaussian blur, fish-eye distortion, depth-of-field simulation via shift-variant defocus blurring, and frame-rate variation. The resize, vignetting and gaussian blur transformations were implemented with the opencv-python library (version 4.2.0.32), fish-eye distortion with the Pygame library (version 1.9.6), and depth-of-field with Matlab (version R2020a) [Pertuz, June 7, 2020]. All codes for dataset augmentation functions are available in https://github.com/CapsuleEndoscope/EndoSLAM.

2.5 Synthetic Data Generation

In addition to real ex-vivo part of EndoSLAM dataset, we have generated synthetic capsule endoscopy frames to facilitate the study of simulation-to-real transfer of learning-based algorithms. The simulation environment, VRCaps [Incetan et al., 2020], provides synthetic data which is visually as well as morphologically realistic. The platform was built with the use of real computed tomography (CT) images in DICOM format for topography and endoscopic images in RGB format for texture assignment. A cinematic rendering tool mimicking the effects in real capsule endoscopy records such as specular reflection, distortion, chromatic aberration, and field of view was used in order to obtain more photo-realistic images. Operating the virtual capsule inside the virtual 3D GI tract, we have recorded three sample endoscopic videos that containing 21,887 frames from colon, 12,558 frames from small intestine and 1,548 frames from stomach with pixel size of 320x320 and having both positional and pixel-wise depth ground truth.

3 Endo-SfMLearner

Recent works have proven that CNN-based depth and ego-motion estimators can achieve high performance using unlabelled monocular videos. However; static scene assumption, scale ambiguity between consecutive frames, brightness variety which basically stems from shallow depth-of-field and the organ tissues exhibiting non-lambertian surface property which are non-diffusely reflecting light particles make difficult to provide both locally and globally consistent trajectory estimations. We are proposing Endo-SfMLearner framework which specifically addresses these gaps.

Endo-SfMLearner jointly trains a camera pose and depth estimation networks from unlabeled endoscopic dataset. Our method proposes two solutions to the light source rooted problems in depth and pose estimation. First proposed solution is to equate brightness conditions throughout the training and validation sets with brightness transformation function and the other is to weight the photometric loss with the brightness coefficient to punish the depth estimation with higher cost under different enlightenment conditions. Apart from these, we are using geometry consistency loss for scale-inconsistency between consecutive frames caused by alternating distances between camera and organ tissue. In principal, we convert the predicted depth map in one frame to 3D space, then project it to the consecutive frame using the estimated ego-motion, and minimize the inconsistency of the estimated and the projected depth maps. This implicitly compels the depth network to predict geometrically consistent (i.e. scale-consistent) results over consecutive frames. With iterative sampling and training, the frame-to-frame consistency can eventually propagate through the entire video sequence. As the scale of depths is strictly linked to the scale of ego-motions, the ego-motion network can estimate scale-consistent relative camera poses over consecutive pairs. The detailed network architecture for both depth and pose networks will be introduced in the following subsections.

Refer to caption
Figure 4: Endo-SfMLearner Architecture Overview. a Overall system architecture of Endo-SfMLearner framework. First of all, two consecutive unlabeled images (Ii,Ii+1I_{i},I_{i+1}) are fed into depth network separately and their corresponding dense disparity maps are predicted (Di,Di+1D_{i},D_{i+1}). PoseNet outputs the relative 6D camera poses Pi,i+1P_{i,i+1} for the same snippet. Reference images, I^i\hat{I}_{i}, are synthesized with predicted depth and pose by warping the source image Ii+1I_{i+1}. The difference between 𝐓b(I^i)\mathbf{T}_{b}(\hat{I}_{i}) and IiI_{i} master the brightness-aware photometric loss. To deal with violation of geometric assumptions in image reconstruction (due to insufficiency of endoscope cameras), we also use geometry consistency loss which takes into account difference between warped Di+1iD_{i+1}^{i} and interpolated Di+1D_{i+1}^{\prime} pixel-wise disparity estimation. b Attention-PoseNet open form. The encoder part of the network consists of four basic ResNet blocks with spatial attention module in between ReLU and maxpooling layer. The decoder consists of four convolution layers each followed by ReLU activation function except last layer. c DispNet open form. DispNet encoder share similar structure with PoseNet encoder except ESAB block and skip connections. Decoder consists of five layers each consists of two convolution layers followed by ELU activation function. With the final convolution operation followed by sigmoid activation function, it outputs the dense disparity map from single image. d The flow diagram of Endo-SfMLearner spatial attention block operations. The non-local operator deduces the relative weights of all positions on the feature maps which measures the input covariance as a degree of tendency between two feature maps at different channels. For GPU memory usage efficiency which is crucial in global attention applications, max-pooling operations are integrated to the block operations. Thanks to the attention mechanism, PoseNet selectively focuses on texture details for more accurate pose and orientation estimation. e Endo-SfMLearner objective function. We are using a weighted sum of brightness-aware photometric loss, smoothness loss and geometry consistency loss as an overall learning objective. Affine brightness transformation function is utilized to equate the illumination conditions in between reference and target image before calculating SSIM and their pixelwise channel differences.

3.0.1 Endo-SfMLearner Depth Network (DispNet)

The depth network which consists of encoder and decoder parts takes single image IiI_{i} as input and gives output the corresponding disparity map DiD_{i}. For the sake of brevity, hereinafter we refer the batch normalization layer as BN, Rectified Linear Unit activation function as ReLU, exponential linear unit as ELU. Let RBk denote basic ResNet Block with k filters and Ck is 3x3 convolution layer with k filters. Cek, Csk and Crk stand for Ck followed by ELU, sigmoid and ReLU, respectively.

  • 1.

    DispNet Encoder DispNet encoder initializes with C64 with 7 kernel size, 2 stride and 3 padding followed by BN, ReLU activation function with a slope of 0.01 and max pooling operation with kernel size 3 and stride 2. Then, four ResNet basic blocks: RB64, RB128, RB256 and RB512 finalize the encoder structure. Each ResNet basic block consists of Ck(3x3), BN, ReLU, Ck, BN and ReLU with skip connection.

  • 2.

    DispNet Decoder DispNet decoder consists of five layers each consists of two convolution operations as follows:

    Ce256(x2) -Ce128(x2)- Ce64(x2) - Ce32(x2) - Ce16(x2)-Cs16

To establish the information flow in between encoder and decoder, we are building skip connections from ithi^{th} to nithn-i^{th} layer where n indicating the total number of layers and i\in{0,1,2,3}, the reader is referred to Fig. 4 c to overview.

3.0.2 Endo-SfMLearner Pose Network (Attention PoseNet)

The pose network takes the consecutive image tuples (Ii,Ii+1)(I_{i},I_{i+1}) as input by superposing and outputs the relative 6-DoF pose, Pi,i+1P_{i,i+1}.

  • 1.

    Attention PoseNet Encoder We have integrated attention module to the encoder of PoseNet between ReLU and maxpooling layers.

    C64-BN-ReLU-ESAB-RB64-RB128-RB256-RB512

  • 2.

    Attention PoseNet Decoder

    Cr256 - Cr256 - Cr256 - C6

The overview for Attention PoseNet is given in Fig. 4 b and the details of attention mechanism are introduced in next subsection.

3.0.3 Endo-SfMLearner Spatial Attention Block (ESAB)

The intuition behind ESAB module in encoder layers is to guide pose network by emphasizing texture details and depth differences of pixels. On the contrary to feature-based and object-based attentions, spatial attention selects a specific region of the input image and features in that regions are processed by attention block. The ESAB mechanism is non-local convolutional process. For any given input 𝐗N×64×H×W\mathbf{X}\in\mathcal{R}^{N\times 64\times H\times W}, our block operation can be overviewed as:

𝐙=f(𝐗,𝐗)g(𝐗),\mathbf{Z}=\mathit{f}(\mathbf{X},\mathbf{X}^{\top})\mathit{g}(\mathbf{X}), (1)

where f\mathit{f} stands for the pixelwise relations of input 𝐗\mathbf{X} between each pixel. The non-local operator extracts the relative weights of all positions on the feature maps.

In ESAB Block, we employ the dot product operation on max-pooled ϕ\phi and θ\theta convolution, which is activated by ReLU function:

𝐏=ψ(σrelu(θ(𝐗)ϕ(𝐗))),\mathbf{P}=\psi(\sigma_{relu}(\theta(\mathbf{X})\phi(\mathbf{X})^{\top})), (2)

where σrelu\sigma_{relu} is the ReLU activation function. The dot product, θ(𝐗)φ(𝐗)\theta(\mathbf{X})\varphi(\mathbf{X})^{\top}, gives a measurement for the input covariance, which can be defined as a degree of tendency between two feature maps at different channels. We activate the ψ\psi convolution operation in 𝑠𝑜𝑓𝑡𝑚𝑎𝑥\mathit{softmax} function and perform a matrix multiplication between the gg and the output of 𝑠𝑜𝑓𝑡𝑚𝑎𝑥\mathit{softmax} function. Then, we convolve and upsample the result of multiplication with ϕ\phi to extract the attention map 𝐒\mathbf{S}. Finally, an element-wise sum operation in between attention map 𝐒\mathbf{S} and the input 𝐗\mathbf{X} generates the output 𝐄N×64×H×W\mathbf{E}\in\mathbb{R}^{N\times 64\times H\times W}:

𝐒=ϕ(σsoftmax(𝐏)g(𝐗)),\mathbf{S}=\phi(\sigma_{softmax}(\mathbf{P})g(\mathbf{X})), (3)
𝐅=𝐒+𝐗,\mathbf{F}=\mathbf{S}+\mathbf{X}, (4)

where σsoftmax\sigma_{softmax} denotes 𝑠𝑜𝑓𝑡𝑚𝑎𝑥\mathit{softmax} function. Short connection between the input 𝐗\mathbf{X} and the output 𝐅\mathbf{F} finalizes the block operations for the residual learning. The detailed flow diagram of block operations of ESAB module is given in Fig. 4 d.

3.0.4 Learning Objectives for Endo-SfMLearner

Endo-SfMLearner is trained both in forward and backward directions with losses calculated in forward direction. We are using three loss functions to guide the network without labels; brightness-aware photometric loss, smoothness loss and geometry consistency loss.

Apart from well-known way of defining photometric loss, we are proposing affine brightness transformation between consecutive frames to deal with the problems stem from brightness constancy assumption of previous methods. First of all, the new reference image, I^i\hat{I}_{i}, is synthesized via interpolating Ii+1I_{i+1}. Previous methods calculate photometric loss directly comparing the synthesized image I^i\hat{I}_{i} with target image, IiI_{i}. However, the difference stem from illumination between consecutive frames might mislead the network. We propose to equate the brightness conditions between these two images as a robust way of supervising training phase. To the best of our knowledge, this is the first implementation of that approach for pose and depth estimation in literature. Moreover, quickly changing distance between organ tissue and camera results in scale inconsistency. We are using geometry consistency loss [Bian et al., 2019] to cope with that problem. The overall objective of the system is to minimize the weighted sum of brightness-aware photometric loss bpM\mathcal{L}_{bp}^{M}, smoothness loss s\mathcal{L}_{s} and geometry consistency loss GC\mathcal{L}_{GC} which can be formulated as:

=αbpM+βs+γGC.\mathcal{L}=\alpha\mathcal{L}_{bp}^{M}+\beta\mathcal{L}_{s}+\gamma\mathcal{L}_{GC}. (5)

where α\alpha, β\beta, and γ\gamma are the weights for the related loss functions which are not necessarily adding up to one.

The well-known photometric loss functions are based on the brightness constancy assumption which can be violated due to auto-exposure of the camera and fast illumination changes to which both L2 and SSIM are no more invariant. To deal with that inconsistent illumination issue which is common in endoscopic image sequences, Endo-SfMLearner network predicts a brightness transformation parameter set which tries to align the brightness of input images during training on the fly and in a self-supervised manner. The evaluations demonstrate that the proposed brightness transformation significantly improves the pose and depth prediction accuracy. The brightness-aware photometric loss formulation is given as follows:

bp=1|P|pP\displaystyle\mathcal{L}_{bp}=\frac{1}{|P|}{{\sum}}_{p\in P} (λp𝐓b(Ii^(p))Ii(p)2\displaystyle(\lambda_{p}\|{\mathbf{T}_{b}(\hat{I_{i}}(p))-I_{i}(p)}\|_{2} (6)
+\displaystyle+ λs1SSIM𝐓b(I^i),Ii2)\displaystyle\lambda_{s}\frac{1-{SSIM}_{\mathbf{T}_{b}(\hat{I}_{i}),I_{i}}}{2})
𝐓b(Ii^)=Ii^at,bt=attIi^+ctt\mathbf{T}_{b}(\hat{I_{i}})=\hat{I_{i}}^{a_{t^{\prime}},b_{t^{\prime}}}=a_{t\rightarrow t^{\prime}}\hat{I_{i}}+c_{t\rightarrow t^{\prime}} (7)

where Ii^\hat{I_{i}} stands for synthesized image by warping Ii+1I_{i+1}, 𝐓b\mathbf{T}_{b} is the brightness alignment function with affine transformation parameters atta_{t\rightarrow t^{\prime}} and cttc_{t\rightarrow t^{\prime}}, P stands for the successfully projected pixels from reference frame, SSIMSSIM is the image dissimilarity loss. By making use of contrast, luminance and structure values of 𝐓b(Ii^)\mathbf{T}_{b}(\hat{I_{i}}) and IiI_{i} image; SSIM targets to measure perceived image quality by human visual system and more sensitive to high frequency content such as textures and edges in regard of PSNR.

Since the photometric loss is not sufficiently informative for the low-texture and homogeneous endoscopic images, we are also incorporating smoothness loss [Ranjan et al., 2018] which is calculated as a combination of predicted depth and input images for both reference and target frames.

s=pP(eIi(p).Di(p))2,\mathcal{L}_{s}={{\sum}}_{p\in P}(e^{-\nabla I_{i}(p)}.\nabla D_{i}(p))^{2}, (8)

where \nabla is the first derivative along spatial directions. Thanks to the smoothness loss, Endo-SfMLearner is guided by edges in the predicted depth and input images. Finally, geometry consistency loss is integrated to our methodology. The main idea behind this loss is to confirm if DiD_{i} provides the same scene under the transformation of Di+1D_{i+1} by predicted relative poses Pi,i+1P_{i,i+1} . The difference between predicted depths, DdiffD_{diff}, can be calculated as:

Ddiff(p)=|Di+1i(p)Di+1(p)|Di+1i(p)+Di+1(p),D_{diff}(p)=\frac{|D_{i+1}^{i}(p)-D_{i+1}^{\prime}(p)|}{D_{i+1}^{i}(p)+D_{i+1}^{\prime}(p)}, (9)

where Di+1iD_{i+1}^{i} is the depth map of Ii+1I_{i+1} by warping DiD_{i} via Pi,i+1P_{i,i+1} and Di+1D_{i+1}^{\prime} is the interpolated depth map from Di+1D_{i+1}. The geometry consistency loss will be defined as summation of this difference across all pixel coordinates after normalization with valid pixel counts:

GC=1|P|pPDdiff(p).\mathcal{L}_{GC}=\frac{1}{|P|}{{\sum}}_{p\in P}D_{diff}(p). (10)

This consistency constrain between consecutive depth maps paves the way for long trajectory estimation with higher accuracy, the reader is referred to see Fig. 4 a. We also use depth inconsistency map results, DdiffD_{diff}, to weight the bp\mathcal{L}_{bp} with M as follows:

M=1Ddiff,M=1-D_{diff}, (11)
bpM=1|P|pP(M(p).(bp(p))).\mathcal{L}_{bp}^{M}=\frac{1}{|P|}{{\sum}}_{p\in P}(M(p).(\mathcal{L}_{bp}(p))). (12)

Thanks to this operation, brightness-aware photometric loss is weighted with higher constant if the predicted and interpolated depth maps are inconsistent for each pixel.

Refer to caption
Figure 5: Pose estimations on real trajectories. Endo-SfMLearner, SC-SfMLearner, Monodepth2, and SfMLearner trajectory estimations are benchmarked on ex-vivo EndoSLAM data. a The results for the first trajectory of small intestine recorded by LowCam. Throughout the all trajectory, SfMLearner relative pose estimations vary incredibly small which results in almost straightforward global pose estimation. We observed network firing problem for Monodepth2 even if we have repeated the tests on dataset with different frame-per-second rates. However, SC-SfMLearner and Endo-SfMLearner exhibit more reasonable predictions thanks to the geometry consistency loss. The most challenging part of trajectories are sharp corners where position and orientation of camera change with high speed in small time intervals. At those points, Endo-SfMLearner yields performance with higher accuracy compared to SC-SfMLearner not just qualitatively but also quantitatively in terms of both translational and rotational errors. b The results for first sub-trajectory of small intestine recorded by HighCam. It illustrates the case where SC-SfMLearner loose the scale consistency after sharp corner angle which is not the case for Endo-SfMLearner. Same problems observed for SfMLearner and Monodepth2 as in previous case. Endo-SfMLearner tracks loopy sections of the trajectories with sufficient precision up to 1000 frames by leaving small offset in between the ground truth. Apart from these, camera orientation estimations significantly improve and rotational relative pose error reduces almost three times compared to SC-SfMLearner which is the baseline state-of-the art method and achieves closest performance to ours. c The results for fourth trajectory of Colon-III recorded by MiroCam. On the contrary to the HighCam and LowCam, MiroCam exhibits fish-eye camera properties with high lens distortion. So, we have tested the reliability of Endo-SfMLearner against camera intrinsic properties.

4 EndoSLAM Use-Case with Endo-SfMLearner

To illustrate the use-case of the EndoSLAM dataset, Endo-SfMLearner, our proposed learning-based structure-from-motion method was benchmarked for the pose and depth estimation tasks. Additionally, we have tested both dataset and EndoSfMLearner with a traditional fully dense 3D-reconstruction pipeline based on SIFT feature-matching and non-lambertian surface reconstruction where the detailed overview is given in Algorithm 1. Error metrics that were used to quantitatively assess the performance of the algorithms are introduced in the following subsections.

4.1 Error Metrics

Endo-SfMLearner pose estimation performance is tested based on three metrics: absolute trajectory error(ATE), translational relative pose error(trans RPE) and rotational relative pose error(rot RPE). The monocular depth estimation performance is evaluated in terms of Root Mean Square Error(RMSE). Finally, the 3D-reconstruction results are evaluated with surface reconstruction error. These error metrics are defined as follows based on the estimated and ground truth trajectories represented by 𝐏1,,𝐏nSE(3)\mathbf{P}_{1},\ldots,\mathbf{P}_{n}\in\mathrm{SE(3)} and 𝐐1,,𝐐nSE(3)\mathbf{Q}_{1},\ldots,\mathbf{Q}_{n}\in\mathrm{SE(3)}, respectively, where the lower subscript is indexing frames and SE(3) is the Special Euclidean Group in three dimensions.

Algorithm 1 3D Reconstruction and Evaluation Pipeline
1:Extract SIFT features between image pairs
2:Find kk-nearest neighbours for each feature using a kk-d tree
3:for each image do
4:  (i) Select mm candidate matching images that have the most number of corresponding feature points
5:  (ii) Find geometrically consistent feature matches using RANSAC to solve for the homography between pairs of images.
6:end for
7:Find connected components of image matches
8:for each connected component do
9:  (i) Perform bundle adjustment for connected components in image matches
10:  (ii) Render final stitched image using multi-band blending
11:end for
12:Apply inpainting on the stitched image to suppress specularities
13:Reconstruct the surface using Tsai-Shah shape from shading method
14:Label a common line segment in ground truth data and reconstructed surface
15:Apply ICP algorithm using the common line as initialization
16:Compute iteratively the cloud-to-mesh distances to acquire RMSE

4.1.1 Absolute trajectory error (ATE)

The ATE is a measure of global consistency between two trajectories, comparing absolute distances between ground truth and predicted poses at each point in time. Let the rigid body transformation 𝐒{\bf S} be the best (least-squares) alignment of the trajectories [Horn, 1987]. Then absolute trajectory error for the ithi^{th} pose sample is calculated as follows:

ATEi=trans(𝐐i1𝐒𝐏i).\textrm{ATE}_{i}=\lVert\textrm{trans}({\bf Q}_{i}^{-1}{\bf SP}_{i})\lVert. (13)

The overall error throughout trajectory is defined by the root mean square of ATEi\textrm{ATE}_{i}.

4.1.2 Relative Pose Error (RPE)

Relative pose error measures the difference in the change in pose over a fixed length Δ\Delta between two trajectories. Defining 𝐄i(Δ)=(𝐐i1𝐐i+Δ)1(𝐏i1𝐏i+Δ){\bf E}_{i}(\Delta)=({\bf Q}_{i}^{-1}{\bf Q}_{i+\Delta})^{-1}({\bf P}_{i}^{-1}{\bf P}_{i+\Delta}), the translational and rotational RPE are given by:

Trans RPEi(Δ)\displaystyle\textrm{Trans RPE}_{i}(\Delta) =trans(𝐄i),\displaystyle=\lVert\textrm{trans}({\bf E}_{i})\lVert, (14)
Rot RPEi(Δ)\displaystyle\textrm{Rot RPE}_{i}(\Delta) =(rot(𝐄i)),\displaystyle=\angle(\textrm{rot}({\bf E}_{i})), (15)

where rot(𝐄i)\textrm{rot}({\bf E}_{i}) is the rotation matrix of 𝐄i{\bf E}_{i} and ()\angle(\cdot) is the positive angle of rotation. The errors are reported for Δ\Delta equals to 1.

Table 2: Quantitative results of pose prediction for various organs and trajectories. Endo-SfMLearner comparison with Endo-SfMLearner without attention block(Ew/oAtt), Endo-SfMLearner without brightness aware photometric loss integration(Ew/oBr), SC-SfMLearner, Monodepth2, and SfMLearner. To test the algorithm robustness against tissue and trajectory differences, we performed tests on two separate trajectories from ex-vivo porcine stomach, colon and intestine. Absolute trajectory error (ATE), translational and rotational Relative Pose Error metrics are used as evaluation criteria. Moreover, for a better understanding of the camera specifications effect on pose estimation, we compared the results coming from high (HighCam) and low (LowCam) resolution camera for same trajectories. We observed considerable decrease in rotational errors for Endo-SfMLearner which proves the effectiveness of spatial attention block integrated to pose network encoder and brightness-aware photometric loss. Even though, most of the tests result in Endo-SfMLearner superiority, only for the third trajectory of Stomach-III from HighCam SC-SfMLearner performed with higher accuracy in terms of ATE. Nevertheless, ablation studies do not provide sufficient cue to explain this improvement either stem from spatial attention block or brightness aware photometric loss.
Organ, Trajectory Trajectory Length [m] ATE \downarrow (mean±\pm std) [m] Trans. RPE \downarrow (mean±\pm std) [m] Rot. RPE \downarrow (mean±\pm std) [deg] Trajectory Length [m] ATE \downarrow (mean±\pm std) [m] Trans RPE \downarrow (mean±\pm std) [m] Rot RPE \downarrow (mean±\pm std) [deg]
HighCam LowCam
EndoSfM Colon-IV,Traj-I 0.4286 0.0878±\pm 0.0549 0.0009±\pm 0.0027 0.488±\pm 0.3217 0.6785 0.1046±\pm 0.0343 0.0011±\pm 0.006 0.4666±\pm 1.3792
Colon-IV,Traj-V 1.2547 0.1731±\pm 0.1179 0.0014±\pm 0.002 0.2552±\pm 0.417 1.1699 0.1771±\pm 0.1177 0.0012±\pm 0.002 0.1493±\pm 0.2321
Intestine,Traj-IV 1.0557 0.0812±\pm 0.0152 0.0010 ±\pm 0.0013 0.173±\pm 0.1942 0.8265 0.0558±\pm 0.0356 0.0011±\pm 0.0008 0.404 ±\pm 0.5052
Stomach-I,Traj-I 1.4344 0.1183±\pm 0.1062 0.0013±\pm 0.0028 0.5988±\pm 0.8185 0.8406 0.1732±\pm 0.116 0.0021±\pm 0.0034 0.8424±\pm 1.0788
Stomach-III,Traj-III 0.8908 0.1177±\pm 0.0543 0.0013±\pm 0.0033 0.5543±\pm 0.928 0.9714 0.1014±\pm 0.0491 0.0011±\pm 0.0007 0.6705±\pm 0.3817
Ew/oAtt Colon-IV,Traj-I 0.4286 0.0894±\pm 0.0274 0.0010±\pm 0.0029 0.3502±\pm 0.2621 0.6785 0.1548±\pm 0.0591 0.0010±\pm 0.3679 1.3613±\pm 1.5908
Colon-IV,Traj-V 1.2547 0.1855 ±\pm 0.0494 0.0014±\pm 0.0022 0.4569±\pm 0.5734 1.1699 0.1628±\pm 0.0375 0.0014±\pm 0.003 0.4168±\pm 0.3149
Intestine,Traj-IV 1.0557 0.1055±\pm 0.0379 0.0011±\pm 0.0012 0.3343±\pm 0.2653 0.8265 0.0691±\pm 0.0305 0.001±\pm 0.0009 0.654 ±\pm 0.6042
Stomach-I,Traj-I 1.4344 0.1889 ±\pm 0.0497 0.0015±\pm 0.0038 0.893±\pm 0.915 0.8406 0.1968±\pm 0.1417 0.0025±\pm 0.0037 1.1823±\pm 1.2112
Stomach-III,Traj-III 0.8908 0.1362 ±\pm 0.068 0.0016±\pm 0.0032 0.8244 ±\pm 1.0127 0.9714 0.1204±\pm 0.0418 0.0010±\pm 0.0009 1.0907±\pm 0.5634
Ew/oBr Colon-IV,Traj-I 0.4286 0.1328±\pm 0.0431 0.0010±\pm 0.0026 0.7198±\pm 0.4764 0.6785 0.1402 ±\pm 0.0671 0.0010±\pm 0.0060 0.7257±\pm 1.424
Colon-IV,Traj-V 1.2547 0.1898±\pm 0.0709 0.0015±\pm 0.002 0.929±\pm 0.7525 1.1699 0.1503±\pm 0.0433 0.0013±\pm 0.002 0.8989±\pm 0.6199
Intestine,Traj-IV 1.0557 0.1467±\pm 0.0848 0.002 ±\pm 0.0010 0.6607±\pm 0.3884 0.8265 0.1241 ±\pm 0.0436 0.0009±\pm 0.0008 1.106 ±\pm 0.8081
Stomach-I,Traj-I 1.4344 0.1963 ±\pm 0.0478 0.002 ±\pm 0.0032 0.6899 ±\pm 1.0401 0.8406 0.1923±\pm 0.118 0.0023±\pm 0.0032 0.9215±\pm 1.1728
Stomach-III,Traj-III 0.8908 0.1277±\pm 0.0805 0.0014±\pm 0.0033 0.3933±\pm 0.9258 0.9714 0.1101 ±\pm 0.0257 0.0010±\pm 0.0006 0.439 ±\pm 0.2672
SC-SfM Colon-IV,Traj-I 0.4286 0.1545±\pm 0.0441 0.0014±\pm 0.0028 1.3532±\pm 0.8541 0.6785 0.1898±\pm 0.0718 0.0015±\pm 0.0060 1.6388±\pm 1.5908
Colon-IV,Traj-V 1.2547 0.2054±\pm 0.1734 0.0024±\pm 0.0029 1.2452±\pm 0.965 1.1699 0.1667±\pm 0.1263 0.0021±\pm 0.003 1.2188±\pm 0.7715
Intestine,Traj-IV 1.0557 0.1247±\pm 0.1327 0.0015±\pm 0.0009 0.9257±\pm 0.584 0.8265 0.0908±\pm 0.0819 0.0016±\pm 0.0009 0.8989 ±\pm 0.7854
Stomach-I,Traj-I 1.4344 0.2325±\pm 0.127 0.002±\pm 0.0038 1.2937±\pm 1.2484 0.8406 0.191±\pm 0.1399 0.0028±\pm 0.0033 2.1322±\pm 1.2601
Stomach-III,Traj-III 0.8908 0.0898 ±\pm 0.035 0.0016±\pm 0.0033 1.3071±\pm 1.3187 0.9714 0.1927±\pm 0.0561 0.0012±\pm 0.0007 2.041±\pm 0.8391
Mono2 Colon-IV,Traj-I 0.4286 0.1071±\pm 0.0756 0.0012±\pm 0.0028 0.3115±\pm 0.268 0.6785 0.215±\pm 0.1084 0.0009±\pm 0.006 0.1679±\pm 1.378
Colon-IV,Traj-V 1.2547 0.1872±\pm 0.1404 0.0016±\pm 0.002 0.1607±\pm 0.4226 1.1699 0.2158±\pm 0.1466 0.0018±\pm 0.002 0.3921±\pm 0.3362
Intestine,Traj-IV 1.0557 0.1507±\pm 0.1165 0.009 ±\pm 0.0013 0.1092±\pm 0.1812 0.8265 0.1431±\pm 0.132 0.0014±\pm 0.001 0.3128 ±\pm 0.5288
Stomach-I,Traj-I 1.4344 0.2878 ±\pm 0.2293 0.0029±\pm 0.0038 0.298±\pm 0.7968 0.8406 0.2033±\pm 0.0971 0.0019±\pm 0.0011 0.5296±\pm 0.3642
Stomach-III,Traj-III 0.8908 0.5841±\pm 0.2742 0.0022±\pm 0.0033 0.8178±\pm 0.9059 0.9714 0.3876±\pm 0.2322 0.0032±\pm 0.0017 0.7345±\pm 0.8349
SfM Colon-IV,Traj-I 0.4286 0.1584±\pm0.1064 0.0043±\pm 0.0042 2.6624±\pm 1.6822 0.6785 0.1946±\pm 0.1708 0.0037±\pm 0.0092 2.0718±\pm 2.3018
Colon-IV,Traj-V 1.2547 0.5849±\pm 0.5201 0.0092±\pm 0.0175 4.4083±\pm 4.6309 1.1699 0.2094±\pm 0.1613 0.005±\pm 0.0041 3.1999±\pm 1.8304
Intestine,Traj-IV 1.0557 0.2119±\pm 0.2022 0.0083±\pm 0.016 3.9877±\pm 5.2134 0.8265 0.2387±\pm 0.1675 0.0048±\pm 0.005 2.7019±\pm 2.189
Stomach-I,Traj-I 1.4344 0.1741±\pm 0.0744 0.0012±\pm 0.0038 0.7249±\pm 0.7904 0.8406 0.2226±\pm 0.0989 0.007±\pm 0.005 4.1709±\pm 2.3479
Stomach-III,Traj-III 0.8908 0.3086±\pm 0.1774 0.0018±\pm 0.0035 0.6137±\pm 0.996 0.9714 0.1711±\pm 0.0548 0.0012±\pm 0.0008 0.802±\pm 0.4236

4.1.3 Surface Reconstruction Error

We use the methodology propounded by [Handa et al., 2014] in order to evaluate the surface reconstruction quality. As the first step, one line segment is manually identified between the reconstructed and ground truth 3D maps. The match points are used to coarsely align both maps. This coarse alignment is used as an initialization for the iterative closest point (ICP) algorithm. ICP iteratively aligns both maps until a termination criteria of 0.001 cm deviation in RMSE is reached.

4.2 Pose Estimation with Endo-SfMLearner

All methods including Endo-SfMLearner are trained with the same data and parameter set for the sake of fairness and unbiased results. The training and validation dataset consist of 2,039 and 509 colon images generated in the Unity simulation environment, respectively. We train the all networks in 200 epochs with randomly shuffled batches each size of 4 images, optimize by ADAM with initial learning rate 10410^{-4} and validate after each epoch. According to the tests in terms of ATE, trans RPE, and rot RPE on the data recorded via the HighCam and LowCam, Endo-SfMLearner achieves the state-of-the-art for most of the cases. The results in Table 2 show clear advantage of ESAB block integration and brightness-aware photometric loss. In the majority of Stomach-III results for both HighCam and LowCam, all models fail to follow trajectory with sufficient accuracy. However, the predicted trajectories aligned with ground truth for Endo-SfMLeaner in general are much better compared to other models.

Both quantitative and qualitative pose estimation results on sample trajectories of HighCam, LowCam and MiroCam are given in Fig. 5. Under the above mentioned training conditions, Monodepth2 and SfMLearner face with fatal failure on endoscopic videos. Even if SC-SfMLearner exhibits closest performance to our method in terms of absolute trajectory errors, we observe improvement specially on the rotational movement estimations which is reflected on rotational relative pose errors. A similar observation is also made on Unity trajectories, see Fig. G.1. Since the rotations cannot be changed frequently and easily while recording clear images in Unity environment, the trajectories are close to the straight lines which result in higher accuracy for all methods. It is seen that the Endo-SfMLearner outputs generally follow the shape of the ground truth, specifically it catches rotations more consistently which is the main reason for the decrease in rotational relative pose error.

For more comprehensive evaluations of results in terms of camera motions, descriptive analysis of the camera speeds and accelerations are given in Fig. F.4, Table F.1 and Table F.2. Since the robot motions are highly effective on image quality, we expect decrease in the pose estimation accuracy for the trajectories of Stomach-III which have highest mean speed and acceleration. The fact pave the way for the difficulty in alignment of those trajectories and also stitching of those frames for 3D reconstruction.

4.3 Depth Estimation with Endo-SfMLearner

In this subsection, we analyse the monocular depth estimation performance of Endo-SfMLearner quantitatively on synthetically generated data coming from EndoSLAM dataset.

Refer to caption
Figure 6: Quantitative Depth Evaluations. The original input image, depth ground truth, predicted depth maps and error heatmaps by Endo-SfMLearner, Endo-SfMLearner without brightness loss integration(Endo_\textunderscorew/o_\textunderscoreb1), Endo-SfMLeaner without attention integration(Endo_\textunderscorew/o_\textunderscorea1), SC-SfMLearner(Endo-SfMLearner without loss and block operation), Monodepth2, published pretrained Monodepth2(Monodepth2pre), SfMLearner and published pretrained SfMLearner(SfMLearnerpre) are shown from left to right, respectively. We benchmark the algorithms quantitatively on the synthetically generated images acquired with the camera whose properties are equivalent to the MiroCam. Even if the models subscripted by "1" are trained with the same data and parameter set, Endo-SfMLearner and SC-SfMLearner which are guided by geometry consistency loss show considerably superior performance to the rest of the methods. In particular, Endo-SfMLearner is able to estimate the relatively far regions more accurately than the remaining ones, although it is optimized for the images obtained by shallow Depth of Field cameras. Besides, its predictions conform with camera light burst and small depth alterations which result in least RMSE errors for all organs that is also proving the cross-organ adaptability of the method. By comparing the Endo-SfMLearner, Endo_\textunderscorew/o_\textunderscoreb1 and Endo_\textunderscorew/o_\textunderscorea1, one can deduce that the biggest advantage of ESAB block in PoseNet provided to the DispNet is increasing texture awareness whereas brightness-aware photometric loss focus the network to the light variations throughout the pixels. Their collaboration significantly improve the performance which is supported by decreasing RMSE values. The published pre-trained models are trained with Kitty dataset genarally consist of images whose upper part representing distant sky points, right and left edges are closer points representing flats or moving cars. This fact causes biased depth estimation especially for Monodepth2pre, on endoscopic images from all organs.

Since EndoSLAM dataset also provides pixelwise depth ground truth for synthetically generated endoscopic frames, we show that Endo-SfMLearner quantitatively outperforms the benchmarked monocular depth estimation methods as given in Fig. 6. The results are evaluated in terms of root mean square error(RMSE) on 1,548 stomach, 1,257 small intestine and 1,062 colon frames. Even if the training and validation dataset consist of synthetic colon frames, Endo-SfMLearner depicts high performance on stomach and small intestine with 0.2966 and 0.1785 mean RMSE. The heatmaps are also indicating that the errors significantly decrease for the pixels representing regions far from 14mm.

4.3.1 Ablation Studies for Spatial Attention Block

In order to increase the pose and depth network sensitivity for the edge and texture details, we have integrated attention block in between ReLU and max pooling operations in PoseNet encoder. By this attention mechanism, we are expecting to preserve low and high-frequency information from the input endoscopic images by exploiting the feature-channel inter-dependencies. In this subsection, we specifically investigate the following cases:

  • 1.

    EndoSfMLearner with brightness-aware photometric loss and ESAB,

  • 2.

    EndoSfMLearner with ESAB and without brightness-aware photometric loss,

  • 3.

    EndoSfMLearner without ESAB and with brightness-aware photometric loss,

  • 4.

    EndoSfMLearner without ESAB and without brightness -aware photometric loss(SC-SfMLearner).

Refer to caption
Figure 7: 3D Map Reconstruction and Evaluation Pipeline. a Input image sequences from Colon-IV, Small Intestine and Stomach-III trajectories which are downsampled to 4 fps. The frames are given as input to Scale Invariant Feature Transform (SIFT), separately. b The final stitched image which is formed by aligning and blending all input frames. Specularities are suppressed using inpainting function of OpenCV. c Depth maps for inpainted images which are predicted using Endo-SfMLearner, SC-SfMLearner and shape from shading. d 3D scanner point cloud data for each organ in ply-format. e The matched area between reference and aligned cloud points by emphasizing in green colour. The aligned regions are chosen as same for all compared groups for the sake of fairness. Iterative Closest Point(ICP) was used to align the ground truth data and reconstructed surface after manually labelling a common line segment. f The cloud mesh distances in the form of heatmap with the bar displaying the root mean square error in cm. The RMSE values of Colon-IV, 0.51 cm, 0.86 cm and 0.65 cm for Endo-SfMLearner, SC-SfMLearner and shape from shading, respectively. The RMSE values of Small Intestine are 0.40 cm, 1.02 cm and 0.54 cm for Endo-SfMLearner, SC-SfMLearner and shape from shading, respectively. The RMSE values of Stomach-III are 0.41 cm, 1.37 cm and 0.73 cm for Endo-SfMLearner, SC-SfMLearner and shape from shading, respectively. For all organs, we sight the superiority of the Endo-SfMLearner over both SC-SfMLearner and shape from shading. Since the training and validation dataset of SC-SfMLearner consist of colon frames, the RMSE values for colon are smaller than the other organs. However, even if the Endo-SfMLearner has the same training and validation dataset, it exhibits highly effective performance on stitched stomach and intestine images in comparison with the remaining methods.

The results for the pose tracking given in Table 2 reveal the usefulness and effectiveness of the module. Although the attention module is only inserted in PoseNet, simultaneous training of networks causes the improvement in depth estimation which is depicted in Fig. 6. As seen from quantitative ablation analysis, attention module makes Endo-SfMLearner more responsive for depth alterations on the synthetically generated images from colon, intestine and stomach. Even for the stomach and small intestine that is not included in training phase, Endo-SfMLearner achieves acceptable RMSE values which is the indicator of its persistent effort to be adaptable for texture differences.

4.4 3D Reconstruction Pipeline

3D reconstruction approach used in this work is depicted in Algorithm1. The steps of the procedure can be summarized under four fundamental techniques which are Otsu threshold-based reflection detection, OPENCV inpainting-based reflection suppression, feature matching and tracking based image stitching and non-lambertion surface reconstruction. To establish feature point correspondences between frames, SIFT feature matching and RANSAC based pair elimination are employed [Brown and Lowe, 2007]. Then, the depth map is estimated using Tsai-Shah shape from shading approach.

This surface reconstruction method applies a discrete approximation of the gradients and then utilize the linear approximation of the reflectance function in terms of the depth directly. For further details of Tsai-Shah method, the reader is referred to the original paper [Ping-Sing and Shah, 1994]. Fig. 7 demonstrates the steps of the approach and output maps aligned with the ground truth scanned data. Using that pipeline, RMSEs of 0.65 cm, 0.54 cm and 0.73 cm are obtained for Colon-IV, Small Intestine and Stomach-III trajectories, respectively. Besides, we use the depth estimations of EndoSfMLearner and SC-SfMLearner on the stitched images and compared with the rule-based method. As a result, we get lower RMSE values on the aligned images for Endo-SfMLearner, see Fig. 7.

5 Discussion and Future Works

In this paper, we introduce a novel endoscopic SLAM dataset that contains both capsule and standard endoscope camera images with 6D ground truth pose and high precision scanned 3D maps of the explored GI organs. Four different cameras were employed in total to collect data from eight ex-vivo porcine GI-tract organs each from different animal instances. Various additional post processing effects such as fish eye distortions, Gaussian blur, downsampling and vignetting can be applied as optional to diversify and enrich the dataset. In addition to the EndoSLAM dataset, Endo-SfMLearner is proposed as a monocular pose and depth estimation method based on spatial attention mechanisms and brightness-aware hybrid loss. Although Endo-SfMLearner is specifically developed and optimized for endoscopic type of images, it also holds great promise for laparoscopy images due to similar texture characteristics. Our future work will focus on generalizing the EndoSLAM dataset concept to other visualization techniques and create datasets with various other imaging modalities. Furthermore, we aim to examine and improve the data adaptability of the Endo-SfMLearner and address these issues as next steps. Last but not least, we plan to investigate the combination of Endo-SfmLearner with segmentation, abnormality detection and classification tasks in the concept of multi-task and meta-learning to enhance the performance of state-of-the-art methods.

Acknowledgment

Mehmet Turan, Kutsev Bengisu Ozyoruk, Guliz Irem Gokceler, Gulfize Coskun, and Kagan Incetan are especially grateful to the Scientific and Technological Research Council of Turkey (TUBITAK) for International Fellowship for Outstanding Researchers. We would like to express deep gratitude to Abdullhamid Obeid and Ebru Sagiroglu for their valuable support during experiments.

References

  • [1] , . Robust medical instrument segmentation (robust-mis) challenge 2019. https://www.synapse.org/#!Synapse:syn18779624/wiki/592660. Accessed: 2020-02-12.
  • iDa [2018] , 2018. An astounding 19 million colonoscopies are performed annually in the united states. https://idataresearch.com/an-astounding-19-million-colonoscopies-are-performed//-annually-in-the-united-states/. Accessed: 08/08/2018.
  • [3] 3D, S., . User handbook - shining 3d einscan. URL: https://www.einscan.com/wp-content/uploads/2020/05/EinScan-Pro-2X-Quick-Guide-1112.pdf. last accessed August 2019.
  • American Cancer Society [2018] American Cancer Society, 2018. Global Cancer Facts and Figures 4th Edition. American Cancer Society.
  • Arnold et al. [2020] Arnold, M., et al., 2020. Global burden of 5 major types of gastrointestinal cancer. Gastroenterology URL: http://www.sciencedirect.com/science/article/pii/S0016508520304522, doi:10.1053/j.gastro.2020.02.068.
  • [6] Artec3D, . User handbook - artec3d eva. URL: https://www.artec3d.com/files/pdf/ArtecScanners-Booklet.pdf. last accessed August 2019.
  • Bernal et al. [2012] Bernal, J., Sanchez, J., Vilariño, F., 2012. Towards automatic polyp detection with a polyp appearance model. Pattern Recognition 45, 3166–3182. doi:10.1016/j.patcog.2012.03.002,.
  • Bernal et al. [2015] Bernal, J., et al., 2015. Wm-dova maps for accurate polyp highlighting in colonoscopy: Validation vs. saliency maps from physicians. Computerized Medical Imaging and Graphics 43, 99—111. doi:10.1016/j.compmedimag.2015.02.007.
  • Bernal et al. [2017] Bernal, J., et al., 2017. Comparative validation of polyp detection methods in video colonoscopy: Results from the miccai 2015 endoscopic vision challenge. IEEE Transactions on Medical Imaging 36, 1231–1249. doi:10.1109/TMI.2017.2664042.
  • Bian et al. [2019] Bian, J.W., et al., 2019. Unsupervised scale-consistent depth and ego-motion learning from monocular video. arXiv:1908.10553.
  • Borgli et al. [2019] Borgli, H., et al., 2019. Hyper-kvasir: A comprehensive multi-class image and video dataset for gastrointestinal endoscopy. doi:10.31219/osf.io/mkzcq.
  • Brown and Lowe [2007] Brown, M., Lowe, D.G., 2007. Automatic panoramic image stitching using invariant features. International Journal of Computer Vision 74, 59–73.
  • Chen et al. [2019a] Chen, R., Bobrow, T., Athey, T., Mahmood, F., Durr, N., 2019a. Slam endoscopy enhanced by adversarial depth prediction.
  • Chen et al. [2019b] Chen, R.J., Bobrow, T.L., Athey, T.L., Mahmood, F., Durr, N.J., 2019b. Slam endoscopy enhanced by adversarial depth prediction. ArXiv abs/1907.00283.
  • Ciuti et al. [2016] Ciuti, G., et al., 2016. Frontiers of robotic endoscopic capsules: a review. Journal of Micro-Bio Robotics 11, 1–18.
  • Dey et al. [2017] Dey, N., Ashour, A.S., Shi, F., Sherratt, R.S., 2017. Wireless capsule gastrointestinal endoscopy: Direction-of-arrival estimation based localization survey. IEEE reviews in biomedical engineering 10, 2–11.
  • Eigen et al. [2014] Eigen, D., Puhrsch, C., Fergus, R., 2014. Depth map prediction from a single image using a multi-scale deep network, in: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N.D., Weinberger, K.Q. (Eds.), Advances in Neural Information Processing Systems 27. Curran Associates, Inc., pp. 2366–2374. URL: http://papers.nips.cc/paper/5539-depth-map-prediction-from-a-single-image-using-a-//multi-scale-deep-network.pdf.
  • Garg et al. [2016] Garg, R., G, V.K.B., Reid, I.D., 2016. Unsupervised CNN for single view depth estimation: Geometry to the rescue. CoRR abs/1603.04992. URL: http://arxiv.org/abs/1603.04992, arXiv:1603.04992.
  • Godard et al. [2018] Godard, C., Aodha, O., Brostow, G., 2018. Digging into self-supervised monocular depth estimation. CoRR .
  • Grasa et al. [2013] Grasa, O.G., Bernal, E., Casado, S., Gil, I., Montiel, J., 2013. Visual slam for handheld monocular endoscope. IEEE transactions on medical imaging 33, 135–146.
  • Handa et al. [2014] Handa, A., Whelan, T., McDonald, J., Davison, A.J., 2014. A benchmark for rgb-d visual odometry, 3d reconstruction and slam, in: Robotics and automation (ICRA), 2014 IEEE international conference on, IEEE. pp. 1524–1531.
  • Hartley and Zisserman [2003] Hartley, R., Zisserman, A., 2003. Multiple View Geometry in Computer Vision. 2 ed., Cambridge University Press, New York, NY, USA.
  • Ho et al. [2017] Ho, H.W., de Croon, G.C., Chu, Q., 2017. Distance and velocity estimation using optical flow from a monocular camera. International Journal of Micro Air Vehicles 9, 198–208.
  • Honegger et al. [2012] Honegger, D., Greisen, P., Meier, L., Tanskanen, P., Pollefeys, M., 2012. Real-time velocity estimation based on optical flow and disparity matching, in: 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE. pp. 5177–5182.
  • Hong et al. [2012] Hong, S.P., Cheon, J., Kim, T., Song, S., Kim, W., 2012. Comparison of the diagnostic yield of "mirocam" and "pillcam sb" capsule endoscopy. Hepato-gastroenterology 59, 778–81. doi:10.5754/hge10472.
  • Horn [1987] Horn, B.K., 1987. Closed-form solution of absolute orientation using unit quaternions. Josa a 4, 629–642.
  • Incetan et al. [2020] Incetan, K., Celik, I.O., Obeid, A., Gokceler, G.I., Ozyoruk, K.B., Almalioglu, Y., Chen, R.J., Mahmood, F., Gilbert, H., Durr, N.J., Turan, M., 2020. Vr-caps: A virtual environment for capsule endoscopy. arXiv:2008.12949.
  • Jha et al. [2020] Jha, D., Smedsrud, P.H., Riegler, M.A., Halvorsen, P., de Lange, T., Johansen, D., Johansen, H.D., 2020. The Kvasir-SEG Dataset. https://datasets.simula.no/kvasir-seg/.
  • Jiao et al. [2018] Jiao, J., Jiao, J., Mo, Y., Liu, W., Deng, Z., 2018. Magicvo: End-to-end monocular visual odometry through deep bi-directional recurrent convolutional neural network. CoRR abs/1811.10964. URL: http://arxiv.org/abs/1811.10964, arXiv:1811.10964.
  • Koulaouzidis et al. [2017] Koulaouzidis, A., et al., 2017. KID Project: an internet-based digital video atlas of capsule endoscopy for research purposes. Endosc Int Open 5, E477–E483.
  • Kuth et al. [2007] Kuth, R., Reinschke, J., Rockelein, R., 2007. Method for determining the position and orientation of an endoscopy capsule guided through an examination object by using a navigating magnetic field generated by means of a navigation device. US Patent App. 11/481,935.
  • Laina et al. [2016] Laina, I., Rupprecht, C., Belagiannis, V., Tombari, F., Navab, N., 2016. Deeper depth prediction with fully convolutional residual networks. CoRR abs/1606.00373. URL: http://arxiv.org/abs/1606.00373, arXiv:1606.00373.
  • Leonard et al. [2018] Leonard, S., Sinha, A., Reiter, A., Ishii, M., Gallia, G., Taylor, R., Hager, G., 2018. Evaluation and stability analysis of video-based navigation system for functional endoscopic sinus surgery on in-vivo clinical data. IEEE Transactions on Medical Imaging doi:10.1109/TMI.2018.2833868.
  • Lin et al. [2013] Lin, B., Johnson, A., Qian, X., Sanchez, J., Sun, Y., 2013. Simultaneous tracking, 3d reconstruction and deforming point detection for stereoscope guided surgery, in: Augmented Reality Environments for Medical Imaging and Computer-Assisted Interventions. Springer, pp. 35–44.
  • Liu et al. [2016] Liu, F., Shen, C., Lin, G., Reid, I., 2016. Learning depth from single monocular images using deep convolutional neural fields. IEEE Transactions on Pattern Analysis and Machine Intelligence 38, 2024–2039.
  • Liu [2002] Liu, G., 2002. On velocity estimation using position measurements, in: Proceedings of the 2002 American Control Conference (IEEE Cat. No. CH37301), IEEE. pp. 1115–1120.
  • Liu et al. [2020] Liu, X., Sinha, A., Ishii, M., Hager, G.D., Reiter, A., Taylor, R.H., Unberath, M., 2020. Dense depth estimation in monocular endoscopy with self-supervised learning methods. IEEE Transactions on Medical Imaging 39, 1438–1447.
  • Lu and Lu [2019] Lu, Y., Lu, G., 2019. Deep unsupervised learning for simultaneous visual odometry and depth estimation, in: 2019 IEEE International Conference on Image Processing (ICIP), pp. 2571–2575.
  • Lucas and Kanade [1981] Lucas, B.D., Kanade, T., 1981. An iterative image registration technique with an application to stereo vision. Proceedings of the 7th international joint conference on Artificial intelligence .
  • Mahmood et al. [2018] Mahmood, F., Chen, R., Durr, N.J., 2018. Unsupervised reverse domain adaptation for synthetic medical images via adversarial training. IEEE Transactions on Medical Imaging 37, 2572–2581.
  • Mahmood and Durr [2017] Mahmood, F., Durr, N.J., 2017. Deep learning and conditional random fields-based depth estimation and topographical reconstruction from conventional endoscopy. CoRR abs/1710.11216. URL: http://arxiv.org/abs/1710.11216, arXiv:1710.11216.
  • McCarthy et al. [2008] McCarthy, C., Barnes, N., Mahony, R., 2008. A robust docking strategy for a mobile robot using flow field divergence. IEEE Transactions on Robotics 24, 832–842.
  • Meng et al. [EasyChair, 2018] Meng, X., Fan, C., Ming, Y., EasyChair, 2018. Visual odometry based on convolutional neural networks for large-scale scenes. EasyChair Preprint no. 413. doi:10.29007/mf57.
  • Mirota et al. [2011] Mirota, D.J., Wang, H., Taylor, R.H., Ishii, M., Gallia, G.L., Hager, G.D., 2011. A system for video-based navigation for endoscopic endonasal skull base surgery. IEEE transactions on medical imaging 31, 963–976.
  • Moccia et al. [2018] Moccia, S., et al., 2018. Learning-based classification of informative laryngoscopic frames. Computer Methods and Programs in Biomedicine 158. doi:10.1016/j.cmpb.2018.01.030.
  • Mountney et al. [2006] Mountney, P., Stoyanov, D., Davison, A., Yang, G.Z., 2006. Simultaneous stereoscope localization and soft-tissue mapping for minimal invasive surgery, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer. pp. 347–354.
  • Penza et al. [2018] Penza, V., Ciullo, A.S., Moccia, S., Mattos, L.S., De Momi, E., 2018. Endoabs dataset: Endoscopic abdominal stereo image dataset for benchmarking 3d stereo reconstruction algorithms. The International Journal of Medical Robotics and Computer Assisted Surgery 14, e1926.
  • Pertuz [June 7, 2020] Pertuz, S., June 7, 2020. Defocus simulation. MATLAB Central File Exchange. URL: https://www.mathworks.com/matlabcentral/fileexchange/55095-defocus-simulation.
  • Ping-Sing and Shah [1994] Ping-Sing, T., Shah, M., 1994. Shape from shading using linear approximation. Image and Vision computing 12, 487–498.
  • Pogorelov et al. [2017] Pogorelov, K., et al., 2017. Kvasir: A multi-class image dataset for computer aided gastrointestinal disease detection, in: Proceedings of the 8th ACM on Multimedia Systems Conference, pp. 164–169. doi:10.1145/3083187.3083212.
  • Puglisi [2015] Puglisi, L.J., 2015. On the velocity and acceleration estimation from discrete time-position signal of linear encoders. Journal of Control Engineering and Applied Informatics 17, 30–40.
  • Ranjan et al. [2018] Ranjan, A., Jampani, V., Kim, K., Sun, D., Wulff, J., Black, M.J., 2018. Adversarial collaboration: Joint unsupervised learning of depth, camera motion, optical flow and motion segmentation. CoRR abs/1805.09806. URL: http://arxiv.org/abs/1805.09806, arXiv:1805.09806.
  • Redondo-Cerezo et al. [2014] Redondo-Cerezo, E., Sánchez-Capilla, A., Torre-Rubio, P., Teresa, J., 2014. Wireless capsule endoscopy: Perspectives beyond gastrointestinal bleeding. World journal of gastroenterology : WJG 20, 15664–15673. doi:10.3748/wjg.v20.i42.15664.
  • Shah et al. [2006] Shah, T., Aziz, S.M., Vaithianathan, T., 2006. Development of a tracking algorithm for an in-vivo rf capsule prototype, in: 2006 International Conference on Electrical and Computer Engineering, IEEE. pp. 173–176.
  • Silva et al. [2014] Silva, J.S., Histace, A., Romain, O., Dray, X., Granado, B., 2014. Towards embedded detection of polyps in WCE images for early diagnosis of colorectal cancer. International Journal of Computer Assisted Radiology and Surgery 9, 283–293. doi:10.1007/s11548-013-0926-3.
  • Simaan et al. [2015] Simaan, N., Taylor, R.H., Choset, H., 2015. Intelligent surgical robots with situational awareness. Mechanical Engineering 137, S3–S6.
  • Son et al. [2015] Son, D., Yim, S., Sitti, M., 2015. A 5-d localization method for a magnetically manipulated untethered robot using a 2-d array of hall-effect sensors. IEEE/ASME Transactions on Mechatronics 21, 708–716.
  • Spyrou and Iakovidis [2013] Spyrou, E., Iakovidis, D.K., 2013. Video-based measurements for wireless capsule endoscope tracking. Measurement Science and Technology 25, 015002.
  • Stoyanov et al. [2010] Stoyanov, D., Scarzanella, M.V., Pratt, P., Yang, G.Z., 2010. Real-time stereo reconstruction in robotically assisted minimally invasive surgery, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer. pp. 275–282.
  • Tajbakhsh et al. [2016] Tajbakhsh, N., Gurudu, S.R., Liang, J., 2016. Automated polyp detection in colonoscopy videos using shape and context information. IEEE Transactions on Medical Imaging 35, 630–644. doi:10.1109/TMI.2015.2487997.
  • Than et al. [2014] Than, T.D., et al., 2014. An effective localization method for robotic endoscopic capsules using multiple positron emission markers. IEEE Transactions on Robotics 30, 1174–1186.
  • Tsai and Lenz [1989] Tsai, R.Y., Lenz, R.K., 1989. A new technique for fully autonomous and efficient 3d robotics hand/eye calibration. IEEE Transactions on Robotics and Automation 5, 345–358. doi:10.1109/70.34770.
  • Turan et al. [2017a] Turan, M., Almalioglu, Y., Araújo, H., Konukoglu, E., Sitti, M., 2017a. Deep endovo: A recurrent convolutional neural network (RCNN) based visual odometry approach for endoscopic capsule robots. CoRR abs/1708.06822. URL: http://arxiv.org/abs/1708.06822, arXiv:1708.06822.
  • Turan et al. [2017b] Turan, M., Almalioglu, Y., Araujo, H., Konukoglu, E., Sitti, M., 2017b. A non-rigid map fusion-based direct slam method for endoscopic capsule robots. International journal of intelligent robotics and applications 1, 399–409.
  • Turan et al. [2017c] Turan, M., Pilavcı, Y., Jamiruddin, R., Araujo, H., Konukoglu, E., Sitti, M., 2017c. A fully dense and globally consistent 3d map reconstruction approach for gi tract to enhance therapeutic relevance of the endoscopic capsule robot.
  • Turan et al. [2018] Turan, M., et al., 2018. Unsupervised odometry and depth learning for endoscopic capsule robots. arXiv preprint arXiv:1803.01047 .
  • Wu et al. [2011] Wu, C., Agarwal, S., Curless, B., Seitz, S.M., 2011. Multicore bundle adjustment, in: In IEEE Conference on Computer Vision and Pattern Recognition (CVPR, IEEE. pp. 3057–3064.
  • Yano and Yamamoto [2009] Yano, T., Yamamoto, H., 2009. Vascular, polypoid, and other lesions of the small bowel. Best practice & research. Clinical gastroenterology 23, 61–74. doi:10.1016/j.bpg.2008.12.001.
  • Ye et al. [2016] Ye, M., Giannarou, S., Meining, A., Yang, G.Z., 2016. Online tracking and retargeting with applications to optical biopsy in gastrointestinal endoscopic examinations. Medical image analysis 30, 144–157.
  • Ye et al. [2017] Ye, M., Johns, E., Handa, A., Zhang, L., Pratt, P., Yang, G.Z., 2017. Self-supervised siamese learning on stereo image pairs for depth estimation in robotic surgery. arXiv preprint arXiv:1705.08260 .
  • Yin and Shi [2018] Yin, Z., Shi, J., 2018. Geonet: Unsupervised learning of dense depth, optical flow and camera pose. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition , 1983–1992.
  • Zhang et al. [2020] Zhang, Y., Xu, S., Wu, B., Shi, J., Meng, W., Zhang, X., 2020. Unsupervised multi-view constrained convolutional network for accurate depth estimation. IEEE Transactions on Image Processing 29, 7019–7031.
  • Zhang [1999] Zhang, Z., 1999. Flexible camera calibration by viewing a plane from unknown orientations, in: Proceedings of the Seventh IEEE International Conference on Computer Vision, Ieee. pp. 666–673.
  • Zhou et al. [2017] Zhou, T., Brown, M., Snavely, N., Lowe, D.G., 2017. Unsupervised learning of depth and ego-motion from video, in: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6612–6619.

Appendix A Equipment

Refer to caption
Figure A.1: Equipment. The overall equipment for dataset generation. a Franka Emika Panda: motion control device for cameras. b Capsule Holder: two-piece holder as a kit between the WCE cameras and the robotic arm. c MiroCam® Data Belt d Real Porcine Colon: sewn onto an ’L’ shaped semi-cylindrical scaffold in high-density foam. e MiroCam® MR1100 receiver: Digital video grabber for the conversion of analog data into digital and output to the computer. f PillCam® recorder g Artec Eva: 3D scanner used to generate ground truth - ply file. h EinScan Pro 2X: 3D scanner used to generate ground truth - .ply, .obj, .stl and .ASC file. i Wireless Endoscope Camera (YPC-HD720P): high resolution - 1280×720 and HD640×480. j Endoscope 3 in 1 Camera: low resolution - 640×480. k Camera Holder: specially designed one-piece holder for the stabilization of the high and low resolution endoscope to the robotic arm. l PillCamTM COLON2: WCE double tip camera. m MiroCam® Regular MC1000-W: WCE camera.

Appendix B Dataset Survey

  • 1.

    The KID Dataset is organized by The Medical Decision Support Systems (MDSS) research group of the University of Thessaly. The dataset is divided into two annotated sections. The first section has a total of 77 wireless capsule endoscopy (WCE) images acquired using MiroCam® (IntroMedic Co, Seoul, Korea) capsules and has some types of abnormalities such as angioectasias, apthae, chylous cysts and polypoid lesions. The second part consists of 2,371 MiroCam® WCE. This dataset not only includes small bowel lesions such as polypoid, vascular and inflammatory lesions but also images from healthy esophagus, stomach, small bowel and colon Given Imaging Atlas Dataset consists of 20 second video clips recorded using PillCam capsules with a resolution of 576x576 pixels. In this database, 117 WCE video clips have been acquired from the small bowel, 5 from esophagus and 13 from the colon [Spyrou and Iakovidis, 2013].

  • 2.

    The Kvasir dataset was collected via standard endoscopic equipments at Vestre Viken (VV) Health Trust in Norway. The initial dataset consists of 4,000 images with eight classes namely Z-line, pylorus, cecum, esophagitis, polyps, ulcerative colitis, dyed and lifted polyps and dyed resection margins of images, each represented with 500 images. All images are annotated and verified by experienced endoscopists [Pogorelov et al., 2017]. Later, the dataset extended to 8,000 images with the same eight classes [Borgli et al., 2019]. The Kvasir-SEG Dataset is an extension of the Kvasir dataset which is used for polyp segmentation. It comprises 1000 polyp images and their corresponding ground truth from the second version of the Kvasir dataset [Jha et al., 2020].

  • 3.

    The Hyper-Kvasir dataset is the largest online available dataset related to the gastrointestinal tract, containing 110,079 images (10,662 labeled and 99,417 unlabeled images) and 373 videos, making a total of 1.17 million frames. The entire dataset was collected in gastro- and colonoscopy examinations in Norway and 10,662 images are labeled for 23 classes by practitioners. [Borgli et al., 2019].

  • 4.

    The NBI-InfFrames dataset includes Narrow-band imaging(NBI) endoscopy which is commonly used as a diagnostic procedure to examine the back of throat, glottis, vocal cords and the larynx. To generate this in vivo dataset, 18 different patients affected by laryngeal spinocellular carcinoma (diagnosed after histopathological examination) were involved. It consists of 180 informative (I), 180 blurred (B), 180 with saliva or specular reflections (S) and 180 underexposed (U) frames with a total number of 720 video frames [Moccia et al., 2018].

  • 5.

    The EndoAbS(Endoscopic Abdominal Stereo Images) Dataset consists of 120 sub-datasets of endoscopic stereo images of abdominal organs (e.g., liver, kidney, spleen) with corresponding ground truth acquired via laser scanner. In order to create variations in the dataset, frames have been recorded under 3 different lighting conditions, presence of smoke and 2 different distances from endoscope to phantom (\sim 5 cm and \sim 10 cm). The main purpose of generating this dataset was to validate 3D reconstruction algorithms for the computer assisted surgery community [Penza et al., 2018].

  • 6.

    CVC-ColonDB is a database of annotated video sequences consisting of 15 short colonoscopy sequences, where one polyp has been shown in each sequence. There are 1,200 different images containing original images, polyp masks, non-informative image masks and contour of polyp masks. It can be used for assessment of polyp detection [Bernal et al., 2012].

  • 7.

    MICCAI 2015 Endoscopic Vision Challenge [Bernal et al., 2017] provides three sub-databases which are CVC-ClinicDB, ETIS-Larib and ASU-Mayo Clinic polyp database and which can be used for polyp detection and localization. CVC-ClinicDB is a cooperative work of the Hospital Clinic and the Computer Vision Center, Barcelona, Spain. It contains 612 images from 31 different sequences. Each image has its annotated ground truth associated, covering the polyp [Bernal et al., 2015]. ETIS-Larib is a database consisting of 300 frames with polyps extracted from colonoscopy videos. Frames and their ground truths are provided by ETIS laboratory, ENSEA, University of Cergy-Pontoise, France [Silva et al., 2014]. The ASU-Mayo Clinic polyp database was acquired as a cooperative work of Arizona State University and Mayo Clinic, USA. It consists of 20 short colonoscopy videos (22,701 frames) with different resolution ranges and different area coverage values for training purposes. Each frame in its training dataset comes with a ground truth image or a binary mask that indicates the polyp region. In addition, it contains 18 videos without annotation for testing purposes [Tajbakhsh et al., 2016].

  • 8.

    The Hamlyn Centre Laparoscopic/Endoscopic Video Dataset consists of 37 subsets. The Gastrointestinal Endoscopic Dataset includes 10 videos and consists of 7,894 images with a size of 2.5 GB which were collected during standard gastrointestinal examinations. The dataset includes images for polyp detection, localization and optical biopsy retargeting. Apart from endoscopy dataset for depth estimation, one of the laparoscopy datasets contains  40,000 pairs of rectified stereo images collected in partial nephrectomy using Da Vinci surgery robot. Its primary use has been training and testing deep learning networks for disparity (inverse depth) estimation [Ye et al., 2016], [Ye et al., 2017].

  • 9.

    ROBUST-MIS Challenge provides a dataset which was created in the Heidelberg University Hospital, Germany during rectal resection and proctocolectomy surgeries. Videos from 30 minimal invasive surgical procedures with three different types of surgery and extracted 10,040 standard endoscopic image frames from these 30 procedures performed a basis for this challenge. These images were acquired using a laparoscopic camera (Karl Storz Image 1) with a 30° optic and a resolution of 1920x1080 pixels. The images are, then, downscaled to 960x540 pixels and annotated with numbers showing the absence or presence of medical instruments [Rob, ].

Appendix C Camera Calibration

For the coordinate transformation between robot pose data and capsule cameras, hand-eye calibration procedure was repeated with two different checkerboards: one with 2×2 mm2\times$2\text{\,}\mathrm{mm}$ squares and one with 1.5×1.5 mm1.5\times$1.5\text{\,}\mathrm{mm}$ squares, both patterns with 8×78\times 7 squares in total. Four images of each checkerboard were acquired from different camera poses. For the pose conversions, only the checkerboard images from Mirocam capsule was used, with the support structure being the same for both capsules (Pillcam and Mirocam). Similarly, to calculate the transformation between the gripper holding HighCam-LowCam and the camera positions, same procedure was repeated by using the checkerboard squares with 10.2×10.2 mm10.2\times$10.2\text{\,}\mathrm{mm}$.

The Tsai and Lenz algorithm[Tsai and Lenz, 1989] was tested with 24 combinations of the 4 chessboard images in Fig. C.1. The transformation between a point 𝐗𝐜\bf{X}_{c} in the reference frame of the camera and a point 𝐗𝐠\bf{X}_{g} in the reference frame of the gripper is, thus, given by

𝐗𝐠=𝐑𝐠𝐜𝐗𝐜+𝐭𝐠𝐜\bf{X}_{g}=\bf{R}^{c}_{g}\,\bf{X}_{c}+\bf{t}^{c}_{g} (16)

with the rotation matrices and translation vectors given in Table  C.1.

Table C.1: Robot Pose to Camera Transformation. The rotation matrices and translation vectors for MiroCAM, HighCam and LowCam to apply the transformations given in Eqn. 16. These values are provided as a .txt file and as a .mat file in the calibration folders of the EndoSLAM Dataset.
Camera Rotation 𝐑𝐠𝐜\bf{R}^{c}_{g} Translation 𝐭𝐠𝐜\bf{t}^{c}_{g}(mm)
MiroCam [0.93660.32420.13250.17380.10170.97950.30410.94050.1516]\begin{bmatrix}-0.9366&-0.3242&-0.1325\\ 0.1738&-0.1017&-0.9795\\ 0.3041&-0.9405&0.1516\end{bmatrix} [2.979327.022472.1070]\begin{bmatrix}2.9793\\ -27.0224\\ 72.1070\end{bmatrix}
HighCam [0.94630.09210.30980.13890.74950.64720.29180.65550.8965]\begin{bmatrix}0.9463&-0.0921&-0.3098\\ -0.1389&0.7495&-0.6472\\ 0.2918&-0.6555&0.8965\end{bmatrix} [46.201720.907494.6349]\begin{bmatrix}-46.2017\\ 20.9074\\ 94.6349\end{bmatrix}
LowCam [0.82940.55770.03220.55860.82860.03790.00560.04950.9988]\begin{bmatrix}0.8294&0.5577&0.0322\\ -0.5586&0.8286&0.0379\\ -0.0056&-0.0495&0.9988\end{bmatrix} [6.016939.5114101.6431]\begin{bmatrix}6.0169\\ 39.5114\\ 101.6431\end{bmatrix}

For the detailed description of vision based calibration technique and its written codes in MATLAB R2020a, visit https://github.com/CapsuleEndoscope/EndoSLAM. Calibration was performed for both the Mirocam and Pillcam capsules, using images of a planar checkerboard with 8×78\times 7 squares of dimension 2×2 mm2\times$2\text{\,}\mathrm{mm}$ and also for HighCam and LowCam using 8×78\times 7 squares of dimension 12.8×12.8 mm12.8\times$12.8\text{\,}\mathrm{mm}$. The calibration checkerboard was printed using a laser printer and then glued on the surface of a glass plate to ensure the planarity of the pattern.

The practical distance and orientation range at which the calibration checkerboard can be placed is limited by the low resolution and depth of field of the cameras. For each camera, 10 calibration images were used with the pattern placed at different poses. The average distance from the camera was approximately 10 mm10\text{\,}\mathrm{mm} for capsule cameras. Fig. C.1 show examples of some of the calibration images.

Refer to caption
Figure C.1: Camera Intrinsic-Extrinsic Calibration Images. Examples of planar checkerboard calibration images obtained by a MiroCam, b PillCam, c HighCam and d LowCam. The chessboards are printed with a laser printer and then glued on the surface of a planar glass to ensure the planarity of the pattern. Since the dataset is recorded in dark room, chessboard images are taken in same environmental conditions.
Refer to caption
Figure C.2: Reprojection errors associated with the camera calibrations. The reprojection errors under pinhole camera assumption for a Mirocam, b Pillcam with a front-facing (Cam1) c Pillcam with a backwards-facing (Cam2) camera. d-f Reprojection errors for the same devices under the fisheye model assumptions.
Refer to caption
Figure C.3: Correction of lens distortions. Examples to correct the lens distortions via camera parameters given in Table C.2 for the images acquired by PillCam and MiroCam. a Original 8×\times7 checkerboard image with 2×\times2mm squares obtained by PillCam, b Undistorted checkerboard image with pinhole calibration parameters, c Undistorted checkerboard image with fisheye parameters, d Newspaper image which is rich in texture details taken by frontal camera of PillCam e Undistorted counterpart of newspaper image with the calculated parameters under fisheye camera assumption. Similarly, f Original Colon-III image of MiroCam and g Undistorted version by the parameters of fisheye calibration model.

Table C.2 summarizes the estimated intrinsic parameters for each of the calibration models. Note that the Pillcam capsule has both a front-facing and a backwards-facing camera. That dual camera setup can lead to interesting novel visual SLAM approaches that makes simultaneous use of front- and backwards-facing cameras. The reprojection errors associated with each calibration can be seen in Fig. C.2. Fig. C.3 shows some examples for the application of estimated intrinsic parameters to correct the lens distortion effects on capsule images to produce undistorted pinhole images. Note that eight images were used due to the limited operational volume to place the chessboard. Outside that volume the images become either less focused/blurry or if too far the corners are undetectable due to low image quality and resolution.

Table C.2: Intrinsic parameters for MiroCam, PillCam, HighCam and LowCam under the pinhole assumption Since the effects mentioned in Sec. 2.4 are directly related with camera lens properties, the intrinsic parameters representing new images are also changing compatible with modifications. For the mixture of Gaussian blur, vignetting and resizing effects, the new parameters are given in High-Modified and Low-Modified columns.
PillCam
HighCam LowCam High-Modified Low-Modified MiroCam Cam1 Cam2
H x W 480×640480\times 640 480×640480\times 640 400×400400\times 400 250×250250\times 250 320×320320\times 320 256×256256\times 256 256×256256\times 256
Pinhole Focal length fxf_{x} 957.4119 816.8598 603.5105 317.6319 156.0418 74.2002 76.0535
fyf_{y} 959.3861 814.8223 807.6887 423.1068 155.7529 74.4184 75.4967
Skew ss 5.6242 0.2072 4.2831 -0.3334 0 0 0
Optical center cxc_{x} 282.1921 308.2864 173.7160 121.3764 178.5604 129.9724 130.9419
cyc_{y} 170.7316 158.3971 133.7022 82.5754 181.8043 129.1209 128.4882
Radial dist. coef. k1k_{1} 0.2533 0.2345 0.2645 0.2265 -0.2486 0.1994 0.1985
k2k_{2} -0.2085 -0.7908 -0.4186 -0.8877 0.0614 -0.1279 -0.1317

Appendix D Temporal-Synchronization

The apparent velocity field in the image, i.e. the optical flow, is the projection of the 3D velocity of the scene (w.r.t. the camera), and is, in general, dependent on scene depth [Honegger et al., 2012]. In the dataset, the camera moves, roughly, in a straight line along its optical axis and with limited rotation, for the most part of the trajectory. The scene is also relatively uniform and symmetric, in terms of relative depth. To estimate this "forward" motion, the divergence of the flow vector field can be used [Ho et al., 2017, McCarthy et al., 2008]. To provide insight, Fig. D.1a shows the divergent optical flow field that would be obtain as a pinhole camera moves towards a frontal-parallel plane. Fig. D.1b shows an example, obtained from the dataset, of divergent optical flow in image areas of high contrast.

The divergence at an image point (x,y)(x,y) is given by

D(x,y)=u(x,y)x+v(x,y)vD(x,y)=\frac{\partial u(x,y)}{\partial x}+\frac{\partial v(x,y)}{\partial v} (17)

where uu and vv are the velocity components of the optical flow field and \partial denotes a partial derivative. The divergence measurement is averaged across all points, yielding a single estimated value for each image. MATLAB was used to compute the optical flow and the flow divergence, using the method in [Lucas and Kanade, 1981].

The robot encoder data provides the camera pose along the trajectory. The linear velocity is computed by applying a finite difference on the position data, followed by a low pass filter [Liu, 2002, Puglisi, 2015]. Specifically, the camera velocity vv is obtained from

vk=𝐗𝐤𝐗𝐤𝟏Tv_{k}=||\frac{\bf{X}_{k}-\bf{X}_{k-1}}{T}|| (18)

where 𝐗\bf{X} denotes the 3D position vector, TT denotes the sampling period, and the subscript kk indexes the sample instant. A low pass Butterworth filter (with a cutoff frequency of 300Hz) is then applied to the velocity measurement in order to reduce noise.

The optical flow divergence and camera velocity measurements are correlated along the time axis to determine the best alignment. As an example, Fig. D.1c shows the camera velocity (magnitude) during sequence/experiment for the sixth sequence of MiroCam record, calculated from the robot position data and estimated from the divergence of the optical flow field in the images. Both signals are shown already synchronized. Fig. D.1 d shows a detail of the plot corresponding to the end of the trajectory, when the camera stops moving.

Table D.1 and D.2 summarize the temporal synchronization for all trajectories in the dataset. They provide a correspondence between the start frame of each sequence and the matching sampling instant of the robot pose data.

Table D.1: Temporal synchronization. Correspondence, for each sequence of each organ, between the first frame of the trajectory for both HighCam and LowCam and the matching sample instant of the robot data with 1kHz recording frequency.
Camera Robot
Organ Trajectory HighCam Start Frame LowCam Start Frame HighCam Sample LowCam Sample
Colon-IV 1 741 393 35,295 15,845
2 44 128 2,561 2,561
3 69 82 3,975 3,975
4 138 120 15,792 15,092
5 99 144 1,270 3,270
SmallIntestine 1 149 95 5,162 4,512
2 133 112 4,913 2,763
3 186 144 6,095 7,845
4 121 79 3,205 3,205
5 138 105 3,807 3,307
Stom-I 1 60 135 4,443 8,093
2 111 144 4,177 2,277
3 71 447 6,058 19,008
4 47 316 2,839 13,289
Stom-II 1 255 125 9,641 5,141
2 1 2 3,358 3,358
3 150 83 5,797 2,247
4 78 85 2,742 4,192
Stom-III 1 195 89 6,746 2,846
2 302 108 1,523 2,725
3 387 105 17,261 2,861
4 125 60 4,451 2,101
Table D.2: Temporal synchronization. Correspondence, for each sequence, between the first frame of the trajectory and the matching sample instant (sample number) of the robot data. Note that, in the Pillcam capsule, Cam1 (front facing camera) and Cam2 (backward facing camera) trigger alternatively, one after the other, with equally spaced time intervals. The values indicated in the table correspond to Cam1.
Camera Robot
Sequence start frame framerate sample instant sampl. freq.
Mirocam 1 336 3 fps 72,050 1kHz
2 153 3 fps 961 1kHz
3 321 3 fps 47,667 1kHz
4 143 3 fps 33,943 1kHz
5 254 3 fps 2,886 1kHz
6 134 3 fps 3,044 1kHz
Pillcam "L" 1,127 0.117 fps 15,800 1kHz
"Z" 815 0.117 fps 11,650 1kHz
Refer to caption
Figure D.1: Optical flow and temporal synchronization of robot data and images. The divergence of the optical flow field is used to estimate the forward motion of the camera. a The divergent optical flow field obtained as a pinhole camera moves towards a frontal-parallel plane. b Image from MiroCam, of divergent optical flow detected in image areas of high contrast. c The camera velocity for the sixth sequence of MiroCam record, obtained from the robot position data (low pass filter was used to reduce noise), and estimated from the divergence of the optical flow field in the images. The vertical scale is normalized for both measurements. The signals are correlated and aligned to obtain temporal synchronization. d Plot corresponding to the end of the trajectory, when the camera movement stops.

Appendix E Data Tree Structure

Refer to caption
Figure E.1: Data Tree. EndoSLAM dataset is divided into four main parts: Cameras, Cameras (Modified), 3D_Scanners and Unity Camera. Cameras and Cameras (Modified) folders include four subfolders as PillCam, MiroCam, HighCam and LowCam. Each of them branches out into calibration and organs subfolders. Calibration subfolder comprises intrinsic-extrinsic camera parameters and corresponding calibration sessions whereas organs subfolder includes images and poses of each trajectories. Apart from Cameras section, modified part includes sample sub-trajectories exemplifying the effects of image modification functions such as fish-eye, Gaussian blur, vignetting, resizing, depth of fields. 3D_Scanners folder consists of reconstructed 3D figures (.fig), point cloud data (.ply), surface geometry of three-dimensional objects without any color or texture representations (.STL), the position of each vertex representing 3D geometry(.obj) and ASCII formatted point cloud data(.ASC). Finally Unity Camera folder includes synthetically generated images, pixelwise depths and corresponding poses.

Appendix F Dataset Analysis

Refer to caption
Figure F.1: Sample frames from EndoSLAM Dataset. Images are acquired by a MiroCam capsule endoscope, b Frontal camera of a PillCam, c HighCam, d LowCam, and e virtually generated UnityCam. The ex-vivo part of the dataset offers opportunity to test the robustness of pose estimation algorithms with images coming from various endoscope camera. Since EndoSLAM dataset contains real and simulated frames, it is also a suitable platform to develop domain adaptation algorithms.
Table F.1: Motion Analysis. Statistics for robot poses matching with frames of HighCam. For all trajectories of each organ, counts of robot sample instances, mean, first quantile(1st QT), median, third quantile(3rd QT), minimum, maximum speed[mm/s] values are given.
Speed Stomach-I Stomach-II Stomach-III Small Intestine Colon-IV
frame count 4695 3302 3230 6487 3697
mean[mm/s] 18.256 19.471 20.031 16.764 17.123
std[mm/s] 22.497 16.809 16.697 14.210 12.660
1st QT 5.931 7.606 7.055 5.684 7.658
median 14.642 16.021 16.489 13.849 15.096
3rd QT 25.32 26.64 28.68 24.342 24.324
min[mm/s] 0.02 0.028 0.02 0 0.007
max[mm/s] 25.32 140.898 116.984 104.08 104.759
Acceleration mean[mm/s] 359.843 382.928 383.829 328.568 326.241
std 450.939 337.08 336.098 284.423 257.08
1st QT 110.408 140.982 111.71 103.129 122.807
median 284.729 314.784 315.012 269.316 283.728
3rd QT 501.31 528.799 556.451 477.991 469.113
min[mm/s] 0.4 0.0 0.0 0.0 0.015
max[mm/s] 14,680.15 2,817.962 2,339.683 2,079.994 2,095.182
Table F.2: Motion Analysis. Statistics for robot poses matching with frames of LowCam. For all trajectories of each organ, counts of robot sample instances, mean, first quantile(1st QT), median, third quantile(3rd QT), minimum, maximum speed[mm/s] values are given.
Speed Stomach-I Stomach-II Stomach-III Small Intestine Colon-IV
frame count 2302 2799 3900 5098 3857
mean[mm/s] 15.599 18.928 25.97 17.918 17.144
std[mm/s] 12.855 14.431 21.564 14.764 12.882
1st QT 5.407 8.259 10.789 6.126 7.401
median 13.18 15.871 21.284 15.322 15.148
3rd QT 22.956. 26.436 35.763 26.146 24.455
min[mm/s] 0.02 0.0 0.02 0.0 0.028
max[mm/s] 79.042 103.254 286.68 97.315 106.271
Acceleration mean[mm/s] 279.254 378.346 519.361 334.373 355.941
std 253.777 288.769 431.327 295.646 259.482
1st QT 66.573 164.972 215.786 119.299 130.256
median 221.297 317.344 425.678 303.582 291.994
3rd QT 428.253 528.695 715.263 520.839 482.864
min[mm/s] 0.4 0.0 0.0 0.0 0.015
max[mm/s] 1,580.846 2.065,071 5,733.593 1,946.305 2,125.42
Refer to caption
Figure F.2: Image Modifications. a Resize The size of the images, width×\timesheight, from left to right is given as 400×\times400, 300×\times300, 200×\times200, 150×\times150, 100×\times100 and 50×\times50, b Gaussian Blur with convolution filter size(α\alpha) are 5×\times5,5×\times5,7×\times7,11×\times11,13×\times13 and 13×\times13 and standard deviation of Gaussian distribution(β\beta) 5,15,20,40,70,100 and the number o filtering times(γ\gamma) 5,5,5,7,7,7. c Depth of Field effects for the focus positions 0.0821, 0.1785, 0.2428, 0.3392, 0.3714, 0.4678, d Fish Eye distortion for discarding ratios ν\nu for 1, 0.95, 0.85, 0.8, 0.75, 0.7.
Table F.3: The Classification of Trajectories The recorded trajectories for each organ divided into two groups based on the tumorous properties of tissue as tumor-containing and tumor-free. Approximately 10% of all trajectories is tumorous which might be practical for segmentation and disease classification tasks.
Organs Tumor-free Trajectory # Tumor-containing Trajectory #
Colon-I I,II,III, IV
Colon-II I,III,IV,V VI,VII
Colon-III I,II III
Colon-IV I,II,III,IV,V -
Stomach-I I,II,III,IV -
Stomach-II I,II,III,IV -
Stomach-III I,II,III,IV -
Small Intestine I,II,III,IV,V -
Table F.4: 3D Point Count Data. The point cloud counts in 3D_Scanner folder containing six polygon (.ply) files, for which Colon-III is scanned by Artec 3D Eva with precision 0.1 mm0.1\text{\,}\mathrm{mm}. Colon-IV, Small Intestine and Stomach-I,-II,-III are scanned by Shining 3D EinScan Pro 2x with the precision 0.05 mm0.05\text{\,}\mathrm{mm}.
Organ 3D Point Count Scanner Precision
Colon-IV 2,106,046 3D EinScan Pro 2x 0.05 mm0.05\text{\,}\mathrm{mm}
Small Intestine 2,193,364 3D EinScan Pro 2x 0.05 mm0.05\text{\,}\mathrm{mm}
Stomach-I 2,597,906 3D EinScan Pro 2x 0.05 mm0.05\text{\,}\mathrm{mm}
Stomach-II 5,729,625 3D EinScan Pro 2x 0.05 mm0.05\text{\,}\mathrm{mm}
Stomach-III 2,234,849 3D EinScan Pro 2x 0.05 mm0.05\text{\,}\mathrm{mm}
Colon-III 151,846 Artec 3D Eva 0.10 mm0.10\text{\,}\mathrm{mm}
Refer to caption
Figure F.3: Depth Evaluation of Point Cloud Data. The frequency distribution of depth values in mm for a Colon-I scanned by Artec Eva: 3D scanner, b Colon-IV, c Small Intestine, d,e,f Stomach-I,II,III all scanned by EinScan Pro 2X.
Refer to caption
Figure F.4: Motion Analysis Histograms The frequency distribution of positional differences between two consecutive frames along the a x, b y, c z axis and the rotational differences in d x, e y, f z axis in terms of Euler angles are given.

Appendix G Results

Refer to caption
Figure G.1: Pose Estimations on Unity Trajectories. On the contrary to real ex-vivo records, synthetically generated trajectories are more straightforward and easier to follow. This fact results in increase in the performance of all methods. However, SC-SfMLeaner and Endo-SfMLeaner track the route with higher accuracy thanks to the geometry consistency loss. Even if all algorithms trained by the synthetic colon images, Monodepth2 and SfMLearner face with the same problem as in real trajectories. For all synthetically generated trajectories, EndoSfMLearner exhibits lowest mean absolute trajectory error(ATE). Although quantitatively Monodepth2 and SfMLearner have lower rotational error, it cannot be taken into account as performance superiority. Since the rotations cannot be changed frequently and easily while recording clear images in Unity environment, they remain close to identity matrix which is generally predicted by Monodepth2 and SfMLearner.