This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

MBAPose: Mask and Bounding-Box Aware Pose Estimation of Surgical Instruments with Photorealistic Domain Randomization

Masakazu Yoshimura, Murilo M. Marinho, Kanako Harada, Mamoru Mitsuishi This work was supported by JSPS KAKENHI Grant Number JP19K14935. (Corresponding author: Murilo M. Marinho)Masakazu Yoshimura, Murilo M. Marinho, Kanako Harada, and Mamoru Mitsuishi are with the Department of Mechanical Engineering, the University of Tokyo, Tokyo, Japan. Emails:{m.yoshimura, murilo, kanako, mamoru}@nml.t.u-tokyo.ac.jp.
Abstract

Surgical robots are usually controlled using a priori models based on the robots’ geometric parameters, which are calibrated before the surgical procedure. One of the challenges in using robots in real surgical settings is that those parameters can change over time, consequently deteriorating control accuracy. In this context, our group has been investigating online calibration strategies without added sensors. In one step toward that goal, we have developed an algorithm to estimate the pose of the instruments’ shafts in endoscopic images. In this study, we build upon that earlier work and propose a new framework to more precisely estimate the pose of a rigid surgical instrument. Our strategy is based on a novel pose estimation model called MBAPose and the use of synthetic training data. Our experiments demonstrated an improvement of 21 % for translation error and 26 % for orientation error on synthetic test data with respect to our previous work. Results with real test data provide a baseline for further research.

I Introduction

Robot-assisted surgery is one of the promising contributions of robotics to the medical field. Robots enable surgeons to perform complex manipulations with robotic instruments that have multiple degrees of freedom at the tip. In this context, we have developed and validated a teleoperated surgical robot, SmartArm [1], with the aim of conducting procedures in constrained workspaces.

One of the applications of the SmartArm system is endoscopic transsphenoidal surgery, wherein instruments are inserted through the nostrils to remove tumors in the pituitary gland or at the base of the skull. Owing to the narrow endonasal workspace, suturing the dura mater after tumor removal is especially difficult and this task can highly benefit from robotic aid. We have developed a control strategy based on the kinematic model of the robots to make suturing in a narrow workspace possible by automatically preventing collisions between robots, instruments, and the patient [2, 3].

Even with a careful offline calibration of the robots’ parameters, a mismatch of a few millimeters between the kinematically calculated pose111Pose here indicates combined position and orientation. and the real pose of the instruments. This mismatch tends to increase during surgery, for instance, owing to the changes in temperature and interactions among the instruments and tissues. This mismatch is also reported in related literature [4, 5] with the da Vinci Surgical System (Intuitive Surgical, USA), which is a commercially used surgical robot.

Aiming for an online calibration strategy, our group is investigating the use of synthetic images for training deep-learning algorithms to consistently address the requirements of data-hungry algorithms that currently dominate state-of-the-art object pose detection. We previously proposed a pose estimation method for the instruments’ shafts that uses monocular endoscopic images [6], in which the model was trained and validated with synthetic images. In another study, we have investigated physically-based rendering (PBR) as a means of creating photorealistic images [7]. We herein build upon [6, 7] and propose a novel framework for the pose estimation of rigid instruments with PBR-rendered images. Moreover, an improved pose estimation model is proposed.

I-A Related Works

Refer to caption
Figure 1: Our proposed dataset that blends photorealistic rendering and domain randomization optimized for endoscopic views.

The biggest challenge in the use of synthetic images for training deep-learning algorithms is the so-called domain gap. In the context of this work, domain gap is considered as the difference in the performance of the algorithm trained with only synthetic images when it is evaluated against real images. Bridging the domain gap is an open research problem and is being investigated on many fronts, such as in improved rendering, domain adaptation, and domain randomization strategies.

I-A1 Rendering

One of the solutions for the domain gap is to render photorealistic images. To that end, PBR has been used in many contexts [8, 9, 10, 7, 11]. PBR yields visually plausible images by solving physics-inspired equations. Nonetheless, creating 3D models for all elements in the workspace can be time-consuming, especially when a large variety of images are required for training.

To partially address this difficulty, the cut-and-paste method [12] was proposed. Synthetic image variety increases by cutting rendered objects and pasting them over real background images. Many works have used this strategy [13, 14, 15, 6, 16], but its adaptivity to real images is relatively low when compared to PBR. Our prior work [7] has also provided evidence of this in a surgical-robot context.

I-A2 Domain Adaptation

Other works [17, 18, 19, 20] have proposed domain adaptation methods in which the style of the real image is transferred to the synthetic image using adversarial training methods [21]. With the state-of-the-art domain adaptation method for car detection [20], the average precision metric was further improved by 4.6% when compared to PBR. However, domain adaptation methods might be unstable and sub-optimal depending on the dataset [18].

I-A3 Domain Randomization

Another method to compensate for the domain gap is a type of data augmentation strategy called domain randomization (DR) [22]. The concept behind DR is to widen the synthetic data domain by adding randomized variations possibly without contextual plausibility or photorealism. DR has been used, for instance, in robot picking [22], robot pose estimation [16], and destination detection for drone racing [15]. The quality of algorithms trained with DR is competitive to PBR in some contexts, despite using cut-and-paste rendering [12]. The major interest in DR is to know what type of randomization is effective in a particular context. In previous methods, the following characteristics have been investigated: intensity of lights [14, 22, 16, 15], position of lights [14, 22, 16], number of lights [14, 22], type of lights [14], object textures [14, 22, 15], shape of the target objects [15], addition of distractors [22, 14, 15, 16], and randomized background images for cut-and-paste rendering [16, 14, 15].

I-A4 Pose Estimation

Several studies have proposed instrument pose estimation using deep learning. Most existing methods are two-stage. They estimate 2D information such as key points [23], or projected corner points of 3D bounding boxes [24], in the first stage and subsequently use an optimization algorithm in the second stage using the information of the first stage to obtain the instrument’s pose. Our strategy is a single-stage and directly predicts the pose of the surgical instrument.

I-B Statement of Contributions

In this work, we build upon our previous work [6] and improve from the pose estimation of the instrument’s shaft to pose estimation of the instruments’ tips. Moreover, we (1) improved the PBR rendering strategy, (2) studied the effects of different DR strategies and proposed two new randomized components, (3) used an improved pose estimation model, and (4) discussed how to practically obtain a real instrument pose dataset and provided results of validation using real data.

To the best of our knowledge, this is the first work to combine those strategies and validate them in the pose estimation of rigid surgical instruments.

II Problem Statement

Refer to captionInstrumentsEndoscopeRefer to captionOrigin (lens)Refer to captionYawRefer to captionPitchRefer to captionRollRoll=0, Pitch=0ZYXYXZRefer to captionTip points
Figure 2: Robot setup (left), its inner view from a 70 endoscope (middle), and the definition of the pose (right).

The target application for our method is robotic-assisted endoscopic transsphenoidal surgery, as shown in Fig. 2. Due for practical reasons, we attached one 3 mm 2-degrees-of-freedom (2-DoF) forceps to each robotic arm (in contrast with the hollow shafts used in [6]). Those forceps are inserted through the nostrils of an anatomically realistic head model (BionicBrain [25]). The workspace is viewed through a 70 endoscope (Endoarm, Olympus, Japan) whose perspective angle is 95.

In this study, our target is to estimate the pose of the instruments relative to the endoscope. The reference frames of the instruments are defined at the distal end of the instrument’s shaft and the zz-axis is along the instruments’ shaft.

The approximate range-of-motion of the instruments in the endonasal workspace is listed in Table I.

TABLE I: Possible range of the instrument pose relative to the endoscope.
Dimension x [mm] y [mm] z [mm] roll [degree] pitch [degree] yaw [degree] gripper [degree]
Range -20~20 -20~20 10~35 50~90 -40~40 0~360 0~60

III Methodology

Our pose estimation strategy is based on three main points—a realistic rendering strategy, a DR strategy, and a single-stage pose estimation model.

III-A Proposed realistic rendering strategy

Refer to caption
Figure 3: Samples rendered with the proposed realistic rendering strategy.

The proposed realistic rendering strategy uses elements of PBR and the cut-and-paste method. It calculates light reflection, diffusion, and absorption faithfully with PBR using ray tracing [26] and the concept of microfacet [27] n the same manner as in previous PBR methods [9, 8, 10, 11]. The background is not rendered using PRB; instead, as described in our previous works [6], we render a cube whose faces are real background images. The reflection of the surroundings in the instruments are rendered by tracing photons between the cubic faces and the instruments’ surface.

In a direct comparison with [6], we improved several rendering methods. First, we applied a bidirectional scattering diffusion function (BSDF) with Oren–Nayar reflectance model [28] on the cubic surface to reflect it in various directions. This was necessary because we simplified the surroundings as a cube. A BSDF with GGX222GGX itself the name of the distribution and it does not seem to be any particular acronym [29]. microfacet distribution [29] was used on the instruments’ surface to obtain realistic reflection. We positioned a black plane with a hole in front of the camera to reproduce the circular endoscopic view. The sheet was small and placed close to the camera so as to not interfere with the ray tracing.

The examples of synthetic images created with the proposed rendering method are shown is Fig. 3.

III-B Domain Randomization

Refer to caption
cubic
texture
augmentation
non-contextual
images
tool
per part
metallic
scratch
non-
convex
light
position
intensity
number
kind
tool
color
glossiness
dr1:
dr2:
dr3:
dr4:
dr5:
dr6:
dr7:
dr8:
dr9:
dr10:
dr11:
contextual
Figure 4: Effect of DR components. The components marked red are new methods.

DR methods have usually been applied together with classic or simple rendering. We herein combine PBR with DR and propose two further DR strategies.

We use the following previously propose DR strategies: we randomize the position, intensity, number, and type of lights. In addition, the real background images used for the cubic texture are randomly selected. For the background images, we separately evaluated only the real images of the experimental environment, images of the experimental environment and their augmented images, or these plus non-contextual images. As for the surgical instruments, we subdivided the previously proposed texture randomization [22] into color randomization and glossiness randomization for the metallic surface. We were also interested in analyzing the effect of randomizing the textures for each individual instrument separately.

In addition to those well-known DR strategies, we propose a further surface randomization method. It randomizes the appearance instrument’s surface using normal mapping [30]. Normal mapping is a strategy used to change surface appearance of objects by changing the normal vectors of the surface at the rendering time. We randomly created concave and convex surface imperfections by changing the normal maps.

The performance of each strategy in the context of this work is evaluated in the experiments in comprehensive ablation studies.

III-C Pose Estimation Model

Refer to caption

Box Aware Head

ResNet34Refer to captionBounding BoxPoseClassRefer to caption= (Conv+Bn+Relu)×\times2= Conv+SigmoidInputSegmentationExtended ResNet18Proposed Double EncoderProposed Mask Guidance
Figure 5: Proposed Model Architecture. The Box-Aware Head is the Architecture-D model defined in [6].

We propose a new pose estimation model named MBAPose to predict the pose of instruments is aware of their mask and bounding box shape. This model is a combination of two parts, a prediction head and a feature extraction model.

The prediction head, which we call the Box-Aware Head, is one we proposed in [6]. It uses the features obtained from the feature extraction model to simultaneously predict the bounding box, class, and pose. The architecture that internally used self-bounding box prediction and self-class prediction to predict their pose enhanced pose estimation. .

This feature extraction is novel and composed of a double encoder model with mask guidance. The double encoder model itself comprises two smaller encoder models and one decoder model. We propose this feature extractor because our previous model [6] had difficulties in detecting small objects, as shown in [31]. This change was inspired by DSSD [32], that improved the detection of small objects by adding a decoder network. Different from DSSD, we further add a second encoder. The aim is to create a mask-aware feature by adding a loss function with an additional convolution at the head of the decoder model in the training stage. By adding segmentation loss, the model estimates the pose of the objects aware of their contour. With the proposed double encoder, the contour information can be more explicitly used compared with others [33].

III-D Test dataset with real instrument’s image and pose information

Refer to captionOptical markerRefer to captionInstrumentRefer to captionOptical marker for an endoscope
Figure 6: Setup for test dataset acquisition using an optical tracker.

One challenge we faced in our prior work [6] was the difficulty in creating a reliable dataset for testing the model with the image and pose information of the real instruments. In this work, we overcame those difficulties and created a test dataset as follows.

The pose of the real instruments was acquired using an optical tracker (Polaris Vega, NDI, Canada) attached to its optical marker at the base of the instruments and endoscope, as shown in Fig. 6 using 3D-printed attachments. However, even small mounting and fabrication errors induce a considerably large error at the tip of the instrument. To calibrate for the mounting error, we propose a practical markerless hand-eye calibration method using the contour of instruments in the view. The method is as follows.

The present hand-eye calibration problem can be represented using homogenous-transformation matrices as 𝑨𝑿=𝒁𝑪𝑩\boldsymbol{AX}=\boldsymbol{ZCB}, where 𝑨\boldsymbol{A}, 𝑩\boldsymbol{B}, 𝑪\boldsymbol{C}, 𝑿\boldsymbol{X}, and 𝒁\boldsymbol{Z} denote the transformation from the instrument’s tip pose to the endoscopic lens pose, the instrument’s maker pose with respect to the world, the endoscope’s marker pose with respect to the world, the transformation from the instruments’ maker pose to the instrument’s tip pose, and the transformation from the endoscope’s maker pose to the endoscopic lens pose. 𝑩\boldsymbol{B} and 𝑪\boldsymbol{C} are acquired from the optical tracker. We have the ideal values for 𝑿\boldsymbol{X} and 𝒁\boldsymbol{Z} using our CAD models; however, they are corrupted by the mounting errors and consequently our calibration target. 𝑨\boldsymbol{A} is usually estimated with a checkerboard; however, in our case, we do not obtain 𝑨\boldsymbol{A} explicitly because installing a checkerboard pattern on the tip of the instrument is not feasible. Instead, we indirectly obtain 𝑨\boldsymbol{A} using the contour of instruments on the images.

We search 𝑿\boldsymbol{X} and 𝒁\boldsymbol{Z} by iteratively comparing manually annotated contours on the real images and contours of projected 3D CAD models based on our current estimated of 𝑿\boldsymbol{X} and 𝒁\boldsymbol{Z}. To do so, we use the following objective function

O=nN{αOiou(xngt,xnproj)+(1α)Oedge(xngt,xnproj)},O=\sum_{n}^{N}\left\{\alpha O_{iou}\left(x_{n}^{gt},x_{n}^{proj}\right)+\left(1-\alpha\right)O_{edge}\left(x_{n}^{gt},x_{n}^{proj}\right)\right\}, (1)

where xngtx_{n}^{gt} and xnprojx_{n}^{proj} are the nn-th contours on the ground truth and projected CAD model. In detail, OiouO_{iou} is intersection over union (IoU). In addition, OedgeO_{edge} is a metric that quantifies edge overlap as

Oedge(xgt,xproj)=uvpgt(u,v)pproj(u,v)uvpgt(u,v)2,O_{edge}\left(x_{gt},x_{proj}\right)=\frac{\sum_{u}\sum_{v}p_{gt}\left(u,v\right)p_{proj}\left(u,v\right)}{\sum_{u}\sum_{v}p_{gt}\left(u,v\right)^{2}},

where p(u,v)p(u,v) are the distance transformed pixel values defined

p(u,v)=(dmaxmax(dnearest(u,v),dmax))2,p\left(u,v\right)=\left(d_{max}-max\left(d_{nearest}\left(u,v\right),d_{max}\right)\right)^{2},

where dnearest(u,v)d_{nearest}\left(u,v\right) is the distance from the nearest edge, and dmaxd_{max} is the maximum distance. With this heuristic, we search for, 𝑿\boldsymbol{X} and 𝒁\boldsymbol{Z}, to maximize OO using a ternary search, as shown in Fig. 7.

Refer to captionannotationProject the CAD model corresponding to Δ\DeltaX and Δ\DeltaZUpdate Δ\DeltaX and Δ\DeltaZIOUEdgeCompute correspondence
Figure 7: The proposed hand-eye calibration method.

IV Evaluation

First, we evaluate the effect of the proposed hand–eye calibration to create test data with the improved ground truth of the real pose. Then, we investigate each component of the DR method and the proposed rendering method. Third, we evaluate the proposed MBANet. Finally, the pose estimation performance using all proposed methods is reported.

IV-A Evaluation of the Proposed Hand-Eye Calibration

In this section, we briefly test the performance of our practical hand-eye calibration.

First, we prepared three sequences of endoscopic images paired with the optical tracker’s pose estimation synchronized at a rate of 20 Hz. In each scene, the instrument was manually moved in the head model. Only one instrument was used at this stage because the attachments and markers make it impossible to simultaneously insert a second instrument.

We calibrated the endoscopic camera with the method proposed by Zhang et al. [34]. A total of 5301 data pairs were prepared. Given that the contours have to be manually annotated, we used 20 images from two sequences to estimate the mounting error. The hyperparameters α\alpha and dmaxd_{max} were set as 0.8 and 10, respectively. The mounting errors were searched in the interval [1,1]\left[-1,1\right] mm for each translation dimension and [1,1]\left[-1,1\right] degree for each orientation dimension. These are reasonable boundaries for our fabrication and mounting accuracy.

We tested with another 20 manually annotated images from the unused sequence using the 2D projection overlap metric (1). The results are summarized in Table II. The results on the test images suggest that the mounting errors decreased owing to our method. The maximum estimated mounting error was 0.6 mm and 0.5 degree. Nonetheless, there was still around 2-3 mm of error between the projections. We believe the remaining error has two sources. First, ternary search cannot always obtain the optimal value and in our case depends on hyperparameters. Second, the visual tracker itself has an associated accuracy and our best estimates can only be as good as the tracker’s. The tracker can estimate the position of each passive marker with a 0.12 mm RMS error, that error is amplified by assembly and fabrication errors in the adapter between the marker and the tip of the instrument or endoscope.

TABLE II: 2D projection overlapping metric (1) with and without the proposed hand-eye calibration.
Used for optimization Yes No
Without calibration 0.33 0.47
With calibration 0.71 0.59

*Larger values mean better accuracy.

IV-B Dataset Creation

IV-B1 Training Dataset

We rendered 10,000 images for each DR component using an open-source rendering software Blender (Blender Foundation, Netherlands). The datasets were rendered not as sequences but as independent scenes with totally randomized poses. As we fixed the random seeds, all rendered instruments had the same poses over each specific dataset. Therefore, any difference in performance between DR cases only stemmed from the DR component itself. Even if no DR components were used, and 754 contextual images of the head model (Fig. 2) were used as textures for the cube. When non-contextual images were used as texture, the 754 contextual images were used. Along with it, a total of 2840 real endoscopic transsphenoidal surgery scenes, images from the COCO dataset [35], and 754 contextual images images were used with a ratio of 0.25, 0.5, and 0.25, respectively. Other randomized parameters are listed in Table III. The image size was 299×299299\times 299.

IV-B2 Test Dataset

The 5301 images in our calibrated real dataset (described in Section (III-D)) were used for testing.

IV-C Evaluation Metrics

We used translation error, centerline angle error, mean average precision (mAP) and the detected rate as the performance metrics. The centerline angle error is obtained from the angle between the estimate instrument shaft centerline and the real one.

IV-D Ablation study for the DR components

In this section, the effect of each DR component is evaluated in our target application. We trained the model proposed in [6] for 120,000 iterations with a batch size of 20 and the one-cycle policy learning rate [36] wherein the learning rate begun at 10-5, increased to 10-4, and subsequently decreased to 5×\times10-7.

The pose map has seven channels corresponding to xx, yy, and zz translations; roll, pitch, and yaw rotations, respectively, and gripper angle. The pose loss was weighted with empirical factors of 1, 1, 1.5, 2, 2, 0.1, and 0.1.

We used the following data augmentation strategies in all cases: random contrast, random hue, random saturation, and additive noise augmentation [6]. We did not use real images for training and instead concentrate on the effects of each synthetic dataset.

The results of this ablation study are summarized in Fig. 8. Interestingly, some DR strategies deteriorated the generalizability of the model to real images. For instance, the randomization of the instrument’s glossiness strongly deteriorated the generalizability. We infer that this means that metallic reflection is important for detecting and estimating the pose. Non-contextual images used as background textures improved all metrics. In addition, the proposed normal-map-based random scratch generation improved all metrics, although further randomization with non-contextual convex and concave deteriorated the performance.

TABLE III: Randomized detail of each DR component.
method randomized detail
dr1 light position behind the camera
dr2 light intensity333The value of intensity parameter in Blender. 5~1×\times106
dr3 light number 1~2
dr4 light kind point, sun, area, hemi
dr5 texture aug. crop, rotate, color jitter
dr6 texture non-contextual ref. IV-B1
dr7 instrument color hue, brightness
dr8 instrument glossiness full range
dr9 instrument per part 50\%
dr10 instrument scratch a normal map of scratch and augment
dr11 instrument non-contextual convex add 19 non-contextual normal maps
Refer to caption-30-20-10010+dr1+ dr2+dr3+dr4+dr5+dr6+dr7+dr8+dr9+dr10+dr11Performance gain [%]TranslationOrientationmAPDetected rateAverage
Figure 8: Ablation study for the DR components. The performance gain per each metric is shown.

IV-E Evaluation of the Overall Synthetic Data Generation

For this evaluation, we selected the DR components that, on average, improved the metrics in the ablation study. We compared our strategy with the cut-and-paste method [12], without Poisson Blending.

The evaluation results were Table IV. The proposed renderer had better performance compared to the cut-and-paste method. Ray-traced reflections are important for pose estimation in our context.

Furthermore, by adding the optimized DR components, the detected rate and translation were improved although rotation error increased. DR improves the object detection rate. In terms of pose estimation, DR was only effective if the correct DR components were used.

TABLE IV: Performance comparison over the real endoscopic images.
Tran. [mm] Orien. [degree] Detected rate
Real Cut and paste [12] 7.19 9.35 47.95
Our renderer 5.29 6.23 69.67
Our renderer + all DR 5.88 6.64 73.44
Our renderer + optimal. DR 2.80 8.04 73.44

Note that the ground-truth of the real data had around 2-3 mm error even after the hand-eye calibration.

IV-F Evaluation of The Pose Estimation Model

The performance of the proposed model was also evaluated. For this purpose, 10,000 photorealistic synthetic images without DR were used as test data. In addition, 90,000 synthetic images with all DR components are used as the training data. Other training settings were the same as that mentioned in Section IV-D.

The implementation details of the pose estimation model were as follows. ResNet34 [37] was used as the first encoder. The output of the decoder had 16 channels, and it was concatenated with the original input image to input the second encoder. The second encoder was based on ResNet18 but extended with two additional layers to detect instruments when they make most of the image, for instance, when they are close to the endoscope. Each layer comprised two residual blocks with a channel size of 512. The loss function is based on multi-task loss defined in [6] and binary cross-entropy loss for mask guidance was added, with a factor of 4.0. To compare with [6] in which only the shaft was estimated, the orientation angle error calculation does not consider the zz-axis rotation.In this experiment, we also add a rigid transformation augmentation [31], to allow rotation augmentation of the images paired with a suitable rotation of the paired instrument pose. We rotated in the range of [-15, 15] with 80% probability. The inference speed was measured with Python 3.6 using Pytorch 1.4 and CUDA 10.0 on Ubuntu 18.04 with a Titan RTX graphics card.

The results are summarized in Table V. The proposed Double Encoder model is fast and accurate. By including mask guidance, pose estimation was further improved. Moreover, the rigid geometric augmentation improved results. The 90,000 synthetic images used can be considered a small number given that 8.6×109\times 10^{9} images are needed if a training dataset is created at intervals of 1 mm and 2. Therefore, the proposed method could estimate the pose with a relatively coarse interpolation of the training data.

TABLE V: Evaluation of the proposed model architecture against synthetic test data.
Tran. [mm] Orien. [degree] Detected rate fps
[6] 1.11 3.58 99.66 41.0
+ double encoder 0.99 3.11 99.80 79.4
+ mask guidance 0.97 3.03 99.81 79.4
+ rigid geo. aug. [31] 0.88 2.66 99.84 79.4

IV-G Overall Performance Evaluation

Finally, we report the best performance with the proposed PBR, optimized DR, and proposed pose estimation model in the pose estimation of real images. In total, 90,000 synthetic images with optimized DR were used as the training data. Moreover, an additional training method using a small number of pose-free real images [6] was added. The model trained with synthetic data was additionally trained with the same synthetic data and pose-free 50 real images. The model was additionally trained for 10,000 iterations with each batch comprising 2 real images and 18 synthetic images. The model was trained from real images except with the pose loss. To compare with [6] in which only the shaft was estimated, the orientation angle error calculation does not consider the zz-axis rotation.

The results are summarized in Table V. The pose estimation improved on synthetic images and real images. In qualitative terms, the 2D projections of the poses estimated using the proposed method generally seemed to be more accurate than those of the ground truth pose as shown in Fig. 9. The actual errors on real images might be smaller than Table V, but we currently do not have a better sensor to test that hypothesis. The creation of a higher-accuracy real dataset is left for future work.

TABLE VI: Pose estimation performance by mixing all proposed methods.
Additional training[6] Tran. [mm] Orien. [degree] Detected rate
Synthetic 0.84 2.50 99.84
Real 2.61 7.43 74.36
\checkmark 2.18 5.11 82.57

Note that the ground-truth of the real data had around 2-3 mm error even after the hand-eye calibration.

Refer to caption
Figure 9: Examples of 2D projections using the pose estimated by our model. The upper images are synthetic images, the middle ones are real endoscopic images, and the lower ones are the corresponding ground truth.

V Conclusions

In this paper, we proposed a pose estimation method for surgical instruments. We achieved a more precise pose estimation than previous works by combining three methods–a PBR strategy with simplified surroundings, DR strategies relevant for our particular application, and a pose estimation network.

In future works, we plan to use sequential video information to increase the robustness of our predictions. Moreover, we will use the proposed method in the online calibration of surgical instruments.

References

  • [1] M. M. Marinho, K. Harada, A. Morita, and M. Mitsuishi, “SmartArm: Integration and validation of a versatile surgical robotic system for constrained workspaces,” The International Journal of Medical Robotics and Computer Assisted Surgery, 2020, (in press).
  • [2] M. M. Marinho, B. V. Adorno, K. Harada, K. Deie, A. Deguet, P. Kazanzides, R. H. Taylor, and M. Mitsuishi, “A unified framework for the teleoperation of surgical robots in constrained workspaces,” in 2019 International Conference on Robotics and Automation (ICRA).   IEEE, may 2019, pp. 2721–2727.
  • [3] M. M. Marinho, B. V. Adorno, K. Harada, and M. Mitsuishi, “Dynamic active constraints for surgical robots using vector-field inequalities,” IEEE Transactions on Robotics, vol. 35, no. 5, pp. 1166–1185, oct 2019.
  • [4] A. Reiter, P. K. Allen, and T. Zhao, “Appearance learning for 3d tracking of robotic surgical tools,” The International Journal of Robotics Research, vol. 33, no. 2, pp. 342–356, 2014.
  • [5] R. Moccia, C. Iacono, B. Siciliano, and F. Ficuciello, “Vision-based dynamic virtual fixtures for tools collision avoidance in robotic surgery,” IEEE Robotics and Automation Letters, vol. 5, no. 2, pp. 1650–1655, apr 2020.
  • [6] M. Yoshimura, M. M. Marinho, K. Harada, and M. Mitsuishi, “Single-shot pose estimation of surgical robot instruments’ shafts from monocular endoscopic images,” in 2020 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2020, pp. 9960–9966.
  • [7] S. A. H. Perez, M. M. Marinho, K. Harada, and M. Mitsuishi, “The effects of different levels of realism on the training of CNNs with only synthetic images for the semantic segmentation of robotic instruments in a head phantom,” International Journal of Computer Assisted Radiology and Surgery, vol. 15, no. 8, pp. 1257–1265, may 2020.
  • [8] A. Gaidon, Q. Wang, Y. Cabon, and E. Vig, “Virtual worlds as proxy for multi-object tracking analysis,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 4340–4349.
  • [9] M. Wrenninge and J. Unger, “Synscapes: A photorealistic synthetic dataset for street scene parsing,” arXiv preprint arXiv:1810.08705, 2018.
  • [10] T. Hodaň, V. Vineet, R. Gal, E. Shalev, J. Hanzelka, T. Connell, P. Urbina, S. N. Sinha, and B. Guenter, “Photorealistic image synthesis for object instance detection,” in 2019 IEEE International Conference on Image Processing (ICIP).   IEEE, 2019, pp. 66–70.
  • [11] P. F. Proença and Y. Gao, “Deep learning for spacecraft pose estimation from photorealistic rendering,” in 2020 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2020, pp. 6007–6013.
  • [12] D. Dwibedi, I. Misra, and M. Hebert, “Cut, paste and learn: Surprisingly easy synthesis for instance detection,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 1301–1310.
  • [13] W. Kehl, F. Manhardt, F. Tombari, S. Ilic, and N. Navab, “Ssd-6d: Making rgb-based 3d detection and 6d pose estimation great again,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 1521–1529.
  • [14] J. Tremblay, A. Prakash, D. Acuna, M. Brophy, V. Jampani, C. Anil, T. To, E. Cameracci, S. Boochoon, and S. Birchfield, “Training deep networks with synthetic data: Bridging the reality gap by domain randomization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2018, pp. 969–977.
  • [15] A. Loquercio, E. Kaufmann, R. Ranftl, A. Dosovitskiy, V. Koltun, and D. Scaramuzza, “Deep drone racing: From simulation to reality with domain randomization,” IEEE Transactions on Robotics, vol. 36, no. 1, pp. 1–14, 2019.
  • [16] T. E. Lee, J. Tremblay, T. To, J. Cheng, T. Mosier, O. Kroemer, D. Fox, and S. Birchfield, “Camera-to-robot pose estimation from a single image,” in 2020 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2020, pp. 9426–9432.
  • [17] K. Saito, Y. Ushiku, T. Harada, and K. Saenko, “Strong-weak distribution alignment for adaptive object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 6956–6965.
  • [18] H.-K. Hsu, C.-H. Yao, Y.-H. Tsai, W.-C. Hung, H.-Y. Tseng, M. Singh, and M.-H. Yang, “Progressive domain adaptation for object detection,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2020, pp. 749–757.
  • [19] J. Hoffman, E. Tzeng, T. Park, J.-Y. Zhu, P. Isola, K. Saenko, A. Efros, and T. Darrell, “Cycada: Cycle-consistent adversarial domain adaptation,” in International conference on machine learning.   PMLR, 2018, pp. 1989–1998.
  • [20] P. Su, K. Wang, X. Zeng, S. Tang, D. Chen, D. Qiu, and X. Wang, “Adapting object detectors with conditional domain normalization,” in European Conference on Computer Vision.   Springer, 2020, pp. 403–419.
  • [21] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial networks,” arXiv preprint arXiv:1406.2661, 2014.
  • [22] J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel, “Domain randomization for transferring deep neural networks from simulation to the real world,” in 2017 IEEE/RSJ international conference on intelligent robots and systems (IROS).   IEEE, 2017, pp. 23–30.
  • [23] S. Peng, Y. Liu, Q. Huang, X. Zhou, and H. Bao, “Pvnet: Pixel-wise voting network for 6dof pose estimation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 4561–4570.
  • [24] Y. Bukschat and M. Vetter, “Efficientpose–an efficient, accurate and scalable end-to-end 6d multi object pose estimation approach,” arXiv preprint arXiv:2011.04307, 2020.
  • [25] T. Masuda, S. Omata, A. Morita, T. Kin, N. Saito, J. Yamashita, K. Chinzei, A. Hasegawa, T. Fukuda, M. Mitsuishi, K. Harada, S. Adachi, and F. Arai, “Bionic-brain: Training of endoscopic endonasal skull base surgery,” in 2019 The Proceedings of JSME annual Conference on Robotics and Mechatronics (Robomec).   The Japan Society of Mechanical Engineers, jun 2019, pp. 2P2–R08.
  • [26] J. T. Kajiya, “The rendering equation,” in Proceedings of the 13th annual conference on Computer graphics and interactive techniques, 1986, pp. 143–150.
  • [27] R. L. Cook and K. E. Torrance, “A reflectance model for computer graphics,” ACM Transactions on Graphics (ToG), vol. 1, no. 1, pp. 7–24, 1982.
  • [28] M. Oren and S. K. Nayar, “Generalization of lambert’s reflectance model,” in Proceedings of the 21st annual conference on Computer graphics and interactive techniques, 1994, pp. 239–246.
  • [29] B. Walter, S. R. Marschner, H. Li, and K. E. Torrance, “Microfacet models for refraction through rough surfaces.” Rendering techniques, vol. 2007, p. 18th, 2007.
  • [30] P. Cignoni, C. Montani, C. Rocchini, and R. Scopigno, “A general method for preserving attribute values on simplified meshes,” in Proceedings Visualization’98 (Cat. No. 98CB36276).   IEEE, 1998, pp. 59–66.
  • [31] M. Yoshimura, M. M. Marinho, K. Harada, and M. Mitsuishi, “On the single-shot pose estimation of objects using an improved ssd6d architecture,” in 29th Annual Congress of Japan Society of Computer Aided Surgery, 2020, pp. 308–309.
  • [32] C.-Y. Fu, W. Liu, A. Ranga, A. Tyagi, and A. C. Berg, “Dssd: Deconvolutional single shot detector,” arXiv preprint arXiv:1701.06659, 2017.
  • [33] Y. Hu, J. Hugonot, P. Fua, and M. Salzmann, “Segmentation-driven 6d object pose estimation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 3385–3394.
  • [34] Z. Zhang, “A flexible new technique for camera calibration,” IEEE Transactions on pattern analysis and machine intelligence, vol. 22, no. 11, pp. 1330–1334, 2000.
  • [35] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in European conference on computer vision.   Springer, 2014, pp. 740–755.
  • [36] L. N. Smith, “A disciplined approach to neural network hyper-parameters: Part 1–learning rate, batch size, momentum, and weight decay,” arXiv preprint arXiv:1803.09820, 2018.
  • [37] H. Kaiming, Z. Xiangyu, R. Shaoqing, and S. Jian, “Identity mappings in deep residual networks,” in European Conference on Computer Vision, 2016.