UACANet: Uncertainty Augmented Context Attention for Polyp Segmentation

Taehun Kim taehoon1018@postech.ac.kr Pohang University of Science and Technology77, Cheongam-ro, Nam-guPohang-siGyeongsangbuk-doRepublic of Korea37655 , Hyemin Lee lhmin@postech.ac.kr Pohang University of Science and Technology77, Cheongam-ro, Nam-guPohang-siGyeongsangbuk-doRepublic of Korea37655 and Daijin Kim dkim@postech.ac.kr Pohang University of Science and Technology77, Cheongam-ro, Nam-guPohang-siGyeongsangbuk-doRepublic of Korea37655

(2021)

Abstract.

We propose Uncertainty Augmented Context Attention network (UACANet) for polyp segmentation which considers an uncertain area of the saliency map. We construct a modified version of U-Net shape network with additional encoder and decoder and compute a saliency map in each bottom-up stream prediction module and propagate to the next prediction module. In each prediction module, previously predicted saliency map is utilized to compute foreground, background and uncertain area map and we aggregate the feature map with three area maps for each representation. Then we compute the relation between each representation and each pixel in the feature map. We conduct experiments on five popular polyp segmentation benchmarks, Kvasir, CVC-ClinicDB, ETIS, CVC-ColonDB and CVC-300, and our method achieves state-of-the-art performance. Especially, we achieve 76.6% mean Dice on ETIS dataset which is 13.8% improvement compared to the previous state-of-the-art method. Source code is publicly available at https://github.com/plemeri/UACANet

medical image segmentation, polyp segmentation, colonoscopy, self-attention

^†^†copyright: acmcopyright^†^†journalyear: 2021^†^†copyright: acmcopyright^†^†conference: Proceedings of the 29th ACM International Conference on Multimedia; October 20–24, 2021; Virtual Event, China^†^†booktitle: Proceedings of the 29th ACM International Conference on Multimedia (MM ’21), October 20–24, 2021, Virtual Event, China^†^†price: 15.00^†^†doi: 10.1145/3474085.3475375^†^†isbn: 978-1-4503-8651-7/21/10^†^†submissionid: 1229^†^†ccs: Computing methodologies Image segmentation^†^†ccs: Computing methodologies Neural networks

1. Introduction

Image segmentation is one of the fundamental and challenging topics in computer vision. It aims to classify each pixel from a given image. Recent studies adopt image segmentation to the specific domain such as salient object detection which focus on classifying each pixel whether it belongs to the most salient object or not. Similar to the salient object detection, one can apply their techniques to the medical purposes.

Medical image segmentation is widely used technique such as classifying each organ in the given tomography images like pancreas segmentation (Oktay et al., 2018), detecting cells from the microscopy images (Ronneberger et al., 2015), or discriminating abnormal regions from normal regions from the body such as brain tumor (Haghighi et al., 2020) or polyp segmentation (Fan et al., 2020).

Polyps are an abnormal tissue growth from a surface of our body and can be found in colon, rectum, stomach or even throat. In most cases, polyps are benign, which means they are not indicating illness or maliciousness. However, because polyps are potentially cancerous, so we need a long-term diagnosis including the growth of their sizes or location and whether it became malignant or not. Thus, detecting polyps in the given colonoscopy image is beneficial for aiding early diagnosis of polyp related diseases.

Previous polyp segmentation networks usually adopt their methodologies from salient object detection (SOD) since they share the main interest, attend more on salient (polyp) region than surrounding scene. Current state-of-the-art methods in SOD which shows decent performance are highly related to the edge guidance (Yang et al., 2017; Su et al., 2019). However, acquiring additional edge data is often expensive. Reverse attention (Chen et al., 2018) suggest using reverse saliency map to obtain boundary cues, but since the boundary region is highly related to the ambiguous saliency score, saliency map without reverse operation already has such boundary information.

In this paper, we propose Uncertainty Augmented Context Attention network (UACANet), augmenting uncertain area with respect to the saliency map which is highly related to the boundary information. Our method computes the region with ambiguous saliency score and combine with a foreground and a background area for context attention module. On top of a modified version of U-Net (Ronneberger et al., 2015) structure network with additional encoders and decoder, we aggregate the feature map based on three areas with weighted summation to obtain a representative context vectors for each area. We then compute the similarity between the context vector and the feature map. We validate our method with five famous polyp segmentation benchmarks, Kvasir, CVC-ClinicDB, ETIS, CVC-ColonDB and CVC-300 and achieve state-of-the-art performance among previous methods.

2. Related Work

2.1. Semantic Segmentation

Fully Convolutional Network (FCN) (Long et al., 2015) introduced a fully convolutional neural network architecture to efficiently train a model to classify each pixel. They also utilized a multi-scale scheme with aggregating low-level feature maps from the early stages in the neural network. Noh et al. proposed deconvolution network (Noh et al., 2015) to compensate a spatial information degradation due to the pooling operation by leveraging transposed convolution and unpooling methods. Pyramid Spatial Pooling Network (PSPNet) utilized grid-wise pooling and comprise multiple sizes of the grid to deal with multi-scale objects. Deeplab (Chen et al., 2017) constructed multiple convolution layers with different dilation rates to diversify receptive fields within a single module to extract a multi-scale context information. Dual Attention Network (Fu et al., 2019) brought a self-attention mechanism to the semantic segmentation network and comprise a dual path network for spatial and channel dimensions which incorporates local features with their global dependencies. Object Contextual Representation (OCR) (Yuan et al., 2019) expanded a non-local operation to consider semantic region by aggregating the representation of the pixels from each class and compute a similarity with each pixel from the feature map.

2.2. Salient Object Detection

Salient object detection (SOD) resembles semantic segmentation but has its own criterion. Rather than predicting the entire region in the given image and classifying an object class label, SOD focuses on the object itself, identifying more important regions from the surrounding areas. Since salient object does not have any specific object class, even if the task might seem superficially easier than multi-class segmentation, it is much harder when it comes to the accurate object detection and assuming their priorities among other surrounding objects. Unlike semantic segmentation, it is hard to say that there is a strong baseline model with a high performing yet simple architecture in SOD.

Current state-of-the-art methods exploit the boundary region of objects for complementary information by multi-task learning strategy which can be incorporated to enhance the quality of saliency map. Edge Guidance Network (EGNet) (Yang et al., 2017) used a bottom-up stream for edge detection branch and side-out fusion strategy which aggregates each path from the top-down stream for salient object branch. To clarify the terms that we use as bottom-up stream means from high-level feature level to low-level feature level from the backbone network which is commonly used for segmentation or heat-map related tasks such as pose estimation. Also, side-out fusion strategy aggregates multiple feature maps from the backbone with concatenation or summation. Boundary-aware Network (BANet) (Su et al., 2019) oppositely used a side-out fusion for boundary branch and single-stream for object branch, but rather than regarding edge detection as a separate task, they combine edge result and object result to generate saliency map. These methods qualitatively showed competitive results consistently (Wang et al., 2021) which proves that edge guidance helps better representation of objects in terms of edge-preserving results. However, acquiring additional edge dataset is often expensive and even with image processing techniques such as Canny edge detection (Canny, 1986), it contains redundant edges which are usually unrelated to the object.

Reverse attention (Chen et al., 2018) explicitly multiply a reverse area of prediction in order to capture residual details for the saliency refinement. However, based on both their experimental results and our experiments, the performance with or without reverse attention wasn’t very different. However, their suggestion provided an intuition that even without explicit edge guidance, we can access to the edge related context with saliency map. Based on this idea, we broaden a reverse saliency map with additional uncertain area, an ambiguous area that saliency score is neither biased to the foreground nor the background. Also, they suggested an efficient yet practical network architecture. The decoder at the end of the backbone feature map produces a global saliency map, and in the bottom-up stream, only the saliency map from the previous level is used unlike U-Net shape networks concatenated feature maps. The final prediction thus only requires a low-level feature map and saliency map from the previous level as a guidance. While U-Net like networks predicts saliency map at the last bottom-up layer, proposed architecture from Reverse attention (Chen et al., 2018) which computes saliency map at the intermediate level feature map and saliency map from the bottom-up stream only.

Refer to caption — Figure 1. Overall architecture of UACANet.

2.3. Polyp Segmentation

Polyp segmentation aims to precisely segment polyps from the given colonoscopy image. One can say that polyp segmentation is a semantic segmentation with binary class problem. While we can adopt such fully convolutional network architecture to solve the problem, since colonoscopy has different image domain compared to the general images, researchers focus on extracting semantic features with detail information. Famous architectures like PSPNet or Deeplab adopt multi-scale scheme to the backbone network which helps to capture detail information with multiple receptive field sizes, but they usually deploy such methods at the end of the backbone network which is spatially insufficient for recovering precise spatial information. DeeplabV3+ (Chen et al., 2017) tried to compensate such problem by concatenate low-level feature map from the backbone network but wasn’t sufficient for recovering such details coming from the input image.

U-Net (Ronneberger et al., 2015) introduced incremental up-sampling of the feature maps alongside the corresponding scales of the low-level feature maps. Such ”skip-connections” also appears in Feature Pyramid Networks (FPN) (Lin et al., 2017) but they use element-wise addition while U-Net aggregate features with channel-wise concatenation. While FPN is also designed similar to the U-Net in order to extract multi-scale features, they do not need a fine-grained detail features since object detection needs to locate a bounding box, not a precise shape of an object. U-Net++ (Zhou et al., 2018) added extra layers and dense connectivity to reduce the gap between low-level and high-level features.

As the significance of polyp segmentation has been increased, studies that only dedicated to the polyp datasets have been done recently. ResUNet++ (Jha et al., 2019) construct a U-Net shape network with famous CNN modules including residual blocks from ResNet, Atrous Spatial Pyramid Pooling (ASPP) module from Deeplab and Squeeze and Excitation mechanism from SENet (Hu et al., 2018). Selective Feature Aggregation network (SFA) (Fang et al., [n.d.]) added another bottom-up path for boundary estimation, similar to SOD networks with edge guidances (Yang et al., 2017; Su et al., 2019). Parallel Reverse Attention Network (PraNet) (Fan et al., 2020) achieved state-of-the-art performance on five different polyp segmentation benchmark by adopting the majority of the network design and techniques from (Chen et al., 2018). They added parallel partial decoder from (Wu et al., 2019) to combine low-level feature maps from the backbone network because the architecture from (Chen et al., 2018) obviously lack of feature sharing for both top-down stream and bottom-up stream. In our network design, we locate intermediate encoder for low-level features and use the output of encoder and previous decoder features for bottom-up stream to incorporate high-level semantic features.

3. Methodology

In this section, we demonstrate the architecture of our UACANet and the details of comprising modules. We first explain the overall structure of our network, then describe the details of fundamental components including Parallel Axial Attention encoder and decoder, and Uncertainty Augmented Context Attention.

3.1. Overall Architecture

We design UACANet based on the overall architecture from PraNet(Fan et al., 2020) which is a modified version of (Chen et al., 2018). As shown in Figure 1, We add additional encoder network, Parallel Axial Attention encoder (PAA-e) for bottom-up stream and side-out fusion path. This helps to reduce the computational cost for both bottom-up and side-out fusion paths by reducing the number of channels for their input feature maps. Feature maps from three PAA-e modules are both used for side-out fusion path (purple arrow), Parallel Axial Attention decoder (PAA-d) and Uncertainty Augmented Context Attention (UACA). We concatenate three feature maps from PAA-e modules for PAA-d and it predicts the initial saliency map for polyps. Then the feature maps from PAA-e and PAA-d is concatenated for UACA, and the output saliency map from PAA-d is used for context guidance (yellow arrow). We describe detailed information of how UACA incorporates feature maps and context guidance in section 3.3. Then, the output saliency map from UACA is added with previously computed saliency map from the PAA-d. After the first UACA, we concatenate PAA-e feature map and previous UACA feature map for the next UACA (gray and black arrow with concatenation symbol). Also, the saliency map from the previous UACA is used as a context guidance for the next UACA (yellow arrows). The output of UACA is forwarded to the final point-wise convolution and then added with the previous context guidance for the current saliency map. After three consequent UACAs, the final output is computed with bi-linear up-sampling with scale factor of 4 and sigmoid function.

To sum up, the overall architecture shows that the backbone features are encoded with PAA-e, and encoded features are forwarded to PAA-d for initial saliency map which serves as a initial guidance map, which leads UACA to learn a residual saliency map apart from the initial map. This helps the consequent UACA can focus more on uncertain area like boundaries rather than fairly evident region.

We use both binary cross entropy (BCE) loss and intersection over union (IoU) loss. The loss function $\mathcal{L}$ is computed as follows,

(1)

\begin{gathered}\mathcal{L}_{BCE}=-\sum_{i\in\mathcal{I}}{y_{i}log(\hat{y}_{i})+(1-y_{i})log(1-\hat{y}_{i})},\\ \mathcal{L}_{IoU}=1-\frac{\sum_{i\in\mathcal{I}}y_{i}\hat{y}_{i}}{\sum_{i\in\mathcal{I}}y_{i}+\hat{y}_{i}-y_{i}\hat{y}_{i}},\\ \\ \mathcal{L}=\mathcal{L}_{BCE}+\mathcal{L}_{IoU},\end{gathered}

where $i\in\mathcal{I}$ refers to a pixel in the output and ground truth, $y$ denotes ground truth and $\hat{y}$ denotes the output. As shown in Figure 1, we add four losses from four prediction from PAA-d and UACA with loss function in Equation 1 (red arrows in Figure 1).

3.2. Parallel Axial Attention encoder and decoder

In the era of deep learning, computer vision researchers who study on semantic segmentation and its related tasks like polyp segmentation are dedicated to find a better architecture to extract a fine-grained feature maps which have both high-level semantic information and low-level detail information because usually they are both hard to extract and to combine each other. Self-attention mechanism (Zhang et al., 2019) is one of the major context modules in computer vision but requires heavy computation. Axial attention (Ho et al., 2019) solve this problem with performing non-local operation with respect to the single axis and sequentially connected each operation.

We propose Parallel Axial Attention (PAA) for extracting both global dependencies and local representation. We adopt axial attention strategy by computing non-local operation for both horizontal axis and vertical axis but collocated in parallel. By locating vertical and horizontal attention, both contributes to the final output almost equally compared to the sequential method. While sequentially connected axial attention added trainable positional encoding, we do not use encoding scheme since positional encoding substantially not effective in relatively small scales. Also, we discover that using a parallel connection, element-wise summation is effective to aggregate feature maps rather than concatenation without performance degradation since the same input is used for both horizontal axis and vertical axis and they contributes to the output almost equally by the parallel connection. Also, since a single axis based attention cause unexpected deformation, element-wise summation can help to compensate such artifact. As shown in Figure 2, we compute two non-local operations with input feature map, one for horizontal axis and the other for vertical axis.

We choose to actively use this module for encoder and decoder modules for better representation in terms of globally refined feature. First, we design Parallel Axial Attention encoder (PAA-e) which aggregates the low-level feature maps from the top-down stream which will be used for bottom-up stream. Since U-Net structure use the low-level features without channel reduction, redundant information may hinder the performance and the number of channels are quite large since backbone networks are trained for image classification. Not to lose any detail information while reducing the number of channels, we design PAA-e with Receptive Field Block (RFB) (Liu et al., 2018) strategy. As shown in Figure 3(a), feature map from the backbone network (green box) is forwarded to each receptive field path. We add PAA for additional global refinement for each scale and concatenate the outputs, then forward to the consecutive convolution layers. As shown in Figure 1, the outputs of PAA-e are used for both decoder module and bottom-up stream. Also, we design Parallel Axial Attention decoder (PAA-d) with simple structure yet add additional PAA for final feature aggregation from different level of PAA-e features which are denoted as purple arrows in Figure 1 and Figure 3(b).

3.3. Uncertainty Augmented Context Attention

While reverse attention (Chen et al., 2018) in both SOD and polyp segmentation does not bring the large margin of performance gain, it was evident that it has shown some qualitatively better results. This phenomenon is highly related to the boundary guided SOD networks (Yang et al., 2017; Su et al., 2019), which show state-of-the-art performances in multiple SOD benchmarks. Boundaries of the object as a extra supervision in SOD networks tends to compensate false negative areas, in other words, object regions with low saliency scores. Reverse attention is thus potentially effective method to bring implicit edge guidance without explicit shape of boundary supervision.

We focus on both saliency and reverse saliency map from the reverse attention and found that usually the boundary region appears where saliency score shows ambiguity. In other words, boundary region is highly related to the saliency score around $0.5$ . From this property of saliency map, we assume that both saliency and reverse saliency map has almost equal amount of edge information since reverse saliency map is obtained by simple subtraction from 1. Based on this assumption, extracting ambiguous region from the saliency map as well as foreground and background area would improve attentive methods such as self-attention.

Based on our research, we propose Uncertainty Augmented Context Attention (UACA) module, a novel self-attention mechanism which incorporates uncertain area for rich semantic feature extraction without extra boundary guidance. We denote previously computed input saliency map as m and generate corresponding foreground map $\textbf{m}_{f}$ , background map $\textbf{m}_{b}$ and uncertain area map $\textbf{m}_{u}$ as follows,

(2)

\begin{gathered}\textbf{m}_{f}=\text{max}(\textbf{m}-0.5,0),\quad\textbf{m}_{b}=\text{max}(0.5-\textbf{m},0),\\ \textbf{m}_{u}=0.5-\text{abs}(\textbf{m}-0.5).\end{gathered}

We compute foreground and background map with max operation in order to disentangle not only from each other but from the uncertain area map since uncertain area map already represents their joint region which makes a redundant information and may diminish the role of uncertainty.

While reverse attention applies explicit channel-wise multiplication to the feature map which resembles Convolutional Block Attention Module (CBAM) (Woo et al., 2018), we compose our context module with a set of non-local operations. Similar to OCR (Yuan et al., 2019) we first compute representative vectors for foreground map, background map and uncertain area map by aggregating the pixel representation with each area map from the input feature map x as follows,

(3)

\begin{gathered}\textbf{v}_{f}=\sum_{i\in\mathcal{I}}\textbf{m}_{fi}\textbf{x}_{i},\quad\textbf{v}_{b}=\sum_{i\in\mathcal{I}}\textbf{m}_{bi}\textbf{x}_{i},\quad\textbf{v}_{u}=\sum_{i\in\mathcal{I}}\textbf{m}_{ui}\textbf{x}_{i},\end{gathered}

where $i\in\mathcal{I}$ denotes pixels in spatial dimension. We implement Equation 2 with matrix multiplication as shown in Figure 4. Each vector stands for the representative feature vector, so $\textbf{v}_{f}$ represents foreground feature and $\textbf{v}_{u}$ represents the uncertain area. Then we compute the similarity between each representation vector ( $\textbf{v}_{f}$ , $\textbf{v}_{b}$ and $\textbf{v}_{u}$ ) and each pixel from the input feature map $\textbf{x}_{i}$ as follows,

(4)

\begin{gathered}s_{fi}^{\prime}=\psi(\textbf{x}_{i})^{\top}\phi(\textbf{v}_{f}),\quad s_{bi}^{\prime}=\psi(\textbf{x}_{i})^{\top}\phi(\textbf{v}_{b}),\quad s_{ui}^{\prime}=\psi(\textbf{x}_{i})^{\top}\phi(\textbf{v}_{u}),\\ s_{fi}=\frac{e^{s_{fi}^{\prime}}}{N},\quad s_{bi}=\frac{e^{s_{bi}^{\prime}}}{N},\quad s_{ui}=\frac{e^{s_{ui}^{\prime}}}{N},\quad\\ \text{where},\quad\textit{N}=e^{\textbf{s}_{fi}^{\prime}}+e^{\textbf{s}_{bi}^{\prime}}+e^{\textbf{s}_{ui}^{\prime}}.\end{gathered}

Dataset	Method	Mean Dice $\uparrow$	Mean IoU $\uparrow$	MAE $\downarrow$
CVC -ClinicDB	UACANet-S (w/o PAA)	0.902	0.858	0.008
CVC -ClinicDB	UACANet-S	0.916	0.870	0.008
ETIS	UACANet-S (w/o PAA)	0.684	0.603	0.029
ETIS	UACANet-S	0.694	0.615	0.023

Table 1. Ablation study for Parallel Axial Attention (PAA) on CVC-ClinicDB and ETIS datasets.

\uparrow

denotes higher the better and

\downarrow

denotes lower the better.

Dataset	Method	Mean Dice $\uparrow$	Mean IoU $\uparrow$	MAE $\downarrow$
CVC -ClinicDB	CANet-S	0.911	0.857	0.009
	CANet-L	0.912	0.861	0.009
	UACANet-S	0.916	0.870	0.008
	UACANet-L	0.926	0.880	0.006
ETIS	CANet-S	0.691	0.613	0.026
	CANet-L	0.678	0.604	0.019
	UACANet-S	0.694	0.615	0.023
	UACANet-L	0.766	0.689	0.012

Table 2. Ablation study for uncertain area on CVC-ClinicDB and ETIS datasets. Red color denotes the best score among the methods, and blue color denotes the second best.

\uparrow

denotes higher the better and

\downarrow

denotes lower the better.

Finally, we compute context feature map by weighted summation of representation vector $\textbf{v}_{f}$ , $\textbf{v}_{b}$ and $\textbf{v}_{u}$ by similarity score $s_{f}$ , $s_{b}$ and $s_{c}$ as follows,

(5)

\begin{gathered}\textbf{t}_{i}=\delta(s_{fi}\omega(\textbf{v}_{f})+s_{bi}\omega(\textbf{v}_{b})+s_{ui}\omega(\textbf{v}_{u})).\end{gathered}

Note that $\psi(\cdot)$ , $\phi(\cdot)$ , $\omega(\cdot)$ and $\delta(\cdot)$ are point-wise convolution. Each pixel in context feature map, $\textbf{t}_{i}$ , can be interpreted as a weighted average of three representation vectors $\textbf{v}_{f}$ , $\textbf{v}_{b}$ and $\textbf{v}_{u}$ . The context feature map t and input feature map x are concatenated with respect to the channel axis and feed forward to the point-wise convolution for final output feature map as shown in Figure 4.

Dataset	Method	Mean Dice $\uparrow$	Mean IoU $\uparrow$	MAE $\downarrow$
Kvasir	U-Net (Ronneberger et al., 2015)	0.818	0.746	0.055
	U-Net++ (Zhou et al., 2018)	0.821	0.743	0.048
	ResUNet++ (Jha et al., 2019)	0.813	0.793	-
	SFA (Fang et al., [n.d.])	0.723	0.611	0.075
	PraNet (Fan et al., 2020)	0.898	0.840	0.030
	UACANet-S (Ours)	0.905	0.852	0.026
	UACANet-L (Ours)	0.912	0.859	0.025
CVC -ClinicDB	U-Net (Ronneberger et al., 2015)	0.823	0.755	0.019
	U-Net++ (Zhou et al., 2018)	0.794	0.729	0.022
	ResUNet++ (Jha et al., 2019)	0.796	0.796	-
	SFA (Fang et al., [n.d.])	0.700	0.607	0.042
	PraNet (Fan et al., 2020)	0.899	0.849	0.009
	UACANet-S (Ours)	0.916	0.870	0.008
	UACANet-L (Ours)	0.926	0.880	0.006

Table 3. Comparison to the previous state-of-the-art methods and our UACANet on Kvasir and CVC-ClinicDB datasets. Red color denotes the best score among the methods, and blue color denotes the second best.

\uparrow

denotes higher the better and

\downarrow

denotes lower the better.

4. Experimental Results

In this section, we demonstrate our implementation details, datasets and benchmarks for experiments, and some experimental results including ablation study on uncertain area and comparison with previous state-of-the-art methods on five polyp segmentation benchmarks. We also visualize feature maps and the uncertainty map from UACA and some qualitative results to compare other methods.

Method	ETIS			CVC-ColonDB			CVC-300
Method	mDice $\uparrow$	mIoU $\uparrow$	MAE $\downarrow$	mDice $\uparrow$	mIoU $\uparrow$	MAE $\downarrow$	mDice $\uparrow$	mIoU $\uparrow$	MAE $\downarrow$
U-Net (Ronneberger et al., 2015)	0.398	0.335	0.036	0.512	0.444	0.061	0.710	0.627	0.022
U-Net++ (Zhou et al., 2018)	0.401	0.344	0.035	0.483	0.410	0.064	0.707	0.624	0.018
SFA (Fang et al., [n.d.])	0.297	0.217	0.109	0.469	0.347	0.094	0.467	0.329	0.065
PraNet (Fan et al., 2020)	0.628	0.567	0.031	0.709	0.640	0.045	0.871	0.797	0.010
UACANet-S (Ours)	0.694	0.615	0.023	0.783	0.704	0.034	0.902	0.837	0.006
UACANet-L (Ours)	0.766	0.689	0.012	0.751	0.678	0.039	0.910	0.849	0.005

Table 4. Comparison to the previous state-of-the-art methods and our UACANet on ETIS, CVC-ColonDB and CVC-300 datasets. Red color denotes the best score among the methods, and blue color denotes the second best.

\uparrow

denotes higher the better and

\downarrow

denotes lower the better.

4.1. Implementation Details

We describe most of our model architecture description in Section 3 and the number of channels in convolution layers which appear outside of the backbone network is unified as 32 for small model and 256 for large model. We denote the small model with 32 channels as UACANet-S and the large model with 256 channels as UACANet-L. We use Res2Net (Gao et al., 2019) with $26w\times 4s$ settings as a backbone network. Intermediate backbone feature maps are acquired from each stage’s last residual block (green box in Figure 1. For UACANet-L, similar to the DeeplabV3+ (Chen et al., 2017), we modified strides and dilation rates to increase the spatial size of the feature map. We resize images to $352\times 352$ for both training and inference and resize back to its original size. Unlike PraNet (Fan et al., 2020) and other state-of-the-art methods, we adopt additional data augmentation techniques which are fairly popular in semantic segmentation including random flipping on both horizontal and vertical axis, random image scaling from $0.75$ to $1.25$ . We conduct random rotation from $0$ to $359$ degrees since colonoscopy images may rotate during examination. We also add additional random dilation and erosion for ground truth label to enhance generalization. We use Adam optimizer (Kingma and Ba, 2014) and the initial learning rate is set to $10^{-4}$ with polynomial learning rate decay (Chen et al., 2017) with factor $(1-(\frac{iter}{iter_{max}})^{0.9})$ . Compared to PraNet which trained only 60 epochs, we increase training epoch to 240 since we apply diverse data augmentation. We use Pytorch (Paszke et al., 2019) to implement our model and single Titan RTX GPU for train the model.

4.2. Datasets

Following (Fan et al., 2020), images which have been randomly selected from Kvasir and CVC-ClinicDB is used for training, but we use same training data for fair comparison which has already been extracted from Kvasir and CVC-ClinicDB and it contains 1450 images total. For benchmark dataset, we use five different datasets.

CVC-ClinicDB:: CVC-ClinicDB (Bernal et al., 2015), also known as CVC-612 contains 612 images from 25 colonoscopy videos and selected 29 sequences from them. The size of images is $384\times 288$ . 62 images from this dataset are used for test and remaining images are used for training.
CVC-300:: CVC-300 is a test dataset from EndoScene (Vázquez et al., 2017). EndoScene contains 912 images from 44 colonoscopy sequences which were acquired from 36 patients total. Since EndoScene dataset is a combination of CVC-ClinicDB and CVC-300, following D.-P. Fan et al, we use CVC-300 as a test dataset which are 60 samples total.
CVC-ColonDB:: CVC-ColonDB (Bernal et al., 2012) dataset is collected from 15 different colonoscopy sequences and sampled 380 images from these sequences.
ETIS:: ETIS (Silva et al., 2013) dataset contains 196 images which are collected from 34 colonoscopy videos. The size of images is $1225\times 966$ which is the largest among other datasets. Unless polyps in this dataset are vary in size and shape, they are mostly small and hard to find, which makes this dataset more challenging.
Kvasir:: Kvasir (Jha et al., 2020) Kvasir dataset consists of 1000 polyp images and corresponding annotations. Unlike the other datasets, images vary in size, from $332\times 487$ to $1920\times 1072$ and also the size of polyps which appear in the images vary in its size and shape. There are 700 large polyps which is larger than $160\times 160$ , 48 small polyps smaller than $64\times 64$ and 323 medium polyps within the large and small scale. 900 images are used for training and 100 images are used for test.

4.3. Ablation Study on Parallel Axial Attention

To exhibit the validity of PAA module, we conduct an experiment to evaluate UACANet without PAA modules. We design another model whose specific model architecture is identical to the UACANet except PAA modules (yellow box in Figure 3) is excluded. We choose three metrics to evaluate our methods, mean Dice (mDice), mean intersection over union (mIoU) and mean absolute error (MAE). We choose these two datasets for ablation study since CVC-ClinicDB is sampled for training while ETIS isn’t. As shown in Table 1, UACANet-S with PAA module shows better.

4.4. Ablation Study on uncertain area

We conduct an experiment to demonstrate the effectiveness of UACA. We leave all the details identical to the UACA except for excluding the uncertainty map ( $\textbf{m}_{u}$ in Figure 4), namely Context Attention (CA). We substitute UACA with CA in Figure 1 to make CANet. We design CANet with same small and large version, namely CANet-S and CANet-L respectively. Table 2 shows quantitative results on CANet and UACANet on CVC-ClinicDB and ETIS. We also choose these two datasets for same reason as section 4.3. In terms of performance measure, UACANet consistently outperforms CANet on three major metrics.

We also visualize the output feature map of attention modules in both settings in Figure 5 to verify the effectiveness of uncertain area qualitatively. UACANet substantially produce more precise results than CANet in terms of quality as well. In first and third row, while visualized feature map and the output of CANet-L shows that even though it detects substantial region of polyps, it fails to detect the ambiguous region which may seems hard to discriminate from mucosa, surface of colon. On the other hand, UACANet consistently segment polyp regions precisely. We also visualize uncertain area of UACA ( $\textbf{m}_{u}$ in Figure 4), it is easy to recognize that uncertain area is closely related to the boundary of polyps. Especially for the third row, $\textbf{m}_{u}$ also helps to detect the ambiguous area denoted as red boxes. Also, in second row, since ground truth miss another polyp on left bottom corner, UACANet detected the missing polyp.

4.5. Experiments with State-of-the-Art methods

As mentioned above, we compare our method with previous state-of-the-art methods on five different polyp segmentation benchmarks. Since we train on sampled data from Kvasir and CVC-ClinicDB, even the test results are still unseen, the domain of images from these two datasets is still similar to the training data. Thus, we first demonstrate our results on first two datasets on Table 3. On both datasets, our UACANet achieve the best performance among other methods. Especially, our UACANet-L achieves 92.6% mean Dice on CVC-Clinic DB which is 2.7% improvement over PraNet, the latest state-of-the-art method. In Table 4, we evaluate our method with three completely unseen datasets. We mentioned above that ETIS is the most challenging datasets among other four, nevertheless UACANet-L achieve 76.6% mean Dice which is 13.8% improvement over PraNet.

We also demonstrate qualitative results on five benchmarks of previous state-of-the-art methods and our method (Figure 5). On Kvasir and CVC-ClinicDB (first and second row), since two datasets are similar to the training dataset, all methods are able to segment the location of polyps, but our method show the most similar results compared to the ground truth. On ETIS dataset (third row), which is the most challenging dataset among other four benchmarks, both UACANet-S and UACANet-L are able to detect a small polyp even if the size of the polyp is very small and hard to notice while other methods has failed to detect.

5. Conclusion

We propose a novel polyp segmentation network called UACANet which augments uncertain area to the context representation for accurate polyp detection. Without expensive edge annotations, we show that uncertain area is capable of representing boundary information. We propose Parallel Axial Attention for encoder for backbone features and decoder for initial saliency map. We also propose Uncertainty Augmented Context Attention which augments uncertainty area which represents complementary edge information. In a series of both quantitative and qualitative experiments shows that our method outperforms compared to the previous state-of-the-art methods.

Acknowledgement

This work was supported by Institute of Information & communications Technology Planning & Evaluation(IITP) grant funded by the Korea government(MSIT) (No.B0101-15-0266, Development of High Performance Visual BigData Discovery Platform for Large-Scale Realtime Data Analysis), (No.2017-0-00897, Development of Object Detection and Recognition for Intelligent Vehicles) and (No.2018-0-01290, Development of an Open Dataset and Cognitive Processing Technology for the Recognition of Features Derived From Unstructured Human Motions Used in Self-driving Cars)

References

(1)
Bernal et al. (2015) Jorge Bernal, F. Javier Sánchez, Gloria Fernández-Esparrach, Debora Gil, Cristina Rodríguez, and Fernando Vilariño. 2015. WM-DOVA maps for accurate polyp highlighting in colonoscopy: Validation vs. saliency maps from physicians. Computerized Medical Imaging and Graphics 43 (2015), 99–111. https://doi.org/10.1016/j.compmedimag.2015.02.007
Bernal et al. (2012) J. Bernal, J. Sánchez, and F. Vilariño. 2012. Towards automatic polyp detection with a polyp appearance model. Pattern Recognition 45, 9 (2012), 3166–3182. https://doi.org/10.1016/j.patcog.2012.03.002 Best Papers of Iberian Conference on Pattern Recognition and Image Analysis (IbPRIA’2011).
Canny (1986) J. Canny. 1986. A Computational Approach to Edge Detection. IEEE Transactions on Pattern Analysis and Machine Intelligence PAMI-8, 6 (1986), 679–698. https://doi.org/10.1109/TPAMI.1986.4767851
Chen et al. (2017) Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. 2017. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587 (2017).
Chen et al. (2018) Shuhan Chen, Xiuli Tan, Ben Wang, and Xuelong Hu. 2018. Reverse attention for salient object detection. In Proceedings of the European Conference on Computer Vision (ECCV). 234–250.
Fan et al. (2020) Deng-Ping Fan, Ge-Peng Ji, Tao Zhou, Geng Chen, Huazhu Fu, Jianbing Shen, and Ling Shao. 2020. Pranet: Parallel reverse attention network for polyp segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 263–273.
Fang et al. ([n.d.]) Yuqi Fang, Cheng Chen, Yixuan Yuan, and Kai-yu Tong. [n.d.]. Selective Feature Aggregation Network with Area-Boundary Constraints for Polyp Segmentation. In Medical Image Computing and Computer Assisted Intervention – MICCAI 2019. Springer-Verlag, Berlin, Heidelberg, 302–310. https://doi.org/10.1007/978-3-030-32239-7_34
Fu et al. (2019) Jun Fu, Jing Liu, Haijie Tian, Yong Li, Yongjun Bao, Zhiwei Fang, and Hanqing Lu. 2019. Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3146–3154.
Gao et al. (2019) Shanghua Gao, Ming-Ming Cheng, Kai Zhao, Xin-Yu Zhang, Ming-Hsuan Yang, and Philip HS Torr. 2019. Res2net: A new multi-scale backbone architecture. IEEE transactions on pattern analysis and machine intelligence (2019).
Haghighi et al. (2020) Fatemeh Haghighi, Mohammad Reza Hosseinzadeh Taher, Zongwei Zhou, Michael B Gotway, and Jianming Liang. 2020. Learning Semantics-enriched Representation via Self-discovery, Self-classification, and Self-restoration. In International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 137–147.
Ho et al. (2019) Jonathan Ho, Nal Kalchbrenner, Dirk Weissenborn, and Tim Salimans. 2019. Axial attention in multidimensional transformers. arXiv preprint arXiv:1912.12180 (2019).
Hu et al. (2018) Jie Hu, Li Shen, and Gang Sun. 2018. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7132–7141.
Jha et al. (2020) Debesh Jha, Pia H Smedsrud, Michael A Riegler, Pål Halvorsen, Thomas de Lange, Dag Johansen, and Håvard D Johansen. 2020. Kvasir-seg: A segmented polyp dataset. In International Conference on Multimedia Modeling. Springer, 451–462.
Jha et al. (2019) Debesh Jha, Pia H Smedsrud, Michael A Riegler, Dag Johansen, Thomas De Lange, Pål Halvorsen, and Håvard D Johansen. 2019. Resunet++: An advanced architecture for medical image segmentation. In 2019 IEEE International Symposium on Multimedia (ISM). IEEE, 225–2255.
Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
Lin et al. (2017) Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. 2017. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2117–2125.
Liu et al. (2018) Songtao Liu, Di Huang, et al. 2018. Receptive field block net for accurate and fast object detection. In Proceedings of the European Conference on Computer Vision (ECCV). 385–400.
Long et al. (2015) Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2015. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3431–3440.
Noh et al. (2015) Hyeonwoo Noh, Seunghoon Hong, and Bohyung Han. 2015. Learning deconvolution network for semantic segmentation. In Proceedings of the IEEE international conference on computer vision. 1520–1528.
Oktay et al. (2018) Ozan Oktay, Jo Schlemper, Loic Le Folgoc, Matthew Lee, Mattias Heinrich, Kazunari Misawa, Kensaku Mori, Steven McDonagh, Nils Y Hammerla, Bernhard Kainz, et al. 2018. Attention u-net: Learning where to look for the pancreas. arXiv preprint arXiv:1804.03999 (2018).
Paszke et al. (2019) Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. Pytorch: An imperative style, high-performance deep learning library. arXiv preprint arXiv:1912.01703 (2019).
Ronneberger et al. (2015) Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention. Springer, 234–241.
Silva et al. (2013) Juan Silva, Aymeric Histace, Olivier Romain, Xavier Dray, and Bertrand Granado. 2013. Toward embedded detection of polyps in WCE images for early diagnosis of colorectal cancer. International Journal of Computer Assisted Radiology and Surgery 9, 2 (sep 2013), 283–293. https://doi.org/10.1007/s11548-013-0926-3
Su et al. (2019) Jinming Su, Jia Li, Yu Zhang, Changqun Xia, and Yonghong Tian. 2019. Selectivity or invariance: Boundary-aware salient object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3799–3808.
Vázquez et al. (2017) David Vázquez, Jorge Bernal, F Javier Sánchez, Gloria Fernández-Esparrach, Antonio M López, Adriana Romero, Michal Drozdzal, and Aaron Courville. 2017. A benchmark for endoluminal scene segmentation of colonoscopy images. Journal of healthcare engineering 2017 (2017).
Wang et al. (2021) Wenguan Wang, Qiuxia Lai, Huazhu Fu, Jianbing Shen, Haibin Ling, and Ruigang Yang. 2021. Salient object detection in the deep learning era: An in-depth survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2021).
Woo et al. (2018) Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. 2018. Cbam: Convolutional block attention module. In Proceedings of the European conference on computer vision (ECCV). 3–19.
Wu et al. (2019) Zhe Wu, Li Su, and Qingming Huang. 2019. Cascaded partial decoder for fast and accurate salient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3907–3916.
Yang et al. (2017) Bing Yang, Xiaoyun Zhang, Li Chen, Hua Yang, and Zhiyong Gao. 2017. Edge guided salient object detection. Neurocomputing 221 (2017), 60–71.
Yuan et al. (2019) Yuhui Yuan, Xilin Chen, and Jingdong Wang. 2019. Object-contextual representations for semantic segmentation. arXiv preprint arXiv:1909.11065 (2019).
Zhang et al. (2019) Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena. 2019. Self-attention generative adversarial networks. In International conference on machine learning. PMLR, 7354–7363.
Zhou et al. (2018) Zongwei Zhou, Md Mahfuzur Rahman Siddiquee, Nima Tajbakhsh, and Jianming Liang. 2018. Unet++: A nested u-net architecture for medical image segmentation. In Deep learning in medical image analysis and multimodal learning for clinical decision support. Springer, 3–11.