Deep Self-cleansing for Medical Image Segmentation with Noisy Labels

Jiahua Dong^a¹¹1Jiahua Dong and Yue Zhang contribute equally to this work. Yue Zhang^b,c²²2Jiahua Dong and Yue Zhang contribute equally to this work. Qiuli Wang^d Ruofeng Tong^a Shihong Ying^e Shaolin Gong^e Xuanpu Zhang^e Lanfen Lin^a Yen-Wei Chen^f S. Kevin Zhou^b,c,g,i a College of Computer Science and Technology, Zhejiang University, Hangzhou, China b School of Biomedical Engineering, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, Anhui, China c Center for Medical Imaging, Robotics, Analytic Computing & Learning (MIRACLE), Suzhou Institute for Advanced Research, University of Science and Technology of China, Suzhou, Jiangsu, China d 7T Magnetic Resonance Translational Medicine Research Center, Department of Radiology, Southwest Hospital, Army Medical University (Third Military Medical University), Chongqing, China e Department of Radiology, The First Affiliated Hospital, College of Medicine, Zhejiang University, Hangzhou, China f College of Information Science and Engineering, Ritsumeikan University, Shiga, Japan g Key Laboratory of Precision and Intelligent Chemistry, University of Science and Technology of China, Hefei Anhui, China i Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS

Abstract

Medical image segmentation is crucial in the field of medical imaging, aiding in disease diagnosis and surgical planning. Most established segmentation methods rely on supervised deep learning, in which clean and precise labels are essential for supervision and significantly impact the performance of models. However, manually delineated labels often contain noise, such as missing labels and inaccurate boundary delineation, which can hinder networks from correctly modeling target characteristics. In this paper, we propose a deep self-cleansing segmentation framework that can preserve clean labels while cleansing noisy ones in the training phase. To achieve this, we devise a gaussian mixture model-based label filtering module that distinguishes noisy labels from clean labels. Additionally, we develop a label cleansing module to generate pseudo low-noise labels for identified noisy samples. The preserved clean labels and pseudo-labels are then used jointly to supervise the network. Validated on a clinical liver tumor dataset and a public cardiac diagnosis dataset, our method can effectively suppress the interference from noisy labels and achieve prominent segmentation performance.

keywords:

Medical image segmentation, Noisy labels, Label self-cleansing

^†^†journal: xxx

1 Introduction

Medical image segmentation is a fundamental task in medical image analysis, which delineates specific organs or tumors from medical images such as computed tomography (CT) and Magnetic Resonance Imaging (MRI). Accurate segmentation provides critical diagnostic information, including shapes, locations, and textures of target tissues, thereby assisting clinicians in disease diagnosis and surgical planning [40, 9]. Recently, supervised deep learning-based methods [29, 42, 25, 30, 13, 34, 17, 5, 12, 14, 18, 4, 8, 22] have achieved immense success in medical image segmentation, with U-Net [29] being the most popular one. Based on U-Net, a lot of variants have been developed by improving skip connections [42, 25, 30, 13], embracing transformers [5, 12, 14, 18, 4], etc.

Refer to caption — Figure 1: Typical types of label noise. Take CT slices with liver tumors as an example, (a) illustrates a missing label that radiologists omitted the tumor region. (b) illustrates the boundary noises around the tumor region.

Supervised methods generally rely on high-quality, large-scale training datasets annotated with clean labels that ensure networks learn correct representations. However, in clinical practice, obtaining flawless labels is hardly possible [43] since radiologists inevitably experience visual fatigue, which leads to inconsistent and noisy labels.

Typically, label noise within medical images can be grouped into missing label noise (as shown in Fig. 1(a)) and boundary delineation noise (as shown in Fig. 1(b)). Missing label noise occurs when radiologists omit target-included slices, resulting in target tissues that are not delineated. Boundary delineation noise, on the other hand, occurs when the low contrast between target tissues and surroundings makes it difficult for radiologists to accurately determine the boundaries, hence leading to inaccurate edge delineations. Training datasets corrupted by label noise can degrade the performance of deep models, as the models learn incorrect characteristics of targets. Therefore, it is crucial to develop methods that make networks resistant to such noise.

In recent years, learning with noisy labels has attracted considerable attention. Most existing methods focus on classification tasks with natural images, including but not limited to designing robust losses [7, 10, 39], re-weighting samples [20, 27, 31], and label correction [11, 32, 41]. However, learning segmentation with noisy labels in medical images has not been extensively investigated. Few known methods in this field can be classified into two categories: 1) Image-level cleansing methods, which estimate the overall noise level of each image and reduce the impact of noisy samples on network training [43]. For example, Zhu et al. [43] proposed a quality awareness module to evaluate the quality of labels in the training set, and assign lower weights for noisy labels when constructing losses for chest image segmentation. 2) Pixel-level cleansing methods, which identify specific noisy regions (pixels) within each image and modify the labels through pseudo-labeling [19, 37, 36, 35, 33, 16]. For example, Liu et al. [19] and Zhang et al. [37] used network predictions as corrected labels to cleanse noisy labels. Zhang et al. [36] and Xu et al. [35] employed Confident Learning [24] to estimate wrongly-labeled pixels and make corrections. Wei et al. [33] used two different networks to select confident pixels within labels for cross-training. Li et al. [16] integrated superpixel representations to guide the label refinement process. Although image-level methods could suppress the influence of noisy samples, local noisy regions within low-noise samples still affect the performance; Meanwhile, pixel-level methods could not distinguish between noisy and clean samples, leading to the indiscriminate modification of all labels (including clean and low-noise labels), which is inefficient and prone to introducing new noise.

To address the above issues, we combine the merits of image-level and pixel-level cleansing methods for medical image segmentation with noisy labels. We propose a deep self-cleansing network that preserves clean labels while cleansing noisy labels in an iterative manner. First, we devise an image-level label filtering module (LFM) to distinguish between noisy and clean labels. Given that clean labels typically generate smaller loss values than noisy ones in the early training stage [6], the LFM uses Gaussian Mixture Models (GMM) [28] to model loss distributions and classify labels into Clean (need preserving) and Noisy (need cleansing) categories. To cleanse the Noisy labels, we then propose a pixel-level label cleansing module (LCM) based on pseudo-labeling. The LCM extracts representative target regions from network outputs as prototypes, and computes pseudo-labels based on the similarity between pixel positions and prototypes. Finally, the preserved Clean labels and generated pseudo-labels are used to jointly supervise the network. Extensive experiments on an abdominal CT dataset of liver tumors and an MRI dataset for cardiac diagnosis demonstrate that the proposed self-cleansing framework is resistant to label noise and achieves excellent segmentation performance.

In summary, our work makes the following contributions:

We present a noise-resistant framework for medical image segmentation with noisy labels. The devised framework can preserve clean labels and iteratively cleanse noisy labels during the training phase.

Table 1: Detailed structure of the segmentation backbone.

Layer Name	Output Size	Convolution, kernel size,
		output channels, stride
Input	256×256	-
Conv_1	256×256	Conv, 3×3, 16, 1
Residual_block_1	128×128	Conv, 2×2, 32, 2
		Conv, 3×3, 32, 1
Residual_block_2	64×64	Conv, 2×2, 64, 2
		Conv, 3×3, 64, 1
		Conv, 3×3, 64, 1
Residual_block_3	32×32	Conv, 2×2, 128, 2
		Conv, 3×3, 128, 1
		Conv, 3×3, 128, 1
		Conv, 3×3, 128, 1
Residual_block_4	64×64	ConvTranspose, 2×2, 64, 2
		Conv, 3×3, 128, 1
		Conv, 3×3, 128, 1
		Conv, 3×3, 128, 1
Residual_block_5	128×128	ConvTranspose, 2×2, 32, 2
		Conv, 3×3, 64, 1
		Conv, 3×3, 64, 1
Residual_block_6	256×256	ConvTranspose, 2×2, 16, 2
		Conv, 3×3, 32, 1
Conv_2	256×256	Conv, 3×3, 2, 1
		Conv, 1×1, 1, 1
		Sigmoid

2.

We present a GMM-based image-level label filtering module to distinguish between noisy and clean labels. Based on this module, our framework focuses on refining only noisy labels, thereby making the training stage stable and controllable.
3.

We devise a pixel-level label cleansing module to correct noisy labels by identifying representative prototypes and generating low-noise pseudo-labels.
4.

We verify the proposed method on an abdominal CT dataset of liver tumors and a public MRI dataset for cardiac diagnosis. Experimental results demonstrate the effectiveness and robustness of our method when training with noisy labels.

2 Method

Fig. 2 illustrates the schematic view of the proposed self-cleansing method for medical image segmentation. This method comprises three main components: the segmentation backbone, the label filtering module (LFM), and the label cleansing module (LCM). During training, our network iteratively preserves clean labels and cleanses noisy ones using LFM and LCM. In each cleansing iteration, the LFM first distinguishes noisy labels from clean labels based on per-sample loss distributions. Subsequently, the LCM generates low-noise pseudo-labels for the identified noisy samples based on class prototypes. The two modules are applied every five epochs throughout the training phase. The following subsections will provide detailed descriptions of each component.

2.1 Segmentation Backbone

The proposed framework can use existing segmentation networks as its backbone. Given the U-Net architecture’s effectiveness in combining high- and low-level features for medical image segmentation [29], we employ an enhanced version of U-Net with residual connections as the segmentation backbone in this paper. Our backbone comprises three down-sampling stages and three up-sampling stages to extract multi-level and multi-scale features progressively. The specific network structure is detailed in Table 1. To ensure that the backbone network can extract effective features and be prepared for subsequent noisy label cleansing, we initially warm up the backbone network for 50 epochs.

2.2 Image-Level Label Filtering

Existing pixel-level cleansing methods generate pseudo-labels for all samples. However, in real-world scenarios, annotated samples often contain varying levels of label noise. Modifying all labels indiscriminately can introduce new noise into clean and low-noise labels, thereby impacting the model’s stability and performance. Therefore, the first step in label self-cleansing is to filter out noisy labels for cleansing while preserving the clean labels.

Gaussian Mixture Model for Label Classification: It has been observed that neural networks tend to first fit clean labels during the early training stage [1, 6]. During this period, the network’s predictions usually result in smaller losses for clean labels, whereas the losses for noisy labels are typically larger. Therefore, our LFM leverages the loss characteristic to filter out noisy samples. Specifically, we use the Dice Similarity Coefficient (DSC) loss [22], which is the most commonly used metric in segmentation, as an example to illustrate this process.

Let $X={\left\{x_{i}\right\}}_{i=1}^{N}$ denote the input images (where $N$ is the number of training samples), $Y={\left\{y_{i}\right\}}_{i=1}^{N}$ denote the manual labels, and $\hat{Y}={\left\{\hat{y}_{i}\right\}}_{i=1}^{N}$ denote the network predictions, the DSC loss between $\hat{Y}$ and $Y$ is formulated as:

L_{DSC}={\{{l_{i}\}}}_{i=1}^{N}=\{1-\frac{2\left|\hat{y_{i}}\cap y_{i}\right|}{\left|\hat{y_{i}}\right|+\left|y_{i}\right|}\}_{i=1}^{N}

(1)

Given the flexibility of the Gaussian Mixture Model (GMM) in fitting distributions with varying sharpness [15], we employ a two-component GMM to model per-sample loss distributions and to classify the training labels into Clean (to be preserved) and Noisy (to be cleansed) categories. After normalizing $L_{DSC}$ into the range of [0,1], we use the Exception Maximization algorithm [23] to fit a two-component GMM to the distribution of $L_{DSC}$ . This GMM allows us to compute the posterior probabilities of each label $y_{i}$ being classified as either Clean or Noisy. These posterior probabilities are denoted as $w_{Clean}^{(i)}$ and $w_{Noisy}^{(i)}$ , respectively:

w_{Clean}^{(i)}=p(g_{small}|l_{i})

(2)

w_{Noisy}^{(i)}=p(g_{large}|l_{i})

(3)

where $g_{small}$ is the gaussian component with the smaller mean value, and $g_{large}$ is the gaussian component with the larger mean value. We classify each label $y_{i}$ as Clean if $w_{Clean}^{(i)}>w_{Noisy}^{(i)}$ ; otherwise, it is classified as Noisy.

Cascaded Label Filtering: In this paper, our network is devised to filter and cleanse noisy labels through pseudo-labeling every $k$ epochs, making the training stage an iterative cleansing process. To ensure efficient label cleansing and minimize potential noise introduced by pseudo-labeling, we retain both Clean manual labels and Clean previous pseudo-labels (generated in the previous iteration) and modify only the remaining labels. To achieve this, we propose a cascaded label filtering algorithm using two GMMs to filter the labels that require cleansing.

Let $Y^{\prime}={\left\{y^{\prime}_{i}\right\}}_{i=1}^{N}$ denote the pseudo label set, initially containing all-zero images, and let $L_{DSC}^{\prime}={\{{l^{\prime}_{i}\}}}_{i=1}^{N}$ denote losses between $\hat{Y}$ and $Y^{\prime}$ . We first fit a GMM (termed GMM-1 in Fig. 2) to $L_{DSC}$ to classify $Y$ into Noisy and Clean categories. Subsequently, we fit another GMM (termed GMM-2 in Fig. 2) to $L^{\prime}_{DSC}$ to categorize $Y^{\prime}$ into Noisy and Clean categories. Finally, we cleanse only those samples whose both manual label and pseudo label are identified as Noisy. The new filtering criteria is as follows:

-

If a sample $x_{i}$ has a Clean manual label $y_{i}$ , its pseudo label $y^{\prime}_{i}$ is kept the same as $y_{i}$ , thereby preserving the Clean manual label;
-

If a sample $x_{i}$ has a Noisy manual label and a Clean pseudo label, $y^{\prime}_{i}$ remains unchanged, thus preserving Clean previous pseudo-label;
-

If both the manual label and pseudo label of $x_{i}$ are Noisy, the sample should be filtered out. And its corresponding $y^{\prime}_{i}$ will be cleansed by the LCM with newly generated pseudo labels.

After each cleansing iteration, $Y^{\prime}$ consists of three parts as shown in Fig. 2: 1) preserved Clean manual labels, 2) preserved Clean pseudo labels generated in the previous iteration, and 3) newly generated pseudo-labels. By maintaining a portion of low-noise manual labels, $Y^{\prime}$ ensures the stability of the training process and mitigates cumulative errors to a certain extent.

2.3 Pixel-Level Label Cleansing

Sec. 2.2 introduces how LFM preserves clean labels and filters out noisy labels to be cleansed. Here, we elaborate on the details of LCM based on pseudo-labeling. Known pseudo-labeling methods typically measure feature similarity between pixels and target prototypes to compute new classification scores. Prototypes are often calculated as either the average feature of the suspected region (network predictions) or as several feature vectors of discrete representative pixels. However, these methods struggle to accurately represent local characteristics of targets, leading to inaccuracies in pseudo-label generation. To address this issue, LCM refines the process by selecting the most representative region from the suspected area as prototypes, while discarding low-confidence regions. The features from this refined representative region are then used as prototype features for calculating pseudo-labels.

Prototype Construction: For an input image $x_{i}$ (abbreviated as $x$ ), its feature map, i.e., the output of "Residual_block_6" in Table 1, is denoted as $f$ , and the suspected target region is denoted as $R$ . We construct prototypes based on representative features from $f$ . Specifically, we divide $f$ into $K\times K$ grids ( $K=64$ in experiments) and select the grids most representative of $R$ . The features of these representative grids are then used as prototypes. Note that we process feature maps based on grids instead of pixels, as calculating the representation degree for each pixel sequentially is highly memory-intensive.

Let $G={\{g_{j}\}}_{j=1}^{K^{2}}$ denote the grids, the feature of a grid $g_{j}$ is an aggregated feature calculated as follows:

f(g_{j})=\frac{\sum_{p\in g_{j}}w(p)f(p)}{\sum_{p\in g_{j}}w(p)},\quad w(p)=\left\{\begin{matrix}1,&p\in R\\ 0,&p\notin R\end{matrix}\right.

(4)

where $p$ is a pixel position in $g_{j}$ and $f(p)$ is the feature vector of $p$ . The representation degree $D$ of each grid can be measured by the following equation:

D(g_{j})=\left\{\begin{matrix}\sum_{g_{k}\in R}\cos\left(f(g_{j}),f(g_{k})\right),&g_{j}\in R\\ 0,&g_{j}\notin R\end{matrix}\right.

(5)

where $\cos\ (*,*)$ denotes the cosine similarity. $D(g_{j})$ reflects the similarity between $g_{j}$ and $R$ . A larger $D(g_{j})$ indicates that $g_{j}$ is more representative of target areas. Fig. 3 provides examples of grid-based representation maps. After normalizing $D$ to the range [0,1], we set a threshold $\sigma$ (empirically set to 0.7) to extract grids with $D>\sigma$ as representative grids. Finally, the features of these representative grids, calculated using Eq. 4, are used as prototypes, denoted as $PROTO={\{proto_{m}\}}_{m=1}^{M}(M<K^{2})$ .

Pseudo Labeling: Subsequently, LCM generates a pseudo label for $x$ by updating the classification score of each pixel. For a pixel $q\in x$ , its new foreground probability $P_{fg}(q)$ is calculated by measuring feature similarity between $q$ and prototypes:

P_{fg}(q)=\frac{1}{M}\sum_{m=1}^{M}\cos(f(q),proto_{m})

(6)

The final pseudo label can be obtained by converting the soft probability map into a binary mask with a threshold $\mu$ (empirically set to 0.7).

After obtaining the pseudo-labels, we further refine them through post-processing, which involves two steps: 1) Removing connected foreground regions with small areas. Since the generated pseudo-labels may contain fragmented regions, we empirically remove any connected foreground regions whose area is smaller than 10% of the total foreground area; 2) Filling small holes within the foreground. Similarly, we fill any holes with an area smaller than 10% of the total foreground area.

2.4 Training Scheme

Loss Function: We devise a hybrid loss $L_{\text{hybrid}}$ to optimize the proposed network, which can be expressed as:

L_{\text{hybrid }}=\lambda_{1}(e)L_{DSC}+\lambda_{2}(e)L^{\prime}_{DSC}

(7)

where $L_{DSC}$ measures the difference between network outputs and manual labels, while $L^{\prime}_{DSC}$ reflects the difference between outputs and pseudo labels. The coefficients $\lambda_{1}(e)$ and $\lambda_{2}(e)$ are two epoch-dependent parameters, which are defined as:

\begin{split}\lambda_{1}(e)=\left\{\begin{array}[]{ll}1,&0\leq e<E_{1}\\ \frac{E_{2}-e}{E_{2}-E_{1}},&E_{1}\leq e<E_{2}\\ 0,&E_{2}\leq e<E\end{array}\right.\\ \lambda_{2}(e)=\left\{\begin{array}[]{ll}0,&0\leq e<E_{1}\\ \frac{e-E_{1}}{E_{2}-E_{1}},&E_{1}\leq e<E_{2}\\ 1,&E_{2}\leq e<E\end{array}\right.\end{split}

(8)

where $e$ represents the training epoch, $E$ is the total training epochs ( $E=500$ ), $E_{1}$ is the warm-up epochs ( $E_{1}=E/10$ ), and $E_{2}=E/2$ .

Training Pipeline: The entire training pipeline is illustrated in Fig. 4. In the early training stage, our network relies only on manual labels to learn features. After a warm-up period of 50 epochs, the network begins to update pseudo-labels every $k$ epochs (set to 5 in our experiments), with the contribution of pseudo-labels gradually increasing. As the training becomes stable, the network is fine-tuned using only pseudo-label set. It is important to note that the pseudo-label set always includes some Clean manual labels to ensure that the training process remains controlled.

Our method is implemented by PyTorch 1.5.0 [26] and deployed on an NVIDIA GTX 2080 Ti GPU (12 GB). We use the SGD optimizer [3] with a mini-batch size of 16. The initial learning rate is set to $4\times 10^{-4}$ and is divided by 10 every 50 epochs.

3 Materials and Experiments

3.1 Materials

Two datasets, an in-house CT dataset of liver tumors (HCC dataset) and a public MRI dataset for automatic cardiac diagnosis (ACDC dataset), are used to verify the effectiveness of the proposed method.

The HCC dataset [38] is an in-house liver tumor dataset collected from the First Affiliated Hospital of Zhejiang University School of Medicine, consisting of 231 cases of CT images (30,828 CT slices in total) of hepatocellular carcinoma. Each case has an intra-slice resolution of $512\times 512$ , a slice thickness of 0.5 mm, and an inter-plane resolution varying from $\rm 0.56\times 0.56mm^{2}$ to $\rm 0.85\times 0.85mm^{2}$ . The tumor regions within each sample in the dataset is delineated by experienced radiologists. In our study, all slices are resized to $256\times 256$ and truncated into the range of [-70, 180] Hu to eliminate irrelevant tissues. The dataset is randomly divided into three parts: 139 CT scans for training, 46 CT scans for validation, and 46 CT scans for testing.

The ACDC dataset [2] is a publicly available dataset for automated cardiac diagnosis, acquired at the University Hospital of Dijon (France). It consists of 200 MRI scans from 100 patients, obtained using two MRI scanners with magnetic field strengths of 1.5T and 3.0T. The scans have in-plane resolutions ranging from $\rm 0.70\times 0.70\ mm^{2}$ to $\rm 1.92\times 1.92\ mm^{2}$ , and through-plane resolutions ranging from 5 mm to 10 mm. All data were collected at the End-Diastolic and End-Systolic stages, with annotations provided for the left ventricle (LV), right ventricle (RV), and myocardium (Myo). Following [21], we extracted $128\times 128$ patches centered on cardiac regions for our experiments. These patches were further split into 1488 patches for training, 414 patches for validation and 414 patches for testing.

3.2 Noise Simulation

To comprehensively evaluate our method under disturbance from varying levels of label noise, we augment the training set by adding label noise of different proportions and amplitudes.

For the HCC dataset, we randomly select $\alpha_{1}$ ( $\alpha_{1}\in\{30\%,50\%,70\%\}$ ) proportion of images from the training set and randomly add missing-label noise or boundary noises. Specifically, missing-label noise is introduced by removing annotations, while boundary noise is added by randomly dilating or eroding tumor masks (labels) by $\beta_{1}\pm 3$ ( $\beta_{1}\in\{5,10\}$ ) pixels. This augmentation process results in seven different training sets (including the original set), each corrupted by varying levels of label noise.

Likewise, we generate multiple training sets with different levels of label noise for the ACDC dataset. We randomly select $\alpha_{2}$ ( $\alpha_{2}\in{30\%,50\%,70\%}$ ) of the images from the training set and introduce either missing-label noise or boundary noise. Missing-label noise is added by removing the organ masks, while boundary noise is applied by randomly dilating or eroding the organ masks by $\beta_{2}\pm 3$ pixels ( $\beta_{2}\in\{5,10\}$ ).

3.3 Experimental Setup

Competing Methods: To evaluate the effectiveness of the proposed method, we first compare it with the Baseline method, i.e., the backbone network without any noise reduction strategy. Besides, we compare our method with four relevant noise-resistance segmentation methods:

1.

Pick-and-learn (PL) [43], an image-level cleansing method that assesses image-level quality of labels in a training batch and assigns lower weights for noisy labels when computing losses.
2.

Confident-learning (CL) [36], a pixel-level cleansing method that identifies noisy regions within labels and generates pseudo-labels using Confident Learning [24].
3.

Joint Co-Regularization (JoCoR) [33],a pixel-level cleansing method that uses two networks to select confident pixels (regions) for cross-training.
4.

Superpixel-guided iterative learning (SP) [16], a pixel-level cleansing method that uses superpixel representations to guide noisy label correction.

To ensure the fairness of the comparison, all methods adopt the same segmentation backbone (as described in Sec. 2.1) and are trained by 500 epochs. Besides, to thoroughly assess the noise-resistance capability of each method, we evaluate the performance of both the B-model and L-model for each method. The B-model represents the best checkpoint selected using the validation set, reflecting the actual segmentation performance on the testing set. Besides, the L-model corresponds to the final checkpoint obtained at the $500^{th}$ epoch, reflecting the stability and sustained noise-resistance performance over time.

Evalutation Metrics: Following PL [43] and CL [36], we use the Dice Similarity Coefficient (DSC) metric to quantify segmentation performance by measuring the similarity between the network outputs and the ground truth labels. A higher DSC score indicates better segmentation results and greater robustness to label noise.

Table 2: Quantitative comparison results in DSC (%) on the HCC dataset.The Baseline method, trained on the noise-free training set, establishes the upper bound for performance. The B-model is the best checkpoint selected based on the validation set and the L-model is the last checkpoint obtained at the

500^{th}

epoch. Red numbers indicate the best results, while blue numbers indicate the second-best results.

Noise level		Method	Liver tumor
Noise level		Method	B-model	L-model
Noise-free (upper bound)		Baseline	71.45	70.81
$\alpha_{1}=30\%$	$\beta_{1}=5$	Baseline	70.15	67.90
		PL	70.36	64.57
		CL	70.51	63.10
		JoCoR	70.84	68.51
		SP	70.63	68.92
		Ours	70.85	69.52
	$\beta_{1}=10$	Baseline	62.36	57.72
		PL	64.68	58.94
		CL	65.56	54.53
		JoCoR	69.11	63.32
		SP	68.94	61.00
		Ours	69.55	67.76
$\alpha_{1}=50\%$	$\beta_{1}=5$	Baseline	66.89	64.27
		PL	67.65	65.13
		CL	68.25	64.73
		JoCoR	68.60	65.35
		SP	68.35	65.93
		Ours	69.07	66.54
	$\beta_{1}=10$	Baseline	62.81	58.62
		PL	62.94	59.03
		CL	60.88	52.11
		JoCoR	67.90	59.81
		SP	66.44	59.26
		Ours	68.29	64.31
$\alpha_{1}=70\%$	$\beta_{1}=5$	Baseline	65.51	64.06
		PL	67.02	63.42
		CL	66.87	62.78
		JoCoR	67.07	63.55
		SP	67.96	62.99
		Ours	68.73	64.38
	$\beta_{1}=10$	Baseline	59.71	51.19
		PL	60.21	52.50
		CL	62.43	50.16
		JoCoR	63.38	48.77
		SP	60.73	51.79
		Ours	67.02	63.55

3.4 Results

In this section, we analyze the comparative experimental results between our method and competing methods on the HCC dataset, both qualitatively and quantitatively. First, we analyze the qualitative results from the visual perspective. Fig. 5 shows several examples of the segmentation results (produced by B-model with $\alpha_{1}=70\%$ , $\beta_{1}=10$ ), where difference maps highlight the discrepancies between network-produced results and ground truths. It is evident that the Baseline network is significantly impacted by noisy samples, leading to obvious false negatives (e.g., Case #2 and Case #3). While the four competing methods exhibit some noise-resistance ability, there still remains a notable gap between their segmentation results and the ground truths. The issue with PL is that it addresses only image-level label noise by reducing the influence of noisy samples but fails to account for disturbances from local noisy regions within each sample. In contrast, the three pixel-level methods (CL, JoCoR, SP) identify pixel-level noise within all samples indiscriminately and rectify the training process. These methods may either introduce extra noise into clean labels or lead to decreased network efficiency and stability. Particularly, CL performs label cleansing in a one-shot manner, meaning the generated pseudo-labels are fixed and cannot be further refined as training progresses. In contrast, our method demonstrates superior performance in noise reduction and achieves optimal segmentation results. By combining the advantages of both image-level and pixel-level methods, our approach preserves clean samples while effectively cleansing noisy ones during training. Furthermore, our method is an iterative method that updates pseudo-labels to reduce noise throughout the training process. As training progresses, these pseudo-labels are continuously refined, enabling the network to better learn robust target features.

Further, we present quantitative results for different methods in Table 2. The Baseline method, trained on the noise-free training set, provides the upper bound for performance. The B-model, representing the best checkpoint selected using the validation set, reflects the actual segmentation performance of the network. The L-model, corresponding to the last checkpoint obtained at the $500^{th}$ epoch, indicates network stability and sustained noise-resistance ability during training. From Table 2, it is evident that PL demonstrates robustness to label noise in the B-model and shows some sustained noise resistance in the L-model. CL exhibits noise resistance in the B-model but shows poor stability and sustained noise robustness (as measured by the L-model). As the level of noise increases, (with larger $\alpha_{1}$ and $\beta_{1}$ ), both methods experience a more significant decline in stability and noise resistance. JoCoR and SP show improvements in stability (as reflected in the L-model), but the results are still not ideal. In contrast, our method achieves the best results in both the B-model and L-model. More importantly, as the noise level increases, our method shows increasingly effective noise-resistance performance. Based on both quantitative and qualitative analyses, it can be concluded that our method outperforms competing methods in terms of noise-resistance capability and stability.

Table 3: Generalization validation on the ACDC dataset (measured in DSC (%)). The Baseline method, trained on the noise-free training set, provides the upper bound for performance. The B-model is the best checkpoint selected using the validation set and the L-model is the last checkpoint obtained in the

500^{th}

epoch. Red numbers indicate the best results, while blue numbers indicate the second-best results.

Noise level		Method	RV		Myo		LV		Average
Noise level		Method	B-model	L-model	B-model	L-model	B-model	L-model	B-model	L-model
Noise-free		Baseline	76.60	72.89	78.85	77.30	83.59	81.19	79.68	77.12
$\alpha_{2}=30\%$	$\beta_{2}=5$	Baseline	76.08	59.58	75.75	58.99	81.57	65.54	77.80	61.37
		PL	73.96	59.73	75.88	72.63	80.67	76.93	76.83	69.76
		CL	78.89	59.37	76.09	66.89	82.15	72.69	79.04	66.31
		JoCoR	75.50	62.88	76.63	67.36	82.21	74.23	78.11	68.16
		SP	75.91	68.07	77.09	70.78	81.63	78.46	78.21	72.44
		Ours	80.32	71.66	78.20	76.47	82.47	80.09	80.33	76.07
	$\beta_{2}=10$	Baseline	74.90	53.28	77.32	60.99	81.75	67.17	77.99	60.48
		PL	75.32	58.99	75.05	73.91	80.67	79.90	77.01	70.93
		CL	76.05	58.73	77.37	66.70	82.17	71.60	78.53	65.68
		JoCoR	76.87	63.57	77.26	66.81	82.13	73.87	78.75	68.08
		SP	75.30	61.64	77.39	69.30	81.42	75.57	78.04	68.83
		Ours	77.87	69.46	78.10	77.25	84.06	82.20	80.01	76.30
$\alpha_{2}=50\%$	$\beta_{2}=5$	Baseline	71.11	41.67	72.27	42.64	78.14	51.10	73.84	45.14
		PL	69.03	61.58	74.03	63.92	63.98	61.58	69.01	62.36
		CL	70.96	55.11	73.79	52.01	79.78	58.67	74.84	55.26
		JoCoR	71.22	56.83	73.69	56.79	80.13	64.82	75.01	59.48
		SP	71.28	63.70	74.65	64.70	79.19	75.25	75.04	67.81
		Ours	73.05	66.95	76.06	74.86	80.14	79.27	76.41	73.69
	$\beta_{2}=10$	Baseline	72.94	42.05	74.46	47.16	80.97	55.66	76.12	48.29
		PL	70.54	66.16	71.58	66.86	77.72	72.46	73.28	68.49
		CL	71.88	61.61	74.98	53.79	81.22	63.82	76.02	59.74
		JoCoR	70.48	61.83	71.18	62.95	78.08	69.88	73.25	64.89
		SP	71.68	65.44	74.75	64.95	80.76	73.75	75.73	68.04
		Ours	73.21	67.12	75.74	71.84	81.33	75.77	76.76	71.57
$\alpha_{2}=70\%$	$\beta_{2}=5$	Baseline	61.30	23.50	58.09	16.33	69.55	27.51	62.98	22.45
		PL	65.85	49.43	62.87	38.03	74.10	48.55	67.60	45.34
		CL	62.42	46.26	57.09	37.47	66.63	49.31	62.04	44.35
		JoCoR	62.01	38.70	61.11	32.40	75.80	38.83	66.31	36.64
		SP	65.80	56.59	62.76	47.91	73.60	62.09	67.39	55.53
		Ours	77.70	64.52	75.65	68.46	80.64	71.73	78.00	68.24
	$\beta_{2}=10$	Baseline	53.11	25.89	53.57	24.30	60.16	31.45	55.61	27.21
		PL	66.26	40.66	72.68	40.31	76.96	40.70	71.96	40.56
		CL	54.94	35.53	60.12	29.42	67.32	41.60	60.79	35.52
		JoCoR	63.36	28.20	63.09	31.58	68.72	38.60	65.05	32.79
		SP	64.32	55.39	58.69	50.43	65.54	61.99	62.58	55.94
		Ours	72.45	69.89	72.84	66.84	80.12	74.72	75.14	70.48

Table 4: Ablation study on the HCC dataset (in DSC(%)).

Components	B-model	L-model
Baseline	59.71	51.19
Baseline+LCM	61.07	53.72
Baseline+LCM (pred)	60.08	52.42
Baseline+LCM (avg)	60.13	52.36
Baseline+LCM+LFM	67.02	63.55

3.5 Generalization Validation

To assess the generalization ability of the proposed method across different datasets and segmentation tasks, we conduct additional validation experiments on the ACDC dataset. Specifically, we focus on segmenting three different organs—left ventricle (LV), right ventricle (RV), and myocardium (Myo)—from MRI images. Fig. 6 provides visual examples of segmentation results produced by different methods (produced by B-model with $\alpha_{2}=70\%$ , $\beta_{2}=10$ ). It is shown that our method produces results that are most similar to the ground truths. Table 3 presents the quantitative results, including the analysis of both the B-model and L-model for each method. The results demonstrate that the proposed method outperforms competing methods across the three-organ segmentation tasks. Besides, the improvement of the stability and sustained noise-resistance ability of the network is particularly evident compared with other four methods. These findings on the ACDC dataset demonstrate the generalization capability of our method, implying its noise-resistance ability across different datasets and segmentation tasks.

3.6 Ablation Study

To verify the effectiveness of each component of the proposed method, we conduct an ablation study on the HCC dataset (with $\alpha_{1}=70\%$ , $\beta_{1}=10$ ). We start from the baseline network (the segmentation backbone without any noise-cleansing strategy) and incrementally add the proposed components. Table. 4 summarizes the results of ablation study. It is observed that the B-model of the baseline achieves the segmentation performance of 59.71% (in DSC) on the test set. However, as training proceeds, the network gradually overfits to noisy samples, the segmentation performance eventually falls back to 51.19% (L-model).

Effectiveness of LCM: We first add LCM to the baseline to assess its effectiveness, whose specific structure is described in Section 2.3. After the warm-up training, LCM indiscriminately corrects all samples every 5 epochs without preserving clean samples. Fig. 7 illustrates several examples of cleansed labels produced by the LCM, demonstrating its effectiveness in correcting both missing labels (Case #1) and boundary noise (Cases #2 and #3). Table 4 provides the corresponding quantitative results, which shows that the LCM improves the segmentation performance of the B-model by +1.36% (in DSC) and also enhances the stability of the baseline, resulting in a +2.53% (in DSC) improvement in the performance of the L-model.

Meanwhile, we compare the pseudo-labeling method used in LCM with two other pseudo-labeling methods: 1) LCM (pred), which directly employs network predictions as pseudo labels; 2) LCM (avg), which yields pseudo labels based on averaged prototypes (the averaged features of suspected target regions). The experimental results of the comparison are shown in Table 4. The results indicate that LCM based on representative prototypes produces more accurate pseudo-labels as it generates better representative features of the target regions.

Effectiveness of LFM: Although LCM effectively improves label quality, indiscriminately modifying all labels can corrupt clean and low-noise manual labels, as the generated pseudo-labels still contain some noise. Such noise can accumulate as training progresses, which potentially limits the performance and stability of the network.

To address this issue, we introduce LFM to allow the network to preserve clean samples and correct only noisy samples in each cleansing iteration, thereby reducing the potential noise introduced by pseudo-labeling. LFM achieves this by modeling per-sample loss distributions and classify labels into Clean and Noisy categories. Incorporating LFM ensures that the network is consistently supervised by low-noise manual labels, which stabilizes the training process and prevents error accumulation. As shown in Table 4, LFM significantly boosts performance by +5.95% in the B-model and +9.83% in the L-model, which demonstrates its effectiveness.

4 Discussion

In this paper, we propose a self-cleansing network for medical image segmentation that is highly resistant to label noise. To achieve this, our method starts by filtering out noisy labels that require cleaning while preserving clean labels based on per-sample loss distributions. It then generates pseudo-labels for the noisy labels using representative target prototypes. Finally, both the preserved clean labels and the generated pseudo-labels are used together to supervise the network. The experimental results of our method are detailed in Section 3.4. We validate its effectiveness on the HCC dataset of liver tumors and it shows superior sustained noise-resistance compared to competing methods. Additionally, validation on the publicly available ACDC dataset for automatic cardiac diagnosis demonstrates the generation ability of our method across diverse datasets and segmentation tasks.

The proposed method offers two main advantages. Firstly, it effectively preserves clean and low-noise labels in each self-cleansing iteration. This is achieved by identifying image-level label noise from both manual labels and previously generated pseudo-labels using LFM. This strategy allows the network to cleanse only a subset of samples in each iteration, ensuring that a certain portion of clean labels are preserved and enhancing stability throughout the training process. Secondly, the method employs continuous representative regions as prototypes for updating pseudo-labels through LCM. These prototypes accurately capture local information within target regions, which helps to generate more precise pseudo-labels.

Despite these advantages, our approach has certain limitations. For example, the framework relies on a single segmentation network for label noise cleansing. Once the network begins fitting noisy labels, the LFM based on loss distributions may struggle to accurately assess the noise level. As training progresses, the network might amplify these errors, which is a phenomenon known as confirmation bias [19]. This can negatively impact the network’s performance and lead to a gradual decline in accuracy during the later stages of training. To address this issue, future work could introduce a co-training approach. Such approach would involve training two segmentation networks simultaneously and allow them to mutually identify and correct label noise, thus avoiding confirmation bias. Each network would update its pseudo-labels based on the other network’s predictions, and both networks would adjust their parameters using the pseudo-labels generated by their counterpart. This co-training mode could help mitigate confirmation bias caused by using a single network, which has the potential to further improve the performance.

5 Conclusion

In this paper, we propose a deep self-cleansing network for medical image segmentation to effectively reduce label noise in training datasets. Our network is designed to preserve clean labels while iteratively cleansing noisy labels throughout the training process. We develop a Label Filtering Module based on a gaussian mixture model to distinguish between noisy and clean labels. Additionally, we introduce a Label Cleansing Module that uses pseudo-labeling to correct the identified noisy samples. Experimental evaluations on a clinical CT dataset of liver tumors and a publicly available MRI dataset for automated cardiac diagnosis show that our method consistently outperforms several competing methods. Furthermore, the results demonstrate that our method is robust when dealing with varying levels of label noise.

References

[1] Devansh Arpit, Stanisław Jastrzębski, Nicolas Ballas, David Krueger, Emmanuel Bengio, Maxinder S Kanwal, Tegan Maharaj, Asja Fischer, Aaron Courville, Yoshua Bengio, et al. A closer look at memorization in deep networks. In International conference on machine learning, pages 233–242. PMLR, 2017.
[2] Olivier Bernard, Alain Lalande, Clement Zotti, Frederick Cervenansky, Xin Yang, Pheng-Ann Heng, Irem Cetin, Karim Lekadir, Oscar Camara, Miguel Angel Gonzalez Ballester, et al. Deep learning techniques for automatic mri cardiac multi-structures segmentation and diagnosis: is the problem solved? IEEE transactions on medical imaging, 37(11):2514–2525, 2018.
[3] Léon Bottou. Stochastic gradient descent tricks. In Neural networks: Tricks of the trade, pages 421–436. Springer, 2012.
[4] Hu Cao, Yueyue Wang, Joy Chen, Dongsheng Jiang, Xiaopeng Zhang, Qi Tian, and Manning Wang. Swin-unet: Unet-like pure transformer for medical image segmentation. In European conference on computer vision, pages 205–218. Springer, 2022.
[5] Jieneng Chen, Yongyi Lu, Qihang Yu, Xiangde Luo, Ehsan Adeli, Yan Wang, Le Lu, Alan L Yuille, and Yuyin Zhou. Transunet: Transformers make strong encoders for medical image segmentation. arXiv preprint arXiv:2102.04306, 2021.
[6] Pengfei Chen, Ben Ben Liao, Guangyong Chen, and Shengyu Zhang. Understanding and utilizing deep neural networks trained with noisy labels. In International Conference on Machine Learning, pages 1062–1070. PMLR, 2019.
[7] Hao Cheng, Zhaowei Zhu, Xingyu Li, Yifei Gong, Xing Sun, and Yang Liu. Learning with instance-dependent label noise: A sample sieve approach. arXiv preprint arXiv:2010.02347, 2020.
[8] Özgün Çiçek, Ahmed Abdulkadir, Soeren S Lienkamp, Thomas Brox, and Olaf Ronneberger. 3d u-net: learning dense volumetric segmentation from sparse annotation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2016: 19th International Conference, Athens, Greece, October 17-21, 2016, Proceedings, Part II 19, pages 424–432. Springer, 2016.
[9] Shuanhu Di, Yu-Qian Zhao, Miao Liao, Fan Zhang, and Xiong Li. Td-net: A hybrid end-to-end network for automatic liver tumor segmentation from ct images. IEEE Journal of Biomedical and Health Informatics, 27(3):1163–1172, 2022.
[10] Aritra Ghosh, Himanshu Kumar, and P Shanti Sastry. Robust loss functions under label noise for deep neural networks. In Proceedings of the AAAI conference on artificial intelligence, volume 31, 2017.
[11] Jiangfan Han, Ping Luo, and Xiaogang Wang. Deep self-learning from noisy labels. In Proceedings of the IEEE/CVF international conference on computer vision, pages 5138–5147, 2019.
[12] Ali Hatamizadeh, Yucheng Tang, Vishwesh Nath, Dong Yang, Andriy Myronenko, Bennett Landman, Holger R Roth, and Daguang Xu. Unetr: Transformers for 3d medical image segmentation. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 574–584, 2022.
[13] Huimin Huang, Lanfen Lin, Ruofeng Tong, Hongjie Hu, Qiaowei Zhang, Yutaro Iwamoto, Xianhua Han, Yen-Wei Chen, and Jian Wu. Unet 3+: A full-scale connected unet for medical image segmentation. In ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 1055–1059. IEEE, 2020.
[14] Yun Jiang, Yuan Zhang, Xin Lin, Jinkun Dong, Tongtong Cheng, and Jing Liang. Swinbts: A method for 3d multimodal brain tumor segmentation using swin transformer. Brain sciences, 12(6):797, 2022.
[15] Junnan Li, Richard Socher, and Steven CH Hoi. Dividemix: Learning with noisy labels as semi-supervised learning. arXiv preprint arXiv:2002.07394, 2020.
[16] Shuailin Li, Zhitong Gao, and Xuming He. Superpixel-guided iterative learning from noisy labels for medical image segmentation. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part I 24, pages 525–535. Springer, 2021.
[17] Xiaomeng Li, Hao Chen, Xiaojuan Qi, Qi Dou, Chi-Wing Fu, and Pheng-Ann Heng. H-denseunet: hybrid densely connected unet for liver and tumor segmentation from ct volumes. IEEE transactions on medical imaging, 37(12):2663–2674, 2018.
[18] Ailiang Lin, Bingzhi Chen, Jiayu Xu, Zheng Zhang, Guangming Lu, and David Zhang. Ds-transunet: Dual swin transformer u-net for medical image segmentation. IEEE Transactions on Instrumentation and Measurement, 71:1–15, 2022.
[19] Luyan Liu, Zhengdong Zhang, Shuai Li, Kai Ma, and Yefeng Zheng. S-cuda: self-cleansing unsupervised domain adaptation for medical image segmentation. Medical Image Analysis, 74:102214, 2021.
[20] Tongliang Liu and Dacheng Tao. Classification with noisy labels by importance reweighting. IEEE Transactions on pattern analysis and machine intelligence, 38(3):447–461, 2015.
[21] Xiangde Luo, Wenjun Liao, Jieneng Chen, Tao Song, Yinan Chen, Shichuan Zhang, Nianyong Chen, Guotai Wang, and Shaoting Zhang. Efficient semi-supervised gross target volume of nasopharyngeal carcinoma segmentation via uncertainty rectified pyramid consistency. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part II 24, pages 318–329. Springer, 2021.
[22] Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In 2016 fourth international conference on 3D vision (3DV), pages 565–571. Ieee, 2016.
[23] Todd K Moon. The expectation-maximization algorithm. IEEE Signal processing magazine, 13(6):47–60, 1996.
[24] Curtis Northcutt, Lu Jiang, and Isaac Chuang. Confident learning: Estimating uncertainty in dataset labels. Journal of Artificial Intelligence Research, 70:1373–1411, 2021.
[25] Ozan Oktay, Jo Schlemper, Loic Le Folgoc, Matthew Lee, Mattias Heinrich, Kazunari Misawa, Kensaku Mori, Steven McDonagh, Nils Y Hammerla, Bernhard Kainz, et al. Attention u-net: Learning where to look for the pancreas. arXiv preprint arXiv:1804.03999, 2018.
[26] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
[27] Mengye Ren, Wenyuan Zeng, Bin Yang, and Raquel Urtasun. Learning to reweight examples for robust deep learning. In International conference on machine learning, pages 4334–4343. PMLR, 2018.
[28] Douglas A Reynolds. Gaussian mixture models. Encyclopedia of biometrics, 741(659-663), 2009.
[29] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer, 2015.
[30] Hyunseok Seo, Charles Huang, Maxime Bassenne, Ruoxiu Xiao, and Lei Xing. Modified u-net (mu-net) with incorporation of object-dependent high level features for improved liver and liver-tumor segmentation in ct images. IEEE transactions on medical imaging, 39(5):1316–1325, 2019.
[31] Jun Shu, Qi Xie, Lixuan Yi, Qian Zhao, Sanping Zhou, Zongben Xu, and Deyu Meng. Meta-weight-net: Learning an explicit mapping for sample weighting. Advances in neural information processing systems, 32, 2019.
[32] Daiki Tanaka, Daiki Ikami, Toshihiko Yamasaki, and Kiyoharu Aizawa. Joint optimization framework for learning with noisy labels. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5552–5560, 2018.
[33] Hongxin Wei, Lei Feng, Xiangyu Chen, and Bo An. Combating noisy labels by agreement: A joint training method with co-regularization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13726–13735, 2020.
[34] Xiao Xiao, Shen Lian, Zhiming Luo, and Shaozi Li. Weighted res-unet for high-quality retina vessel segmentation. In 2018 9th international conference on information technology in medicine and education (ITME), pages 327–331. IEEE, 2018.
[35] Zhe Xu, Donghuan Lu, Yixin Wang, Jie Luo, Jagadeesan Jayender, Kai Ma, Yefeng Zheng, and Xiu Li. Noisy labels are treasure: mean-teacher-assisted confident learning for hepatic vessel segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 3–13. Springer, 2021.
[36] Minqing Zhang, Jiantao Gao, Zhen Lyu, Weibing Zhao, Qin Wang, Weizhen Ding, Sheng Wang, Zhen Li, and Shuguang Cui. Characterizing label errors: confident learning for noisy-labeled image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 721–730. Springer, 2020.
[37] Tianwei Zhang, Lequan Yu, Na Hu, Su Lv, and Shi Gu. Robust medical image segmentation from non-expert annotations with tri-network. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2020: 23rd International Conference, Lima, Peru, October 4–8, 2020, Proceedings, Part IV 23, pages 249–258. Springer, 2020.
[38] Yue Zhang, Chengtao Peng, Liying Peng, Yingying Xu, Lanfen Lin, Ruofeng Tong, Zhiyi Peng, Xiongwei Mao, Hongjie Hu, Yen-Wei Chen, et al. Deeprecs: From recist diameters to precise liver tumor segmentation. IEEE Journal of Biomedical and Health Informatics, 26(2):614–625, 2021.
[39] Zhilu Zhang and Mert Sabuncu. Generalized cross entropy loss for training deep neural networks with noisy labels. Advances in neural information processing systems, 31, 2018.
[40] Rencheng Zheng, Qidong Wang, Shuangzhi Lv, Cuiping Li, Chengyan Wang, Weibo Chen, and He Wang. Automatic liver tumor segmentation on dynamic contrast enhanced mri using 4d information: deep learning model based on 3d convolution and convolutional lstm. IEEE Transactions on Medical Imaging, 41(10):2965–2976, 2022.
[41] Songzhu Zheng, Pengxiang Wu, Aman Goswami, Mayank Goswami, Dimitris Metaxas, and Chao Chen. Error-bounded correction of noisy labels. In International Conference on Machine Learning, pages 11447–11457. PMLR, 2020.
[42] Zongwei Zhou, Md Mahfuzur Rahman Siddiquee, Nima Tajbakhsh, and Jianming Liang. Unet++: A nested u-net architecture for medical image segmentation. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: 4th International Workshop, DLMIA 2018, and 8th International Workshop, ML-CDS 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, September 20, 2018, Proceedings 4, pages 3–11. Springer, 2018.
[43] Haidong Zhu, Jialin Shi, and Ji Wu. Pick-and-learn: automatic quality evaluation for noisy-labeled image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 576–584. Springer, 2019.