This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Weakly Supervised Segmentation Framework for Thyroid Nodule Based on High-confidence Labels and High-rationality Losses

Jianning Chi    \IEEEmembershipMember, IEEE    Zelan Li    Geng Lin    MingYang Sun    and Xiaosheng Yu Manuscript submitted on February 26, 2025. (Corresponding author:Jianning Chi.)Jianning Chi is with the Faculty of Robot Science and Engineering, Northeastern University, Shenyang, 110169, China (e-mail: chijianning@mail.neu.edu.cn). Zelan Li is with the Faculty of Robot Science and Engineering, Northeastern University, Shenyang, 110169, China (e-mail: 2410844@stu.neu.edu.cn).Geng Lin is with the Faculty of Robot Science and Engineering, Northeastern University, Shenyang, 110169, China (e-mail: 2202048@stu.neu.edu.cn).Mingyang sun is with the Faculty of Robot Science and Engineering, Northeastern University, Shenyang, 110169, China (e-mail: 2302161@stu.neu.edu.cn).Xiaosheng Yu is with the Faculty of Robot Science and Engineering, Northeastern University, Shenyang, 110169, China (e-mail: yuxiaosheng@mail.neu.edu.cn).
Abstract

Weakly supervised segmentation methods can delineate thyroid nodules in ultrasound images efficiently using training data with coarse labels, but suffer from: 1) low-confidence pseudo-labels that simply follow topological priors, introducing significant label noise, and 2) low-rationality loss functions that rigidly compare segmentation with labels, ignoring discriminative information for nodules with diverse and complex shapes. To solve these problems, we clarify the objective together with required references of weakly supervised ultrasound image segmentation, and present the framework with high-confidence pseudo-labels to represent topological and anatomical information, and high-rationality losses to learn multi-level discriminative information. Specifically, we fuse geometric transformations of four-point annotations and results from the MedSAM model prompted by certain annotations to generate high-confidence box, foreground, and background labels. We design a high-rationality learning strategy comprising: 1) Alignment loss that measures the spatial projection consistency between the segmentation and box label, and the topological continuity of the segmentation within foreground label, guiding the network to perceive the location arrangement of nodule features; 2) Contrastive loss that pulls features sampled from labeled foreground regions, while pushing features sampled from labeled foreground and background regions, guiding the network to capture the regional distribution of nodule and background features; 3) Prototype correlation loss that measures the consistency between correlation maps derived by comparing features with foreground and background prototypes respectively, shrinking the uncertain regions to precise nodule edge delineation. Experimental results demonstrate that our method achieves state-of-the-art performance on the publicly available TN3K and DDTI datasets. The code is publicly available at HCL-HRL.

{IEEEkeywords}

Medical image segmentation; Thyroid nodule; Deep learning; Weakly supervised segmentation.

1 Introduction

\IEEEPARstart

Thyroid nodule segmentation in the ultrasound image is critical for accurate thyroid disease diagnosis [1, 2], but suffers from the blurred structures of anatomy with speckle noise, making it highly dependent on the expertise of the radiologist [3]. Employing deep learning algorithms [4, 5, 6, 7] for thyroid nodule segmentation can significantly enhance diagnostic efficiency for healthcare professionals. While fully supervised algorithms [8, 9, 10, 11, 12] achieve promising performance on specific datasets where precise ground truth masks are available for training, acquiring a large number of delicate annotations remains resource-intensive and time-consuming [13].

Weakly supervised segmentation (WSS) algorithms offer attractive alternatives by utilizing coarse annotations such as bounding boxes [14, 15, 16, 17, 18], points [19, 20, 21], or scribbles [22, 23, 24] to achieve accurate segmentation results. These approaches are particularly suitable for thyroid nodule segmentation in clinical practice, since they can utilize simple four-point annotations provided by clinicians as training supervision. However, existing weakly supervised methods still face the following challenges: 1) they typically generate low-confidence pseudo-labels based on topological geometric priors only [14, 15, 25, 26, 21, 16], introducing label noise and potentially misleading training according to these uncertain or ambiguous conditions; 2) they primarily adopt rigid learning strategies such as comparing the segmentation with fixed-shape labels or pseudo-labels [14, 15, 27, 28], severely limiting their flexibility and adaptability in handling diverse and complex nodule variations.

To address the aforementioned challenges, we propose a weakly supervised segmentation framework that leverages clinical four-point annotations to generate high-confidence pseudo-labels with both topological and anatomical information, and learns location-level, region-level, and edge-level discriminative information through high-rationality losses. Specifically, we fuse geometric transformations of four-point annotations and results from the MedSAM prompted by certain annotations to generate high-confidence box, foreground, and background labels. We then design a high-rationality multi-level learning strategy consisting of: 1) Alignment loss that measures the spatial projection consistency between the segmentation and the box label, and topological continuity of the segmentation within the foreground label, guiding the network to perceive the location arrangement of nodule features; 2) Contrastive loss that reduces the distances between features sampled from the labeled foreground regions, while increasing distances between features sampled from the labeled foreground and background regions, guiding the network to capture the regional distribution of nodule and background features; 3) Prototype correlation loss that measures the consistency between correlation maps derived by comparing deep features with foreground and background prototypes respectively, so that the uncertain regions are gradually evolved to precise nodule edge delineation. In summary, the contributions of our work are as follows:

  1. 1.

    We propose a high-confidence pseudo label generation method that fuses geometric transformations of point annotations and segmentation provided by the MedSAM using prompts derived from point annotations, preventing the misleading conditions from label noise during network training.

  2. 2.

    We introduce a series of high-rationality losses, including alignment, contrastive, and prototype correlation loss. These losses guide the network to capture multi-level discriminative information of thyroid nodules’ locations and shapes, significantly enhancing the reliability of the training process.

  3. 3.

    Extensive experiments show that our method achieves state-of-the-art on the publicly available thyroid nodule ultrasound dataset TN3K [29] and DDTI [30]. Additionally, the code for this paper is publicly available at HCL-HRL.

2 Related Work

2.1 Medical image Segmentation

Medical image segmentation is a fundamental task in radiology and pathology, enabling automated analysis of medical images to assist in diagnosis and treatment planning [31, 32]. In recent years, deep learning technology has achieved significant progress in the field of medical image segmentation [31, 33]. Models based on convolutional neural networks (CNN) [8, 12, 34, 35], such as U-Net [8] and its variant networks [12, 34, 35], adopt an encoder-decoder architecture that enables them to maintain high resolution while extracting multi-scale feature information. On the other hand, Vision Transformer-based networks [11, 6, 36, 37, 38], like TransUnet [38] and SWin-Unet [11], utilize attention mechanisms during encoding or decoding processes to capture both local and global features of images through transformers, thereby learning more precise results for medical image segmentation. Moreover, models based on Mamba [39, 40, 41, 42, 43], such as SegMamba [41] and VM-UNet [42], effectively capture long-range dependencies in full-scale features across various scales using state space models, achieving competitive performance in medical image segmentation tasks.

In recent years, algorithms for thyroid nodule segmentation based on fully supervised precise labels [5, 44, 45, 46, 47, 48] have been extensively studied. Chi et al. [5] employed transformer attention mechanisms to extract intra-frame and inter-frame contextual features within thyroid nodule regions, achieving competitive segmentation results. Chen et al. [44] developed a multi-view learning model, which introduced deep convolutional neural networks to encode local view features and a cross-layer graph convolution module to learn the correlations between high-level and low-level features for superior segmentation performance. Wu et al. [45] introduced dynamic conditional encoding and a feature frequency parser based on the diffusion probabilistic model, achieving excellent results in thyroid nodule segmentation on ultrasound images.

Despite their competitive performance in medical image segmentation, these deep learning methods require extensive annotated data, which demands significant efforts and time in data collection and management, making them impractical for clinical settings.

2.2 Weakly Supervised Segmentation Methods

Weakly supervised learning represents an emerging learning paradigm that requires only a small amount of coarse-grained annotation information for model training [49, 50]. This approach significantly reduces the annotation workload while maintaining promising segmentation accuracy [51].

Typical methods focus on directly exploiting sparse annotations or inaccurate geometric shapes to generate pseudo-labels [13] for pixel-to-pixel region learning. For instance, some approaches incorporate conditional probability modeling techniques, such as conditional random fields (CRF) [14, 15], and uncertainty estimates [25, 26] into the training process directly using weakly supervised labels to learn predictions. Other methods generate pseudo-masks based on topological geometric transforms [22, 21, 52, 20]. For example, Zhao et al. [21] employ quadrilaterals as conservative labels and irregular ellipses as radical labels while introducing dual-branch designs to improve the consistency of pseudo-labels during training, thereby enhancing prediction accuracy. Similarly, Li et al. [20] propose a method that generates octagons from point annotations to serve as initial contours for iteratively refining thyroid nodule boundaries through active contour learning.

Recently, BoxInst [16] employs box annotations to localize segmentation targets, combining color similarity with graph neural networks to delineate segmentation boundaries. Nevertheless, for thyroid ultrasound images with low contrast and blurred boundaries, color similarity cannot fully indicate the thyroid nodule’s boundary. Inspired by BoxInst, Du et al. [27] proposed an algorithm that learns the location and geometric prior of organs mainly relying on the region of interest (ROI) feature, which is useful for organ segmentation with fixed prior shapes but not suitable for thyroid nodules with diverse and complex shapes.

Although recent advancements in Weakly supervised segmentation have yielded promising results, challenges such as pseudo-label noise from dependency on low-confidence pseudo-labels and the adoption of rigid learning strategies that compare the segmentation with fixed-shape labels or pseudo-labels hinder delicate segmentation learning.

3 Method

Refer to caption
Figure 1: Overview of the proposed HCL-HRL framework. 1) High-confidence multi-level labels are generated by combining MedSAM results with geometric transformation outputs. 2) High-rationality multi-level learning strategy: The upper branch processes deep features through the segmentation head to generate segmentation predictions and applies alignment loss for location-level learning. The lower branch refines feature representations by calculating contrastive loss for region-level learning and the prototype correlation loss for edge-level learning.

3.1 Overall Framework

As shown in Fig. 1, we propose a novel Weakly supervised segmentation (WSS) framework for thyroid nodule segmentation. The framework consists of high-confidence multi-level labels generation flow and high-rationality multi-level learning strategy branches. In labels generation flow, we fuse MedSAM results with geometric transformations of four-point annotations to generate high-confidence multi-level labels. In multi-level learning branches, we use high-rationality losses consisting of alignment loss, contrastive loss, and prototype correlation loss to learn precise segmentation location and delicate shape jointly.

3.2 Objective

Image segmentation aims to locate and delineate regions precisely. Fully supervised methods can use ground truth labels with clear location LocLoc and shapeSS to learn feature distributions through pixel-to-pixel comparison. However, weakly supervised approaches for thyroid ultrasound images can only refer to coarse localization cues without precise shape details. This limitation renders traditional optimization objectives for shape SS learning, prompting a shift toward directly assessing the rationality of feature distributions across the image space.

Let II denote the image, with RfR_{f} as the foreground region and RbR_{b} as the background region. Sample sets within these regions are denoted by XfX_{f} and XbX_{b}. The segmentation network is represented by F(x;θ)F(x;\theta), where xx is the input of the network and θ\theta are the parameters. Beside the basic conditions: 1) the union of RfR_{f} and RbR_{b} equals the entire image (RfRb=IR_{f}\cup R_{b}=I); 2) their intersection is empty (RfRb=R_{f}\cap R_{b}=\emptyset); 3) both regions of RfR_{f} and RbR_{b} are fully connected, weakly supervised algorithms to identify regions RfR_{f} and RbR_{b} should also satisfy:

Objective 1: RfR_{f} should lie within a predefined location range LocLoc, while RbR_{b} should be outside this range.

RfLoc&RbLoc=\displaystyle R_{f}\subseteq Loc\And R_{b}\cap Loc=\emptyset (1)

Objective 2: The prototype of the predicted foreground regions P(F(Xf;a))P(F(X_{f};a)) should closely match the reference foreground prototype PfP_{f}, while the predicted background regions prototype P(F(Xb;a))P(F(X_{b};a)) should align with the reference background prototype PbP_{b}, as shown below:

𝒟(P(F(Xf;θ)),Pf)<ϵf,\displaystyle\mathcal{D}(P(F(X_{f};\theta)),P_{f})<\epsilon_{f}, (2)
𝒟(P(F(Xb;θ)),Pb)<ϵb,\displaystyle\mathcal{D}(P(F(X_{b};\theta)),P_{b})<\epsilon_{b},

where 𝒟(,)\mathcal{D}(\cdot,\cdot) denotes the distance between features prototype, and ϵ\epsilon are small thresholds ensuring proximity of segmentation prototypes to reference prototypes.

Objective 3: The sampled feature distribution in the predicted foreground F(XXf;a))F(X\in X_{f};a)) should align with the foreground prototype P(F(Xf;a))P(F(X_{f};a)), and those in the predicted background F(XXb;a)F(X\in X_{b};a) should match the background prototype P(F(Xb;a))P(F(X_{b};a)), as shown below:

𝒟(F(XXf;a),P(F(Xf;a)))<δf,\displaystyle\mathcal{D}(F(X\in X_{f};a),P(F(X_{f};a)))<\delta_{f}, (3)
𝒟(F(XXb;a),P(F(Xb;a)))<δb,\displaystyle\mathcal{D}(F(X\in X_{b};a),P(F(X_{b};a)))<\delta_{b},

where δ\delta are thresholds ensuring similarity between sampled feature distribution and reference prototypes.

To achieve these objectives, high-confidence references used in the optimal process are essential as follows:

Reference 1: A correct range GboxG_{box} must be provided to guide the segmentation location process LocLoc.

Reference 2: High-confidence labels GfG_{f} and GbG_{b} are necessary to define precise foreground and background references PfP_{f} and PbP_{b}.

3.3 High-confidence Multi-level labels Generation

According to Reference 1 and 2 discussed in 3.2, weakly supervised segmentation requires spatial labels for location learning and region distribution labels for shape learning. In this section, we integrate geometric transformations on point annotations and segmentation from MedSAM to generate high-confidence box labels GboxG_{box} for location LocLoc learning (1), as well as high-confidence foreground labels GfG_{f} and background labels GbG_{b} for shape SS learning (2) and  (3).

Table 1: The precision of different labels as region prototype. Box represents using bounding box as label, MedSAM denotes using MedSAM result as label, HC f/b represents our proposed high-confidence foreground/background labels.
Labels TN3K DDTI
Foreground Background Foreground Background
Box 66.14%66.14\% 99.98%99.98\% 73.64%73.64\% 99.98%99.98\%
±7.08%\pm 7.08\% ±0.75%\pm 0.75\% ±5.59%\pm 5.59\% ±0.56%\pm 0.56\%
MedSAM 92.36%92.36\% 95.85%95.85\% 93.76%93.76\% 96.97%96.97\%
±4.44%\pm 4.44\% ±3.21%\pm 3.21\% ±3.66%\pm 3.66\% ±2.38%\pm 2.38\%
HC f/b 99.66%99.66\% 99.99%99.99\% 99.79%99.79\% 99.99%99.99\%
±2.82%\pm 2.82\% ±0.50%\pm 0.50\% ±0.88%\pm 0.88\% ±0.01%\pm 0.01\%

Specifically, as illustrated in Fig. 1, we derive three geometric transformations representing low-level location information from clinical annotations:

  • Connecting the endpoints of the annotations along each axis to form quadrilateral regions.

  • Identifying and filling the minimum bounding box enclosing these four points per target to create box regions encompassing all foreground pixels

  • Negating the bounding box regions results into obtain background-only regions

The regions represent high-level semantic information are generated by MedSAM:

  • Using MedSAM with prompts computed from point annotations to obtain segmentation masks that reflecting anatomical distributions from input images.

Finally, we fuse the geometric transformations of four-point annotations and results from MedSAM to generate high-confidence box, foreground, and background labels.

As illustrated in Table 1, our generated high-confidence foreground and background labels achieve precision exceeding 99.5% across domain distributions, guaranteeing that precise foreground and background reference, as outlined in (2) and (3), can be learned from these label regions.

3.4 High-rationality Multi-level Learning Strategy

According to the Objective 1, 2, and 3 discussed we proposed in 3.2, we design a high-rationality learning strategy that guides the network to learn location and shape features to satisfy all conditions for weakly supervised segmentation tasks, which consisting of the following three losses: alignment loss, contrastive loss and prototype correlation loss.

3.4.1 Alignment Loss for Location Learning

To learn the location information as described in (1), the alignment loss consists of two components: 1) alignment of projected segmentation results with bounding box labels, and 2) topological continuity loss in high-confidence foreground regions.

For a predicted result mim_{i} and bounding box label GboxG_{box} derived from the points annotations, we define the projection of each box label mask as:

px=maxcol(Gbox)\displaystyle p_{x}=\max_{col}(G_{box}) ,py=maxrow(Gbox),\displaystyle,p_{y}=\max_{row}(G_{box}), (4)
px~=maxcol(mi)\displaystyle\tilde{p_{x}}=\max_{col}(m_{i}) ,py~=maxrow(mi),\displaystyle,\tilde{p_{y}}=\max_{row}(m_{i}),

where pxp_{x} and pyp_{y} denote projections derived from the box label GboxG_{box} onto the x-axis and y-axis, respectively, px~\tilde{p_{x}} and py~\tilde{p_{y}} are the projections of predicted result mim_{i} onto the x and y axixes.

The first part LaliprojL_{ali-proj} of alignment loss between the predicted mask and the box label is then computed as follows:

Laliproj=2×|px~px||px~|+|px|+2×|py~py||py~|+|py|.\displaystyle L_{ali-proj}=\frac{2\times|\tilde{p_{x}}\cap p_{x}|}{|\tilde{p_{x}}|+|p_{x}|}+\frac{2\times|\tilde{p_{y}}\cap p_{y}|}{|\tilde{p_{y}}|+|p_{y}|}. (5)

This projection loss provides region localization constraints with high feasibility and efficiency.

To prevent extreme predictions on coordinate projections, we introduce topological continuity loss as the second component of the alignment loss. Given the high-confidence foreground labels GfG_{f} and the prediction result mi{m_{i}}, target regions in the high-confidence foreground are calculated as:

mi=miGf,\displaystyle m_{i}^{\prime}={m_{i}}\cdot G_{f}, (6)

where mim_{i}^{\prime} represents the predicted areas in high-confidence foreground areas. The topological continuity loss LalitopoL_{ali-topo} is defined as:

Lalitopo=[milog(Gf)+(1mi)log(1Gf)].\displaystyle L_{ali-topo}=-\left[m_{i}^{\prime}\log(G_{f})+(1-m_{i}^{\prime})\log(1-G_{f})\right]. (7)

This loss ensures topological continuity within predicted foreground regions covered by the high-confidence foreground. Despite variations in the shapes of thyroid nodules, pixels within these high-confidence regions achieve classification accuracy of over 99.66%99.66\%. This specific loss function is exclusively utilized to identify foreground regions to prevent local optima in coordinate axis-based optimization.

By combining loss LaliprojL_{ali-proj} and loss LalitopoL_{ali-topo}, we obtain the final alignment loss LalignL_{align} as follows:

Lalign=Laliproj+Lalitopo.\displaystyle L_{align}=L_{ali-proj}+L_{ali-topo}. (8)

This loss allows weak supervision labels to constrain segmentation localization effectively while avoiding over-fitting to a single shape.

3.4.2 Contrastive Loss for Region-level Shape Feature Learning

Due to the absence of ground truth masks in weak supervision, as outlined in 3.2, we propose to learn the segmentation shape by refining the feature representation of foreground and background regions to achieve (2).

The proposed contrastive loss function aims to learn region-level discriminative feature representations from high-confidence foreground areas GfG_{f} and background areas GbG_{b}, the generic contrastive loss with a sampling scale size k×kk\times k, denoted as LconkL_{con}^{k}, which is formally defined as:

Lconk=1nqk+Qplog(eqkqk+τeqkqk+τ+qkQneqkqkτ),\displaystyle L_{con}^{k}=\frac{1}{n}\sum_{q_{k}^{+}\in Q_{p}}-\log\left(\frac{e^{\frac{q_{k}\cdot q_{k}^{+}}{\tau}}}{e^{\frac{q_{k}\cdot q_{k}^{+}}{\tau}}+\sum_{q_{k}^{-}\in Q_{n}}e^{\frac{q_{k}\cdot q_{k}^{-}}{\tau}}}\right), (9)

where QpQ_{p} and QnQ_{n} represent the sets of positive and negative feature embedding queues, respectively. qk+q_{k}^{+} denotes a foreground feature sample of size k×kk\times k, while qkq_{k}^{-} represents corresponding background feature samples. The anchor feature queue qq is selected from high-confidence foreground regions GfG_{f}. The temperature parameter τ>0\tau>0 controls the slope of the loss function and its smoothness. Empirically, we evaluate the contrastive loss at scale sizes k=3k=3 and k=1k=1, with τ=0.07\tau=0.07.

The contrastive loss is designed to minimize the distance between intra-class features (i.e., either both foregrounds or both backgrounds) in the embedding space and maximize the distance between inter-class features (i.e., foreground and background). Through this learning mechanism, the obtained feature representations effectively sharpen the classification boundaries between foreground and background regions. This enhancement directly improves the network’s ability to distinguish between these regions, ultimately refining its predicted prototypes for both categories to align closely with reference standards.

3.4.3 Prototype Correlation Loss for Edge-level Shape learning

To learn more precise segmentation shapes, building on the segmentation objective defined as (3), we extend the shape constraint to edge-level by introducing a prototype correlation loss, as illustrated in Fig. 2. This loss comprises two components: 1) complementary consistency between feature correlations refer to high-confidence foreground prototype and background prototype, and 2) consistency between network segmentation results and fused correlation segmentation results.

Refer to caption
Figure 2: Diagram of Prototype Correlation Loss Competition.

Given the high-confidence foreground and background labels GfG_{f} and GbG_{b}, we first extract the corresponding region features PfP_{f} and PbP_{b}. Global pooling is then applied to reduce the dimensionality of these features to C×1×1C\times 1\times 1 where C means channels to obtain the prototype features. Using metric learning, we evaluate the correlation response of each position in the feature map FF with respect to the foreground prototypes and background prototypes, obtaining the foreground correlation response CfC_{f} and the background correlation response CbC_{b} as follows:

Cf=max(0,FTPfPf2+ϵ),\displaystyle C_{f}=\max\left(0,\frac{F^{T}\cdot P_{f}}{\|P_{f}\|_{2}+\epsilon}\right), (10)
Cb=max(0,FTPbPb2+ϵ).\displaystyle C_{b}=\max\left(0,\frac{F^{T}\cdot P_{b}}{\|P_{b}\|_{2}+\epsilon}\right).

Since foreground and background are mutually exclusive in segmentation tasks, the foreground correlation response Cf^\hat{C_{f}} derived from the background prototype CbC_{b} as follows:

Cf^=1Cb.\displaystyle\hat{C_{f}}=1-C_{b}. (11)

The correlation map represents the similarity between image features and each prototype. The first component LcorfeL_{cor-fe} of the prototype correlation loss reflects the complementary consistency between the feature correlations referring to high-confidence foreground and background prototypes, which is defined as follows:

Lcorfe=12[Cflog(Cf^)+(1Cf)log(1Cf^)]\displaystyle L_{cor-fe}=-\frac{1}{2}\left[C_{f}\log(\hat{C_{f}})\right.+\left.(1-C_{f})\log(1-\hat{C_{f}})\right]
12[Cf^log(Cf)+(1Cf^)log(1Cf)].\displaystyle-\frac{1}{2}\left[\hat{C_{f}}\log(C_{f})\right.+\left.(1-\hat{C_{f}})\log(1-C_{f})\right]. (12)

The fused predictions CiC_{i} that should exhibit the same region distribution as that of the segmentation branch are obtained by balancing the foreground correlation maps CfC_{f} based on the foreground and those Cf^\hat{C_{f}} on the background prototypes. Therefore, the second component LcorsegL_{cor-seg} of the correlation loss measures the consistency between the fused correlation map and the segmentation prediction is defined as:

Lcorseg=2×|miCi^||mi|+|Ci^|.\displaystyle L_{cor-seg}=\frac{2\times|m_{i}\cap\hat{C_{i}}|}{|m_{i}|+|\hat{C_{i}}|}. (13)

The total prototype correlation loss LcorrL_{corr} is then calculated as follows:

Lcorr=Lcorfe+Lcorseg.\displaystyle L_{corr}=L_{cor-fe}+L_{cor-seg}. (14)

By considering the complementary consistency of foreground and background prototype correlation, the algorithm obtained segmentation edges with low uncertainty. By directly propagating the fused segmentation edges to the predicted results, we achieved explicit shape learning, effectively addressing the challenge of refining the edges of the nodule region.

3.4.4 Overall Loss Function

By combining the losses of learning location information and shape information mentioned above, the overall loss function during the training process can be expressed as:

Lall=Lalign+λLcon+βLcorr,\displaystyle L_{all}=L_{align}+\lambda L_{con}+\beta\cdot L_{corr}, (15)

where LallL_{all} indicates the overall loss, LalignL_{align} represents alignment loss. is LconL_{con} represents Contrastive Loss and LcorrL_{corr} is prototype correlation loss, λ\lambda and β\beta are weighted parameter.

4 Experiments

4.1 Experimental Materials

4.1.1 Dataset

To evaluate the effectiveness of the proposed segmentation framework, we conduct ablation and comparative experiments on two publicly available thyroid ultrasound datasets: TN3K[29] and DDTI[30]. The TN3K dataset consists of 3,494 high-resolution thyroid nodule images, following a standardized clinical split protocol with 2,879 images designated for training and 614 for testing. For the DDTI dataset, which contains 637 thyroid ultrasound scans, the data is randomly split into training and testing sets at a 4:1 ratio. Additionally, 10-fold cross-validation was applied to both datasets to address the challenge of limited data while ensuring statistical reliability.

Notably, in the TN3K and DDTI datasets, the ground truth annotations of thyroid nodules in ultrasound images for metric calculation were performed by experienced radiologists, while point annotations were carried out by less-experienced senior medical students to simulate a real-world weakly supervised learning scenario.

4.1.2 Evaluation Metrics

We performed a quantitative comparison using four common segmentation evaluation metrics: Mean Intersection over Union (mIoU) [53], Dice Similarity Coefficient (DSC) [53, 54, 55], Hausdorff Distance (HD) [55], and Prediction Precision (Pr) [53, 54], which evaluate the segmentation’s overall accuracy, precision of segmented regions, boundary matching, and the reliability of predicted positives, respectively.

4.1.3 Parameter Setting and Implementation

For training the network, we set a learning rate of 0.0001, a batch size of 16, and trained for 100 epochs with images resized to 256×256256\times 256. The network was optimized using the Adam optimizer. For comparative experiments, we adhered to the parameter configurations outlined in their respective papers. All experiments were carried out using PyTorch on an Nvidia 3090 GPU equipped with 24GB of memory.

4.2 Comparisons with State-of-the-art

4.2.1 Comparative Results on TN3K Dataset

Table 2: Quantitative comparison results of different methods on the TN3K and DDTI dataset
Method(backbone) TN3K DDTI
mIoU(%)\uparrow HD(mm)\downarrow DSC(%)\uparrow Pr(%)\uparrow mIoU(%)\uparrow HD(mm)\downarrow DSC(%)\uparrow Pr(%)\uparrow
U-Net[8] 64.76±25.0164.76\pm 25.01 5.83±2.205.83\pm 2.20 75.13±23.3275.13\pm 23.32 77.66±25.8977.66\pm 25.89 54.24±22.4854.24\pm 22.48 7.05±1.827.05\pm 1.82 66.80±18.3566.80\pm 18.35 68.75±19.8468.75\pm 19.84
Dense-UNet[9] 66.61±25.7466.61\pm 25.74 5.26±2.065.26\pm 2.06 76.27+25.1976.27+25.19 77.95+26.4777.95+26.47 52.07±23.3852.07\pm 23.38 7.09±1.877.09\pm 1.87 64.99±23.0064.99\pm 23.00 65.37±28.3965.37\pm 28.39
Cenet[12] 74.07±21.1174.07\pm 21.11 4.70±1.844.70\pm 1.84 82.95±18.8082.95\pm 18.80 83.52±19.4883.52\pm 19.48 66.63±21.7266.63\pm 21.72 5.88±1.765.88\pm 1.76 77.46±19.5077.46\pm 19.50 77.44±23.6377.44\pm 23.63
SCRF[14] 56.68±21.8556.68\pm 21.85 6.58±1.896.58\pm 1.89 69.19+23.0069.19+23.00 63.47+22.8163.47+22.81 50.29±20.3850.29\pm 20.38 7.86±1.627.86\pm 1.62 64.22±20.1964.22\pm 20.19 57.27±24.0457.27\pm 24.04
UNCRF[15] 56.27±23.8956.27\pm 23.89 6.51±1.976.51\pm 1.97 68.21±25.2568.21\pm 25.25 66.29±24.5866.29\pm 24.58 49.55±20.6749.55\pm 20.67 7.64±1.587.64\pm 1.58 63.42±20.8463.42\pm 20.84 61.52±25.661.52\pm 25.6
BoxInst[16] 65.42±22.5465.42\pm 22.54 5.47±2.105.47\pm 2.10 76.22+21.4776.22+21.47 76.99+22.6876.99+22.68 44.62±17.0344.62\pm 17.03 7.50±1.397.50\pm 1.39 59.67±17.4459.67\pm 17.44 64.52±29.3264.52\pm 29.32
WSDAC[20] 57.22±22.4557.22\pm 22.45 5.41±1.855.41\pm 1.85 70.58±17.3370.58\pm 17.33 76.81±28.0276.81\pm 28.02 52.05±17.4652.05\pm 17.46 6.62±1.596.62\pm 1.59 66.41±18.0566.41\pm 18.05 70.93±26.0470.93\pm 26.04
S2ME[22] 59.47±23.3759.47\pm 23.37 5.34±2.105.34\pm 2.10 70.97+23.4170.97+23.41 75.42+25.7375.42+25.73 51.60±23.6551.60\pm 23.65 7.01±1.527.01\pm 1.52 64.54±23.6664.54\pm 23.66 68.97±28.1768.97\pm 28.17
IDMPS[21] 62.76±26.6462.76\pm 26.64 5.21±1.995.21\pm 1.99 73.47±26.7373.47\pm 26.73 80.40±25.4580.40\pm 25.45 57.03±23.1157.03\pm 23.11 6.55±1.84¯\underline{6.55\pm 1.84} 69.43±22.0269.43\pm 22.02 69.44±26.8969.44\pm 26.89
HCL-HRL(U-Net) 69.30±22.38\textbf{69.30}\pm\textbf{22.38} 5.01±2.01\textbf{5.01}\pm\textbf{2.01} 79.10±21.37\textbf{79.10}\pm\textbf{21.37} 80.64±22.69\textbf{80.64}\pm\textbf{22.69} 59.82±21.98\textbf{59.82}\pm\textbf{21.98} 6.41±1.68\textbf{6.41}\pm\textbf{1.68} 72.14±20.76\textbf{72.14}\pm\textbf{20.76} 72.96±24.60\textbf{72.96}\pm\textbf{24.60}
HCL-HRL(Dense-UNet) 68.77±23.6868.77\pm 23.68 5.12±2.105.12\pm 2.10 78.32±22.77978.32\pm 22.779 79.04±23.2379.04\pm 23.23 59.00±21.0759.00\pm 21.07 6.56±1.696.56\pm 1.69 71.64±19.7771.64\pm 19.77 72.50±25.2172.50\pm 25.21
HCL-HRL(CENet) 73.32±20.1273.32\pm 20.12 4.76±1.894.76\pm 1.89 82.56±17.9782.56\pm 17.97 85.31±18.8085.31\pm 18.80 69.18±20.3469.18\pm 20.34 5.81±1.835.81\pm 1.83 79.66±17.8979.66\pm 17.89 77.73±22.3677.73\pm 22.36

The segmentation performance of our proposed method was compared against the state-of-art weakly supervised methods SCRF[14], UNCRF[15], BoxInst[16], WSDAC[20], S2ME[22], IDMPS[21], and fully supervised methods U-Net[8], Dense-UNet[9] and Cenet[12]. The quantitative results of TN3K and DDTI are shown in Table. 2. The qualitative results of TN3K and DDTI are shown in Fig. 3 and Fig.4, respectively.

Refer to caption
Figure 3: Quantitative comparison results on TN3K dataset. Red indicates correct thyroid predictions, green represents missing thyroid segmentation, and blue shows an overestimation of other organs as the thyroid.

As shown in Fig. 3, SCRF and UNCRF tended to produce box-like segmentation with significant over-segmentation. WSDAC and S2ME frequently under-segmented thyroid nodules in images with multiple nodules or irregular nodule shapes. While BoxInst and IDMPS outperformed other existing methods in high-contrast scenarios, such as the example in the third row, they still struggled with under-segmentation of incomplete nodules located at image boundaries. In contrast, our proposed method achieved more consistent segmentation regions and delicate segmentation edges, while exhibiting even less over- and under-segmentation than fully supervised networks using the same backbone. Quantitative results in Table 2 further support that our framework using the U-Net backbone outperformed state-of-the-art weakly-supervised methods. Specifically, it achieved an average mIOU of 69.30%, a Hausdorff distance (HD) of 5.01 mm, a DSC of 79.10%, and a precision (Pr) of 80.64%. These results were even better than the fully supervised U-Net, which achieved 64.76% mIOU, 5.83 mm HD, 75.13% DSC, and 77.66% Pr.

4.2.2 Comparative results on DDTI dataset

Refer to caption
Figure 4: Quantitative comparison results on DDTI dataset. Red indicates correct thyroid predictions, green represents missing thyroid segmentation, and blue shows an overestimation of other organs as the thyroid.

As illustrated in Fig. 4, fully supervised networks exhibited substantial over-segmentation in images where the foreground and background tissues are similar. Weakly supervised algorithms, such as SCRF, UNCRF, and S2ME, struggled with severe over-segmentation of small targets, as well as in images with similar foreground and background tissues. Both BoxInst and WSDAC encountered significant challenges with over- and under-segmentation when processing such images. Notably, BoxInst delivered superior segmentation results with minimal over-segmentation in simpler background scenarios, while WSDAC tended to exhibit more under-segmentation in these cases. IDMPS outperformed other weakly supervised algorithms by reducing over-segmentation in small nodules, however, it still faced substantial issues with under-segmentation and excessive over-segmentation in more complex cases. In contrast, our algorithm not only reduced over-segmentation but also illustrated enhanced shape adaptation, surpassing fully supervised methods. Additionally, by incorporating a more efficient feature extraction backbone such as CENet, we achieved better segmentation precision, with accurate shape delineation and fine edge fitting. The quantitative comparison presented in Table 2 further substantiates these qualitative observations.

4.3 Ablation Analysis

We conducted an ablation study to analyze the effectiveness of individual components in our framework. Table 3 provides an overview of different design strategies (Models A to E). The quantitative assessment of each constraint condition is detailed in Table 4, with feature maps inferred and visualized across two datasets as shown in Fig. 5.

Table 3: Design Strategies of the Ablation Models
Model single-level learning L_align L_cnt L_corr
A \checkmark
B \checkmark
C \checkmark \checkmark
D \checkmark \checkmark
E \checkmark \checkmark \checkmark
Refer to caption
Figure 5: Quantitative results of ablation experiments. Top 3 rows: validation on TN3K; bottom 3 rows: validation on DDTI. Red: correct thyroid predictions, green: under-fitting thyroid segmentation, blue: over-fitting of thyroid.
Table 4: Quantitative Ablation Results of Different Design Strategies on the TN3K and DDTI Datasets
Model TN3K DDTI
mIoU(%) \uparrow HD (mm) \downarrow DSC (%) \uparrow Pr (%) \uparrow mIoU (%) \uparrow HD (mm) \downarrow DSC (%) \uparrow Pr (%) \uparrow
A 58.97±22.6158.97\pm 22.61 5.61±2.035.61\pm 2.03 70.13±23.5070.13\pm 23.50 74.48±23.2674.48\pm 23.26 50.43±20.4150.43\pm 20.41 7.26±1.547.26\pm 1.54 63.65±20.3363.65\pm 20.33 69.46±27.2869.46\pm 27.28
B 60.78±23.4760.78\pm 23.47 5.79±2.145.79\pm 2.14 72.16±22.5772.16\pm 22.57 77.37±23.7677.37\pm 23.76 51.97±22.1251.97\pm 22.12 7.42±1.647.42\pm 1.64 65.58±21.6465.58\pm 21.64 70.82±27.1070.82\pm 27.10
C 67.49±22.4967.49\pm 22.49 5.23±2.045.23\pm 2.04 77.67±21.5877.67\pm 21.58 78.49±22.3878.49\pm 22.38 58.25±22.2558.25\pm 22.25 6.66±1.906.66\pm 1.90 70.68±20.1570.68\pm 20.15 72.35±24.8472.35\pm 24.84
D 67.61±24.7267.61\pm 24.72 5.08±2.035.08\pm 2.03 77.23±24.0877.23\pm 24.08 78.63±24.3478.63\pm 24.34 58.84±22.4658.84\pm 22.46 6.59±1.826.59\pm 1.82 71.05±20.7471.05\pm 20.74 71.60±26.0871.60\pm 26.08
E 69.30±22.38\textbf{69.30}\pm\textbf{22.38} 5.01±2.01\textbf{5.01}\pm\textbf{2.01} 79.10±21.37\textbf{79.10}\pm\textbf{21.37} 80.64±22.69\textbf{80.64}\pm\textbf{22.69} 59.82±21.98\textbf{59.82}\pm\textbf{21.98} 6.41±1.68\textbf{6.41}\pm\textbf{1.68} 72.14±20.76\textbf{72.14}\pm\textbf{20.76} 72.96±24.60\textbf{72.96}\pm\textbf{24.60}

4.3.1 Effectiveness of Alignment Loss for Spatial Learning

The effectiveness of the alignment loss for spatial learning is demonstrated through quantitative segmentation results on two datasets, summarized in Table 4. Compared to Model A, Model B showed nearly 1.5% improvements in mIoU and DSC across both datasets, indicating that decoupling weakly supervised tasks and incorporating positional constraints improves segmentation accuracy and region consistency. While precision also increased significantly, the Hausdorff Distance (HD) slightly increased by 0.18 for TN3K and 0.16 for DDTI, because the alignment loss focused primarily on region localization and required complementary shape loss for shape refinement.

Fig. 5 illustrated Model B reduced under-segmentation and over-segmentation compared to Model A, achieving more accurate location capturing and shape fitting. These observations further illustrate the network’s improved ability to accurately capture target location without misleading the segmentation shapes with the inclusion of alignment loss.

4.3.2 Effectiveness of Contrastive Loss for Semantic Feature Learning

Refer to caption
Figure 6: Performance comparison of Models C and D on the TN3K and DDTI datasets. The lines represent the performance of each model at different weight parameters (λ\lambda for contrastive loss and β\beta for prototype correlation loss).

Model C combines alignment losses with contrastive loss, where the weight of the prototype correlation loss is controlled by parameter λ\lambda. Experimental results demonstrate that Model C achieved its best comprehensive performance when λ\lambda was set to 0.8, as shown in Fig. 6.

Table 4 shows significant improvements in Model C over Model B across various metrics. Specifically, Model C achieved improvements in mIoU and DSC of more than 5.5% on TN3K and over 5% on DDTI. Similarly, there is a notable increasement in Precision for both datasets. Additionally, the Hausdorff distance decreased by 0.56 mm on TN3K and 0.76 mm on DDTI. These improvements indicate that contrastive loss contributes to more accurate and refined segmentation. The segmentation feature heatmaps shown in Fig. 5 visually support the quantitative findings.

Furthermore, by integrating contrastive loss with other losses, Model E also showed more precise segmentation regions than Model D without contrastive loss, reducing over-segmentation and under-segmentation. The inclusion of contrastive loss leads to predicted regions that closely align with the ground truth, demonstrating its effectiveness in enhancing feature learning of segmentation shapes.

4.3.3 Effectiveness of Prototype Correlation loss for Semantic Shape Learning

Model D combines alignment losses with prototype correlation loss, where the weight of the prototype correlation loss is controlled by parameter β\beta. Experimental results illustrate that Model D achieved its best comprehensive performance when β\beta was set to 1.0 on the TN3K dataset and 0.5 on the DDTI dataset, as shown in Fig. 6.

To validate the benefits of adding prototype correlation loss as a shape constraint, we compared Model D with Model B, which only using alignment loss. The experimental results show that Model D achieved nearly 7% improvement in mIoU on the TN3K dataset and 6% improvement on the DDTI dataset compared to Model B, and the Hausdorff Distance was significantly reduced, with a decrease of 0.71 on the TN3K dataset and 0.67 on the DDTI dataset. These results are supported by qualitative results shown in Figure 5, Model D generated more accurate segmentation of shapes, with lower uncertainty boundaries in feature heatmaps.

When comparing Model E and Model C, the integration of prototype correlation loss provided superior segmentation regions and edges. This resulted in clear boundary delineation and segmentation outcomes closely aligning with ground truth annotations as shown in Fig. 6. Although the effects of combined losses were not linear, this approach leaded to 0.98% to 1.69% improvement in mIoU, positioning Model E as the best-performing model among all variants. These observations confirm that prototype correlation loss effectively constrains shape information, enhancing segmentation accuracy.

5 Discussion

5.1 Comparation with weakly-supervised Segmentation (WSS) methods

Weakly Supervised Segmentation (WSS) methods have attracted increasing attention due to their ability to utilize sparse annotations for generating segmentation results, thereby reducing the need for fully annotated masks. However, these methods often encounter limitations due to label noise stemming from low-confidence pseudo-labels and insufficient discriminative features extracted for diverse and complex nodule variations through rigid learning strategies. SCRF and UNCRF generated box-like predictions because they directly utilized box labels to learn segmentation results, which introduced significant inaccuracies in shape representation and misguides the training process. Similarly, S2ME and IDMPS performed well on the DDTI dataset, where shape variations are minimal. However, their performance decreased on the TN3K dataset, which suffers from diverse and irregular nodule shapes. This decline in effectiveness was due to their reliance on fixed geometric pseudo-labels for shape learning, which failed to capture the discriminative features needed for accurately segmenting nodules with complex and variable shapes. BoxInst exhibited reasonable performance on the larger TN3K dataset but struggled with the smaller DDTI dataset due to its heavy dependence on color similarity for learning segmentation shapes. This approach was effective for high-contrast images but required a larger volume of training data. WSDAC consistently produced under-segmentation results because it heavily relies on initial contours and image gradients, which proved to be less effective for images with blurred boundaries.

In contrast, our proposed method effectively addresses these challenges through two key innovations: (1) the generation of high-confidence labels to mitigate label noise and improve training stability, and (2) the introduction of a high-rationality loss function designed to capture location-level, region-level, and edge-level features for segmentation location and shape learning. These advancements are integrated into our proposed framework, showing comparable or even surpassing results to those of fully supervised networks, demonstrating its robustness and effectiveness in addressing the unique challenges posed by thyroid ultrasound image segmentation.

5.2 Limitations and Future work

Our method showed convincing segmentation results on the thyroid nodules dataset but still faces challenges as follows:

Firstly, Although our algorithm achieved moderate inference speed (0.0065 seconds per image), it relied on MedSAM for generating high-confidence labels for training, which increased computational complexity and training time. Future work will focus on developing more efficient, lightweight models for incorporating anatomical prior information to generate high-confidence labels.

Secondly, The feature extraction backbone in our framework can be seamlessly integrated and used as needed. In this work, we employed general feature extraction backbones, improving these backbones for specialized tasks can further enhance segmentation accuracy. Future research can develop based on our framework can focus on addressing feature extraction challenges related to low contrast and speckle noise.

6 Conclusion

In this paper, we present a novel weakly supervised segmentation framework for thyroid nodule segmentation based on clinical point annotations. We clarify the segmentation objective that integrates location and shape elements to indicate the learning process. Our method combines geometric transformations with topology priors and leverages the MedSAM prediction with anatomical information to generate high-confidence labels. Furthermore, we propose a multi-level learning strategy through high-rationality losses. The alignment loss is for precise location learning, while contrastive and prototype correlation losses are for robust shape understanding. Experimental results demonstrate superior performance compared to state-of-the-art weakly supervised methods on benchmark datasets, including TN3K and DDTI. The framework is highly versatile and can be seamlessly integrated into various feature extraction architectures, offering flexibility for diverse application scenarios.

References

  • [1] J. Chen, H. You, and K. Li, “A review of thyroid gland segmentation and thyroid nodule segmentation methods for medical ultrasound images,” Computer methods and programs in biomedicine, vol. 185, p. 105329, 2020.
  • [2] N. G. Inan, O. Kocadağlı, D. Yıldırım, İ. Meşe, and Ö. Kovan, “Multi-class classification of thyroid nodules from automatic segmented ultrasound images: Hybrid resnet based unet convolutional neural network approach,” Computer Methods and Programs in Biomedicine, vol. 243, p. 107921, 2024.
  • [3] H. G. Khor, G. Ning, X. Zhang, and H. Liao, “Ultrasound speckle reduction using wavelet-based generative adversarial network,” IEEE Journal of Biomedical and Health Informatics, vol. 26, no. 7, pp. 3080–3091, 2022.
  • [4] L. Chen, W. Zheng, and W. Hu, “Mtn-net: a multi-task network for detection and segmentation of thyroid nodules in ultrasound images,” in International Conference on Knowledge Science, Engineering and Management, pp. 219–232, Springer, 2022.
  • [5] J. Chi, Z. Li, Z. Sun, X. Yu, and H. Wang, “Hybrid transformer unet for thyroid segmentation from ultrasound scans,” Computers in Biology and Medicine, vol. 153, p. 106453, 2023.
  • [6] A. Ozcan, Ö. Tosun, E. Donmez, and M. Sanwal, “Enhanced-transunet for ultrasound segmentation of thyroid nodules,” Biomedical Signal Processing and Control, vol. 95, p. 106472, 2024.
  • [7] Z. Xiang, X. Tian, Y. Liu, M. Chen, C. Zhao, L.-N. Tang, E.-S. Xue, Q. Zhou, B. Shen, F. Li, et al., “Federated learning via multi-attention guided unet for thyroid nodule segmentation of ultrasound images,” Neural Networks, vol. 181, p. 106754, 2025.
  • [8] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, pp. 234–241, Springer, 2015.
  • [9] S. Cai, Y. Tian, H. Lui, H. Zeng, Y. Wu, and G. Chen, “Dense-unet: a novel multiphoton in vivo cellular image segmentation model based on a convolutional neural network,” Quantitative imaging in medicine and surgery, vol. 10, no. 6, p. 1275, 2020.
  • [10] S. Zheng, J. Lu, H. Zhao, X. Zhu, Z. Luo, Y. Wang, Y. Fu, J. Feng, T. Xiang, P. H. Torr, et al., “Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 6881–6890, 2021.
  • [11] H. Cao, Y. Wang, J. Chen, D. Jiang, X. Zhang, Q. Tian, and M. Wang, “Swin-unet: Unet-like pure transformer for medical image segmentation,” arXiv preprint arXiv:2105.05537, 2021.
  • [12] H. Tao, C. Xie, J. Wang, and Z. Xin, “Cenet: A channel-enhanced spatiotemporal network with sufficient supervision information for recognizing industrial smoke emissions,” IEEE internet of things journal, vol. 9, no. 19, pp. 18749–18759, 2022.
  • [13] Y. Liu, L. Lin, K. K. Wong, and X. Tang, “Procns: Progressive prototype calibration and noise suppression for weakly-supervised medical image segmentation,” IEEE Journal of Biomedical and Health Informatics, 2024.
  • [14] N. Zhang, S. Francis, R. A. Malik, and X. Chen, “A spatially constrained deep convolutional neural network for nerve fiber segmentation in corneal confocal microscopic images using inaccurate annotations,” in 2020 IEEE 17th International Symposium on Biomedical Imaging (ISBI), pp. 456–460, IEEE, 2020.
  • [15] G. K. Mahani, R. Li, N. Evangelou, S. Sotiropolous, P. S. Morgan, A. P. French, and X. Chen, “Bounding box based weakly supervised deep convolutional neural network for medical image segmentation using an uncertainty guided and spatially constrained loss,” in 2022 IEEE 19th International Symposium on Biomedical Imaging (ISBI), pp. 1–5, IEEE, 2022.
  • [16] Z. Tian, C. Shen, X. Wang, and H. Chen, “Boxinst: High-performance instance segmentation with box annotations,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5443–5452, 2021.
  • [17] Q. Chen, Y. Chen, Y. Huang, X. Xie, and L. Yang, “Region-based online selective examination for weakly supervised semantic segmentation,” Information Fusion, vol. 107, p. 102311, 2024.
  • [18] X. Zhao, Z. Li, X. Luo, P. Li, P. Huang, J. Zhu, Y. Liu, J. Zhu, M. Yang, S. Chang, et al., “Ultrasound nodule segmentation using asymmetric learning with simple clinical annotation,” IEEE Transactions on Circuits and Systems for Video Technology, 2024.
  • [19] T. Zhao and Z. Yin, “Weakly supervised cell segmentation by point annotation,” IEEE Transactions on Medical Imaging, vol. 40, no. 10, pp. 2736–2747, 2020.
  • [20] Z. Li, S. Zhou, C. Chang, Y. Wang, and Y. Guo, “A weakly supervised deep active contour model for nodule segmentation in thyroid ultrasound images,” Pattern Recognition Letters, vol. 165, pp. 128–137, 2023.
  • [21] X. Zhao, Z. Li, X. Luo, P. Li, P. Huang, J. Zhu, Y. Liu, J. Zhu, M. Yang, S. Chang, et al., “Ultrasound nodule segmentation using asymmetric learning with simple clinical annotation,” IEEE Transactions on Circuits and Systems for Video Technology, 2024.
  • [22] A. Wang, M. Xu, Y. Zhang, M. Islam, and H. Ren, “S 2 me: Spatial-spectral mutual teaching and ensemble learning for scribble-supervised polyp segmentation,” in International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 35–45, Springer, 2023.
  • [23] M. Han, X. Luo, X. Xie, W. Liao, S. Zhang, T. Song, G. Wang, and S. Zhang, “Dmsps: Dynamically mixed soft pseudo-label supervision for scribble-supervised medical image segmentation,” Medical Image Analysis, vol. 97, p. 103274, 2024.
  • [24] Z. Li, Y. Zheng, D. Shan, S. Yang, Q. Li, B. Wang, Y. Zhang, Q. Hong, and D. Shen, “Scribformer: Transformer makes cnn work better for scribble-based medical image segmentation,” IEEE Transactions on Medical Imaging, 2024.
  • [25] W. Lei, Q. Su, T. Jiang, R. Gu, N. Wang, X. Liu, G. Wang, X. Zhang, and S. Zhang, “One-shot weakly-supervised segmentation in 3d medical images,” IEEE Transactions on Medical Imaging, 2023.
  • [26] Z. Fan, R. Jiang, J. Wu, X. Huang, T. Wang, H. Huang, and M. Xu, “Enhancing weakly supervised 3d medical image segmentation through probabilistic-aware learning,” arXiv preprint arXiv:2403.02566, 2024.
  • [27] H. Du, Q. Dong, Y. Xu, and J. Liao, “Weakly-supervised 3d medical image segmentation using geometric prior and contrastive similarity,” IEEE Transactions on Medical Imaging, 2023.
  • [28] S. Zhai, G. Wang, X. Luo, Q. Yue, K. Li, and S. Zhang, “Pa-seg: Learning from point annotations for 3d medical image segmentation using contextual regularization and cross knowledge distillation,” IEEE transactions on medical imaging, vol. 42, no. 8, pp. 2235–2246, 2023.
  • [29] H. Gong, G. Chen, R. Wang, X. Xie, M. Mao, Y. Yu, F. Chen, and G. Li, “Multi-task learning for thyroid nodule segmentation with thyroid region prior,” in 2021 IEEE 18th international symposium on biomedical imaging (ISBI), pp. 257–261, IEEE, 2021.
  • [30] L. Pedraza, C. Vargas, F. Narváez, O. Durán, E. Muñoz, and E. Romero, “An open access thyroid ultrasound image database,” in 10th International symposium on medical information processing and analysis, vol. 9287, pp. 188–193, SPIE, 2015.
  • [31] M. E. Rayed, S. S. Islam, S. I. Niha, J. R. Jim, M. M. Kabir, and M. Mridha, “Deep learning for medical image segmentation: State-of-the-art advancements and challenges,” Informatics in Medicine Unlocked, p. 101504, 2024.
  • [32] R. Obuchowicz, M. Strzelecki, and A. Piórkowski, “Clinical applications of artificial intelligence in medical imaging and image processing—a review,” 2024.
  • [33] D. Das, M. S. Iyengar, M. S. Majdi, J. J. Rodriguez, and M. Alsayed, “Deep learning for thyroid nodule examination: a technical review,” Artificial Intelligence Review, vol. 57, no. 3, p. 47, 2024.
  • [34] Z. Zhou, M. M. Rahman Siddiquee, N. Tajbakhsh, and J. Liang, “Unet++: A nested u-net architecture for medical image segmentation,” in Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: 4th International Workshop, DLMIA 2018, and 8th International Workshop, ML-CDS 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, September 20, 2018, Proceedings 4, pp. 3–11, Springer, 2018.
  • [35] F. Isensee, P. F. Jaeger, S. A. Kohl, J. Petersen, and K. H. Maier-Hein, “nnu-net: a self-configuring method for deep learning-based biomedical image segmentation,” Nature methods, vol. 18, no. 2, pp. 203–211, 2021.
  • [36] Y.-T. Zhou, T.-Y. Yang, X.-H. Han, and J.-C. Piao, “Thyroid-detr: Thyroid nodule detection model with transformer in ultrasound images,” Biomedical Signal Processing and Control, vol. 98, p. 106762, 2024.
  • [37] H. Bi, C. Cai, J. Sun, Y. Jiang, G. Lu, H. Shu, and X. Ni, “Bpat-unet: Boundary preserving assembled transformer unet for ultrasound thyroid nodule segmentation,” Computer methods and programs in biomedicine, vol. 238, p. 107614, 2023.
  • [38] “Transunet: Rethinking the u-net architecture design for medical image segmentation through the lens of transformers,” Medical Image Analysis, vol. 97, p. 103280, 2024.
  • [39] C. Chen, J. Miao, D. Wu, A. Zhong, Z. Yan, S. Kim, J. Hu, Z. Liu, L. Sun, X. Li, et al., “Ma-sam: Modality-agnostic sam adaptation for 3d medical image segmentation,” Medical Image Analysis, vol. 98, p. 103310, 2024.
  • [40] Z. Wang, J.-Q. Zheng, Y. Zhang, G. Cui, and L. Li, “Mamba-unet: Unet-like pure visual mamba for medical image segmentation,” arXiv preprint arXiv:2402.05079, 2024.
  • [41] Z. Xing, T. Ye, Y. Yang, G. Liu, and L. Zhu, “Segmamba: Long-range sequential modeling mamba for 3d medical image segmentation,” in International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 578–588, Springer, 2024.
  • [42] J. Ruan, J. Li, and S. Xiang, “Vm-unet: Vision mamba unet for medical image segmentation,” arXiv preprint arXiv:2402.02491, 2024.
  • [43] J. Liu, H. Yang, H.-Y. Zhou, L. Yu, Y. Liang, Y. Yu, S. Zhang, H. Zheng, and S. Wang, “Swin-umamba†: Adapting mamba-based vision foundation models for medical image segmentation,” IEEE Transactions on Medical Imaging, 2024.
  • [44] G. Chen, G. Tan, M. Duan, B. Pu, H. Luo, S. Li, and K. Li, “Mlmseg: a multi-view learning model for ultrasound thyroid nodule segmentation,” Computers in Biology and Medicine, vol. 169, p. 107898, 2024.
  • [45] J. Wu, R. Fu, H. Fang, Y. Zhang, Y. Yang, H. Xiong, H. Liu, and Y. Xu, “Medsegdiff: Medical image segmentation with diffusion probabilistic model,” in Medical Imaging with Deep Learning, pp. 1623–1639, PMLR, 2024.
  • [46] C. Li, R. Du, Q. Luo, R. Wang, and X. Ding, “A novel model of thyroid nodule segmentation for ultrasound images,” Ultrasound in Medicine & Biology, vol. 49, no. 2, pp. 489–496, 2023.
  • [47] H. Gong, J. Chen, G. Chen, H. Li, G. Li, and F. Chen, “Thyroid region prior guided attention for ultrasound segmentation of thyroid nodules,” Computers in biology and medicine, vol. 155, p. 106389, 2023.
  • [48] X. Ma, B. Sun, W. Liu, D. Sui, S. Shan, J. Chen, and Z. Tian, “Tnseg: adversarial networks with multi-scale joint loss for thyroid nodule segmentation,” The Journal of Supercomputing, vol. 80, no. 5, pp. 6093–6118, 2024.
  • [49] J. Guo and Y. Ge, “Temporal contrastive and spatial enhancement coarse grained network for weakly supervised group activity recognition,” Engineering Applications of Artificial Intelligence, vol. 133, p. 108115, 2024.
  • [50] L. Lin, Y. Liu, J. Wu, P. Cheng, Z. Cai, K. K. Wong, and X. Tang, “Fedlppa: learning personalized prompt and aggregation for federated weakly-supervised medical image segmentation,” IEEE Transactions on Medical Imaging, 2024.
  • [51] H. R. Roth, D. Yang, Z. Xu, X. Wang, and D. Xu, “Going to extremes: weakly supervised medical image segmentation,” Machine Learning and Knowledge Extraction, vol. 3, no. 2, pp. 507–524, 2021.
  • [52] Z. Wang and I. Voiculescu, “Weakly supervised medical image segmentation through dense combinations of dense pseudo-labels,” in MICCAI Workshop on Data Engineering in Medical Imaging, pp. 1–10, Springer, 2023.
  • [53] Y. Ling, Y. Wang, W. Dai, J. Yu, P. Liang, and D. Kong, “Mtanet: Multi-task attention network for automatic medical image segmentation and classification,” IEEE Transactions on Medical Imaging, 2023.
  • [54] R. Wang, T. Lei, R. Cui, B. Zhang, H. Meng, and A. K. Nandi, “Medical image segmentation using deep learning: A survey,” IET image processing, vol. 16, no. 5, pp. 1243–1267, 2022.
  • [55] E. Zhu, H. Feng, L. Chen, Y. Lai, and S. Chai, “Mp-net: A multi-center privacy-preserving network for medical image segmentation,” IEEE Transactions on Medical Imaging, 2024.