\FAILED\FAILED

MiM: Mask in Mask Self-Supervised Pre-Training for 3D Medical Image Analysis

Jiaxin Zhuang Linshan Wu Qiong Wang Peng Fei Varut Vardhanabhuti Lin Luo and Hao Chen \IEEEmembershipSenior Member, IEEE This work was supported by Hong Kong Innovation and Technology Fund (Project No. ITS/028/21FP, ITCPD/17-9 and MHP/002/22), Shenzhen Science and Technology Innovation Committee Fund (Project No. SGDX20210823103201011), and the Project of Hetao Shenzhen-Hong Kong Science and Technology Innovation Cooperation Zone (HZQB-KCZYB-2020083).
Jiaxin Zhuang and Linshan Wu are with the Departments of Computer Science and Engineering, Hong Kong University of Science and Technology, Hong Kong (email: {jzhuangad,linshan.wu}@cse.ust.hk)
Qiong Wang is with Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, China. (email: wangqiong@siat.ac.cn).
Peng Fei is with the School of Optical and Electronic Information-Wuhan National Laboratory for Optoelectronics, Huazhong University of Science and Technology, Wuhan, 430074, China (email: feipeng@hust.edu.cn).
Varut Vardhanabhuti is with the Department of Diagnostic Radiology, The University of Hong Kong, Hong Kong SAR. (email: varv@hku.hk).
Lin Luo is with the College of Engineering, Peking University, Beijing, China. (email:luol@pku.edu.cn).
Hao Chen is with the Department of Computer Science and Engineering , Department of Chemical and Biological Engineering, and State Key Laboratory of Molecular Neuroscience, Hong Kong University of Science and Technology, Hong Kong, and HKUST Shenzhen-Hong Kong Collaborative Innovation Research Institute, Futian, Shenzhen, China (corresponding author: jhc@cse.ust.hk).

Abstract

The Vision Transformer (ViT) has demonstrated remarkable performance in Self-Supervised Learning (SSL) for 3D medical image analysis. Masked AutoEncoder (MAE) for feature pre-training can further unleash the potential of ViT on various medical vision tasks. However, due to large spatial sizes with much higher dimensions of 3D medical images, the lack of hierarchical design for MAE may hinder the performance of downstream tasks. In this paper, we propose a novel Mask in Mask (MiM) pre-training framework for 3D medical images, which aims to advance MAE by learning discriminative representation from hierarchical visual tokens across varying scales. We introduce multiple levels of granularity for masked inputs from the volume, which are then reconstructed simultaneously ranging at both fine and coarse levels. Additionally, a cross-level alignment mechanism is applied to adjacent level volumes to enforce anatomical similarity hierarchically. Furthermore, we adopt a hybrid backbone to enhance the hierarchical representation learning efficiently during the pre-training. MiM was pre-trained on a large scale of available 3D volumetric images, i.e., Computed Tomography (CT) images containing various body parts. Extensive experiments on twelve public datasets demonstrate the superiority of MiM over other SSL methods in organ/tumor segmentation and disease classification. We further scale up the MiM to large pre-training datasets with more than 10k volumes, showing that large-scale pre-training can further enhance the performance of downstream tasks. Code and checkpoint will be available once upon acceptance.

{IEEEkeywords}

CT, Self-Supervised Learning, Segmentation, Classification, 3D medical images.

1 Introduction

The advent of deep learning has catalyzed unprecedented advances in medical image analysis, predominantly within the supervised learning paradigm. However, this paradigm’s fundamental limitation lies in its reliance on extensively labeled training data, particularly challenging for 3D medical images where annotation demands substantial domain expertise, time, and resources [1, 2, 3, 4, 5, 6]. Self-Supervised Learning (SSL) has emerged as a transformative solution, offering a powerful mechanism to learn robust, transferable representations from unlabeled data that significantly enhance downstream task performance [7, 8, 9, 10, 11, 12, 13, 14, 15, 16]. This paradigm shift has enabled unprecedented advances in medical image analysis while substantially reducing the annotation burden [2, 4].

In the realm of SSL, Masked Autoencoders (MAE) [17] have demonstrated remarkable success in natural image analysis through their innovative approach to reconstructing distorted views via Vision Transformer architectures [17, 18]. However, extending these methods to 3D medical imaging introduces unique challenges due to the inherent complexity of volumetric data, characterized by high dimensionality and intricate anatomical structures spanning multiple scales. Recent advances have made significant strides in addressing these challenges. MAE3D [19] pioneered the adaptation to 3D medical imaging by reconstructing non-overlapping patches from cropped sub-volumes, establishing a foundation for understanding complex anatomical structures. GL-MAE [20] further advanced this direction by introducing local-global contrastive learning to enhance multi-granular anatomical structure modeling, while SwinUNETR [9] incorporated multi-scale feature learning through hybrid transformers [21]. Although these approaches demonstrated promising results, their reliance on single-scale cropped volumes inherently constrains their ability to capture comprehensive anatomical relationships. Recent efforts have explored alternative strategies to address these limitations: SwinMM [22] leveraged multi-view consistency to enhance volume-wise representations, while Alice [7] utilized pre-trained models [23] to capture rich intra-volume relationships. Despite these advances, three critical challenges persist: the spatial context limitations imposed by volume cropping, which hinder the understanding of complete anatomical structures; the lack of explicit modeling of hierarchical relationships across different anatomical scales; and the substantial computational demands of processing full-resolution 3D volumes. These interconnected challenges underscore the need for a more sophisticated approach that can efficiently capture both fine-grained anatomical details and global contextual information while maintaining practical computational requirements.

Refer to caption — Figure 1: Different SSL for 3D medical image analysis. Current Masked Image Modeling methods for 3D medical images primarily (a) rely on pretext tasks e.g., inpainting, at a single level, utilizing hybrid transformers to incorporate all tokens or (b) employ an MAE that reconstructs at a single level using unmasked tokens. In contrast, (c) we observe that 3D medical images inherently exhibit hierarchical properties. Thus, our Mask in Mask (MiM) framework aims to encode multi-level 3D medical image learning across hierarchical visual tokens at various scales through multi-level reconstruction and cross-level alignment (we set the number of level $L$ to 3 in this figure). Additionally, our framework employs a hybrid transformer while only using unmasked tokens.

The intrinsic hierarchical nature of medical images, particularly those with expansive spatial dimensions, necessitates a sophisticated multi-level analytical framework for comprehensive clinical interpretation [24, 9]. As shown in Fig 1, we introduce MiM, a novel hierarchical framework that fundamentally advances MAE-based representation learning for 3D medical images. Our framework systematically addresses the key limitations of existing approaches through three synergistic components. First, to overcome the limited context of cropped volumes, we propose a multi-level volume generation strategy that processes the larger view of 3D volume at multiple scales simultaneously, enabling our model to capture both fine-grained anatomical details and their broader contextual relationships. Second, to explicitly model hierarchical representations, we design a sophisticated multi-level reconstruction mechanism that operates across different anatomical scales. This mechanism preserves critical anatomical details at varying granularities while enforcing consistency through an advanced cross-level alignment strategy, ensuring coherent interpretation between local structures and their global context. Third, to address the computational challenges of processing larger view of the 3D medical images, we incorporate an efficient hybrid backbone design inspired by MCMAE [25]. This architecture significantly reduces computational overhead while maintaining the advantages of transformer-based models, making it practical to analyze high-resolution 3D medical images. These innovations work together synergistically, enabling MiM to effectively model the complete anatomical hierarchy while maintaining computational efficiency.

The principal contributions of this work are threefold:

1.

We present MiM, a computationally efficient SSL framework that advances MAE through hierarchical design for 3D medical image pre-training. Our approach effectively manages the complexity of 3D medical data while enabling the simultaneous capture of anatomical features across multiple scales, crucial for accurate medical image analysis.
2.

We introduce a comprehensive methodology for encoding multi-level visual information through two synergistic proxy tasks: multi-level reconstruction and cross-level alignment. This design enables robust local and global representation learning while maintaining anatomical consistency across scales through our novel cross-level alignment mechanism.
3.

Through extensive experimental validation across twelve diverse datasets, utilizing pre-training sets ranging from 1k to 10k volumes, we demonstrate state-of-the-art performance and establish a clear correlation between pre-training dataset scale and model effectiveness. Our comprehensive evaluation shows significant improvements across various medical imaging tasks, with our multi-level approach consistently outperforming single-scale alternatives.

2 Related works

Recent advances in SSL for medical image analysis have evolved through different paradigms, with dense prediction tasks like segmentation being crucial yet challenging for 3D medical images. In this section, we first review SSL methods in medical imaging, focusing on the progression from contrastive-based to generative-based approaches and their effectiveness in such dense prediction tasks. This evolution leads to our discussion of Masked Image Modeling, a powerful generative SSL technique showing promise in 3D medical image analysis. However, current Masked Image Modeling approaches struggle to capture the complex hierarchical nature of 3D medical data, motivating our examination of hierarchical SSL designs. Through this review, we demonstrate how existing methods fall short in effectively modeling multi granularity of features in 3D medical images, thereby establishing the context for our proposed hierarchical MIM framework.
SSL for Medical Image Analysis. SSL methods in medical imaging have evolved through two primary paradigms: contrastive-based and generative-based approaches [26, 27, 28]. Contrastive-based methods focus on aligning representations of positive pairs while separating negative pairs in feature space [29]. While foundational works like SimCLR [29], MoCov2 [30], and MoCov3 [31] demonstrated success in medical classification tasks [32, 33], their effectiveness in medical imaging has been further enhanced through domain-specific innovations. For instance, [34] introduced sub-volume based contrastive learning, while [35, 23] advanced multi-granular understanding by contrasting features at both local and global scales. However, these approaches’ focus on instance-level alignment has limited their effectiveness in dense prediction tasks like segmentation [32]. Generative-based approaches address these limitations by explicitly modeling spatial structures and preserving local-global consistency [36]. The field has progressed from basic reconstruction tasks using 3D UNet [37, 38] to sophisticated approaches incorporating modern architectures like 3D Swin Transformers with inpainting [9, 34]. This evolution has culminated in Masked Image Modeling, demonstrating robust performance across various 3D medical imaging applications [17, 39, 19, 20, 40].

Medical Image Analysis with Masked Image Modeling. Masked Image Modeling has emerged as a powerful technique for medical image analysis, building upon the success of generative SSL. Initially proposed by [17], this approach demonstrated that high-ratio masking creates an effective self-supervisory task through raw pixel restoration. The field has advanced through innovations in masking strategies [41] and reconstruction targets [18]. While early applications showed promise in 2D medical tasks [42, 43], including disease classification [44] and segmentation [43], the extension to 3D medical imaging introduced new challenges. MAE3D [19] pioneered the adaptation of masked autoencoding to volumetric data, while GL-MAE [20] enhanced anatomical structure understanding by incorporating global-local consistency through contrastive learning. SwinUNETR [9] further advanced the field by recognizing the importance of multi-scale representation learning and incorporating hybrid transformers [21]. However, these methods [19, 20, 9, 22], despite their innovations, process 3D medical images at a single scale with limited receptive fields or rely solely on backbone-level multi-scale processing. As illustrated in Fig. 1, this approach becomes problematic when handling the significantly larger spatial dimensions and varying anatomical structures present in 3D medical images.

Hierarchical SSL. The importance of hierarchical structures in SSL has gained recognition in both natural and medical image domains. In natural image processing, approaches like [18] and [39] have integrated Masked Image Modeling with hierarchical backbones like Swin Transformer [45]. [46] explored multi-scale feature reconstruction, while [25] advanced masked patch prediction using unmasked contexts. In medical imaging, Adam [47] introduced multi-granular contrastive learning with ResNet [48]. However, these approaches either suffer from the limited dimensionality of 2D images [47, 25] or treat hierarchical levels independently [47], failing to capture the cross-level semantic relationships crucial for 3D medical images.
Distinguished from these approaches, our MiM framework advances the state-of-the-art in 3D medical image analysis through four key innovations: (1) We radically advance beyond MAE3D [19] by designing a powerful multi-level volume generation strategy that enables simultaneous reconstruction of features across both fine and coarse scales, addressing the limitations of existing single-scale methods with limited receptive fields [19, 9, 22, 20] (2) Inspired by [49, 20], we develop a sophisticated cross-level alignment mechanism specifically for 3D medical volumes that enforces anatomical consistency across hierarchical levels, achieving significant improvements in anatomical feature learning. (3) Our innovative 3D hybrid backbone architecture adapted from 2D natural images [25] achieves superior efficiency in capturing multi-scale features during pre-training while reducing computational demands. (4) MiM demonstrates exceptional scalability by successfully pre-training on massive datasets exceeding 10,000 volumes, surpassing the scale of existing 3D SSL medical imaging methods [38, 32, 19, 20, 9, 8]. The effectiveness of these innovations is validated through extensive experiments across twelve public datasets in organ/tumor segmentation and disease classification tasks, establishing MiM as a transformative advancement in the field.

3 Methodology

This section provides an overview of our proposed MiM method. Firstly, the overall framework of the MiM method is introduced in Section 3.1. Secondly, the process of multi-level reconstruction is presented in Section 3.2. Then, the cross-level alignment via contrastive learning in our proposed MiM method is described in Section 3.3. Finally, the backbone of the MiM method in the pre-training period is introduced in Section 3.4.

3.1 Overall framework

The proposed MiM method’s overall framework is presented in Fig. 2, which comprises multi-level reconstruction modules and cross-level alignment modules. To pre-train the model with MiM, we first generate multi-level volumes from the input 3D medical images. Then, an input volume is cropped into non-overlapping patches, which are divided into unmasked patches and masked patches. The unmasked patches are transformed into a high-dimension feature space using a typical backbone (CNN [48] and transformer [50]). The masked patches are used to generate the next level of masked volumes for multi-level learning. The goal is to restore the masked patches from different levels of masked volumes. In this paper, instead of a single level of masked volume [41, 46], we propose to ease this goal by multi-level of masked volume. We developed a $\mathcal{L}_{R}$ to supervise the final prediction. In addition, we further use a loss function $\mathcal{L}_{C}$ to align the shared semantics between the cross-level volumes, aiming to learn the global semantic content as well as local details. Further details are presented in Section 3.2 and Section 3.3.

3.2 Multi-level reconstruction

Generation of multi-level masked volumes. Given a volumetric image $x\in\mathbb{R}^{CHWD}$ (e.g., $C=1$ for CT), we aim to generate multi-level volumes $\{x^{l}\in\mathbb{R}^{CH^{l}W^{l}D^{l}},l\in L\}$ , which contains $L$ numbers of level, i.e., different granularity information from coarse to fine level. The generation of multi-level volume is illustrated in Fig. 3. We start by cropping a large portion of sub-volume from $x$ as Level-1 volume $x^{1}\in\mathbb{R}^{CH^{1}W^{1}D^{1}}$ . The level 1 volume $x^{1}$ is then patchified into $N$ non-overlapping visual tokens using a patch of $\{\frac{H^{1}}{6},\frac{W^{1}}{6},\frac{D^{1}}{6}\}$ as in [19, 20]. As previous MAE-based methods [20], we apply a high mask ratio $\mu$ (e.g., 60%) to the $N$ non-overlapping tokens (e.g., 216), resulting in unmasked tokens and masked tokens. Each unmasked token is resized at spatial dimension to $\mathbb{R}^{Chwd}$ before further processed by a linear projection as previous MAE-based methods [19]. The masked tokens from Level-1 $x^{1}$ are considered as the next level volume, i.e., Level-2 volume $x^{2}\in\mathbb{R}^{CH^{2}W^{2}D^{2}}$ . It’s noted that the Level-2 volume $x^{2}$ is generated from the masked patches of Level-1 volume $x^{1}$ , instead of unmasked patches. The content of the target reconstruction at different levels shares regions, meaning that the volumes for Level-1 and Level-2 reconstruct overlapping regions but from different granularities, respectively. This can effectively capture the hierarchical structure of 3D medical images and improve representation learning [46]. Ablation study of the reconstruction targets, i.e., the choices of masked patches or unmasked patches as next level volume, is presented in Section 4.4.1. Since there exist many unmasked tokens, we randomly sample $\gamma$ times without replacement from them for computation efficiency. This generation process of multi-level volume is repeated until the last level of volume $x^{L}$ is generated.

Thus, all unmasked tokens from different levels are fed into the backbone to extract the high-dimension features $z$ . Following [17, 19], a lightweight decoder is used to project $z$ along with learnable masked tokens into latent features $q$ . For reconstruction, the decoder followed by a simple prediction head is used to reconstruct the masked tokens $y$ .

Single-level reconstruction. Our reconstruction target is the pixel values of the masked tokens for each level volume. With feature extracted from the backbone and the decoder with a prediction head (i.e., a linear layer), following previous MAE methods [19], we reshape the prediction results and reconstruction targets to one-dimensional vector, i.e., $\hat{y}\in\mathbb{R}^{{1\times C}}$ and $y\in\mathbb{R}^{{1\times C}}$ , where $C$ is the number of dimensions. Specifically, we empirically set the $C$ to 768, as in [19, 20].

Then we compute the differences $d$ between the reconstruction target $y$ and the prediction result $\hat{y}$ . Specifically, we use MSE distance [19] to measure the difference $d_{m}$ for each masked token $m$ as follows:

d_{m}=\|y_{m}-\hat{y}_{m}\|_{2},m\in M,M=\mu N,

(1)

where $M$ denotes numbers of masked tokens. To minimize the difference $d_{m}$ , we define the reconstruction loss $\mathcal{L}_{\mathcal{R}}^{l}$ for each level of masked volumetric images $l\in L$ as follows:

\mathcal{L}_{\mathcal{R}}^{l}=\frac{1}{|M|}\sum_{m=1}^{|M|}d_{m}^{l},l\in L,

(2)

where $|M|$ represents the number of mask tokens.

Loss function for multi-level reconstruction. To learn multi-granularity details, we apply single-level reconstruction for each level of masked volumetric images. Thus, the multi-level reconstruction loss $\mathcal{L}_{\mathcal{R}}$ can be formulated as follows:

\mathcal{L}_{\mathcal{R}}=\frac{1}{L}\sum_{l=1}^{L}\mathcal{L}_{\mathcal{R}}^{l}

(3)

3.3 Cross-level alignment

The alignment between the shared semantic patches of cross-level volumes from fine-to-coarse can enforce anatomical similarity in a hierarchical manner. Fig. 4 illustrates the process of cross-level alignment. Since we generate the finer level volumes from the coarser level volumes (e.g., Level-2 volumes generated from masked patches of the Level-1 volumes), these volumes must share semantic context, which can be regarded as positive pairs. In contrast, the non-overlap patches in the coarser volumes (e.g., the rest patches in the Level-1 volumes) are considered as negative patches. To enlarge the high-dimensional feature consistency for shared anatomical structure patches (i.e., positive pairs) and discrepancy between non-overlap patches (i.e., negative pairs), we apply contrastive learning [29] to the context vector $c$ and patches $p$ . Specially, we reshape the context vector and patches its coarser volumes to one-dimension vector $c\in\mathbb{R}^{1\times D}$ and $\{p_{i}\in\mathbb{R}^{1\times D},i\in N\}$ , where $D$ is the number of dimensions, empirically set $D$ to 2048, as in [49].

First, given the context vector $c_{i}$ and the patches $p$ , we compute the cosine similarity $s_{ij}$ between the context vector $c_{i}$ and the patches $p_{j}$ as follows:

s_{ij}=cosSim(c_{i},p_{j})=\frac{c_{i}\cdot p_{j}}{\|c_{i}\|\ \|p_{j}\|},i\in M,j\in N.

(4)

We aim to maximize cosine similarity between the context vector $c_{i}$ and the positive patches $p$ and minimize cosine similarity between the context vector $c_{i}$ and the negative patches $p_{j}$ from the rest patches of the volumes. Thus, we apply contrastive loss [29] to implement this goal as follows,

\ell_{ij}=-\textrm{log}\frac{\textrm{exp}\ (s_{ij}/\tau)}{\sum_{k=1}^{N}\mathbbm{1}_{k\neq i}\ \textrm{exp}(s_{ik}/\tau)},i\in M,j\in N,M\in\mu N.

(5)

Thus, the cross-level alignment loss $\mathcal{L}_{\mathcal{C}}^{l,l+1}$ between adjacent level $l$ and $l+1$ volumes is computed as follows:

\mathcal{L}_{\mathcal{C}}^{l,l+1}=\frac{1}{M\cdot N}\sum_{i=1}^{M}\sum_{j=1}^{N}l_{ij},l\in L-1,M\in\mu N.

(6)

The cross-level alignment loss is computed as the sum of the negative log-likelihood of the cosine similarity between the context vector and the rest patches from the coarser volumes.

The formulation of cross-level alignment $\mathcal{L}_{\mathcal{C}}$ is defined as the average of the cross-level alignment loss between adjacent level volumes, as shown in Eq. 7,

\mathcal{L}_{\mathcal{C}}=\frac{1}{L-1}\sum_{l=1}^{L-1}\mathcal{L}_{\mathcal{C}}^{l,l+1}

(7)

By minimizing the overall cross-level alignment loss, we can enforce anatomical similarity hierarchically.

3.4 Backbone

While previous hybrid transformers like Swin Transformer [21] generate pyramid features, they process all tokens during Masked Image Modeling pre-training, leading to computational inefficiency [25]. We address this limitation by extending MCMAE’s backbone [25] to 3D medical images through an additional depth dimension, allowing our model to process only unmasked tokens in transformer layers. This optimization significantly improves computational efficiency and scalability, as validated in Section 4.4.1. As illustrated in Fig 5, our FPN architecture adopts MCMAE’s [25] hierarchical design, processing features at four scales (H/2×W/2 to H/16×W/16) with channel dimensions (C to 4C). Each scale utilizes StrideConv [25] downsampling and is then patchified to patches, followed by masking for unmasked patch generation. The bottom-up pathway integrates features through lateral connections and summations, producing comprehensive multi-scale representations while maintaining computational efficiency.

Overall objective function. Our MiM introduces a hierarchically designed approach to 3D medical image representation learning through multi-level reconstruction $\mathcal{L}_{R}$ and cross-level alignment $\mathcal{L}_{C}$ . We empirically set the level of masked volume to $L=3$ since the three-level masked volume can provide a good balance between the representation learning and computational efficiency. Ablation study of $L$ is on the Section 4.4.1. Specifically, the multi-level reconstruction loss in Eq. 3 can be further expanded as follows:

\mathcal{L}_{\mathcal{R}}=\mathcal{L}_{\mathcal{R}}^{1}+\mathcal{L}_{\mathcal{R}}^{2}+\mathcal{L}_{\mathcal{R}}^{3}.

(8)

Then, since cross-level alignment loss applied between adjacent level volumes, Eq. 6 expands as follows:

\mathcal{L}_{\mathcal{C}}=\mathcal{L}_{\mathcal{C}}^{1,2}+\mathcal{L}_{\mathcal{C}}^{2,3}.

(9)

Thus, the total loss function $\mathcal{L}$ is the combination of these two losses, as shown in Eq. 10,

\mathcal{L}=\mathcal{L}_{\mathcal{R}}+\alpha\mathcal{L}_{\mathcal{C}},

(10)

where the hyper-parameters $\alpha$ are used to balance the relative contributions of these two kinds of losses. We empirically set the $\alpha$ to 0.1 based on our experiment results. The ablation study of hyperparameter $\alpha$ presents in Section 4.4.2.

4 Experiments

This section will commence by introducing the datasets and evaluation metrics. Then, we will elaborate on the implementation details of our MiM. Lastly, we will present the experimental results of MiM in comparison to existing methods, along with an analysis of our proposed approach.

4.1 Datasets and Evaluation

Pre-training datasets. To conduct a fair comparison with previous works [19, 9, 8, 22, 51, 38, 32, 52, 20], we also carried out pre-training experiments on two public datasets, i.e., BTCV [53] and TCIA Covid19 [54], and combined them to form a new dataset named 1k. Additionally, to explore the scaling ability of our proposed method compared to previous state-of-the-art methods [19, 9], we collected eight publicly accessible 3D medical image datasets consisting of 10,502 CT scans to establish our pre-training datasets, which we named 10k. It is important to note that 10k is only used for exploring the scaling ability of our proposed method, while we mainly focus on the 1k dataset for fair comparison and analysis of our proposed method. Table 1 provides a summary of the sources of each collected dataset.

Table 1: The details of each dataset in our pre-training datasets.

Datasets

Region of Interest

Pre-training

#Samples

10k

BTCV [53]

Abdomen

✔

TCIA Covid19 [54]

Chest

✔

722

LUNA16 [55]

Chest

✔

843

STOIC 2021 [56]

Chest

✔

2,000

FLARE23 [15]

Abdomen

✔

4,000

LiDC [57]

Chest

✔

589

HNSCC [58]

Head/Neck

✔

1,071

TotalSegmentator [59]

Head/Neck/Chest/leg

Abdomen/pelvis/feet

✔

1,203

Total

10,502

Downstream datasets. We conduct experiments on twelve public datasets, i.e, BTCV [53], MM-WHS [60], Spleen [61], Flare22 [62], Amos22 [63], MSD Task03 [61], MSD Task06 [61], MSD Task07 [61], MSD Task08 [61], MSD Task10 [61], BrasTS 21 [61], and CC-CCII [64]. These datasets include segmentation and classification tasks, with the first ten datasets used for organ segmentation, the eleventh dataset used for lesion segmentation, the twelfth dataset used for tumor segmentation, and the last dataset used for disease classification. For BTCV [53] dataset, we adhere strictly to the data splits defined in prior studies [9, 20, 11], which comprise only training and validation sets. The training split is utilized for both pre-training and fine-tuning, while the validation split is excluded from pre-training and reserved solely for evaluation. All other datasets are unseen during pre-training. Furthermore, to evaluate the cross-modality generalization ability, we transferred the model pre-trained on CT scans to the MRI dataset BrasTS 21 [61]. We adopted consistent settings as previous work [19, 9, 20].

Evaluation metrics. Following [9, 20], we utilized Dice Similarity Coefficient (DSC) and Normalized Surface Distance (NSD) to evaluate the segmentation tasks. We then utilized Accuracy (ACC) and the Area Under the Curve (AUC) to evaluate the disease classification tasks.

Table 2: Pre-training settings.

Pre-training settings
Steps	45K
Optimizer	AdamW
Learning rate (LR)	1e-4
LR scheduler	cosine annealing schedule
Warmup step	100
Regularization weight	1e-2
Batch size	256
MiM Level-1 $H^{1}W^{1}D^{1}$	$384,384,192$
MiM Level-2 $H^{2}W^{2}D^{2}$	$96,96,96$
MiM Level-3 $H^{3}W^{3}D^{2}$	$16,16,16$
MiM Resize after crop $hwd$	$96,96,96$
MiM mask ratio $\tau$	0.6
MiM sampling times on next level $\gamma$	4

Table 3: Experiment results on BTCV dataset [53] across thirteen organs. The best results are highlighted in red. The term ‘From Scratch’ signifies the supervised baseline without self-supervised pre-training.

{\dagger}

denotes that we re-implement the approach. Most results are drawn from [19, 42, 20] or their papers.

Method	Pre-traning dataset	Network	Dice Score(%)													Avg
Method	Pre-traning dataset	Network	Spl	RKid	LKid	Gall	Eso	Liv	Sto	Aor	IVC	Veins	Pan	RAG	LAG	Avg
From Scratch
UNETR [65]	-	-	93.02	94.13	94.12	66.99	70.87	96.11	77.27	89.22	82.10	70.16	76.65	65.32	59.21	79.82
Swin-UNETR [21]	-	-	94.06	93.54	93.80	65.51	74.60	97.09	75.94	91.80	82.36	73.63	75.19	68.00	61.11	80.53
General SSL
SimCLR [29]	1k	UNETR	92.79	93.04	91.41	49.65	50.99	98.49	77.92	85.56	80.58	64.37	67.16	59.04	48.99	73.85
MoCov3 [66]	1k	UNETR	91.96	92.85	92.42	68.25	72.77	94.91	78.82	88.21	81.59	71.15	75.76	66.48	58.81	79.54
DINO ${\dagger}$ [49]	1k	UNETR	93.64	92.95	92.77	74.70	71.87	96.47	77.85	89.49	83.30	72.12	78.41	67.26	63.88	81.22
localMIM ${\dagger}$ [46]	1k	UNETR	95.31	94.16	94.17	74.52	73.69	96.57	82.21	89.92	84.67	72.12	76.89	67.68	62.29	81.96
HPM ${\dagger}$ [41]	1k	UNETR	94.47	93.46	93.86	75.62	74.07	96.11	80.92	90.01	84.42	71.25	79.29	67.34	64.40	82.03
SimMIM [39]	1k	Swin-UNETR	95.51	93.61	93.49	67.91	73.50	96.46	81.15	89.78	84.86	72.45	75.70	66.89	64.46	81.41
MCMAE ${\dagger}$ [25]	1k	Swin-UNETR	94.60	94.08	93.87	62.66	75.13	96.26	82.08	90.27	85.68	75.99	81.18	68.78	64.68	82.20
3D Medical SSL
ROT [34]	1k	3D UNet	91.75	93.13	91.62	65.09	76.55	94.21	86.16	89.74	83.08	71.13	81.55	67.90	63.72	81.20
ModelGen [38]	1k	3D UNet	91.99	93.52	91.81	65.11	76.14	95.98	86.88	89.29	83.59	71.79	81.62	67.97	63.18	81.45
PCRLv1 ${\dagger}$ [32]	1k	3D UNet	94.44	92.50	92.75	56.46	74.95	96.62	81.64	89.86	87.12	72.78	75.24	69.73	68.18	81.30
PCRLv2 ${\dagger}$ [52]	1k	3D UNet	95.50	91.43	89.52	76.15	73.54	97.28	79.64	90.16	84.17	75.20	78.71	68.74	62.93	81.74
UniMiss ${\dagger}$ [51]	1k	MiT	95.24	93.74	93.78	72.69	73.61	96.23	83.08	88.50	83.31	71.89	78.60	66.22	68.03	82.05
GL-MAE [20]	1k	UNETR	94.54	94.39	94.37	73.19	74.93	96.51	83.49	89.74	83.11	70.80	75.71	69.39	63.12	82.33
MAE3D [19]	1k	UNETR	95.81	94.38	94.48	69.96	76.85	96.69	80.44	90.33	84.33	73.65	80.11	68.65	64.44	82.40
Rubik++ [36]	1k	Swin-UNETR	96.21	90.41	89.33	75.22	72.64	97.44	79.25	89.65	83.76	74.74	78.35	67.14	61.97	81.38
SwinMM [22]	1k	Swin-UNETR	94.33	94.18	94.16	72.97	74.75	96.37	83.23	89.56	82.91	70.65	75.52	69.17	62.90	81.81
Adam ${\dagger}$ [47]	1k	Swin-UNETR	94.16	93.65	93.43	66.14	71.28	96.18	76.93	89.83	85.35	71.16	80.37	63.97	60.83	80.45
GVSL ${\dagger}$ [8]	1k	Swin-UNETR	95.27	91.22	92.25	72.69	73.56	96.44	82.40	88.90	84.22	70.84	76.42	67.48	63.25	81.87
Swin-UNETR ${\dagger}$ [9]	1k	Swin-UNETR	95.91	94.48	94.42	69.57	76.47	96.94	78.50	90.31	85.77	75.12	81.33	67.37	64.92	82.58
MiM	1k	Swin-UNETR	96.05	94.58	94.53	75.73	77.36	97.03	83.23	90.37	87.64	74.97	82.62	71.02	68.80	84.46
\cdashline1-15 MAE3D ${\dagger}$ [19]	10k	UNETR	95.91	94.28	94.26	73.82	75.35	97.07	83.53	91.12	86.74	75.13	83.33	68.43	66.40	83.52
Swin-UNETR ${\dagger}$ [9]	10k	Swin-UNETR	95.20	94.30	94.15	76.53	74.02	96.61	82.25	89.84	84.20	74.23	82.53	70.74	67.49	83.20
MiM	10k	Swin-UNETR	96.37	94.83	94.75	81.02	80.08	97.12	85.30	90.36	87.66	75.99	84.41	71.94	69.64	85.41

^*Note: Spl: spleen, RKid: right kidney, LKid: left kidney, Gall: gallbladder, Eso: esophagus, Liv: liver, Sto: stomach, Aor: aorta, IVC: inferior vena cava, Veins: portal and splenic veins, Pan: pancreas, AG: left and right adrenal glands.

4.2 Implementation Details

During pre-training, we followed the settings of previous works [19, 9] and provided the details of our MiM pre-training settings in Table 2. Specifically, the level-1 volume was randomly cropped from the entire CT volumes. We used the backbone of [25] as the encoder for efficient token processing. As previous SSL [49, 20], the prediction head and projection head are implemented with MLP layers for aligning the dimension. During fine-tuning, we used Swin-UNETR [21] for fine-tuning on segmentation tasks and Swin-ViT for fine-tuning on classification tasks, as per the settings of previous works [9, 20]. Specifically, for segmentation task, we discarded the decoder and only used the backbone in the finetuning period. For classification tasks,we strictly adhered to the methodology used in prior work in general computer vision [49, 17] and 3D medical imaging [20], utilizing only the features from the final layer for pre-training. While incorporating multiscale features could enhance classification performance [45, 67, 68], we opted not to follow this approach to ensure a fair comparison with other methods. Therefore, we use only the final layer’s features, combined with a global average pooling (GAP) layer and a simple MLP classifier, to predict the category. We initialized the encoder of the network with learned parameters in the pre-training process and fine-tuned the overall network. For inference on these datasets, we applied a sliding window inference with overlapping to enable a fair comparison with previous works [9]. It’s important to note that for evaluating the pure effectiveness of our proposed method, we did not use any foundation model or post-processing techniques [7, 69].

Comparison methods. We compared our MiM method with both General SSL and Medical SSL methods. First, we compare the typical SSL methods MoCov3 [66] and MAE [19, 17], since they represent the two mainstream SSL paradigms. We also report the results of SimCLR [29] according to [19, 20]. We further evaluate the performance of SimMIM [39], HPM [22], localMIM [19] and MCMAE [25], since they are related to our advanced hybrid MAE. We also compare with Adam [47] due to related to our hierarchical design of multi-granularity. Additionally, we compared with most existing SOTA medical SSL methods. Following common practices in SSL for natural images [70, 29, 17, 66, 49] and 3D medical images [9, 19, 20, 32, 52], we performed a single round of pre-training and fine-tuning for all methods to obtain results.

Table 4: Experiment results on BTCV [53], MM-WHS [60], MSD Spleen [61], Amos22 [63] and Flare22 [62] with varying ratios of training datasets. The evaluation was conducted using the DSC(%) metric.

{\dagger}

denotes we re-implement the approach and report the results. Most results are drawn from [19, 20].

Method	Pre-traning dataset	BTCV			MM-WHS			MSD Spleen			Amos22			Flare22			Avg
Method	Pre-traning dataset	25%	50%	100%	25%	50%	100%	25%	50%	100%	25%	50%	100%	25%	50%	100%	Avg
From scratch
UNETR [65]	-	58.99	75.20	79.82	78.77	85.31	85.85	89.67	92.23	94.20	73.53	81.46	84.34	75.32	76.29	76.94	80.53
Swin-UNETR [21]	-	63.08	75.38	80.53	79.59	85.65	86.11	90.08	93.29	94.90	76.03	82.60	84.85	77.43	80.50	83.08	82.21
General SSL
SimCLR [29]	1k	59.03	75.31	79.85	78.88	85.52	86.00	89.88	92.52	94.11	73.62	81.65	84.66	75.34	76.33	77.03	80.65
MoCov3 ${\dagger}$ [66]	1k	49.55	75.65	79.54	79.50	86.13	84.16	92.12	93.56	94.23	76.11	82.70	84.95	77.62	80.91	83.22	81.33
DINO ${\dagger}$ [49]	1k	59.68	75.53	81.22	78.80	86.31	89.80	93.58	93.84	95.79	74.40	84.43	87.24	78.42	82.01	85.99	83.14
SimMIM ${\dagger}$ [39]	1k	62.01	75.98	81.41	79.71	86.38	90.22	93.87	93.48	94.94	79.57	86.97	89.74	78.01	82.49	86.05	84.01
localMIM ${\dagger}$ [46]	1k	62.26	77.43	81.96	82.21	87.15	89.99	91.80	92.48	93.37	79.84	85.72	88.09	78.43	82.42	86.48	83.98
HPM ${\dagger}$ [41]	1k	62.95	76.32	82.03	82.34	87.64	89.86	93.78	93.11	93.86	79.89	85.70	88.07	78.20	82.42	86.15	84.16
MCMAE ${\dagger}$ [25]	1k	64.70	78.01	82.20	85.24	88.74	90.29	94.92	95.88	96.04	75.82	84.76	86.63	78.97	80.79	84.31	84.49
3D Medical SSL
ROT [34]	1k	53.43	74.44	81.20	80.12	86.98	87.12	90.99	95.01	95.45	76.88	86.01	87.12	77.11	81.16	84.51	82.50
Rubik++ [36]	1k	53.00	74.74	81.38	80.88	87.18	88.02	91.91	95.22	95.73	77.43	86.91	88.20	78.43	82.85	86.49	83.22
ModelGen [38]	1k	54.18	61.26	81.45	80.33	86.26	89.69	92.01	95.22	95.73	77.49	86.20	88.29	78.17	82.37	84.02	82.18
PCRLv1 ${\dagger}$ [32]	1k	55.08	70.92	81.30	83.34	87.59	90.32	89.69	92.01	95.10	79.46	86.24	86.02	78.44	84.14	87.34	83.13
PCRLv2 ${\dagger}$ [52]	1k	58.67	71.40	81.32	84.51	88.13	90.36	90.32	95.16	95.46	77.32	84.92	88.79	79.10	85.37	87.45	83.89
SwinMM ${\dagger}$ [22]	1k	59.40	73.66	81.81	84.61	88.32	90.40	90.36	95.23	95.55	77.33	84.95	88.88	79.26	85.62	87.56	84.20
Adam ${\dagger}$ [47]	1k	64.12	73.66	80.45	83.12	87.26	88.82	88.73	92.09	86.90	59.89	73.89	88.87	79.54	84.43	87.24	81.27
GVSL ${\dagger}$ [8]	1k	52.03	74.23	81.87	85.24	88.00	90.22	95.36	96.20	96.57	78.72	87.41	89.92	81.89	85.38	87.42	84.70
Unimiss ${\dagger}$ [51]	1k	65.96	77.15	82.05	83.64	88.49	90.37	93.30	94.74	95.10	77.03	85.08	88.23	80.37	85.21	87.66	84.96
GL-MAE [20]	1k	66.44	76.37	82.33	76.16	87.72	88.88	93.65	94.36	95.72	74.65	84.65	87.40	80.20	85.03	87.26	84.05
Swin-UNETR ${\dagger}$ [9]	1k	62.12	76.13	82.58	85.33	87.50	90.32	95.13	95.00	95.02	78.41	87.49	89.83	83.21	85.72	87.17	85.40
MAE3D ${\dagger}$ [19]	1k	63.66	78.57	82.48	85.57	88.62	90.03	95.12	95.11	95.50	77.63	87.08	89.71	83.31	85.85	87.31	85.70
MiM	1k	70.71	81.22	84.66	86.52	88.99	91.04	95.99	96.42	96.98	80.55	88.94	90.96	85.55	86.29	88.96	87.59
\cdashline1-18 Swin-UNETR ${\dagger}$ [9]	10k	69.96	78.32	83.20	85.88	88.83	90.47	92.30	95.97	96.26	78.15	86.61	89.96	84.54	86.27	88.04	86.32
MAE3D ${\dagger}$ [19]	10k	70.59	79.44	83.52	85.78	89.13	90.75	95.57	96.19	96.92	77.01	86.94	88.55	84.44	86.27	88.50	86.64
MiM	10k	75.34	81.80	85.61	86.41	89.77	91.12	96.36	96.68	96.99	81.29	89.06	91.60	86.29	87.79	89.67	88.34

4.3 Experiments on downstream tasks

4.3.1 Comparison on the BTCV dataset

We first conducted experiments on the BTCV [53], and the results are presented in Table 3. Among the comparison methods, SimCLR [29], MoCov3 [66], DINO [49], localMIM [46], HPM [46], MAE3D [19], GL-MAE [20] adopted UNETR [65] as networks architecture. Most other methods, including our MiM, used Swin-UNETR [21] as per the settings of previous works [9]. Table 3 includes the details of the backbone for 3D UNet [37], UNETR [65], and Swin-UNETR [21]. UNETR [65] uses ViT [50] and Swin-UNETR [21] employs the Swin Transformer [45]. For all experiments, we used ViT-Base [50] for UNETR [65] and Swin-Base [45] for Swin-UNETR [21], balancing performance and computational efficiency. These pre-trained encoders were used to initialize the encoder of respective segmentation networks.

Remark. As shown in Table 3, we observed that General SSL methods performed worse than Medical SSL methods. Specifically, SimCLR [29] and MoCov3 [66] achieved only 73.85% and 79.54%, respectively. This is because these methods rely on a large batch size and negative samples to avoid trivial constants, which is impractical for 3D medical images. Additionally, the negative relationship between different images used in SimCLR [29] and MoCov3 [66] is not suitable for 3D medical images. DINO [49] also achieved limited improvements. Our proposed MiM, outperformed MAE-based methods such as MAE3D [19], GL-MAE [20], localMIM [46], HPM [46], and MCMAE [25] by a significant margin. We concluded that General SSL methods are not very suitable for 3D medical images, and it’s crucial to consider the characteristics of 3D medical images when designing SSL methods.

The scratch Swin-UNETR [21] only achieves 80.53% DSC. By pre-training MiM on a 1k unlabeled dataset, we gained a 3.93% improvement with 84.46% DSC, which outperforms existing methods by a clear margin. Among the compared methods, Swin-UNETR [9] and MAE3D [19] achieved the best 82.58% and second-best 82.40% DSC, respectively. Our MiM surpasses these two methods by 1.88% and 2.06% DSC, respectively, which is a clear improvement on this dataset.

It’s worth noting that scaling laws [71] also apply to 3D medical image pre-training. By pre-training with larger scale unlabeled 10k dataset, we observed that Swin-UNETR [9] and MAE3D [19] achieved DSC scores of 83.20% and 83.52%, respectively. Our MiM with 10k achieved a DSC score of 85.41%, which consistently outperformed these two methods significantly. These results suggest that scaling plays an important role in pre-training for 3D medical images and that our MiM method is effective for pre-training on larger datasets.

Qualitative results. MiM was found to improve the completeness of segmentation results, as shown in Fig 6. The results of segmentation using MiM were better than existing methods.

4.3.2 Comparison on the Unseen datasets

We further conduct experiments on unseen datasets in pre-training, i.e., MM-WHS [60], Spleen [61], Amos22 [63], and Flare22 [62]. The results of these four datasets are shown in Table 4. It can be observed that our MiM consistently outperformed all existing methods with a clear margin, which demonstrates promising generalizability to unseen datasets. Specifically, MiM outperformed existing methods by at least 1.89% DSC on average. By pre-training with a larger scale of unlabeled dataset 10k, Swin-UNETR [9] and MAE3D [19] improved by 0.92% and 0.94% DSC with 86.32% and 86.64%, respectively. Our MiM also gained improvement to 88.34% DSC and outperformed these two methods consistently. MiM also shows label efficiency when finetuning with less labels [9]. Specifically, MiM with 50% label achieved comparable performance with the scratch Swin-UNETR with 100% label with a clear margin.

Table 5: Experiment results on Five CT-based tasks on MSD dataset [61].

Method	Pre-training dataset	Network	Task03 Liver		Task06 Lung		Task07 Pancreas		Task08 Hepatic Vessel		Task10 Colon		Avg
Method	Pre-training dataset	Network	DSC(%)	NSD(%)	DSC(%)	NSD(%)	DSC(%)	NSD(%)	DSC(%)	NSD(%)	DSC(%)	NSD(%)	DSC(%)	NSD(%)
From scratch
3D UNet [37]	-	-	94.41	93.94	48.09	42.65	75.33	90.98	56.11	77.58	40.97	54.63	62.98	71.95
UNETR [65]	-	-	94.27	94.00	44.26	40.74	74.12	88.79	58.45	79.34	27.39	42.55	59.70	69.08
Swin-UNETR [21]	-	-	94.52	94.08	51.97	48.68	75.75	88.87	57.24	77.84	44.57	56.15	64.84	73.12
General SSL
MoCov3 [66]	1k	UNETR	94.15	93.22	52.11	48.16	74.16	78.13	58.11	78.13	43.99	55.13	64.90	72.55
localMIM [46]	1k	UNETR	94.72	93.69	52.93	48.75	74.74	79.77	58.86	78.77	44.52	54.35	65.55	73.10
HPM [46]	1k	UNETR	94.63	93.25	53.63	48.34	74.74	78.13	59.89	78.93	42.67	55.13	65.51	72.81
SimMIM [39]	1k	Swin-UNETR	94.50	92.45	56.93	52.41	74.99	79.77	60.44	78.99	42.42	54.35	66.26	73.65
MCMAE [25]	1k	Swin-UNETR	95.49	94.33	57.08	57.18	75.78	78.13	60.44	78.93	45.63	55.13	67.28	74.96
3D Medical SSL
ModelGen [38]	1k	3D Unet	94.49	94.22	56.26	51.80	76.92	91.70	58.28	78.34	47.15	58.08	66.68	74.83
PCRLv2 [52]	1k	3D Unet	94.61	95.78	62.57	59.67	77.02	92.67	53.63	76.25	39.78	53.50	65.52	75.57
GL-MAE [20]	1k	UNETR	94.65	93.48	53.28	49.51	72.96	87.90	60.26	78.65	45.52	59.45	65.53	74.20
MAE3D [19]	1k	UNETR	95.53	96.62	58.30	52.32	77.31	91.92	60.44	78.80	45.40	56.00	66.76	75.33
Rubik++ [36]	1k	Swin-UNETR	94.49	94.22	56.43	52.84	76.14	90.88	60.42	78.71	40.49	50.78	65.63	73.69
GVSL [8]	1k	Swin-UNETR	95.62	95.78	59.44	54.59	77.82	92.61	57.74	77.93	50.55	61.42	68.23	76.76
Swin-UNETR [9]	1k	Swin-UNETR	95.55	96.61	60.23	57.22	76.93	90.88	60.44	79.08	43.21	55.43	68.98	75.16
MiM	1k	UNETR	96.01	97.03	63.06	60.11	78.39	92.71	60.71	79.44	51.02	62.57	69.79	78.37
MiM	1k	Swin-UNETR	96.14	97.43	63.52	60.92	78.86	92.80	60.85	79.55	51.02	63.03	70.07	78.75
\cdashline1-15 MiM	10k	Swin-UNETR	96.41	97.63	64.46	62.59	79.19	93.14	61.72	80.06	52.73	64.85	70.76	79.67

4.3.3 Comparison on the MSD datasets

To evaluate the generalizability on organ segmentation tasks, we conducted experiments on the five CT-based tasks on the MSD dataset [61], i.e., Task 03 Liver, Task06 Lung, Task07 Pancreas, Task08 Hepatic Vessel and Task10 Colon. Since existing methods didn’t conduct experiments with the same pre-training dataset, we reimplemented these methods for fair comparison. It can be observed in Table 5 that MiM achieves the best average DSC (70.07%) and NSD(78.75%) in all tasks. Since the scratch Swin-UNETR [21] performs better than UNETR [65] in terms of avg DSC (64.84% vs 62.98%) and NSD (73.12% vs 69.08). We further pre-trained UNETR [65] based on MiM for a fair comparison. It can be observed that with MiM pre-training, Swin-UNETR [21] gained an average improvement 5.23% and 5.63% in terms of DSC and NSD, respectively. With UNETR [65] as the network, we observed an average improvement of 10.09% and 9.29% in terms of DSC and NSD, respectively. Moreover, by pre-training with a larger scale of unlabeled dataset 10k, MiM further improved to 70.76% and 79.67% in DSC and NSD, respectively.

4.3.4 Comparison on CC-CCII dataset

To evaluate the generalizability of our MiM on classification task, we fine-tuned it on the CC-CCII [64] dataset and compared its performance with state-of-the-art general and medical SSL methods. Since existing SSL methods didn’t conduct experiments on this dataset, we reproduced the related methods and reported the results. As shown in Table 6, it can be observed that MiM achieved the best performance in terms of ACC and AUC, surpassing all other methods with 93.63% and 99.39%. These findings demonstrate that the learned representation by MiM can be well transferable to classification problems and can be used effectively for medical image classification tasks. With a larger scale of pre-training dataset 10k, our MiM further improved to 94.12% and 99.52% in ACC and AUC, respectively, which shows the scalability of our proposed method when transferring across tasks.

Table 6: Experiment results on CC-CCII dataset [64].

Method	Pre-traning dataset	Network	Disease classification
Method	Pre-traning dataset	Network	ACC(%)	AUC
From Scratch
ResNet [48]	-	-	85.94	87.61
ViT [50]	-	-	85.92	96.26
Swin-ViT [45]	-	-	87.15	97.32
General SSL
SimCLR [29]	1k	ResNet	87.12	96.21
MoCov3 [66]	1k	ResNet	87.01	95.49
localMIM [46]	1k	ViT	88.15	97.58
HPM [41]	1k	ViT	88.26	97.65
SimMIM [39]	1k	Swin-ViT	89.62	98.16
MCMAE [25]	1k	Swin-ViT	90.26	97.12
Medical SSL
PCRLv1 [32]	1k	ResNet	88.84	97.69
PCRLv2 [52]	1k	ResNet	89.35	98.05
GL-MAE [20]	1k	ViT	88.00	96.97
MAE3D [19]	1k	ViT	91.30	98.13
Rubik++ [36]	1k	Swin-ViT	89.93	98.55
SwinMM [22]	1k	Swin-ViT	89.99	98.77
Swin-UNETR [9]	1k	Swin-ViT	91.81	98.97
MiM	1k	Swin-ViT	93.63	99.39
\cdashline1-5 MiM	10k	Swin-ViT	94.26	99.69

4.3.5 Comparison on the BraTS 21 dataset

To evaluate the generalizability of our MiM on MRI datasets, we fine-tuned it on the BraST 21 [61] MRI tumor segmentation dataset and compared its performance with state-of-the-art general and medical SSL methods. WT, TC, and ET denote the whole tumor, tumor core, and enhancing tumor, respectively. It can be observed in Table 7 that both SSL methods can improve the performance of the model in segmenting tumors in BrasTS 21 [61] dataset. This is because CT and MRI are often used for the same task but for different purposes, and thus share similar anatomical structures. Therefore, the knowledge learned by SSL methods from unlabeled CT datasets can be transferred to MRI datasets [72, 9]. Our MiM outperformed all other methods by at least 1.34% with 79.28% DSC. By pre-training with a larger unlabeled dataset 10k, our MiM further improved to 79.92%, which shows the scalability of our proposed method when transferring across modalities.

Table 7: Experiment results on BraTS-21 [61].

Method	Pre-traning dataset	Network	Dice Score(%)
Method	Pre-traning dataset	Network	TC	WT	ET	AVG
From Scratch
UNETR [65]	-	-	81.62	87.81	57.34	75.58
Swin-UNETR [21]	-	-	81.28	88.67	57.73	75.89
General SSL
MoCov3 [66]	1k	UNETR	82.60	88.89	57.69	76.39
localMIM ${\dagger}$ [46]	1k	UNETR	82.44	88.78	58.64	76.62
SimMIM ${\dagger}$ [39]	1k	Swin-UNETR	84.06	90.43	59.07	77.85
MCMAE ${\dagger}$ [25]	1k	Swin-UNETR	84.27	90.52	59.04	77.94
3D Medical SSL
MAE3D ${\dagger}$ [19]	1k	UNETR	82.34	90.35	59.18	77.29
PCRLv2 [52]	1k	Swin-UNETR	82.13	90.06	57.70	76.63
Swin-UNETR [9]	1k	Swin-UNETR	82.51	89.08	58.15	76.58
SwinMM [22]	1k	Swin-UNETR	83.48	90.47	58.72	77.56
MiM	1k	Swin-UNETR	84.99	91.92	60.94	79.28
\cdashline1-5 MiM	10k	Swin-UNETR	85.96	92.34	61.45	79.92

4.4 Analysis of our proposed method

All the models were pre-trained on 1k, which were then evaluated on BTCV [53] and MM-WHS [60].

4.4.1 Ablation study

Loss functions. We conducted comprehensive ablation studies on the BTCV [53] and MM-WHS [60] validation datasets to evaluate the effectiveness of our hierarchical design, focusing on multi-level reconstruction and cross-level alignment components. As shown in Table 8, the multi-level reconstruction losses $L_{R}^{1}$ , $L_{R}^{2}$ , and $L_{R}^{3}$ substantially enhance model performance by capturing features at multiple granularities from 3D medical images, outperforming traditional single-level mask auto-encoding approaches (Rows 3-4). The importance of proper loss function composition is evident when examining the cross-level alignment loss $L_{C}$ in isolation (Row 2), which resulted in poor convergence with modest performance: 81.92% DSC, 77.11% NSD on BTCV and 89.99% DSC, 74.12% NSD on MM-WHS. This observation indicates that reconstruction losses $L_{R}$ play a fundamental role in establishing robust within-level representations before cross-level alignment $L_{C}$ can be effectively applied. Further analysis reveals that alignment $L_{C}^{2,3}$ between finer levels $x^{2}$ and $x^{3}$ (Row 5) shows modest gains since fine-grained details are primarily addressed by reconstruction losses, whereas alignment $L_{C}^{1,2}$ between coarser levels $x^{1}$ and $x^{2}$ (Row 6) yields more substantial improvements in both DSC and NSD metrics by effectively capturing global context. The optimal performance across both datasets was achieved by combining all reconstruction and cross-level alignment losses (Last row), demonstrating the synergistic relationship between multi-level reconstruction and cross-level alignment in hierarchical representation learning.

Table 8: Evaluation of the loss terms

\mathcal{L}_{R}

and

\mathcal{L}_{C}

Loss

BTCV

MM-WHS

{\color[rgb]{0,0,0}\mathcal{L}_{\mathcal{R}}^{1}}

{\color[rgb]{0,0,0}\mathcal{L}_{\mathcal{R}}^{2}}

{\color[rgb]{0,0,0}\mathcal{L}_{\mathcal{R}}^{3}}

{\color[rgb]{0,0,0}\mathcal{L}_{\mathcal{C}}^{12}}

{\color[rgb]{0,0,0}\mathcal{L}_{\mathcal{C}}^{23}}

DSC(%)

NSD(%)

DSC

NSD(%)

✗

81.49

75.92

89.42

73.06

✗

✔

81.92

77.11

89.99

74.12

✔

✗

83.10

79.51

90.44

74.53

✔

✗

84.25

81.29

90.74

75.50

\cdashline1-9 ✔

✔

✗

✔

84.29

81.45

90.89

75.97

✔

✗

84.43

81.56

90.89

76.27

✔

84.66

82.11

91.04

76.65

Patch Types for Generating subsequent level volumes. During the patchification process (Figure 3), each level- $l$ volume ( $x^{l}$ ) is divided into masked and unmasked patches. We evaluate both patch types for generating the subsequent level- $l+1$ volume ( $x^{l+1}$ ) in our framework. As shown in Table 9, using masked patches from $x^{l}$ to generate $x^{l+1}$ consistently achieves better performance than using unmasked patches from $x^{l}$ . This superiority stems from masked patches forcing the model to recover missing information, thereby promoting effective reconstruction of overlapping regions across levels and learning of multi-scale semantic representations. In contrast, unmasked patches directly expose original volume information to the model during the same iteration, effectively creating an information shortcut. This shortcut reduces the learning challenge by allowing the model to simply copy unmasked features rather than learning to infer and reconstruct them, ultimately limiting its ability to capture rich semantic details and develop robust generalization capabilities.

Table 9: Evaluation of the patch types derived from level-l volumes

x^{l}

s utilized for subsequent level volumes

x^{l+1}

s. We report the DSC(%) and NSD(%) on BTCV [53] and MM-WHS [60].

Patch Types for Generating $x^{l+1}$	BTCV		MM-WHS
Patch Types for Generating $x^{l+1}$	DSC (%)	NSD (%)	DSC (%)	NSD (%)
Unmasked patches from $x^{l}$	82.18	76.22	89.33	73.65
Masked patches from $x^{l}$	84.66	82.11	91.04	76.65

Negative pairs for $\mathcal{L}_{\mathcal{C}}$ . In Eq. 6, we employ infoNCE loss[29] as the default choice. This loss maximizes the similarity between the cross-level images and pushes away the negative samples. Another loss function is BYOL cosine loss [73]. The primary distinction is whether to use negative samples. As in Table 10, negative samples aid in learning better representations [29], and thus, we include them in our default choices.

Table 10: Evaluation of

\mathcal{L}_{C}

whether using negative samples.

Loss function $L_{C}$	BTCV		MM-WHS
Loss function $L_{C}$	DSC(%)	NSD(%)	DSC(%)	NSD(%)
BYOL-style cosine loss [73]	83.68	80.12	90.53	74.66
Contrastive loss [29]	84.66	82.11	91.04	76.65

Efficiency of MiM. This study adopts a hybrid convolution-transformer backbone that integrates convolutional blocks with transformer layers. By incorporating convolutional blocks, the architecture enhances inductive bias learning and enables the reuse of multi-scale features, effectively supporting hybrid representation learning [25]. This design enhances the transformer’s capability to process 3D medical images with greater efficiency and effectiveness. As such, this backbone is a foundational component of our proposed framework. Table 11 presents a comparative analysis of computation costs i.e. flops and times during the pre-training period across various methodologies. Our evaluation highlights the computational efficiency of MAE-based approaches when coupled with UNETR, as they leverage only the unmasked tokens from masked 3D medical images. In contrast, alternative MAE-based methods naively cooperating with hybrid backbones such as Swin-UNETR, necessitate processing all tokens. Our proposed method extends the hybrid backbone architecture [25] to 3D medical images in the pre-training stage, achieving significantly superior performance while maintaining computational efficiency.

Table 11: Evaluation of model complexity during pre-training and subsequent performance on downstream tasks.

Method	Networks	Tokens		flops(G)	Pre-training duration(h)	DSC(%)
Method	Networks	All	Partial	flops(G)	Pre-training duration(h)	BTCV	MM-WHS
MAE3D	UNETR		✔	12.3	9	82.20	90.03
HPM	UNETR		✔	34.1	9.9	82.04	89.86
localMIM	UNETR		✔	26.6	10.3	81.38	89.99
Adam	Swin-UNETR	✔		43.7	27.0	80.45	88.82
SimMIM	Swin-UNETR	✔		70.8	26.6	82.24	90.22
Swin-UNETR	Swin-UNETR	✔		394.84	130.0	82.58	90.32
MiM	Swin-UNETR		✔	45.8	14.6	84.66	91.04

4.4.2 Hyper-parameter analysis

Weights of $\mathcal{L}_{\mathcal{R}}$ and $\mathcal{L}_{\mathcal{C}}$ . In Eq. 10, the loss function of MiM consists of two parts. Thus, in Fig 8, we increased the value of $\alpha$ . The optimal value of $\alpha$ was $1e^{-1}$ by observation. Further scaling up the weight of cross-level alignment did not bring any additional benefits. This may be because the magnitude of the cross-level alignment function is much larger than the magnitude of the reconstruction loss, resulting the ignorance of reconstruction process.

Weight of different levels of $\mathcal{L}_{\mathcal{R}}$ . To determine the optimal strategy for multi-level reconstruction, we evaluated four distinct learning processes during pretraining by varying the weights of different level reconstruction losses, as shown in Table 12. Using the BTCV [53] and MM-WHS [60] datasets, we found that our baseline model without hierarchical reconstruction - w/o Coarse & Fine with all weights set to 0 - yielded the lowest performance with DSC scores of 81.4% and 89.42%,. The Coarse $\rightarrow$ Fine process, which initially focused on coarser levels ( $L_{R}^{1}$ and $L_{R}^{2}$ ) before prioritizing finer levels ( $L_{R}^{3}$ ), improved performance to 83.99% and 90.70% DSC. Notably, the Fine $\rightarrow$ Coarse process, which reversed this order by beginning with finer-level reconstruction, achieved even better results with DSC scores of 84.26% and 90.85%. The Simultaneously process, which maintained equal weights across all levels throughout training, emerged as the most effective approach, achieving the highest DSC scores of 84.66% and 91.04%. These results demonstrate that hierarchical learning strategies significantly enhance representation learning, with simultaneous multi-level reconstruction proving most beneficial.

Table 12: Evaluation of model with different learning process.

Learning process	Weight of Level-l			Datasets
Learning process	$\mathcal{L}_{R}^{1}$	$\mathcal{L}_{R}^{2}$	$\mathcal{L}_{R}^{3}$	BTCV	MM-WHS
w/o Coarse & Fine	1 $\longrightarrow$ 1	0 $\longrightarrow$ 0	0 $\longrightarrow$ 0	81.49	89.42
Coarse $\longrightarrow$ Fine		2 $\longrightarrow$ 1	0 $\longrightarrow$ 1	83.99	90.70
Fine $\longrightarrow$ Coarse		0 $\longrightarrow$ 1	2 $\longrightarrow$ 1	84.26	90.85
Simultaneously		1 $\longrightarrow$ 1	1 $\longrightarrow$ 1	84.66	91.04

Numbers of multi-levels $L$ . Our adoption of a three-level architecture is supported by both empirical evidence and practical considerations. Empirically, as demonstrated in Fig 9, our multi-level strategy preserves anatomical structures through carefully selected resolutions: Level-1 (384 $\times$ 384 $\times$ 192) captures broad contextual information, Level-2 (96 $\times$ 96 $\times$ 96) extracts intermediate features, and Level-3 (16 $\times$ 16 $\times$ 16) retains fine details. Through extensive experimentation shown in Fig 8, we investigated the impact of varying the number of levels $L$ and found that $L=3$ yields optimal performance. Adding a fourth level would result in an extremely low resolution ( $1\times 1\times 1$ ), which, while potentially simplifying reconstruction, leads to suboptimal performance due to excessive information loss, consistent with findings in similar architectures[17, 39].

4.4.3 Reconstruction results

We provided reconstruction results of MiM on BTCV [53] in Fig. 10, where the first and second rows represent the volumes and masked 3D medical images, respectively, and the reconstruction results in the last row demonstrate excellent performance.

4.4.4 Incorporating Parameter-Efficient Fine-Tuning Methods

Parameter-efficient fine-tuning methods, such as LoRA [74] and Ladder fine-tuning [75], offer valuable insights in realistic scenarios with limited computation resources [74, 75]. Specifically, we evaluated LoRA [74] for both classification and segmentation tasks on 3D medical images using our proposed method, implementing it via the official codebase ¹¹1https://github.com/microsoft/LoRA. Our implementation introduces low-rank matrices into attention and convolutional layers to approximate weight updates while freezing the original parameters during fine-tuning. As demonstrated in Table 13 and Table 14, LoRA achieves comparable performance to full fine-tuning while significantly reducing the number of updated parameters, consistent with previous findings [74] and [75]. We observed that increasing the rank $r$ improves performance, approaching that of full fine-tuning, though at the cost of increased training time and memory usage. These results show that our method can effectively leverage LoRA [74] during fine-tuning, achieving an optimal balance between computational resources and performance.

Table 13: Comparison of different LoRA fine-tuning strategies applied to our method on the MM-WHS dataset [60]. The Update Param. column shows the ratio between LoRA’s trainable parameters and the total trainable parameters, excluding the segmentation decoder which was added later for downstream tasks.

Fine-tuning strategies	Update Param. (%)	Memory Usage (GB)		Segmentation
Fine-tuning strategies	Update Param. (%)	Train	Inference	DSC(%)	NSD(%)
Fully fine-tuning(default)	100	16.74	0.81	91.12	96.19
LoRA [74](r=16)	10.51	17.03	0.82	91.09	96.72
LoRA [74](r=8)	5.55	16.72	0.81	90.97	96.58
LoRA [74](r=4)	2.85	16.58	0.81	91.01	96.78

Table 14: Comparison of different LoRA fine-tuning strategies applied to our method on the CC-CCII dataset [64]. The Update Param. column shows the ratio between LoRA’s trainable parameters and the total trainable parameters, excluding the classifier which was added later for classification downstream tasks.

Fine-tuning strategies	Update Param. (%)	Memory Usage (GB)		Classification
Fine-tuning strategies	Update Param. (%)	Train	Inference	ACC(%)	AUC
Fully fine-tuning(default)	100	14.52	0.20	94.26	99.69
LoRA [74](r=16)	11.17	17.18	0.21	94.12	98.99
LoRA [74](r=8)	8.10	16.46	0.21	93.78	98.74
LoRA [74](r=4)	5.37	16.05	0.21	93.95	98.78

5 Conclusion

In this paper, we introduce the Mask in Mask (MiM) pre-training framework, which significantly advances 3D medical image analysis. By incorporating hierarchical designs, i.e., multi-level reconstruction and cross-level alignment, MiM efficiently encodes multi-granularity visual cues of structure and details into the representation. To facilitate the fair and comprehensive comparison of existing methods, we collected ten public datasets and curated two scales of pre-training datasets, i.e., 1k and 10k. The results reveal that the hierarchical design of the MiM framework is crucial for achieving superior performance for 3D medical images. We further explore to scale up the pre-training dataset to 10k. The results show that the performance of MiM can be further improved by scaling up the pre-training dataset. This finding emphasizes the importance of large-scale pre-training for building the foundation model in 3D medical images.

Several promising directions can be explored upon this work. (1) Building up a large-scale pre-training datasets, e.g., more than 100k volumes. (2) Exploring more downstream tasks, e.g., 3D medical image registration. (4) Exploring the potential of learnable level embeddings instead of hard coding the number of level. (3) Exploring cooperation with other modalities for multi-modal pre-training, e.g., language. Actually, (1) is the cornerstone for the foundation model in 3D medical images, and (2) can further comprehend the evaluation. (3) can complement the 3D medical foundation model with the information from different modalities. (4) can further improve the method’s design.

References

[1] N. Tajbakhsh, L. Jeyaseelan, Q. Li, J. N. Chiang, Z. Wu, and X. Ding, “Embracing imperfect datasets: A review of deep learning solutions for medical image segmentation,” MIA, vol. 63, p. 101693, 2020.
[2] Y. He, F. Huang, X. Jiang, Y. Nie, M. Wang, J. Wang, and H. Chen, “Foundation model for advancing healthcare: Challenges, opportunities, and future directions,” arXiv, 2024.
[3] P. R. Bassi, W. Li, Y. Tang, F. Isensee, Z. Wang, J. Chen, Y.-C. Chou, Y. Kirchhoff, M. Rokuss, Z. Huang et al., “Touchstone benchmark: Are we on the right way for evaluating ai algorithms for medical segmentation?” arXiv, 2024.
[4] C. Jin, Z. Guo, Y. Lin, L. Luo, and H. Chen, “Label-efficient deep learning in medical image analysis: Challenges and future directions,” arXiv, 2023.
[5] Y. Yang, R. Wang, T. Zhang, and J. Su, “Semi-supervised medical image segmentation via feature-perturbed consistency,” in BIBM. IEEE, 2023, pp. 1635–1642.
[6] J.-X. Zhuang, J. Cai, J. Zhang, W.-s. Zheng, and R. Wang, “Class attention to regions of lesion for imbalanced medical image recognition,” Neurocomputing, vol. 555, p. 126577, 2023.
[7] Y. Jiang et al., “Anatomical invariance modeling and semantic alignment for self-supervised learning in 3d medical image analysis,” in ICCV, 2023, pp. 15 859–15 869.
[8] Y. He, G. Yang, R. Ge, Y. Chen, J.-L. Coatrieux, B. Wang, and S. Li, “Geometric visual similarity learning in 3d medical image self-supervised pre-training,” in CVPR, 2023, pp. 9538–9547.
[9] Y. Tang, D. Yang, W. Li, H. R. Roth, B. Landman, D. Xu, V. Nath, and A. Hatamizadeh, “Self-supervised pre-training of swin transformers for 3d medical image analysis,” in CVPR, 2022, pp. 20 730–20 740.
[10] L. Wu, J. Zhuang, and H. Chen, “Large-scale 3d medical image pre-training with geometric context priors,” arXiv, 2024.
[11] Wu, Linshan and Zhuang, Jiaxin and Chen, Hao, “Voco: A simple-yet-effective volume contrastive learning framework for 3d medical image analysis,” in CVPR, 2024.
[12] X. Ni, L. Wu, J. Zhuang, Q. Wang, M. Wu, V. Vardhanabhuti, L. Zhang, H. Gao, and H. Chen, “Mg-3d: Multi-grained knowledge-enhanced 3d medical vision-language pre-training,” arXiv, 2024.
[13] L. Wu, J. Zhuang, X. Ni, and H. Chen, “Freetumor: Advance tumor segmentation via large-scale tumor synthesis,” arXiv, 2024.
[14] W. Li, L. Luo, and H. Chen, “Bootstrapping radiography pre-training via siamese masked vision-language modeling with complementary self-distillation,” in BIBM. IEEE, 2024.
[15] J. Ma, Y. Zhang, S. Gu, C. Ge, E. Wang, Q. Zhou, Z. Huang, P. Lyu, J. He, and B. Wang, “Automatic organ and pan-cancer segmentation in abdomen ct: the flare 2023 challenge,” arXiv, 2024.
[16] Y. Xu, Y. Wang, F. Zhou, J. Ma, S. Yang, H. Lin, X. Wang, J. Wang, L. Liang, A. Han et al., “A multimodal knowledge-enhanced whole-slide pathology foundation model,” arXiv, 2024.
[17] K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, “Masked autoencoders are scalable vision learners,” in CVPR, 2022, pp. 16 000–16 009.
[18] K. Chen, Z. Liu, L. Hong, H. Xu, Z. Li, and D.-Y. Yeung, “Mixed autoencoder for self-supervised visual representation learning,” in CVPR, 2023, pp. 22 742–22 751.
[19] Z. Chen, D. Agarwal, K. Aggarwal, W. Safta, M. M. Balan, and K. Brown, “Masked image modeling advances 3d medical image analysis,” in WACV, 2023, pp. 1970–1980.
[20] J.-X. Zhuang, L. Luo, and H. Chen, “Advancing volumetric medical image segmentation via global-local masked autoencoder,” arXiv, 2023.
[21] A. Hatamizadeh, V. Nath, Y. Tang, D. Yang, H. R. Roth, and D. Xu, “Swin unetr: Swin transformers for semantic segmentation of brain tumors in mri images,” in MICCAIW. Springer, 2021, pp. 272–284.
[22] Y. Wang et al., “Swinmm: masked multi-view with swin transformers for 3d medical image segmentation,” in MICCAI, 2023, pp. 486–496.
[23] K. Yan, J. Cai, D. Jin, S. Miao, D. Guo, A. P. Harrison, Y. Tang, J. Xiao, J. Lu, and L. Lu, “SAM: Self-Supervised Learning of Pixel-Wise Anatomical Embeddings in Radiological Images,” TMI, vol. 41, pp. 2658–2669, 2022.
[24] R. J. Chen, C. Chen, Y. Li, T. Y. Chen, A. D. Trister, R. G. Krishnan, and F. Mahmood, “Scaling vision transformers to gigapixel images via hierarchical self-supervised learning,” in CVPR, 2022.
[25] P. Gao, T. Ma, H. Li, Z. Lin, J. Dai, and Y. Qiao, “MCMAE: Masked Convolution Meets Masked Autoencoders,” in NeurIPS, vol. 35, 2022, pp. 35 632–35 644.
[26] Y. Du, Q. Zhai, W. Dai, and X. Li, “Teach clip to develop a number sense for ordinal regression,” ECCV, 2024.
[27] Y. Gao, J.-X. Zhuang, S. Lin, H. Cheng, X. Sun, K. Li, and C. Shen, “Disco: Remedying self-supervised learning on lightweight models with distilled contrastive learning,” in ECCV, 2022, pp. 237–253.
[28] J. Gui, T. Chen, J. Zhang, Q. Cao, Z. Sun, H. Luo, and D. Tao, “A survey on self-supervised learning: Algorithms, applications, and future trends,” IEEE T-PAMI, 2024.
[29] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” in ICML, 2020, pp. 1597–1607.
[30] X. Chen, H. Fan, R. Girshick, and K. He, “Improved baselines with momentum contrastive learning,” in CVPR, 2020, pp. 9729–9738.
[31] X. Chen, S. Xie, and K. He, “An empirical study of training self-supervised visual transformers,” arXiv e-prints, pp. arXiv–2104, 2021.
[32] H.-Y. Zhou, C. Lu, S. Yang, X. Han, and Y. Yu, “Preservational learning improves self-supervised medical image models by reconstructing diverse contexts,” in ICCV, 2021, pp. 3499–3509.
[33] Y. Ye, J. Zhang, Z. Chen, and Y. Xia, “Desd: Self-supervised learning with deep self-distillation for 3d medical image segmentation,” in MICCAI. Springer, 2022, pp. 545–555.
[34] A. Taleb, W. Loetzsch, N. Danz, J. Severin, T. Gaertner, B. Bergner, and C. Lippert, “3d self-supervised methods for medical imaging,” NeurIPS, vol. 33, pp. 18 158–18 172, 2020.
[35] Y. Xie, J. Zhang, Z. Liao, Y. Xia, and C. Shen, “Pgl: prior-guided local self-supervised learning for 3d medical image segmentation,” arXiv, 2020.
[36] J. Zhu, Y. Li, Y. Hu, K. Ma, S. K. Zhou, and Y. Zheng, “Rubik’s cube+: A self-supervised feature learning framework for 3d medical image analysis,” MIA, vol. 64, p. 101746, 2020.
[37] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in MICCAI, 2015, pp. 234–241.
[38] Z. Zhou, V. Sodha, J. Pang, M. B. Gotway, and J. Liang, “Models genesis,” MIA, vol. 67, p. 101840, 2021.
[39] Z. Xie, Z. Zhang, Y. Cao, Y. Lin, J. Bao, Z. Yao, Q. Dai, and H. Hu, “SimMIM: A simple framework for masked image modeling,” in CVPR, 2022, pp. 9653–9663.
[40] C. Prabhakar, H. Li, J. Yang, S. Shit, B. Wiestler, and B. Menze, “Vit-ae++: improving vision transformer autoencoder for self-supervised medical image representations,” in MIDL. PMLR, 2024, pp. 666–679.
[41] H. Wang, K. Song, J. Fan, Y. Wang, J. Xie, and Z. Zhang, “Hard patches mining for masked image modeling,” in CVPR, 2023, pp. 10 375–10 385.
[42] C. Zhang, H. Zheng, and Y. Gu, “Dive into the details of self-supervised learning for medical image analysis,” MIA, vol. 89, p. 102879, 2023.
[43] W. Wang, J. Wang, C. Chen, J. Jiao, Y. Cai, S. Song, and J. Li, “Fremim: Fourier transform meets masked image modeling for medical image segmentation,” in WACV, 2024, pp. 7860–7870.
[44] J. Xiao, Y. Bai, A. Yuille, and Z. Zhou, “Delving into masked autoencoders for multi-label thorax disease classification,” in WACV, 2023, pp. 3588–3600.
[45] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in CVPR, 2021, pp. 10 012–10 022.
[46] H. Wang, Y. Tang, Y. Wang, J. Guo, Z.-H. Deng, and K. Han, “Masked image modeling with local multi-scale reconstruction,” in CVPR, 2023, pp. 2122–2131.
[47] M. R. Hosseinzadeh Taher, M. B. Gotway, and J. Liang, “Towards foundation models learned from anatomy in medical imaging via self-supervision,” in MICCAI, 2023, pp. 94–104.
[48] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016, pp. 770–778.
[49] M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, and A. Joulin, “Emerging properties in self-supervised vision transformers,” in ICCV, 2021, pp. 9650–9660.
[50] A. Dosovitskiy et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” ICLR, 2020.
[51] Y. Xie, J. Zhang, Y. Xia, and Q. Wu, “Unimiss: Universal medical self-supervised learning via breaking dimensionality barrier,” in ECCV, 2022, pp. 558–575.
[52] H.-Y. Zhou, C. Lu, C. Chen, S. Yang, and Y. Yu, “A unified visual information preservation framework for self-supervised pre-training in medical image analysis,” IEEE T-PAMI, 2023.
[53] B. Landman, Z. Xu, J. Igelsias, M. Styner, T. Langerak, and A. Klein, “Miccai multi-atlas labeling beyond the cranial vault–workshop and challenge,” in MICCAIW, vol. 5, 2015, p. 12.
[54] K. Clarket al., “The cancer imaging archive (tcia): maintaining and operating a public information repository,” Journal of digital imaging, vol. 26, pp. 1045–1057, 2013.
[55] S. A. Harmon et al., “Artificial intelligence for the detection of covid-19 pneumonia on chest ct using multinational datasets,” Nat. Commun., vol. 11, no. 1, p. 4080, 2020.
[56] M.-P. Revel et al., “Study of thoracic ct in covid-19: the stoic project,” Radiology, vol. 301, no. 1, pp. E361–E370, 2021.
[57] S. G. Armato III et al., “The lung image database consortium and image database resource initiative : a completed reference database of lung nodules on ct scans,” Med. Phys., vol. 38, no. 2, pp. 915–931, 2011.
[58] A. J. Grossberg et al., “Imaging and clinical data archive for head and neck squamous cell carcinoma patients treated with radiotherapy,” Sci. Data, vol. 5, no. 1, pp. 1–10, 2018.
[59] J. Wasserthal, M. Meyer, H.-C. Breit, J. Cyriac, S. Yang, and M. Segeroth, “Totalsegmentator: robust segmentation of 104 anatomical structures in ct images,” Radiology: Artificial Intelligence, 2022.
[60] X. Zhuang, “Multivariate mixture model for myocardial segmentation combining multi-source images,” IEEE T-PAMI, vol. 41, no. 12, pp. 2933–2946, 2018.
[61] A. L. Simpson et al., “A large annotated medical image dataset for the development and evaluation of segmentation algorithms,” arXiv, 2019.
[62] J. Ma et al., “Unleashing the strengths of unlabeled data in pan-cancer abdominal organ quantification: the flare22 challenge,” arXiv, 2023.
[63] Y. Ji et al., “Amos: A large-scale abdominal multi-organ benchmark for versatile medical image segmentation,” NeurIPS, 2022.
[64] K. Zhang et al., “Clinically applicable ai system for accurate diagnosis, quantitative measurements, and prognosis of covid-19 pneumonia using computed tomography,” Cell, vol. 181, no. 6, pp. 1423–1433, 2020.
[65] A. Hatamizadeh, Y. Tang, V. Nath, D. Yang, A. Myronenko, B. Landman, H. R. Roth, and D. Xu, “UNETR : Transformers for 3d medical image segmentation,” in WACV, 2022, pp. 574–584.
[66] X. Chen, S. Xie, and K. He, “An empirical study of training self-supervised vision transformers. in 2021 ieee,” in ICCV, 2021, pp. 9620–9629.
[67] H. Fan, B. Xiong, K. Mangalam, Y. Li, Z. Yan, J. Malik, and C. Feichtenhofer, “Multiscale vision transformers,” in ICCV, 2021, pp. 6824–6835.
[68] H. Du, J. Wang, M. Liu, Y. Wang, and E. Meijering, “Swinpa-net: Swin transformer-based multiscale feature pyramid aggregation network for medical image segmentation,” TNNLS, vol. 35, no. 4, pp. 5355–5366, 2022.
[69] J. Liu et al., “Clip-driven universal model for organ segmentation and tumor detection,” in ICCV, 2023, pp. 21 152–21 164.
[70] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” in CVPR, 2020, pp. 9729–9738.
[71] J. Kaplan et al., “Scaling laws for neural language models,” arXiv, 2020.
[72] Q. Lyu and G. Wang, “Conversion between ct and mri images using diffusion and score-matching models,” Arxiv, 2022.
[73] J.-B. Grill et al., “Bootstrap your own latent-a new approach to self-supervised learning,” NeurIPS, vol. 33, pp. 21 271–21 284, 2020.
[74] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,” ICLR, 2021.
[75] Y.-L. Sung, J. Cho, and M. Bansal, “Lst: Ladder side-tuning for parameter and memory efficient transfer learning,” NeurIPS, vol. 35, pp. 12 991–13 005, 2022.