Dual-channel Prototype Network for few-shot Classification of Pathological Images

Hao Quan, Xinjia Li, Dayu Hu, Tianhang Nan and Xiaoyu Cui This paper was submitted for review on November 12, 2023. This work was supported in part by the China Key Research and Development Program (Grant No. 2023YFC2508200), the Natural Science Foundation of Liaoning Province (Grant No. 2022-MS-105), Ningbo Science and Technology Bureau (Grant No. 2021Z027), the Fundamental Research Funds for the Central Universities (Grant No. N2219001), and the Liaoning Province Medical Engineering Cross Joint Fund (Grant No. 2022-YGJC-76).Hao Quan, Xinjia Li, Tianhang Nan and Xiaoyu Cui was with the College of Medicine and Biological Information Engineering, Northeastern University, Shenyang, China (e-mail: h.quan@siat.ac.cn, xj.li6@siat.ac.cn, nth15833133507@outlook.com, cuixy@bmie.neu.edu.cn).Dayu Hu was with the School of Computer, National University of Defense Technology, No. 109 Deya Road, 410073, Changsha, Hunan, China (e-mail: hzauhdy@gmail.com ).

Abstract

In pathology, the rarity of certain diseases and the complexity in annotating pathological images significantly hinder the creation of extensive, high-quality datasets. This limitation impedes the progress of deep learning-assisted diagnostic systems in pathology. Consequently, it becomes imperative to devise a technology that can discern new disease categories from a minimal number of annotated examples. Such a technology would substantially advance deep learning models for rare diseases. Addressing this need, we introduce the Dual-channel Prototype Network (DCPN), rooted in the few-shot learning paradigm, to tackle the challenge of classifying pathological images with limited samples. DCPN augments the Pyramid Vision Transformer (PVT) framework for few-shot classification via self-supervised learning and integrates it with convolutional neural networks. This combination forms a dual-channel architecture that extracts multi-scale, highly precise pathological features. The approach enhances the versatility of prototype representations and elevates the efficacy of prototype networks in few-shot pathological image classification tasks. We evaluated DCPN using three publicly available pathological datasets, configuring small-sample classification tasks that mirror varying degrees of clinical scenario domain shifts. Our experimental findings robustly affirm DCPN’s superiority in few-shot pathological image classification, particularly in tasks within the same domain, where it achieves the benchmarks of supervised learning.

Index Terms:

Few-shot learning, Self-supervised learning, Transformer, Pathology image, Classification.

I Introduction

Histopathological slides provide crucial feature information for the accurate diagnosis of cancer [1]. Traditionally, pathologists identify tumor characteristics by directly observing tissue slides under a microscope, a process that is not only time-consuming but also susceptible to individual subjectivity [2]. Considering the complexity of classifying over a hundred types of cancerous tissues, there is an urgent need for more efficient and precise computer-assisted diagnostic methods. The emergence of digital pathology has promoted significant developments in computational pathology [3, 4], where deep learning, particularly deep neural networks, has demonstrated remarkable potential in medical image analysis. However, the availability of large-scale labeled data is a key pillar for the superior performance of deep learning methods [5]. This stands in contrast to the real-world clinical setting where high-quality, labeled medical imaging data are challenging to obtain [6].

Few-shot learning (FSL) is dedicated to addressing the challenge of data scarcity and has made significant progress in the field of natural image analysis [7, 8, 9]. In few-shot classification tasks, the essence is that classifiers usually learn from a limited number of samples and handle new classes not seen during the training process. Data augmentation-based FSL methods, primarily through Generative Adversarial Networks (GANs) [10] and mixup data augmentation [11], are used to acquire additional training samples to tackle the issue of insufficient data. However, the data generated by these methods may exhibit distributional biases and may not accurately reflect the true data distribution, potentially reducing the model’s generalization ability. Meta-learning approaches, such as Model-Agnostic Meta-Learning (MAML) [8], focus on model initialization to facilitate rapid adaptation to new tasks after a few gradient updates. However, its training process involves complex second-order gradient calculations and storage of multiple gradient states, leading to high computational and memory costs. Metric learning approaches, such as Siamese Networks [12], Matching Networks [13], and Prototypical Networks [14], enhance generalizability by optimizing distance metrics between samples and have advantages in computational efficiency and training stability compared to data augmentation and meta-learning. These methods are particularly effective in the field of medical image analysis because they can capture subtle visual differences and adapt to common issues of class imbalance and sample scarcity.

In medical image analysis, FSL methods are widely applied to the classification and segmentation tasks of CT and MRI images [15, 16, 17, 18], but there is less research on pathological images. The work used for the study of pathological images mainly focuses on metric learning-based Siamese Networks and Prototypical Networks. Siamese Networks achieve classification by measuring the similarity between two inputs through multiple sub-networks with the same structure and shared parameters. Medela et al. [19] constructed a Siamese Network with three backbones for the classification of colon, lung, and breast tissue pathological images, demonstrating that Siamese Networks are significantly superior to model fine-tuning when data is insufficient. Prototypical Networks [14], on the other hand, use an embedding net to cluster the feature embeddings of samples around the prototype representation of their respective categories. Shaikh et al. [20] used Prototypical Networks to filter pathological images containing artifacts, optimizing the analysis of downstream tasks. Deuschel et al. [21] expanded the single prototype of Prototypical Networks to multiple prototypes, an improvement that extended the feature space of prototype representations, thereby enhancing the network’s rapid adaptability to new data. Expanding the feature space of prototype representations is considered key to improving the generalizability and achieving superior performance of Prototypical Networks.

Currently, Convolutional Neural Networks (CNNs) are widely used as the embedding net for Prototypical Networks. CNNs struggle to encode long-range dependencies within images and lack global modeling capabilities, which limits the generalization of prototype representations [22]. In contrast, the Vision Transformer (ViT) [23] has demonstrated significant advantages in global feature modeling for visual recognition tasks, promising to further improve the generalizability of prototype representations. Gan et al. [24] constructed global features by encoding the correlations between context features within the support set through transformer blocks. However, this approach only uses transformer blocks to integrate local features extracted by CNNs and does not directly extract global features from input images. Indeed, due to the data-hungry nature of ViTs, it is challenging to directly apply them to FSL methods that aim to solve the problem of sample insufficiency [25]. How to extend the ViT architecture to few-shot classification tasks based on Prototypical Networks and improve prototype representations remains a question worth exploring.

Based on the aforementioned analysis, this study constructs a Dual-channel Prototype Network (DCPN) that integrates Transformers and CNNs, successfully extending the Transformer model to few-shot classification tasks. First, we employ a self-supervised learning strategy to pre-train the Transformer model, enabling it to learn discriminative feature representations on a large-scale unlabeled pathological image dataset to enhance its generalizability and reduce the likelihood of overfitting. Next, by combining it with CNNs, which have an inherent inductive bias, the model can effectively capture local features of images. Subsequently, we merge the features encoded by the Transformer and CNN to output multi-scale features, thereby improving the generalizability of the prototype representation. Finally, we employ a soft voting strategy to construct a robust classifier [26]. Furthermore, we introduce a regularization strategy to further mitigate the overfitting problem of Transformers in FSL [27]. In summary, the main contributions of this paper are as follows:

1. We proposed a Dual-channel Prototype Network (DCPN) suitable for few-shot classification tasks of pathological images, achieving state-of-the-art (SOTA) classification performance.

2. By employing self-supervised learning, we applied the Transformer model to few-shot classification tasks of pathological images, with the potential to be extended to other label-hungry problems.

3. We validated the effectiveness of the proposed method through the design of three small-sample pathological image classification tasks with varying degrees of domain shift.

II Problem Definition

The long-tail issue of diseases results in the majority of illnesses having only a limited amount of data, with a lack of large-scale annotated datasets. This constrains the application of traditional supervised deep learning in medical imaging. Additionally, the emergent nature of diseases demands that deep learning models possess flexible scalability. Consequently, we define the pathological image classification task using the FSL paradigm. Generally, FSL refers to the algorithmic approach where a model is trained on a base dataset $D_{\text{base}}=\{(x_{i},y_{i})\}_{i=1}^{N_{\text{base}}}$ to develop discriminative capabilities, which are then generalized to recognize new classes with only a few labeled samples $D_{\text{novel}}=\{(x_{t},y_{t})\}_{t=1}^{N_{\text{novel}}}$ , where the label spaces $y_{i}\cap y_{t}=\emptyset$ . We attempt to conceptualize FSL as an N-way K-shot meta-learning framework, employing an episodic approach to accomplish FSL for pathological images. Meta-learning is divided into meta-training and meta-testing phases, each requiring the construction of numerous meta-tasks, with Figure 1 providing an example of a 5-way 1-shot meta-learning scenario. Meta-tasks consist of a support set $S$ and a query set $Q$ , denoted as $T=\{(S_{j},Q_{j})\}_{j=1}^{N_{\text{task}}}$ , where $N_{\text{task}}$ is the number of meta-tasks sampled per epoch. The data for constructing meta-tasks in the meta-training and meta-testing phases are sampled from $D_{\text{base}}$ and $D_{\text{novel}}$ , respectively. If the meta-tasks for both meta-training and meta-testing are derived from $D_{\text{joint}}=D_{\text{base}}\cap D_{\text{novel}}$ , we refer to this type of FSL as generalized few-shot learning (GFSL).

Refer to caption — Figure 1: An example of meta-tasks within the meta-learning framework. Each row represents a meta-task composed of a support set and a query set, corresponding to a 5-way 1-shot pathological image few-shot classification task.

III Methodology

We propose a multi-scale prototype network approach for the FSL classification task of pathological images. Figure 2 illustrates the overall workflow of the method, which primarily encompasses three parts: 1. Self-supervised pre-training of the PVT (Pyramid Vision Transformer) network; 2. Construction of a dual-channel prototype network to extract multi-scale prototype representations; 3. Development of a robust classifier based on a soft voting strategy.

III-A Self-Supervised Pre-training of the PVT Model

The global feature modeling capability of Vision Transformers has been demonstrated to hold significant advantages in multiple sub-domains of computer vision. Among these, the Pyramid Vision Transformer (PVT) [28], as an improved version of ViT, aims to reduce the computational complexity of multi-head self-attention by introducing a pyramid structure, thereby capturing richer contextual features. Hence, in this paper, the PVT network is chosen as the encoder for global feature representation of pathological images.

The Masked Autoencoder (MAE) [29] exhibits outstanding performance across multiple visual tasks through its asymmetric encoder-decoder architecture. This achievement is partly attributed to the Vanilla Vision Transformer’s ability to extract global features. However, the PVT network typically introduces operations within local windows to reduce the quadratic complexity of global self-attention, with the possibility of complete loss of visual token information within these randomly constituted visual token sequences. To address this issue, our study introduces a uniform masking strategy [30], which improves upon the random masking method in MAE, enabling the effective application of the MAE architecture to the pre-training of PVT.

Specifically, Uniform masking includes two stages of masking: Uniform Sampling (US) and Secondary Masking (SM). As illustrated in Figure 3, the original input image is first partitioned into chunks; then, during the US phase, 25% of the patches are sampled from the image based on a uniform distribution strategy, ensuring that at least one patch is selected in every 2×2 grid across the entire image, resulting in a uniformly distributed set of sparse image blocks. Following this, SM further randomly masks a portion (about 25%) of the sampled areas without completely discarding the masked blocks. Instead, these are treated as a shared, learnable token sequence and combined with visible visual tokens as input to the encoder.

The design of the encoder, decoder, and reconstruction target follows the original setup of MAE. However, unlike the original MAE, the encoder here is the PVT, which contains four stages, each outputting feature maps of different resolutions with a total stride of 32. This pyramid-structured transformer encodes the input 16*16 patches into sub-pixel level latent representations. Therefore, before inputting into the decoder, a PixelShuffle operation is required to restore the resolution of the latent features. Finally, the decoder reconstructs the missing pixels of the original image and uses the Mean Squared Error (MSE) as the loss function to optimize the model, with the MSE formula as follows:

MSE=\frac{1}{N}\sum_{i=1}^{N}(Y_{i}-\hat{Y}_{i})^{2}

(1)

where $N$ is the total number of missing pixels in the original image, $Y_{i}$ is the true pixel value of the $i$ -th pixel, and $\hat{Y}_{i}$ is the predicted pixel value by the model. By minimizing the MSE loss, the model is trained to accurately reconstruct each pixel, thereby learning the global features of the image.

III-B Construction of Multi-scale Prototype Representations

The multi-scale prototype representation is achieved through a dual-channel encoder, as shown in Figure 4(a). We utilize a pretrained PVT to extract the global features of pathological images, while a CNN is employed to extract local features. Specifically, for a given input image $x$ :

Z_{G}=f_{\text{PVT}}(x)

(2)

Z_{L}=f_{\text{CNN}}(x)

(3)

Here, $Z_{G}$ and $Z_{L}$ represent the global and local feature embeddings of the input image $x$ , respectively, each with a dimension of $D$ . To further refine these features and reduce their dimensionality, we apply Principal Component Analysis (PCA) [31] to process the global and local feature embeddings, extracting a set of linearly independent feature embeddings and removing redundant information as much as possible. The processed features are then concatenated, resulting in a mixed semantic feature:

Z^{\prime}_{G},Z^{\prime}_{L}=\text{PCA}(Z_{G},Z_{L})

(4)

Z_{\text{Mix}}=\text{Concat}(Z^{\prime}_{G},Z^{\prime}_{L})

(5)

Here, $Z^{\prime}_{G}$ and $Z^{\prime}_{L}$ are the global and local features post-PCA dimensionality reduction, each with a dimension of $D/2$ . $Z_{\text{Mix}}$ is the mixed semantic feature obtained by concatenating $Z^{\prime}_{G}$ and $Z^{\prime}_{L}$ , with a dimension of $D$ . At this point, the input image $x$ can be represented as a collection of three different scale feature sets $\{Z_{G},Z_{L},Z_{\text{Mix}}\}$ .

Next, to construct the multi-scale prototype representation, we consider a support set $D_{\text{support}}$ containing $N$ categories, with each category $C_{N}$ comprising a set of images $\{x_{1},x_{2},...,x_{K}\}_{i=1,2,..,K}$ . Here, $K$ denotes the number of samples per category in the support set, i.e., K-shot. Each image is mapped to the multi-scale feature space, resulting in $\{Z^{i}_{G},Z^{i}_{L},Z^{i}_{\text{Mix}}\}_{i=1,2,..,K}$ . For each category $C_{N}$ , we compute its multi-scale prototype representation in the respective feature spaces, which is the mean of all image features in the support set:

MP_{C_{N}}=\{\frac{1}{K}\sum_{i=1}^{K}Z_{G}^{i},\frac{1}{K}\sum_{i=1}^{K}Z_{L}^{i},\frac{1}{K}\sum_{i=1}^{K}Z_{\text{Mix}}^{i}\}

(6)

Here, MP denotes the multi-scale prototype. Thus, each $C_{N}$ category is represented by a set of multi-scale prototype representations $MP_{C_{N}}=\{P_{G}^{C_{N}},P_{L}^{C_{N}},P_{\text{Mix}}^{C_{N}}\}$ . Ultimately, we obtain a multi-scale prototype representation matrix:

MP=\begin{bmatrix}P_{G}^{C_{1}},P_{L}^{C_{1}},P_{\text{Mix}}^{C_{1}}\\ P_{G}^{C_{2}},P_{L}^{C_{2}},P_{\text{Mix}}^{C_{2}}\\ \vdots\\ P_{G}^{C_{N}},P_{L}^{C_{N}},P_{\text{Mix}}^{C_{N}}\\ \end{bmatrix}_{N\times 3}

(7)

This matrix reflects the average features of each category sample in the support set across different scales, aiming to provide a rich and effective prototype representation for the task of pathological image classification in FSL scenarios.

III-C Multi-scale Feature Soft Voting Classifier

The soft voting strategy, a common classification method in ensemble learning [26, 32, 33], typically relies on voting based on the probability predictions of multiple classifiers. This approach fully considers the confidence of each classifier and can enhance the performance of the classifier. Unlike traditional soft voting methods, in this paper, we adopt a soft voting strategy from the perspective of multi-scale features on a single classifier to avoid the overfitting problem caused by training multiple classifiers, as illustrated in Figure 4(b).

Given a query set sample $x_{q}$ , we first extract a set of multi-scale feature sets $\{Z_{G}^{q},Z_{L}^{q},Z_{Mix}^{q}\}$ through a dual-channel encoder. To determine the category of $x_{q}$ , we calculate the Euclidean distance between the query sample and the prototypes of each category in the multi-scale prototype representation matrix at the corresponding scale, as follows:

\begin{bmatrix}d_{C_{N}}^{G}\\ d_{C_{N}}^{L}\\ d_{C_{N}}^{Mix}\\ \end{bmatrix}=\begin{bmatrix}\lVert Z_{G}^{q}-P_{G}^{C_{N}}\rVert_{2}\\ \lVert Z_{L}^{q}-P_{L}^{C_{N}}\rVert_{2}\\ \lVert Z_{Mix}^{q}-P_{Mix}^{C_{N}}\rVert_{2}\\ \end{bmatrix}

(8)

where $d_{C_{N}}^{G}$ , $d_{C_{N}}^{L}$ , and $d_{C_{N}}^{Mix}$ respectively represent the Euclidean distances between the global, local, and mixed features of the query sample $x_{q}$ and the prototype representation of the $C_{N}$ th category at the corresponding scale.

To convert the distances into confidences, we use a negative exponential function to prevent numerical instability caused by distances approaching zero. The confidence level for each scale is calculated through the negative exponential of the distance:

\alpha_{C_{N}}=\exp(-d_{C_{N}}^{G})+\exp(-d_{C_{N}}^{L})+\exp(-d_{C_{N}}^{Mix})

(9)

Subsequently, we normalize these confidences using the softmax function to calculate the probability distribution of the query sample $x_{q}$ belonging to each category:

p(C_{N}|x_{q})=\frac{\exp(\alpha_{C_{N}})}{\sum_{j=1}^{N}\exp(\alpha_{C_{j}})}

(10)

Finally, we select the category with the highest probability as the prediction result:

y=\arg\max_{C_{N}}p(C_{N}|x_{q})

(11)

where $y$ represents the predicted category. The optimization process for model training employs the following loss function:

\text{Loss}(x_{q},y_{q})=-\log(p(C_{N}|x_{q}))

(12)

where $y_{q}$ is the true label corresponding to $x_{q}$ .

IV Experiments

IV-A Datasets

Our study selected three public histological datasets for model evaluation, with details as follows:

•

CRCTP dataset [34]: This dataset includes 280,000 non-overlapping patches of size 150 $\times$ 150, extracted at a 20 $\times$ magnification from 20 H&E stained colorectal cancer Whole Slide Images (WSIs), encompassing seven different tissue types: Tumor, Stroma, Complex Stroma, Muscle, Debris, Inflammatory, and Benign. Within this dataset, 70% is used as the training set, and 30% as the test set.
•

NCTCRC dataset [35]: Comprising 107,180 non-overlapping patches of size 224 $\times$ 224, extracted from 86 H&E stained colorectal cancer WSIs, this dataset includes nine different tissue types: Adipose (ADI), background (BACK), debris (DEB), lymphocytes (LYM), mucus (MUC), smooth muscle (MUS), normal colon mucosa (NORM), cancer-associated stroma (STR), and colorectal adenocarcinoma epithelium (TUM). Here, 100,000 patches are designated for training, with the remaining 7,180 for testing. Although both NCTCRC and CRCTP originate from colorectal sources, they differ in disease categories and regional data sources.
•

LC25000 dataset [36]: This dataset contains 25,000 patches of size 768 $\times$ 768, with five different tissue types: colon adenocarcinomas, benign colonic tissues, lung adenocarcinomas, lung squamous cell carcinomas, and benign lung tissues, with 5,000 patches for each tissue type. In this dataset, 70% is allocated for training and 30% for testing.

The diversity in disease and source among these three datasets allows us to construct tasks with varying degrees of domain shift. Specifically, we define three different data realms based on the origin of new classes and base classes: when new and base classes come from the same organ and data source, it is considered same domain. When new and base classes originate from the same organ but different data sources, it is categorized as near domain. When new and base classes derive from not entirely the same organ and different data sources, it is defined as mixture domain. The specific task settings are as follows:

1.

Same-domain task: New and base classes both come from the CRCTP dataset, defining a 7-way n-shot task. In this task, meta-training and meta-testing are carried out on the training and testing sets of the CRCTP dataset, respectively.
2.

Near-domain task: Utilizing CRCTP as the base dataset for meta-training, and NCTCRC as the new class dataset for meta-testing, specifically a 5-way n-shot task.
3.

Mixture-domain task: Similar to the Near-domain task, CRCTP is used as the base dataset for meta-training, but LC25000 is chosen as the new class dataset for meta-testing, specifically a 5-way n-shot task. It is noteworthy that within LC25000, two classes are related to colorectal tissues, and three to pulmonary tissues, thereby defining it as a mixture-domain.

We employed the t-SNE method to reduce the datasets of near domain and mixture domain tasks to three-dimensional space for visualization, observing the data distribution under different tasks, as shown in Figure 5. To quantitatively assess the difficulty of tasks, we calculated the Euclidean distance between datasets. Specifically, the Euclidean distance between CRCTP and NCTCRC datasets in the near domain task is 0.20; while in the mixture domain task, the distance between CRCTP and LC25000 datasets is 1.01. This indicates that the task difficulty of the mixture domain is significantly higher than that of the near domain, consistent with our task design expectations.

IV-B Experimental Setup

During the pre-training phase, our study utilized the UM-MAE algorithm to pre-train a PVT as the encoder for global features on the training sets of the aforementioned three datasets, aiming to mitigate the difficulty of training on FSL tasks. The size of the input images was adjusted to 224 $\times$ 224, with a batch size set to 256 and an efficient batch size of 1024. Optimization was carried out using the AdamW optimizer with betas=(0.9, 0.95), a learning rate of $1\times 10^{-3}$ , and epochs set to 100. The remaining parameters were kept consistent with the original UM-MAE paper.

In the FSL phase, the size of the input images remained at 224 $\times$ 224, with a batch size of $N\times K$ (depending on the specific $N$ -Way $K$ -shot task), and optimization was performed using the Adam optimizer with a learning rate of $1\times 10^{-3}$ and epochs set to 100. For the evaluation phase, 1000 meta-tasks were randomly drawn for each task to assess the model, with each meta-task utilizing 15 samples as a query set and presenting accuracy as the evaluation criterion. All results are the mean of the outcomes from 1000 meta-tasks. To address the imbalance in the number of base and new classes, we followed the settings detailed in [37]. The evaluation metrics involved in this paper are as follows:

\text{Accuracy}=\frac{\text{TP}+\text{TN}}{\text{TP}+\text{TN}+\text{FP}+\text{FN}}

(13)

\text{Precision}=\frac{\text{TP}}{\text{TP}+\text{FP}}

(14)

\text{Recall}=\frac{\text{TP}}{\text{TP}+\text{FN}}

(15)

\text{F1 Score}=2\times\frac{\text{Precision}\times\text{Recall}}{\text{Precision}+\text{Recall}}

(16)

where TP, TN, FP, FN respectively stand for True Positives, True Negatives, False Positives, False Negatives.

IV-C Comparison with Other SOTA Methods

Our experiment compared various SOTA FSL methods to validate the performance of the proposed approach. This included four classical FSL algorithms: Matching Networks [13], ProtoNet [14], MetaOpt [38], and MAML [8]; as well as three pathology-specific FSL algorithms: Latent Augmentation [39], Histology Siamese Network [19], and Multi-ProtoNets [21]. The experimental results are presented in Table I.

Firstly, in the task of few-shot classification of histological images, DCPN significantly outperformed other classical FSL methods in accuracy. For instance, in the same-domain 10-shot scenario, DCPN achieved an accuracy of 84.67%, which is at least a 6.28% improvement over Matching Networks at 74.69%, ProtoNet at 78.39%, MetaOpt at 74.21%, and MAML at 71.32%.

Compared to the three pathology-specific methods, DCPN still attained the highest accuracy or was on par in all three scenarios. For example, DCPN showed a 2.26% increase in accuracy over Latent Augmentation in the same-domain 10-shot scenario. The Latent Augmentation method primarily increases data diversity by introducing latent variables during training, allowing the model to better adapt to new, unseen classes. However, the randomly generated latent variables may introduce noise that can diminish the representation capability of feature embeddings, limiting further performance enhancements in FSL tasks. DCPN’s advantage was even more pronounced over the Histology Siamese Network, with an accuracy improvement of 5.24%. The Histology Siamese Network applies a deep Siamese neural network directly to the task of few-shot classification of pathological images, learning inherent characteristics that map to class distance on the source domain dataset. However, due to the lack of in-depth modifications tailored to the characteristics of pathological images, its actual effectiveness is limited. Multi-ProtoNets, constructing multiple prototype representations through k-means, achieved the best performance in the same-domain and mixture-domain 10-shot scenarios, but since it does not consider features at different scales, DCPN still achieved the best performance in most scenarios.

The DCPN algorithm, which utilizes multi-scale features, generally surpassed other methods and could even match the performance of supervised training with a full dataset (as shown in Table III). This indicates that using a dual-channel network for feature mining can improve prototype representations, and also proves the importance of multi-scale features for pathological image classification tasks. Furthermore, we observed a significant decrease in network classification accuracy with the increase in domain difficulty. This decline in model generalization capability is due to the growing disparity in data distribution between different domains, as demonstrated in Section 4.1.

TABLE I: Comparison of State-Of-The-Art Methods

Method	Backbone	Same domain			Near domain			Mixture domain
Method	Backbone	1-shot	5-shot	10-shot	1-shot	5-shot	10-shot	1-shot	5-shot	10-shot
Matching Networks [13]	ResNet50	62.47	72.85	74.69	60.98	71.26	72.08	50.41	55.74	57.82
ProtoNet [14]	ResNet50	65.35	76.85	78.39	61.81	72.75	73.98	53.91	59.45	60.96
MetaOpt [38]	ResNet50	63.32	70.08	74.21	59.2	67.83	70.74	52.82	54.81	56.96
MAML [8]	ResNet50	60.32	68.41	71.32	58.23	66.45	70.89	52.18	54.23	58.97
Latent Augmentation [39]	ResNet50	66.92	79.63	82.41	62.1	72.09	75.61	53.31	57.19	63.73
Histology Siamese Network [19]	ResNet50	63.21	75.82	79.43	61.85	71.46	74.31	52.02	55.42	61.36
Multi-ProtoNets [21]	ResNet50	67.91	80.45	85.12	62.98	73.42	76.67	52.87	58.45	65.21
DCPN	ResNet50 and PVT	68.72	82.02	84.67	63.78	75.65	77.43	54.59	59.8	65.01

IV-D Comparing the classification performance of different backbones

The experiment aimed to compare the classification performance of different backbones on a pathological image dataset to determine the optimal feature extraction module, thereby supporting the design of a dual-channel feature embedding network. The CRCTP dataset (base dataset) was used for the training and evaluation of the models. In this experiment, we assessed 11 different backbones, including 6 classic CNN models and 5 Transformer models. The overall experimental results are summarized in Table II. The findings indicate that within the CNN category, ResNet50 achieved the best performance, with its accuracy (acc), precision (pre), recall, F1 score (f1), and area under the curve (auc) being 85.87%, 86.85%, 86.39%, 86.62%, and 98.63%, respectively. In contrast, VGG showed the lowest performance among all the CNN models. As an early deep CNN model, VGG is simply composed of multiple convolutional and pooling layers stacked together, which results in a large number of parameters and, consequently, its relatively lower performance. Among the Transformer series, the Pyramid Transformer achieved the highest performance, with accuracy, precision, recall, F1 score, and AUC of 85.15%, 86.50%, 85.43%, 85.96%, and 98.60%, respectively. Nevertheless, overall, the performance of the Transformer series models was still slightly inferior to that of ResNet50. We speculate that this may be related to the size of the CRCTP dataset, as Transformer models typically adapt better to large-scale datasets, and may not easily demonstrate their potential advantages on relatively smaller datasets. Taking into account the above experimental results, we selected ResNet50 and the Pyramid Transformer (PVT-small) as the backbones for the dual-channel embedding network.

TABLE II: Comparison of Classification Performance with Different Backbones on the CRCTP Dataset

	Backbone	Acc	Pre	Recall	F1	AUC
CNN	VGG16 [40]	82.59	82.87	81.69	82.28	95.41
	VGG19 [40]	80.74	80.95	81.25	81.10	94.22
	DenseNet121 [41]	85.26	86.23	85.89	86.06	98.56
	EfficientNet-B0 [42]	84.28	82.68	85.21	83.93	97.11
	ResNet34 [43]	84.96	85.85	86.89	86.37	97.42
	ResNet50 [43]	85.87	86.85	86.39	86.62	98.63
Transformer	ViT-base [23]	84.74	85.74	85.41	85.57	98.52
	LeViT [44]	82.91	83.29	83.84	83.56	93.18
	BoTNet-50 [45]	84.87	85.09	83.92	84.50	97.96
	Swin-Tiny [46]	84.71	83.82	84.18	84.00	95.82
	PVT-small [28]	85.15	86.50	85.43	85.96	98.60

IV-E The Impact of Pretraining on the Feature Encoding Performance of PVT

This experiment compared the impact of pre-training on the performance of the PVT model in few-shot classification tasks across three scenarios: same domain, near domain, and mixture domain. Meta-training was completed on the CRCTP dataset for all three scenarios, with meta-testing respectively performed on the CRCTP, NCTCRC, and LC25000 datasets, thereby simulating the aforementioned scenarios. The experimental results are shown in Table III.

Initially, we solely utilized PVT as the encoder component of a prototype network. After self-supervised pre-training, PVT achieved significant performance improvements in few-shot classification tasks across all three scenarios, with an average increase of 28.54%. This notable enhancement is primarily attributed to self-supervised learning conducting specific masking prediction tasks on a large volume of unlabeled data, enabling the model to learn rich feature representations without the need for annotations. This unsupervised pre-training approach provided PVT with a saturated feature space to alleviate its data-hungry characteristics, thereby reducing the risk of overfitting in few-shot data training tasks and enabling the successful application of transformers to few-shot classification tasks.

Subsequently, we constructed a dual-channel prototype network DCPN by combining PVT with ResNet and further explored the effect of pre-training within this architecture. Self-supervised pre-training also resulted in a performance increase in DCPN, with an average improvement of 3.48%. More crucially, in the absence of pre-training, DCPN exhibited a significant advantage over the prototype network using PVT alone, with a performance enhancement of 28.69%. We believe this gain stems from the inherent inductive bias of CNNs: local spatial continuity. When dealing with high-resolution medical images, local features may contain critical information about diseases or other medical conditions. Thus, CNNs provide the model with a more robust and fine-grained representation of pathological images. Such representation can compensate for the shortcomings of PVT, effectively mitigating its data-hungry issue.

TABLE III: The Impact of Self-Supervised Pretraining on PVT Models in Few-Shot Classification Tasks

Method	Pretrained	Same domain			Near domain			Mixture domain
Method	Pretrained	1-shot	5-shot	10-shot	1-shot	5-shot	10-shot	1-shot	5-shot	10-shot
ProtoNet	N	36.76	42.67	45.12	31.37	40.42	41.82	30.01	34.71	39.18
ProtoNet	Y	63.21	76.85	78.39	60.08	71.22	74.96	53.84	58.96	61.85
DCPN	N	62.46	77.93	78.64	61.06	70.89	75.42	53.87	57.85	62.15
DCPN	Y	68.72	82.02	84.67	63.78	75.65	77.43	54.59	59.80	65.01

IV-F The Necessity of Multi-scale Features

The experiment aimed to compare the impact of features at different scales on the task of few-shot classification of pathological images, with the results summarized in Table IV. Here, ”global” and ”local” refer to features extracted by PVT and ResNet respectively, while ”mix” represents the concatenated features of global and local after dimensionality reduction through PCA. Furthermore, ”multi-scale” denotes the use of all three types of features in concert for the few-shot classification task of pathological images, as proposed in our DCPN architecture.

In the ”Same domain” scenario, the 1-shot, 5-shot, and 10-shot accuracy rates of ”local” features were 65.35%, 76.85%, and 78.39% respectively, which were notably superior to the accuracy rates of ”global” features, at 63.21%, 73.71%, and 75.89%. Further observations in the ”Near domain” and ”Mixture domain” scenarios revealed that ”local” features continued to exhibit better classification performance than ”global” features. These experimental results suggest that local features appear to have better discriminability in the few-shot classification tasks of pathological images, potentially because global features, which contain overarching information, may dilute the local characteristics beneficial for classification tasks in pathology, as demonstrated in Figure 6.

Additionally, examining the performance of ”mix” features, we find that by combining the principal components of local and global features, it outperforms the singular ”local” or ”global” features in all three scenarios. This confirms that integrating multiple scales of features can provide a more comprehensive image representation, thereby compensating for each other’s shortcomings. The feature visualization in Figure 6 also distinctly shows the differences and complementarity between PVT and ResNet features. The specific performance of ”multi-scale” features was the most outstanding, and we attribute this to several reasons: 1. The ”mix” feature, formed by the PCA-reduced concatenation of ”global” and ”local” features, may lose some discriminative features beneficial for classification; 2. The ”multi-scale” utilizes a soft voting strategy that can dynamically adjust the contribution of different scale features to the classification decision based on the confidence of the classifier outputs, thereby achieving optimal classification performance.

TABLE IV: The Impact of Multi-scale Features on the Performance of Few-shot Classification Tasks

Feature	Same domain			Near domain			Mixture domain
Feature	1-shot	5-shot	10-shot	1-shot	5-shot	10-shot	1-shot	5-shot	10-shot
Global	63.21	73.71	75.89	60.08	71.22	74.96	53.84	58.96	61.85
Local	65.35	76.85	78.39	61.81	72.75	73.98	53.91	59.45	60.96
Mix	66.73	79.96	81.37	62.49	74.26	76.13	54.08	59.71	63.64
Multi-scale	68.72	82.02	84.67	63.78	75.65	77.43	54.59	59.80	65.01

TABLE V: The Impact of Different Distance Metric Methods on the DCPN Model

Method	Same domain			Near domain			Mixture domain
Method	1-shot	5-shot	10-shot	1-shot	5-shot	10-shot	1-shot	5-shot	10-shot
Cosine similarity	64.43	75.58	80.42	58.21	71.09	73.98	51.52	57.39	62.07
Euclidean distance	68.72	82.02	84.67	63.78	75.65	77.43	54.59	59.80	65.01

IV-G The Impact of Distance Evaluation Methods on DCPN

This experiment aimed to compare whether cosine similarity or Euclidean distance is more suitable for assessing the distance between query samples and prototype features within prototype-based few-shot classification methods for pathological images. The results in Table V indicate that across all three scenarios, employing Euclidean distance as the metric method outperforms cosine similarity. Particularly in the ”Same domain” scenario, the 5-shot classification task using Euclidean distance achieved an accuracy of 82.02%, significantly higher than the 75.58% using cosine similarity. Moreover, in the 10-shot task in the ”Mixture domain” scenario, using Euclidean distance achieved an accuracy of 65.01%, compared to 62.07% with cosine similarity, showing a clear advantage. This conclusion is consistent with research findings in the field of natural images [14].

Cosine similarity measures the similarity between two feature vectors by calculating the cosine of the angle between them, primarily reflecting the difference in vector direction rather than scale. In contrast, Euclidean distance involves not only the directional difference of vectors but also their scale difference, meaning the absolute numerical discrepancy between feature vectors. This suggests that Euclidean distance may provide more information when evaluating the differences between samples and prototype representations. It is noteworthy that when Bregman divergence is used as the metric, the prototype network can be seen as a process of probabilistic distribution estimation over the support set. However, cosine distance is not a Bregman divergence and hence does not possess the corresponding properties. This may explain why, in this study, the performance of squared Euclidean distance (a type of Bregman divergence) exceeded that of cosine distance.

V Conclusion

In this study, we introduced a novel Dual-Channel Prototype Network (DCPN), aimed at addressing the classification problem of pathological images in few-shot learning scenarios. Given the inherent long-tail distribution characteristics of pathological images and the scarcity of annotated samples, our designed DCPN model integrates the Pyramid Vision Transformer (PVT) and Convolutional Neural Network (CNN), effectively mining multi-scale and fine-grained pathological features, which significantly enhances the generalizability of the prototype representation. Experimental results based on three public pathological datasets simulate real-world clinical problems and confirm the significant advantage of DCPN in few-shot pathological image classification tasks. Particularly in same-domain tasks, its performance can even be comparable to traditional supervised learning methods. In summary, our research provides an efficient and practical new method for few-shot classification of pathological images, laying a solid foundation for future clinical applications and research.

References

[1] Z. Zhang, P. Chen, M. McGough, F. Xing, C. Wang, M. Bui, Y. Xie, M. Sapkota, L. Cui, J. Dhillon et al., “Pathologist-level interpretable whole-slide cancer diagnosis with deep learning,” Nature Machine Intelligence, vol. 1, no. 5, pp. 236–245, 2019.
[2] G. Campanella, M. G. Hanna, L. Geneslaw, A. Miraflor, V. Werneck Krauss Silva, K. J. Busam, E. Brogi, V. E. Reuter, D. S. Klimstra, and T. J. Fuchs, “Clinical-grade computational pathology using weakly supervised deep learning on whole slide images,” Nature medicine, vol. 25, no. 8, pp. 1301–1309, 2019.
[3] J. Pallua, A. Brunner, B. Zelger, M. Schirmer, and J. Haybaeck, “The future of pathology is digital,” Pathology-Research and Practice, vol. 216, no. 9, p. 153040, 2020.
[4] M. Cui and D. Y. Zhang, “Artificial intelligence and computational pathology,” Laboratory Investigation, vol. 101, no. 4, pp. 412–422, 2021.
[5] J. Van der Laak, G. Litjens, and F. Ciompi, “Deep learning in histopathology: the path to the clinic,” Nature medicine, vol. 27, no. 5, pp. 775–784, 2021.
[6] H.-P. Chan, L. M. Hadjiiski, and R. K. Samala, “Computer-aided diagnosis in the era of deep learning,” Medical physics, vol. 47, no. 5, pp. e218–e227, 2020.
[7] A. Parnami and M. Lee, “Learning from few examples: A summary of approaches to few-shot learning,” arXiv preprint arXiv:2203.04291, 2022.
[8] C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning for fast adaptation of deep networks,” in International conference on machine learning. PMLR, 2017, pp. 1126–1135.
[9] Y. Gong, “Meta-learning with differentiable convex optimization,” EasyChair, Tech. Rep., 2023.
[10] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” Advances in neural information processing systems, vol. 27, 2014.
[11] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, “mixup: Beyond empirical risk minimization,” arXiv preprint arXiv:1710.09412, 2017.
[12] J. Bromley, I. Guyon, Y. LeCun, E. Säckinger, and R. Shah, “Signature verification using a” siamese” time delay neural network,” Advances in neural information processing systems, vol. 6, 1993.
[13] O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra et al., “Matching networks for one shot learning,” Advances in neural information processing systems, vol. 29, 2016.
[14] J. Snell, K. Swersky, and R. Zemel, “Prototypical networks for few-shot learning,” Advances in neural information processing systems, vol. 30, 2017.
[15] G. Xiao, S. Tian, L. Yu, Z. Zhou, and X. Zeng, “Siamese few-shot network: a novel and efficient network for medical image segmentation,” Applied Intelligence, pp. 1–13, 2023.
[16] H. Jiang, M. Gao, H. Li, R. Jin, H. Miao, and J. Liu, “Multi-learner based deep meta-learning for few-shot medical image classification,” IEEE Journal of Biomedical and Health Informatics, vol. 27, no. 1, pp. 17–28, 2022.
[17] F.-M. Guo and Y. Fan, “Zero-shot and few-shot learning for lung cancer multi-label classification using vision transformer,” arXiv preprint arXiv:2205.15290, 2022.
[18] A. Cai, W. Hu, and J. Zheng, “Few-shot learning for medical image classification,” in International Conference on Artificial Neural Networks. Springer, 2020, pp. 441–452.
[19] A. Medela, A. Picon, C. L. Saratxaga, O. Belar, V. Cabezón, R. Cicchi, R. Bilbao, and B. Glover, “Few shot learning in histopathological images: reducing the need of labeled data on biological datasets,” in 2019 IEEE 16th International Symposium on Biomedical Imaging (ISBI 2019). IEEE, 2019, pp. 1860–1864.
[20] N. N. Shaikh, K. Wasag, and Y. Nie, “Artifact identification in digital histopathology images using few-shot learning,” in 2022 IEEE 19th International Symposium on Biomedical Imaging (ISBI). IEEE, 2022, pp. 1–4.
[21] J. Deuschel, D. Firmbach, C. I. Geppert, M. Eckstein, A. Hartmann, V. Bruns, P. Kuritcyn, J. Dexl, D. Hartmann, D. Perrin et al., “Multi-prototype few-shot learning in histopathology,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 620–628.
[22] P. Song, J. Li, Z. An, H. Fan, and L. Fan, “Ctmfnet: Cnn and transformer multiscale fusion network of remote sensing urban scene imagery,” IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–14, 2022.
[23] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
[24] T. Gan, W. Li, Y. Lu, and Y. He, “Transformer-based few-shot learning for image classification,” in Artificial Intelligence for Communications and Networks: Third EAI International Conference, AICON 2021, Xining, China, October 23–24, 2021, Proceedings, Part I 3. Springer, 2021, pp. 68–74.
[25] D. Qi, “Improving few-shot learning with vision transformer,” in Journal of Physics: Conference Series, vol. 2253, no. 1. IOP Publishing, 2022, p. 012025.
[26] A. Manconi, G. Armano, M. Gnocchi, and L. Milanesi, “A soft-voting ensemble classifier for detecting patients affected by covid-19,” Applied Sciences, vol. 12, no. 15, p. 7554, 2022.
[27] B. Zhou, P. Wang, J. Wan, Y. Liang, and F. Wang, “Effective vision transformer training: A data-centric perspective,” arXiv preprint arXiv:2209.15006, 2022.
[28] W. Wang, E. Xie, X. Li, D.-P. Fan, K. Song, D. Liang, T. Lu, P. Luo, and L. Shao, “Pyramid vision transformer: A versatile backbone for dense prediction without convolutions,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 568–578.
[29] K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, “Masked autoencoders are scalable vision learners,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 16 000–16 009.
[30] X. Li, W. Wang, L. Yang, and J. Yang, “Uniform masking: Enabling mae pre-training for pyramid-based vision transformers with locality,” arXiv preprint arXiv:2205.10063, 2022.
[31] H. Abdi and L. J. Williams, “Principal component analysis,” Wiley interdisciplinary reviews: computational statistics, vol. 2, no. 4, pp. 433–459, 2010.
[32] M. U. Salur and İ. Aydın, “A soft voting ensemble learning-based approach for multimodal sentiment analysis,” Neural Computing and Applications, vol. 34, no. 21, pp. 18 391–18 406, 2022.
[33] S. Karlos, G. Kostopoulos, and S. Kotsiantis, “A soft-voting ensemble based co-training scheme using static selection for binary classification problems,” Algorithms, vol. 13, no. 1, p. 26, 2020.
[34] S. Javed, A. Mahmood, M. M. Fraz, N. A. Koohbanani, K. Benes, Y.-W. Tsang, K. Hewitt, D. Epstein, D. Snead, and N. Rajpoot, “Cellular community detection for tissue phenotyping in colorectal cancer histology images,” Medical image analysis, vol. 63, p. 101696, 2020.
[35] J. N. Kather, N. Halama, and A. Marx, “100,000 histological images of human colorectal cancer and healthy tissue,” Zenodo10, vol. 5281, 2018.
[36] A. A. Borkowski, M. M. Bui, L. B. Thomas, C. P. Wilson, L. A. DeLand, and S. M. Mastorides, “Lung and colon cancer histopathological image dataset (lc25000),” arXiv preprint arXiv:1912.12142, 2019.
[37] F. Shakeri, M. Boudiaf, S. Mohammadi, I. Sheth, M. Havaei, I. B. Ayed, and S. E. Kahou, “Fhist: a benchmark for few-shot classification of histological images,” arXiv preprint arXiv:2206.00092, 2022.
[38] K. Lee, S. Maji, A. Ravichandran, and S. Soatto, “Meta-learning with differentiable convex optimization,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 10 657–10 665.
[39] J. Yang, H. Chen, J. Yan, X. Chen, and J. Yao, “Towards better understanding and better generalization of low-shot classification in histology images with contrastive learning,” in International Conference on Learning Representations, 2022. [Online]. Available: https://openreview.net/forum?id=kQ2SOflIOVC
[40] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
[41] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4700–4708.
[42] M. Tan and Q. Le, “Efficientnet: Rethinking model scaling for convolutional neural networks,” in International conference on machine learning. PMLR, 2019, pp. 6105–6114.
[43] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
[44] B. Graham, A. El-Nouby, H. Touvron, P. Stock, A. Joulin, H. Jégou, and M. Douze, “Levit: a vision transformer in convnet’s clothing for faster inference,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 12 259–12 269.
[45] A. Srinivas, T.-Y. Lin, N. Parmar, J. Shlens, P. Abbeel, and A. Vaswani, “Bottleneck transformers for visual recognition,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 16 519–16 529.
[46] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 10 012–10 022.