Department of Computer Science and Engineering
11email: wkjeong@korea.ac.kr
DiffMix: Diffusion Model-based Data Synthesis for Nuclei Segmentation and Classification in Imbalanced Pathology Image Datasets
Abstract
Nuclei segmentation and classification is a significant process in pathology image analysis. Deep learning-based approaches have greatly contributed to the higher accuracy of this task. However, those approaches suffer from the imbalanced nuclei data composition, which shows lower classification performance on the rare nuclei class. In this paper, we propose a realistic data synthesis method using a diffusion model. We generate two types of virtual patches to enlarge the training data distribution, which is for balancing the nuclei class variance and for enlarging the chance to look at various nuclei. After that, we use a semantic-label-conditioned diffusion model to generate realistic and high-quality image samples. We demonstrate the efficacy of our method by experiment results on two imbalanced nuclei datasets, improving the state-of-the-art networks. The experimental results suggest that the proposed method improves the classification performance of the rare type nuclei classification, while showing superior segmentation and classification performance in imbalanced pathology nuclei datasets.
Keywords:
Diffusion models Data augmentation Nuclei segmentation and classification.1 Introduction
In digital pathology, nuclei segmentation and classification are crucial tasks for the diagnosis of diseases. Due to its diverse nature (e.g., shape, size, and color) and large numbers, nuclei analysis in whole slide images (WSIs) is a challenging task where computerized processing has become a de facto standard these days [11]. With the advent of deep learning, many challenging problems in nuclei analysis, such as color inconsistency, overlapping nuclei, and clustered nuclei, are effectively handled via data-driven approaches [15, 8, 23, 6]. Some of the recent work tackle nuclei segmentation and classification simultaneously. For example, HoVer-Net [8] and SONNET [6] perform nuclei segmentation using a distance map to identify nucleus instances, and then assign a proper class to each of them. Although such deep learning-based algorithms have shown promising performance and overcome various challenges in nuclei analysis, data imbalance among nuclei types in the training data has become a major performance bottleneck [5].
Data augmentation [16, 2] can be an effective solution to compensate for data imbalance and to generalize DNN by enlarging the learnable training distribution using virtual training data. There exist several previous works for the image classification task. Mixup [22] interpolates pairs of images and labels to generate virtual training data. CutOut [3] randomly masks out square regions of input during training. CutMix [21] cuts out patches from original images and pastes them onto other training images. Recently, a generative adversarial network(GAN) [10, 7, 12, 18] has been actively studied for pathology data augmentation. However, training a GAN is a challenging procedure because of its instability and a need for hyper-parameter tuning [4]. Moreover, most of the previous works mainly focus on nuclei segmentation only without considering nuclei classification. More recently, Doan et al.[5] proposed a data regularization scheme that addresses the data imbalance problem in pathology images. The main idea is to cut the nuclei from a scarce class image and paste them onto the nuclei from an abundant class image. Since the source and target nuclei are different in size and shape, a distance-based blending scheme is proposed. This method reduces the data imbalance problem to some extent, but it only considers pixel values for blending and some unrealistic blending artifacts can be observed, which is the main limitation of the method.
The main motivation for this work stems from the recent advances in generative models. Recently, the denoising diffusion probabilistic model(DDPM) [9] has gained much attention due to its superior performance that surpasses that of conventional GANs and has been successfully adopted to a conditional environment [13, 4, 20]. Among them, we were specifically inspired by Wang et al. [19], the semantic diffusion model(SDM) which can synthesize a semantic image conditioned on the semantic label map. Since data augmentation for nuclei segmentation and classification requires accurate semantic image and label map pairs, we believe SDM fits well the data augmentation scenario of our imbalanced nuclei data while allowing much more realistic pathology image generation compared to the pixel-blending or GAN-based prior work.

In this paper, we propose a novel data augmentation technique using a conditioned diffusion model, DiffMix, for imbalanced pathology nuclei datasets. DiffMix consists of several steps as follows. First, we train SDM with semantic map guidance that consists of instance and class-type maps. Next, we build custom label maps by modifying the existing imbalance label maps. We change nuclei labels and randomly shift locations of the nuclei mask so that the number of each class label is balanced as well as data distribution is expanded. Finally, we synthesize more diverse, semantically realistic, and well-balanced new pathology nuclei images using SDM conditioned on our custom label maps. The main contributions of our work are summarized as follows.
-
•
We introduce a data augmentation framework for imbalanced pathology image datasets which can generate realistic samples using semantic diffusion model conditioned on two custom label maps which can enlarge the data distribution.
- •
-
•
Our experiments demonstrate that the optimal approach for data augmentation depends on the level of imbalance, with balancing sample numbers and enlarging the training data distribution being critical factors to consider.
2 Method
In this section, we describe the proposed method in detail. DiffMix works with several steps. First, we train SDM first on training data. Balancing label maps have many rare class labels, and enlarging label maps are composed of some randomly shifted nuclei included. Lastly, using pre-trained SDM and custom label maps, we synthesize realistic data to train on imbalanced datasets. Before we dive into DiffMix, we start with a brief introduction to SDM. The overview of the proposed method is shown in Fig 1.
2.1 Preliminaries
SDM is a conditional denoising diffusion probabilistic model (CDPM) conditioned on semantic label maps. Based on CDPM, SDM follows two fundamental diffusion processes i.e., forward and reverse process. The reverse process is a Markov chain with Gaussian transitions. When the added noise is large enough, the reverse process is approximated by a random variable , defined as follows:
(1) | |||
(2) |
The forward process implements Gaussian noise addition for timesteps based on variance schedule as below:
(3) |
With and , we can write the marginal distribution as follows,
(4) |
The conditional DDPM is optimized to minimize the negative log-likelihood of the data for the given input and condition information. If noise in the data follows Gaussian distribution with a diagonal covariance matrix , denoising can be the optimization target by removing the noise assumed to be present in data as follows,
(5) |
2.2 Semantic Diffusion Model (SDM)
SDM is a U-Net-based network that estimates the noise from the noisy input image. Unlike other conditional DDPMs, the denoising network of SDM processes the semantic label map and noisy input independently. While is fed into the encoder part, is injected into the decoder to fully leverage the semantic information[19]. As for training, SDM is trained in a manner similar to the improved DDPM [14] so that it not only predicts the involved noise to reconstruct the input image but also predicts variances to enhance the log-likelihood of the generated images. To improve sample quality, SDM utilized classifier-free guidance for inference.
SDM replaces the semantic label map with an empty (null) map in order to separate the noise estimated under the label map guidance by , from the noise estimated in an unconditioned case . The strategy allows for the inference of the gradient of the log probability, expressed as follows,
(6) |
In the sampling process, the disentangled component is increased to improve the samples from conditional diffusion models, formulated as follows,
(7) |
2.3 Custom Semantic Label Maps Generation
Fig 1 illustrates the process of creating custom label maps to condition the semantic diffusion model for synthesizing desired data based on the original input image label . We have prepared custom semantic label maps to condition SDM to direct it to synthesize the data for our imbalanced datasets. So, we have considered making two types of semantic label maps to improve imbalanced datasets. First, balancing maps to balance the number of nuclei among different nuclei types. GradMix [5] have increased the fewest type nuclei in datasets by cutting, pasting, and smoothing for both images and labels. On the other hand, we only used their mixed labels for our experiment. Second, enlarging maps to stretch the data distribution available in training. We randomly moved nuclei positions on semantic maps to synthesize diverse image patches with SDM by conditioning with the unfamiliar semantic maps to lead the diffusion model to generate as many various patches as possible.
2.4 Image Synthesis
We synthesize the virtual data with pretrained SDM conditioned on the custom semantic label maps . Fig 1 depicted the data sampling process in SDM. Before putting the original image into diffusion net , we added noise to . The semantic label and noisy image will be simultaneously used. To synthesize virtual images, we built two label maps and trained a semantic diffusion model. Before we start data synthesis, we add noise on the input image . After that, we input and to the pretrained denoising network , for the encoder, and for the decoder. As SDM generates samples, it uses the empty label to generate unconditioned output. The image is sampled from an existing but noised patch, depending on the pre-defined time steps. By doing so, we generate patches that are conditioned on the custom semantic label maps. In this process, we added noise on the input image , so we input the noised input into the pretrained denoising network , and following [19], we input the custom semantic label maps to decoder parts. Then the semantic label maps will condition SDM to synthesize image data that satisfies the label maps. We have sampled the new data with DDIM[17] process, which diminishes the sampling steps but with high-quality data.
3 Experiments
Dataset | Method | Segmentation | Classification | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Dice | AJI | DQ | SQ | PQ | Acc | ||||||
GLySAC | HoVer-Net | 0.839 | 0.670 | 0.807 | 0.787 | 0.637 | 0.713 | 0.565 | 0.556 | 0.315 | - |
GradMix | 0.839 | 0.672 | 0.809 | 0.789 | 0.640 | 0.703 | 0.551 | 0.551 | 0.320 | - | |
DiffMix-E | 0.838 | 0.669 | 0.806 | 0.789 | 0.640 | 0.719 | 0.572 | 0.560 | 0.321 | - | |
DiffMix-B | 0.840 | 0.673 | 0.811 | 0.791 | 0.642 | 0.697 | 0.573 | 0.519 | 0.304 | - | |
DiffMix | 0.837 | 0.669 | 0.806 | 0.790 | 0.639 | 0.716 | 0.582 | 0.541 | 0.324 | - | |
SONNET | 0.835 | 0.660 | 0.789 | 0.792 | 0.627 | 0.679 | 0.511 | 0.511 | 0.305 | - | |
GradMix | 0.835 | 0.658 | 0.787 | 0.790 | 0.625 | 0.680 | 0.506 | 0.509 | 0.312 | - | |
DiffMix-E | 0.837 | 0.662 | 0.793 | 0.793 | 0.631 | 0.700 | 0.533 | 0.524 | 0.334 | - | |
DiffMix-B | 0.839 | 0.661 | 0.788 | 0.792 | 0.627 | 0.687 | 0.530 | 0.507 | 0.300 | - | |
DiffMix | 0.837 | 0.663 | 0.793 | 0.791 | 0.630 | 0.694 | 0.538 | 0.513 | 0.312 | - | |
CoNSeP | HoVer-Net | 0.835 | 0.545 | 0.636 | 0.758 | 0.483 | 0.799 | 0.588 | 0.490 | 0.204 | 0.478 |
GradMix | 0.836 | 0.562 | 0.658 | 0.765 | 0.504 | 0.802 | 0.598 | 0.519 | 0.144 | 0.494 | |
DiffMix-E | 0.832 | 0.550 | 0.645 | 0.760 | 0.492 | 0.804 | 0.602 | 0.486 | 0.223 | 0.493 | |
DiffMix-B | 0.835 | 0.558 | 0.653 | 0.762 | 0.499 | 0.809 | 0.595 | 0.496 | 0.324 | 0.498 | |
DiffMix | 0.836 | 0.563 | 0.658 | 0.766 | 0.505 | 0.818 | 0.604 | 0.501 | 0.363 | 0.508 | |
SONNET | 0.841 | 0.564 | 0.646 | 0.766 | 0.496 | 0.863 | 0.610 | 0.618 | 0.367 | 0.560 | |
GradMix | 0.840 | 0.561 | 0.639 | 0.764 | 0.489 | 0.861 | 0.600 | 0.639 | 0.348 | 0.555 | |
DiffMix-E | 0.842 | 0.567 | 0.648 | 0.767 | 0.498 | 0.860 | 0.604 | 0.600 | 0.374 | 0.557 | |
DiffMix-B | 0.842 | 0.562 | 0.636 | 0.765 | 0.488 | 0.857 | 0.600 | 0.606 | 0.335 | 0.557 | |
DiffMix | 0.844 | 0.570 | 0.649 | 0.766 | 0.499 | 0.873 | 0.622 | 0.627 | 0.463 | 0.575 |
3.1 Datasets
In this study, we have used two imbalanced nuclei segmentation and classification datasets for our experiment. First, GLySAC [6] consists of 59 H&E images of size 10001000 pixels, and split into 34 train images and 25 test images. The GLySAC has 30875 nuclei, and grouped into 3 nuclei types which are 12081 lymphocytes, 12287 epithelial and 6507 miscellaneous, respectively. Second, CoNSeP [8] consists of 41 H&E images of size 10001000 pixels, and divided into 27 train images and 14 test images. The CoNSeP has 24319 nuclei in total, and composed of four nuclei classes, which are 5537 epithelial nuclei, 3941 inflammatory, 5700 spindle, and 371 miscellaneous nuclei.
3.2 Implementation Details
We used one NVIDIA RTX A6000 to train SDM, we trained SDM for 10000epochs. For data synthesis, we implemented DDIM-based diffusion process from 1000 to 100, and we added noise on the input image to SDM, setting as 55. In our scheme, is defined as the all-zero vector as same as [19] and set when sampling both datasets. We implemented experiments on two baseline networks SONNET[6] and HoVer-Net[8]. SONNET is implemented with Tensorflow version 1.15[1] as software framework with two NVIDIA GeForce 2080 Ti GPUs. HoVer-Net is trained with PyTorch 1.11.0 as software framework with one NVIDIA GeForce 3090 Ti GPU. We implemented 4-fold cross validation for SONNET, and 5-fold cross validation for HoVer-Net. For fair comparison, we trained each process with same iterations, changing only the epoch numbers depending on the training set size. DiffMix and GradMix used all the original patches and the same numbers of synthesized patches In case of DiffMix-B, it has original data and balancing map based patches. Likewise, DiffMix-E training set consists of enlarging patches with original training set. Therefore, We trained each baseline network for 100epochs, 75epochs for DiffMix-B and -E, and trained 50epochs for GradMix and DiffMix.

3.3 Results
Fig 2 presents a qualitative comparison of synthesized patches. The original patch is on the left, followed by enlarging, balancing, and GradMix patches. Our two types of patches are well-harmonized with the surrounding structure, compared to GradMix. Moreover, using our scheme, we can synthesize many patches using a semantic diffusion model.
For the quantitative evaluation, we implemented two state-of-the-art networks, HoVer-Net and SONNET, on two public imbalanced nuclei-type datasets, GLySAC and CoNSeP. Table 1 shows the results of 5 experiments per network for each dataset. We also conducted ablation studies on balancing (DiffMix-B) and enlarging (DiffMix-E) patch datasets. Before analyzing the experiment results, we computed the proportion of the least presenting nuclei type in each dataset. We found that Miscellaneous nuclei accounted for around 19 in GLySAC, but only 2.4 in CoNSeP. This means that GLySAC is more balanced in terms of nuclei types than CoNSeP. Taking this information into account, we analyzed our experiment results. First, we noticed that DiffMix-E showed the highest classification performance in the GLySAC dataset. This result indicates that enlarging semantic map-based data synthesis, like DiffMix-E, had enough opportunities to enlarge its learning distribution for GLySAC. However, in the CoNSeP dataset, DiffMix-E performed lower than other methods, suggesting that if a dataset is somewhat balanced, it is important to enlarge the available data distribution. DiffMix showed the highest performance in most metrics, with a 4 and 9 margin from the second-highest result in classifying Miscellaneous, successfully diminishing the classification performance variability among class types. Furthermore, DiffMix improved the segmentation and classification performance of two state-of-the-art networks, even compared to GradMix.
4 Conclusion
In this paper, we introduced DiffMix, a semantic diffusion model-based data augmentation framework for imbalanced pathology nuclei datasets. We have experimentally demonstrated that our method can synthesize virtual data which can balance and enlarge the imbalanced pathology nuclei datasets. Our method also outperforms the state-of-the-art GradMix in terms of qualitative and quantitative comparisons. Moreover, DiffMix enhances the segmentation and classification performance of two state-of-the-art networks, HoVer-Net and SONNET, even in imbalanced datasets like CoNSeP. Our results suggest that DiffMix can be used to improve the performance of medical image processing tasks in various applications. In the future, we plan to improve the performance of the diffusion model to generate various pathology tissue types.
References
- [1] Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al.: Tensorflow: a system for large-scale machine learning. In: Osdi. vol. 16, pp. 265–283. Savannah, GA, USA (2016)
- [2] Chapelle, O., Weston, J., Bottou, L., Vapnik, V.: Vicinal risk minimization. Advances in neural information processing systems 13 (2000)
- [3] DeVries, T., Taylor, G.W.: Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552 (2017)
- [4] Dhariwal, P., Nichol, A.: Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems 34, 8780–8794 (2021)
- [5] Doan, T.N.N., Kim, K., Song, B., Kwak, J.T.: Gradmix for nuclei segmentation and classification in imbalanced pathology image datasets. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2022: 25th International Conference, Singapore, September 18–22, 2022, Proceedings, Part II. pp. 171–180. Springer (2022)
- [6] Doan, T.N., Song, B., Vuong, T.T., Kim, K., Kwak, J.T.: Sonnet: A self-guided ordinal regression neural network for segmentation and classification of nuclei in large-scale multi-tissue histology images. IEEE Journal of Biomedical and Health Informatics 26(7), 3218–3228 (2022)
- [7] Gong, X., Chen, S., Zhang, B., Doermann, D.: Style consistent image generation for nuclei instance segmentation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 3994–4003 (2021)
- [8] Graham, S., Vu, Q.D., Raza, S.E.A., Azam, A., Tsang, Y.W., Kwak, J.T., Rajpoot, N.: Hover-net: Simultaneous segmentation and classification of nuclei in multi-tissue histology images. Medical Image Analysis 58, 101563 (2019)
- [9] Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) Advances in Neural Information Processing Systems. vol. 33, pp. 6840–6851. Curran Associates, Inc. (2020), https://proceedings.neurips.cc/paper/2020/file/4c5bcfec8584af0d967f1ab10179ca4b-Paper.pdf
- [10] Hou, L., Agarwal, A., Samaras, D., Kurc, T.M., Gupta, R.R., Saltz, J.H.: Robust histopathology image analysis: To label or to synthesize? In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8533–8542 (2019)
- [11] Li, X., Li, C., Rahaman, M.M., Sun, H., Li, X., Wu, J., Yao, Y., Grzegorzek, M.: A comprehensive review of computer-aided whole-slide image analysis: from datasets to feature extraction, segmentation, classification and detection approaches. Artificial Intelligence Review 55(6), 4809–4878 (2022)
- [12] Lin, Y., Wang, Z., Cheng, K.T., Chen, H.: InsMix: Towards realistic generative data augmentation for nuclei instance segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer (2022)
- [13] Nichol, A., Dhariwal, P., Ramesh, A., Shyam, P., Mishkin, P., McGrew, B., Sutskever, I., Chen, M.: Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741 (2021)
- [14] Nichol, A.Q., Dhariwal, P.: Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning. pp. 8162–8171. PMLR (2021)
- [15] Raza, S.E.A., Cheung, L., Shaban, M., Graham, S., Epstein, D., Pelengaris, S., Khan, M., Rajpoot, N.M.: Micro-net: A unified model for segmentation of various objects in microscopy images. Medical image analysis 52, 160–173 (2019)
- [16] Simard, P.Y., LeCun, Y.A., Denker, J.S., Victorri, B.: Transformation invariance in pattern recognition—tangent distance and tangent propagation. In: Neural networks: tricks of the trade, pp. 239–274. Springer (2002)
- [17] Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020)
- [18] Wang, H., Xian, M., Vakanski, A., Shareef, B.: Sian: Style-guided instance-adaptive normalization for multi-organ histopathology image synthesis. arXiv preprint arXiv:2209.02412 (2022)
- [19] Wang, W., Bao, J., Zhou, W., Chen, D., Chen, D., Yuan, L., Li, H.: Semantic image synthesis via diffusion models. arXiv preprint arXiv:2207.00050 (2022)
- [20] Wolleb, J., Bieder, F., Sandkühler, R., Cattin, P.C.: Diffusion models for medical anomaly detection. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2022: 25th International Conference, Singapore, September 18–22, 2022, Proceedings, Part VIII. pp. 35–45. Springer (2022)
- [21] Yun, S., Han, D., Oh, S.J., Chun, S., Choe, J., Yoo, Y.: Cutmix: Regularization strategy to train strong classifiers with localizable features. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (October 2019)
- [22] Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D.: mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412 (2017)
- [23] Zhao, B., Chen, X., Li, Z., Yu, Z., Yao, S., Yan, L., Wang, Y., Liu, Z., Liang, C., Han, C.: Triple u-net: Hematoxylin-aware nuclei segmentation with progressive dense feature aggregation. Medical Image Analysis 65, 101786 (2020)
Dataset | Lymphocyte/Inflammatory | Epithelial | Miscellaneous | Spindle | Total | |
---|---|---|---|---|---|---|
GLySAC | Train | 7409 (41.3%) | 7154 (39.9%) | 3386 (18.9%) | - | 17949 |
Test | 4672 (36.1%) | 5133 (39.7%) | 3121 (24.1%) | - | 12926 | |
Total | 12081 (39.1%) | 12287 (39.8%) | 6507 (21.1%) | - | 30875 | |
CoNSeP | Train | 3941 (25.3%) | 5537 (35.6%) | 371 (2.4%) | 5700 (36.7%) | 15549 |
Test | 1638 (18.7%) | 3214 (35.6%) | 561 (6.4%) | 3357 (38.3%) | 8770 | |
Total | 5579 (22.9%) | 8751 (36.0%) | 932 (3.8%) | 9057 (37.2%) | 24319 |

Seg. | Description | Cla. | Description |
---|---|---|---|
Dice | Dice coefficient | Acc | Accuracy for each nuclei type |
AJI | Aggregated Jaccard Index (AJI) | F1-score for epithelial nuclei | |
DQ | Detection Quality | F1-score for lymphocyte/Inflammatory nuclei | |
SQ | Segmentation Quality | F1-score for Miscellaneous nuclei | |
PQ | Panoptic Quality | F1-score for Spindle-shaped nuclei |
