\cftpagenumbersoff

figure \cftpagenumbersofftable

Prior-guided Diffusion Model for Cell Segmentation in Quantitative Phase Imaging

Zhuchen Shao University of Illinois Urbana-Champaign, Department of Electrical and Computer Engineering, Urbana, Illinois, United States Mark A. Anastasio University of Illinois Urbana-Champaign, Department of Electrical and Computer Engineering, Urbana, Illinois, United States University of Illinois Urbana-Champaign, Department of Bioengineering, Urbana, Illinois, United States Hua Li University of Illinois Urbana-Champaign, Department of Bioengineering, Urbana, Illinois, United States Washington University School of Medicine, Department of Radiation Oncology, Saint Louis, Missouri, United States

keywords:

medical image segmentation, diffusion model, prior-guided, content information

Purpose: Quantitative phase imaging (QPI) is a label-free technique that provides high-contrast images of tissues and cells without the use of chemicals or dyes. Accurate semantic segmentation of cells in QPI is essential for various biomedical applications. While DM-based segmentation has demonstrated promising results, the requirement for multiple sampling steps reduces efficiency. This study aims to enhance DM-based segmentation by introducing prior-guided content information into the starting noise, thereby minimizing inefficiencies associated with multiple sampling.

Approach: A prior-guided mechanism is introduced into DM-based segmentation, replacing randomly sampled starting noise with noise informed by content information. This mechanism utilizes another trained DM and DDIM inversion to incorporate content information from the to-be-segmented images into the starting noise. An evaluation method is also proposed to assess the quality of the starting noise, considering both content and distribution information.

Results: Extensive experiments on various QPI datasets for cell segmentation showed that the proposed method achieved superior performance in DM-based segmentation with only a single sampling. Ablation studies and visual analysis further highlighted the significance of content priors in DM-based segmentation.

Conclusion: The proposed method effectively leverages prior content information to improve DM-based segmentation, providing accurate results while reducing the need for multiple samplings. The findings emphasize the importance of integrating content priors into DM-based segmentation methods for optimal performance.

*Hua Li, \linkableli.hua@wustl.edu

1 Introduction

Quantitative phase imaging (QPI) is a powerful, label-free approach that generates high-contrast images by measuring optical path length differences in tissue and cell samples. It is particularly effective for visualizing transparent or low-contrast biological cell structures [1]. This non-invasive approach eliminates the need for chemicals or dyes, allowing for continuous observation and monitoring of cell growth and behavior over time without interference [2, 3]. Precise cell segmentation in QPI datasets enables comprehensive examination of cell morphology and dynamics, which is crucial for a wide range of biomedical research applications [4, 5, 6, 7].

Traditional U-Net-based methods [8, 9] have been widely used in cell segmentation and shown promising performance, benefiting from the ability of multi-scale feature extraction [10, 11]. In contrast, diffusion models (DM) [12, 13] generated the semantic segmentation map through a stochastic diffusion process [14]. Given the noise randomly sampled from a Gaussian distribution, a trained DM can denoise this sampled noise to generate the segmentation mask conditioned on the input images.

As the sampling begins with randomly sampled Gaussian noise, inherent randomness may lead to performance variations in segmentation. Therefore, DM-based segmentation employs multiple random samplings with various starting noises to generate multiple predictions, which are then refined using ensemble learning methods like majority voting for improved accuracy [15, 16, 17]. With the help of multiple samplings and ensemble learning, DM-based segmentation has shown superior results compared to traditional segmentation methods in various medical image segmentation tasks [18, 15, 16, 17]. Within the sampling process of DM, two common methods are employed: the stochastic sampling method introduced in denoising diffusion probabilistic models (DDPM) [12], and the deterministic sampling method introduced in denoising diffusion implicit models (DDIM) [19]. These methods are respectively referred to as “DDPM sampling” and “DDIM sampling” throughout this paper. As the sampling process in DM is iterative and processed step-by-step, lengthy sampling times are a significant drawback of both sampling methods [19, 20, 21]. Multiple samplings further extend these sampling times, making the segmentation process more inefficient and time-consuming.

Our method proposes a novel DM-based cell segmentation approach in QPI images that reduces the need for multiple samplings and achieves better segmentation results. Given that QPI provides high-contrast imaging, which enhances the distinctiveness and separation of different content within the images, the proposed method integrates content information from the to-be-segmented QPI images into the starting noise used in DM sampling. In our study, this content information is referred to as the prior information of the starting noise. In addition, an evaluation method is proposed to interpretably assess the quality of starting noise, involving both content and distribution information. Extensive experiments were conducted on various QPI datasets for cell segmentation. Our method achieved better performance by sampling only once compared to ensemble predictions from multiple random samplings.

The remainder of the paper is organized as follows. Section II provides background information on diffusion models, DDIM sampling and its inversion process. Section III describes the proposed method. Section IV details the numerical studies, and the experimental results are presented in Section V. Finally, Section VI provides a summary and discussion of the work.

2 Background

2.1 Diffusion model

Diffusion models (DM) are advanced generative models capable of generating high-quality images [12, 22] and effectively learning useful semantic information [23, 24]. The training process of a DM involves two main steps: forward diffusion and reverse diffusion. In the forward diffusion process, a transition is incrementally made from the original data distribution towards a Gaussian distribution across a sequence of discrete time steps $t$ , which range from $0$ to $T-1$ . At each step, a predetermined amount of Gaussian noise $\bm{\epsilon}$ is added to the input images $\bm{x}$ , progressively transforming the data as follows:

\bm{x_{t}}=\sqrt{\bar{\alpha}_{t}}\bm{x}+\sqrt{1-\bar{\alpha}_{t}}\bm{\epsilon},

(1)

where $\bar{\alpha}_{t}$ is a hyper-parameter, $\{\bm{x_{t}}\}^{T}_{t=1}$ are noise-added images and $T$ is the maximal sampling steps. During the reverse diffusion process, the denoising U-Net denoted as $\bm{\epsilon_{\theta}}$ is responsible for predicting the noise in $\bm{x_{t}}$ . The training objective is to minimize the error between the added noise $\bm{\epsilon}$ and the predicted noise $\bm{\epsilon_{\theta}}\left(\bm{x_{t}}\right)$ , which can be represented as follows:

L_{DM}=\mathbb{E}_{\bm{x},\bm{\epsilon}\sim\bm{\mathcal{N}}(0,1),t}\left[\left\|\bm{\epsilon}-\bm{\epsilon_{\theta}}\left(\bm{x_{t}}\right)\right\|_{2}^{2}\right].

(2)

Implementing a traditional DM poses challenges due to high computational costs. To address this challenge, the latent diffusion model (LDM) [13] was proposed. This method incorporates the idea of compression by introducing pre-trained autoencoding models to encode images into a latent space. Therefore, the training of the diffusion model in the LDM method is more efficient. Specifically, a pre-trained autoencoding model includes an encoder $\bm{\mathcal{E}}$ and a decoder $\bm{\mathcal{D}}$ . All input images $\bm{x}$ for LDM are encoded into the latent representation $\bm{z}$ by $\bm{\mathcal{E}}$ . The following diffusion training process is in the latent space. The training objective can be represented as follows:

L_{LDM}=\mathbb{E}_{\bm{z},\bm{\epsilon}\sim\bm{\mathcal{N}}(0,1),t}\left[\left\|\bm{\epsilon}-\bm{\epsilon_{\theta}}\left(\bm{z_{t}}\right)\right\|_{2}^{2}\right].

(3)

2.2 DDIM sampling and its inversion process

Stochastic DDPM sampling [12] and deterministic DDIM sampling [19] are two commonly used sampling methods for a trained DM to generate segmentation masks. Specifically, DDPM sampling follows a Markov chain process, and the sampling process is as follows.

\bm{x_{t-1}}=\frac{1}{\sqrt{\alpha_{t}}}\left(\bm{x_{t}}-\frac{\beta_{t}}{\sqrt{1-\bar{\alpha}_{t}}}\bm{\epsilon_{\theta}}\left(\bm{x_{t}}\right)\right)+\sigma_{t}\bm{\epsilon},

(4)

where $\bm{\epsilon_{\theta}}$ represents pre-trained denoising U-Net within the DM, $\bm{\epsilon}$ represents random sample Gaussian noise. The hyper-parameters $\alpha_{t}$ , $\bar{\alpha}_{t}$ , $\beta_{t}$ and $\sigma_{t}$ are used to determine the scales of noise. Due to the inclusion of random noise $\bm{\epsilon}$ , the sampling process in DDPM is inherently stochastic.

In contrast, DDIM sampling is a deterministic process, the formulation is as follows:

	$\displaystyle\bm{x_{t-1}}\!=\!\sqrt{\bar{\alpha}_{t-1}}\left(\frac{\bm{x_{t}}\!-\!\sqrt{1\!-\!\bar{\alpha}_{t}}\bm{\epsilon_{\theta}}\left(\bm{x_{t}}\right)}{\sqrt{\bar{\alpha}_{t}}}\right)\!\!+\!\!\sqrt{1\!-\!\bar{\alpha}_{t-1}}\bm{\epsilon_{\theta}}\left(\bm{x_{t}}\right)$		(5)
	$\displaystyle=\sqrt{\frac{\bar{\alpha}_{t-1}}{\bar{\alpha}_{t}}}\bm{x_{t}}\!+\!\sqrt{\bar{\alpha}_{t-1}}\left(\sqrt{\frac{1}{\bar{\alpha}_{t-1}}\!-\!1}\!-\!\sqrt{\frac{1}{\bar{\alpha}_{t}}\!-\!1}\right)\bm{\epsilon_{\theta}}\left(\bm{x_{t}}\right),$		(5)

where $\bm{\epsilon_{\theta}}$ represents pre-trained denoising U-Net within the DM, while the hyper-parameter $\bar{\alpha}_{t}$ determines the scales of the noise. The DDIM process does not involve random noise $\bm{\epsilon}$ . This process is defined as a deterministic sampling method and can be reformulated to derive its corresponding ordinary differential equation (ODE) [19]. Consequently, numerical approximation methods, including the Euler method, are capable of approximating the inversion process of DDIM sampling. By using the Euler method with a sufficient number of discrete time steps, the DDIM inversion is effectively deterministic.

The deterministic characteristic of DDIM sampling and DDIM inverse sampling ensures consistent transformation between input images and their corresponding noise. To facilitate the recovery of the original image from noise, the DDIM inverse sampling method preserves important content information during its noise addition process. Inspired by the ability to retain content information, DDIM inversion is employed in areas such as image translation [25, 26] and image editing [27] for extracting content information of input images.

Refer to caption — Figure 1: Proposed prior-guided DM-based segmentation (PG-DiffSeg) method. (a) The process of prior-guided DM-based segmentation, (b) The process of prior information extraction. Pre-trained LDM-P was employed to generate content-informed starting noise directly from a given image through DDIM inversion process. This obtained noise, informed by the image’s content and following a Gaussian distribution, will be input to the LDM-S for segmentation.

3 Proposed framework

3.1 Overview of the proposed method

As shown in Fig. 1, the proposed method is named the prior-guided DM-based segmentation (PG-DiffSeg). Leveraging the high-contrast content information in QPI images, this method incorporates content information into the starting noise of the sampling process, enabling DM-based segmentation to achieve precise results without multiple samplings and ensemble learning. Specifically, this method utilizes two latent diffusion models (LDMs) [13]: one LDM for prior extraction (LDM-P) and another for segmentation (LDM-S). By exploiting the deterministic DDIM inversion process, LDM-P transforms to-be-segmented images into corresponding noise, which conforms to a Gaussian distribution while retaining content information. Instead of relying on randomly sampled noise, LDM-S utilizes the prior-informed noise extracted by LDM-P as the starting noise to generate the segmentation mask. Starting from the prior-informed noise better supports LDM-S in stably generating a precise segmentation mask.

3.2 Latent diffusion model for prior extraction (LDM-P)

To generate Gaussian noise while retaining content information in images, a latent diffusion model (LDM) with the DDIM inverse sampling process is employed to transform input images into their corresponding noises. The LDM used in this process is named LDM-P. Specifically, the training of LDM-P involves both the forward and reverse diffusion processes in an unsupervised manner. For efficient training, the input images are transformed into latent representations $\bm{z}$ using a pre-trained encoder $\bm{\mathcal{E}}$ [13]. During the forward diffusion process, random noise is introduced to $\bm{z}$ , as illustrated in Eq.1. Subsequently, during the reverse diffusion process, LDM-P predicts the added noise by minimizing the training loss defined in Eq. 3. Given to-be-segmented images, the trained LDM-P model converts the input latent representation into prior-informed noise for DM-based segmentation described below.

3.3 Latent diffusion model for segmentation (LDM-S)

Given the prior-informed noise generated by LDM-P, another LDM is employed to predict the segmentation mask of the to-be-segmented image. In our study, this LDM for segmentation tasks is referred to as LDM-S. Training LDM-S involves both images and segmentation masks and follows the supervised learning process. The paired inputs are first pre-processed. The to-be-segmented images $\bm{x}$ are transferred to latent representation $\bm{z}$ using the pre-trained encoder $\bm{\mathcal{E}}$ [13]. The segmentation mask $\bm{y}$ is initially converted into multiple binary segmentation masks through one-hot encoding and then resized to match the dimensions of the latent representation $\bm{z}$ . The training of LDM-S involves both forward and reverse diffusion processes. In the forward diffusion, Gaussian noise is added to the pre-processed segmentation mask $\bm{y}$ , as illustrated in Eq. 1. During the reverse diffusion, with the condition of the to-be-segmented image, a denoising U-Net $\bm{\epsilon_{\theta}}$ is utilized to predict the added noise on the segmentation masks $\bm{y_{t}}$ . The training objective aims to minimize the loss function, defined as the squared error between the added noise $\bm{\epsilon}$ and the noise predicted by the model, $\bm{\epsilon_{\theta}}(\bm{y_{t}},\bm{z})$ :

L=\mathbb{E}_{\bm{y},\bm{z},\bm{\epsilon}\sim\mathcal{N}(0,1),t}\left[\left\|\bm{\epsilon}-\bm{\epsilon_{\theta}}\left(\bm{y_{t}},\bm{z}\right)\right\|_{2}^{2}\right].

(6)

The trained LDM-S model predicts the segmentation mask using the prior-informed noise $\bm{\epsilon_{p}}$ as the starting noise, which is generated by the trained LDM-P on the given to-be-segmented image. The segmentation in LDM-S requires the dimensions of the starting noise to align with those of the segmentation mask. The prior-informed noise $\bm{\epsilon_{p}}$ obtained from LDM-P has a dimension of $\bm{\epsilon_{p}}\in\mathbb{R}^{h/f\times w/f\times 3}$ , where $h\times w$ corresponds to the original image size, and $f$ signifies the down-sampling factor. Averaging and replication operations are performed on $\bm{\epsilon_{p}}$ to preprocess it into $\bm{\epsilon_{p}}\in\mathbb{R}^{h/f\times w/f\times n}$ , where $n$ represents the number of classes in the segmentation mask.

Using this pre-processed starting noise, the trained LDM-S employs the DDIM sampling process (Eq. 5) to generate segmentation mask predictions. Subsequently, a resizing operation is applied to convert the prediction from the latent space size to the image space. The testing process of LDM-S is presented in Algorithm 1.

Input: Input image

\bm{x}

, informed noise

\bm{\epsilon_{p}}

from Section 3.2. Pre-trained encoder

\bm{\mathcal{E}}

from [13]. LDM-S is referred to as

\bm{\mathcal{D_{S}}}

h\times w

indicates the original image dimensions, while

h_{f}\times w_{f}

represents the reduced image dimensions in the latent space.

Output: Predicted segmentation result

\bm{\hat{y}}

1) Pre-processing the prior-informed noise and input image.

\bm{\epsilon_{p}^{a}}=\operatorname{Average}(\bm{\epsilon_{p}})

\bm{\epsilon_{p}^{a}}\in\mathbb{R}^{h_{f}\times w_{f}\times 1}

\bm{\epsilon_{p}^{r}}=\operatorname{Repeat}(\bm{\epsilon_{p}^{a}})

\bm{\epsilon_{p}^{r}}\in\mathbb{R}^{h_{f}\times w_{f}\times n}

n

is the number of class

\bm{z}=\bm{\mathcal{E}}\left(\bm{x}\right)

\bm{z}\in\mathbb{R}^{h_{f}\times w_{f}\times 3}

2) Using pre-trained LDM-S to predict segmentation mask.

\bm{\hat{y}}=\operatorname{Resize}\left(\bm{\mathcal{D_{S}}}\left(\bm{\epsilon_{p}^{r}},\bm{z}\right)\right)

\bm{\hat{y}}\in\mathbb{R}^{h\times w\times n}

Algorithm 1 Testing of the LDM-S

4 Numerical Studies

4.1 Datasets

Two QPI datasets for cell viability assessment study [6], HeLa and CHO-2, were utilized to evaluate model performance on semantic segmentation, aiming to distinguish between live and dead cells. For simplicity, CHO-2 was referred to as CHO in this study. The image size is 832 $\times$ 832 pixels, each accompanied by a corresponding segmentation mask. The HeLa dataset contains 1,199 images, while the CHO dataset comprises 2,051 images. The original images were subdivided into smaller sizes of 256 $\times$ 256 and split into training, validation and testing datasets. Details about the data can be found in Table 1.

Table 1: Statistical overview of datasets after pre-processing

	HeLa	CHO
#image	19,184	22,816
#train	14,400	16,416
#validation	3,200	3,200
#test	1,584	3,200
image size	256 $\times$ 256	256 $\times$ 256
#classes	3	3

4.2 Model architecture

The LDM-P and LDM-S were used for prior extraction and segmentation, respectively. The denoising U-Net within both LDMs follows the architecture designed in LDM [13], with the same number of U-Net layers, ResNet-structured blocks, and self-attention blocks. The model architecture for comparative methods is described in Sec. 4.4. The LDM-P prioritizes learning overall content information with a larger receptive field. Accordingly, it features larger hidden layers with dimensions of 224, 448, 672, and 896, along with higher down-sampling factors where the spatial resolution of input features to attention modules is set to 8, 16, and 32. Conversely, the LDM-S emphasizes learning structure and local information, necessitating larger spatial resolution in self-attention blocks. Accordingly, the spatial resolution of input features to attention modules is set to 16, 32, and 64. To reduce the computational burden, it incorporates a smaller hidden layer with dimensions of 128, 256, 384, and 512. The parameters for LDM-P and LDM-S are 274M and 84.6M, respectively.

4.3 Implementation details

In the experiments, the parameters predefined by LDM [13] were utilized. The configuration involved a maximum diffusion step of 1000, and the pre-trained weight named “vq-f4” was selected for the encoder $\bm{\mathcal{E}}$ to transform the size of input images from 256 to 64. In the training process, the LDM-P used a batch size of 20 and a learning rate of 2e-06, while the LDM-S utilized a batch size of 10 and a learning rate of 2e-06. The training for both LDMs involved standard data augmentation techniques, including rotation, horizontal and vertical flipping, as well as random adjustments to brightness and contrast. During testing, the LDM-P utilized a 100-step DDIM inversion to transform input images into prior-informed noise. The LDM-S employed a 200-step DDIM sampling (Eq. 5) in generating the segmentation mask.

4.4 Comparison methods

Five state-of-the-art segmentation methods were implemented for comparison. Three of them are deterministic segmentation methods, while the other two are diffusion model-based segmentation methods using multiple sampling and ensemble learning.

4.4.1 Deterministic segmentation methods

The proposed method was compared to three deterministic segmentation methods, including 1) ResNet-U-Net [28, 8] and 2) Efficient-U-Net [29, 8], as well as 3) DeepLabv3 [30]. Both ResNet-U-Net and DeepLabv3 used resnet101 [28] as encoder, and Efficient-U-Net used efficientnetb4 [29] as encoder. For the decoder part, both ResNet-U-Net and Efficient-U-Net used U-Net decoder. Different from U-Net decoder, the decoder in DeepLabv3 used atrous spatial pyramid pooling (ASPP) module for capturing multi-scale information. The model parameters for ResNet-U-Net, Efficient-U-Net, and DeepLabv3 were 51.5M, 20.2M, and 58.6M, respectively. To ensure a fair comparison, all the comparative methods were trained from scratch.

For the training of these three methods, a batch size of 50 and a learning rate of 1e-04 with the Lookahead+Radam optimizer [31] were employed. The training process also employed the same data augmentation as our method. Dice loss [32] was employed for model training.

4.4.2 DM-based segmentation methods

The proposed method was also compared with two approaches: 1) DM-based segmentation based on multiple samplings [18] and 2) DM-based segmentation guided with a pre-trained DINO [33] named CCDM method [14]. Randomly sampled starting noise is used in both methods. Additionally, CCDM integrated the DINO to extract features from the input images and then combined them with the features extracted in the denoising U-Net, thereby incorporating image content information into the segmentation process.

The training of traditional diffusion model-based segmentation was similar to the proposed PG-DiffSeg. For training CCDM, a batch size of 8 and a learning rate of 2e-06 were used. The pretrained weight for DINO [33] was applied with the “dino_vits8” setting, and the dimension of the extracted features was 384. These features were concatenated with the input features of the fourth U-Net block in LDM. The model parameters for the LDM-implemented CCDM were 108M.

4.5 Evaluation metrics

Mean Intersection over Union (mIoU) and F1 score (F1) were used to evaluate the segmentation performance. Higher values for these metrics indicate better performance. The Structural Similarity Index (SSIM) was used to assess the content information retained in the starting noise for sampling. SSIM takes into account structure, texture, and luminance information, ranging from -1 to 1, with 1 indicating perfect similarity. Additionally, the Probability Density Function (PDF) of the starting noise is estimated using the Gaussian Kernel Density Estimation technique. Kullback-Leibler Divergence (KLD) is applied to determine the difference between the estimated PDF of the starting noise and that of the standard Gaussian distribution. The value of KLD ranges from 0 to positive infinity, with 0 indicating a standard Gaussian distribution. Both content information assessment and distribution assessment aim to analyze the superiority of the prior-informed noise in retaining content image information while still conforming to a Gaussian distribution. With three-fold cross-validation, all evaluation metrics included the mean and standard deviation (SD). Independent samples t-tests were conducted to identify significant differences in the results and were reported with p-values.

5 Results

5.1 Cell segmentation results

Experiments were conducted on two label-free imaging datasets to assess the effectiveness of the method in stratifying and contouring live and dead cells. Section 4.1 provided the dataset details, while Sections 4.4 and 4.5 discussed the comparative methods and evaluation metrics, respectively. The results are summarized in Table 2. It is worth noting that “Random $\times 1$ ” represents DM-based segmentation with only one sampling, while “Random $\times$ 3/ $\times$ 5” and “CCDM $\times$ 3/ $\times$ 5” represent 3 or 5 times of multiple sampling. Majority voting is employed to obtain ensemble predictions from these multiple predictions. Our proposed PG-DiffSeg is denoted as “Prior”.

1) Compared to two U-Net-based segmentation methods, the DM-based method with single sampling (“Random $\times$ 1”) can achieve comparable performance. By using multiple samplings in the DM-based method, the segmentation performance increased compared to “Random $\times$ 1” and also achieved better results than all the deterministic segmentation methods. This observation confirms that multiple sampling plays an important role in improving the performance of DM-based segmentation methods.

2) Nonetheless, increasing the number of multiple sampling does not always guarantee performance improvement. For example, from “Random $\times$ 1” to “Random $\times$ 3”, the mIoU and F1 increased 3.5% and 3.1% in HeLa dataset, respectively. However, from “Random $\times$ 3” to “Random $\times$ 5”, the mIoU and F1 remained the same performance. Same trend was also observed on CCDM. The possible reason is after a certain point, additional samplings may not provide significant new information or may even introduce noise into the segmentation process, resulting in a negative effect in performance. Compared to “Random”, the “CCDM” method showed higher performance on the HeLa dataset but not the CHO dataset. The distinction between ”Random” and ”CCDM” lies in the use of a pre-trained DINO model with a large amount of natural images to guide the diffusion training process in CCDM. The content information extracted by DINO had a positive impact on the HeLa dataset, whereas it had a negative effect on the CHO dataset. This difference may be attributed to the more complex structure of cell images within the CHO dataset, which significantly differs from the characteristics of the natural images used for pre-training DINO, leading to the ineffectiveness of the CCDM method.

3) Our proposed prior-guided method for DM-based segmentation significantly enhanced performance in terms of mIoU and F1 over both the HeLa and CHO datasets. Furthermore, by incorporating content information specific to the to-be-segmented images into the starting noise, only a single sampling was required to achieve optimal performance.

Table 2: Results for cell segmentation on the QPI datasets

	HeLa		CHO
	mIoU	F1	mIoU	F1
ResNet-U-Net	0.589 (0.026)	0.671 (0.019)	0.584 (0.047)	0.652 (0.046)
Efficient-U-Net	0.592 (0.006)	0.675 (0.011)	0.570 (0.045)	0.645 (0.041)
DeepLabv3	0.618 (0.047)	0.696 (0.031)	0.627 (0.064)	0.692 (0.060)
Random $\times$ 1	0.582 (0.045)	0.657 (0.030)	0.584 (0.094)	0.646 (0.090)
Random $\times$ 3	0.617 (0.067)	0.688 (0.051)	0.640 (0.086)	0.698 (0.082)
Random $\times$ 5	0.618 (0.050)	0.688 (0.034)	0.656 (0.081)	0.712 (0.078)
CCDM $\times$ 1	0.602 (0.058)	0.669 (0.045)	0.519 (0.049)	0.581 (0.047)
CCDM $\times$ 3	0.637 (0.096)	0.698 (0.083)	0.601 (0.079)	0.661 (0.077)
CCDM $\times$ 5	0.630 (0.078)	0.690 (0.066)	0.619 (0.080)	0.678 (0.078)
Prior (Ours)	0.653 (0.061)	0.725 (0.045)	0.661 (0.070)	0.715 (0.070)

1

The results were reported as “average performance (standard deviation)” due to the cross validation. The highest values were presented in bold, and the second-highest values were underlined.

5.2 Ablation study

5.2.1 Effects of DM-based prior extraction network

The LDM-P serves as the prior extraction network in the proposed method. To evaluate the effectiveness of this trainable module (“LDM-P” in Table 3), two different starting noises were compared in DM-based segmentation, involving “Random” and “Forward Diff“. In the “Random” approach, the starting noise is randomly sampled from a Gaussian distribution. The starting noise in the “Forward Diff.” approach is generated following the forward diffusion process (Eq. 1) to the input images directly. Three starting noises were generated by using 0, 300, and 600 forward diffusion steps, respectively. Increasing the forward diffusion steps introduces more Gaussian noise, consequently preserving less content information. Additionally, in the “LDM-P” method, three different starting noises were generated using different DDIM inversion steps of 50, 100, and 200, respectively. A larger DDIM inversion step indicates a smaller stride in DDIM sampling and a more precise sampling process.

Table 3 illustrates the results for different approaches on two datasets of HeLa and CHO. According to the metrics of mIoU and F1, it is evident that on both datasets, incorporating content prior information into the starting noise improved segmentation accuracy compared to the random sampling setting, regardless of whether it was generated by LDM-P or the “Forward Diff.” method. In addition, utilizing DDIM inversion yielded better average performance compared to the non-trainable forward diffusion method, with the average mIoU improving by at least 0.9% in the HeLa dataset and by 6.3% in the CHO dataset.

To explain how the starting noise affects the segmentation performance, SSIM and KLD metrics were used to evaluate the quality of the starting noise. The results shown in Table 3 validated that our proposed prior extraction method struck a better balance by preserving more content information in the starting noise and closely aligning with the Gaussian distribution. In the “Random” setting, the starting noise consisted of pure Gaussian noise without any content information, as indicated by both SSIM and KLD of 0. In “Forward Diff. 0”, starting noise is set as the image itself, meaning it retained the most content information but also deviated the most from the standard Gaussian distribution. Gaussian noise was introduced to the to-be-segmented image through 300 and 600 forward diffusion steps, respectively in “Forward Diff. 300” and “Forward Diff. 600”. Due to the minimal noise addition, the KLD metric in “Forward Diff. 300” was found to be 10 times higher compared to other methods, which negatively impacted its performance. Additionally, “Forward Diff. 300” had lower KLD by applying more forward diffusion steps, but it also lost much content information in this process, leading to suboptimal segmentation results. In addition to “Forward Diff. 0” and “Forward Diff. 300”, both non-trainable and trainable methods that incorporated content prior information exhibited significantly larger SSIM metrics than random sampling (p-value<0.05) and showed small KLD metrics, indicating a closer alignment with the standard Gaussian distribution. Interestingly, when applying different DDIM inversion steps in our LDM-P, there was not a significant difference observed, indicating the robustness in selecting this hyper-parameter.

Table 3: Ablation study for DM-based prior extraction network.

	HeLa
	mIoU $\uparrow$	F1 $\uparrow$	SSIM $\uparrow$	KLD $\downarrow$
Random	0.582(0.045)	0.657(0.030)	0	0
Forward Diff. 0	0.636(0.089)	0.704(0.072)	1	inf
Forward Diff. 300	0.650(0.058)	0.721(0.044)	0.655(0.169)	0.019(0.001)
Forward Diff. 600	0.644(0.064)	0.715(0.052)	0.215(0.063)	0.001(0.000)
LDM-P 50	0.656(0.062)	0.728(0.046)	0.229(0.040)	0.001(0.000)
LDM-P 100	0.653(0.061)	0.725(0.045)	0.230(0.043)	0.001(0.000)
LDM-P 200	0.653(0.061)	0.725(0.045)	0.230(0.044)	0.001(0.000)
	CHO
	mIoU $\uparrow$	F1 $\uparrow$	SSIM $\uparrow$	KLD $\downarrow$
Random	0.584(0.093)	0.646(0.090)	0	0
Forward Diff. 0	0.620(0.016)	0.677(0.014)	1	inf
Forward Diff. 300	0.639(0.070)	0.698(0.068)	0.609(0.001)	0.009(0.000)
Forward Diff. 600	0.597(0.102)	0.657(0.099)	0.164(0.000)	0.002(0.000)
LDM-P 50	0.660(0.071)	0.715(0.069)	0.192(0.016)	0.001(0.000)
LDM-P 100	0.661(0.070)	0.716(0.069)	0.194(0.017)	0.001(0.000)
LDM-P 200	0.661(0.070)	0.716(0.069)	0.197(0.019)	0.001(0.000)

1

The results were reported as “average performance (standard deviation)” due to the cross validation. The highest values were presented in bold, and the second-highest values were underlined.

5.2.2 Effects of DDIM inversion method

The trained LDM-P employed DDIM inversion to extract content information from to-be-segmented images, integrating this with starting noise that conforms to a Gaussian distribution. This study explores the impact of DDIM inversion on content information extraction compared to two alternative methods. In the first method, features extracted from the self-attention module of the LDM-P contain segmentation-relevant content information [26]. However, the starting noise generated during this feature extraction may not conform to standard Gaussian noise, potentially reducing the efficiency of DM-based segmentation. In the second method, content information is extracted from input images using a U-Net block [23] and is then utilized as the starting noise for the subsequent segmentation process. Both methods rely on features extracted from the fifth block of the denoising LDM-P, with 400 noise addition steps during the forward diffusion process. Additionally, the ”Random” method, which indicates DM-based segmentation without DDIM inversion, was also compared. mIoU, F1, SSIM, and KLD were used as the evaluation metrics. The results are summarized in Table 4.

Based on the mIoU and F1 results, it was evident that the features extracted from the self-attention modules performed better than those from the U-Net block. This finding indicates that the long-distance information extraction capability of the self-attention module can produce better starting noise than the U-Net block. However, neither feature extraction method achieved satisfactory results in the KLD metric, showing a significant deviation from the standard Gaussian distribution. Although these methods yielded higher SSIM metrics in the CHO dataset, the poor performance in distribution information negatively impacted segmentation results. In contrast, DDIM inversion proved to be the optimal approach for using trained LDM-P to obtain starting noise for following DM-based segmentation tasks.

Table 4: Ablation study for DDIM inversion method.

	HeLa
	mIoU $\uparrow$	F1 $\uparrow$	SSIM $\uparrow$	KLD $\downarrow$
Random / w/o DDIM inversion	0.582(0.045)	0.657(0.030)	0	0
Feature Extraction (U-Net block)	0.514(0.171)	0.569(0.172)	0.104(0.075)	0.106(0.025)
Feature Extraction (Self-attention)	0.619(0.079)	0.689(0.058)	0.087(0.052)	0.127(0.027)
DDIM inversion	0.653(0.061)	0.725(0.045)	0.230(0.043)	0.001(0.000)
	CHO
	mIoU $\uparrow$	F1 $\uparrow$	SSIM $\uparrow$	KLD $\downarrow$
Random / w/o DDIM inversion	0.584(0.093)	0.646(0.090)	0	0
Feature Extraction (U-Net block)	0.531(0.062)	0.578(0.063)	0.274(0.010)	0.050(0.004)
Feature Extraction (Self-attention)	0.649(0.055)	0.705(0.055)	0.383(0.037)	0.078(0.014)
DDIM inversion	0.661(0.070)	0.716(0.069)	0.194(0.017)	0.001(0.000)

1

The results were reported as “average performance (standard deviation)” due to the cross validation. The highest values were presented in bold, and the second-highest values were underlined.

5.3 Visualization

Visualization was used to demonstrate the effectiveness of the proposed method. The results are shown in Fig. 2. Additionally, the visualization of content prior is examined in Fig. 3.

In Fig. 2, the HeLa and CHO datasets were obtained using label-free imaging, focusing on distinguishing live cells, dead cells, and the background through semantic segmentation. It is clear from the dashed box that adding random noise to DM-based segmentation generates multiple different predictions. Using ensemble learning, like the majority vote, can improve overall prediction performance. However, these multiple predictions faced the problem of unpredictable variance, potentially misleading ensemble learning. Our method accurately predicted results using prior information with just one sample. Compared to deterministic methods such as U-Net, our approach also achieved better segmentation accuracy.

In Fig. 3, a comparison among random sampling (Gaussian distribution), the non-trainable method (forward diffusion), and the trainable method (DDIM inversion) is presented. The canny edge detector [34] is employed to analyze content information, while contrast enhancement followed by Gaussian blur is used to preprocess the starting noise. Both the HeLa and CHO datasets are utilized for content information and distribution analysis. From the histogram analyses, it can be observed that both the starting noise obtained by non-trainable and trainable methods follow a Gaussian distribution similar to Gaussian noise. Additionally, the contour information is retained in the starting noise for these two methods, while Gaussian noise does not retain any content information due to the randomness. Moreover, the red box indicates that the trainable method can retain more contour information than the non-trainable method, which is significantly important for segmentation.

6 Discussion and conclusion

Utilizing the high-contrast content information in QPI images, this study introduces a novel DM-based segmentation framework called PG-DiffSeg, which integrates a prior extraction network to enhance DM-based segmentation in both segmentation accuracy and speed. By utilizing DDIM inversion to extract content prior information from the to-be-segmented images, this method addresses the issue of lacking content information in randomly sampled starting noise. Furthermore, this paper proposes a unique approach involving SSIM and KLD metrics to assess the effectiveness of starting noise, emphasizing the importance of content prior and data distribution information. Extensive experiments were conducted on two QPI datasets for cell segmentation. Our method achieved satisfactory results with just one sampling. The ablation studies confirm the effectiveness of the proposed approaches and the robustness of the hyper-parameter selection.

Future research on the prior-guided DM-based segmentation framework could expand in multiple directions. One potential direction is to improve the prior extraction network by integrating classification-related prior information into the starting noise. This adjustment may potentially enhance the performance of semantic segmentation, particularly when dealing with a larger number of classes. Additionally, in the LDM-based segmentation method, replacing Gaussian noise with Bernoulli noise could be advantageous, given that the generation target is a one-hot encoding segmentation mask. Furthermore, conducting additional experiments on a diverse range of datasets would facilitate comprehensive testing of the model’s generalization capabilities.

Disclosures

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Code, Data, and Materials Availability

Code and data will be made publicly available upon acceptance of the paper.

References

[1] Y. Park, C. Depeursinge, and G. Popescu, “Quantitative phase imaging in biomedicine,” Nature photonics 12(10), 578–589 (2018).
[2] M. Mir, B. Bhaduri, R. Wang, et al., “Quantitative phase imaging,” in Progress in optics, 57, 133–217, Elsevier (2012).
[3] T. Vicar, J. Balvan, J. Jaros, et al., “Cell segmentation methods for label-free contrast microscopy: review and comprehensive comparison,” BMC bioinformatics 20, 1–25 (2019).
[4] J. Park, B. Bai, D. Ryu, et al., “Artificial intelligence-enabled quantitative phase imaging methods for life sciences,” Nature Methods 20(11), 1645–1660 (2023).
[5] Y. Jo, H. Cho, S. Y. Lee, et al., “Quantitative phase imaging and artificial intelligence: a review,” IEEE Journal of Selected Topics in Quantum Electronics 25(1), 1–14 (2018).
[6] C. Hu, S. He, Y. J. Lee, et al., “Live-dead assay on unlabeled cells using phase imaging with computational specificity,” Nature communications 13(1), 713 (2022).
[7] C. Stringer, T. Wang, M. Michaelos, et al., “Cellpose: a generalist algorithm for cellular segmentation,” Nature methods 18(1), 100–106 (2021).
[8] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Germany, 2015, 234–241, Springer (2015).
[9] X.-X. Yin, L. Sun, Y. Fu, et al., “U-net-based medical image segmentation,” Journal of Healthcare Engineering 2022 (2022).
[10] F. Isensee, P. F. Jaeger, S. A. Kohl, et al., “nnu-net: a self-configuring method for deep learning-based biomedical image segmentation,” Nature methods 18(2), 203–211 (2021).
[11] N. Siddique, S. Paheding, C. P. Elkin, et al., “U-net and its variants for medical image segmentation: A review of theory and applications,” Ieee Access 9, 82031–82057 (2021).
[12] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in Neural Information Processing Systems 33, 6840–6851 (2020).
[13] R. Rombach, A. Blattmann, D. Lorenz, et al., “High-resolution image synthesis with latent diffusion models,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 10684–10695 (2022).
[14] L. Zbinden, L. Doorenbos, T. Pissas, et al., “Stochastic segmentation with conditional categorical diffusion models,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 1119–1129 (2023).
[15] M. Sun, W. Huang, and Y. Zheng, “Instance-aware diffusion model for gland segmentation in colon histology images,” in International Conference on Medical Image Computing and Computer-Assisted Intervention, 662–672, Springer (2023).
[16] T. Chen, C. Wang, and H. Shan, “Berdiff: Conditional bernoulli diffusion model for medical image segmentation,” arXiv preprint arXiv:2304.04429 (2023).
[17] J. Wolleb, R. Sandkühler, F. Bieder, et al., “Diffusion models for implicit image segmentation ensembles,” in International Conference on Medical Imaging with Deep Learning, 1336–1348, PMLR (2022).
[18] T. Amit, T. Shaharbany, E. Nachmani, et al., “Segdiff: Image segmentation with diffusion probabilistic models,” arXiv preprint arXiv:2112.00390 (2021).
[19] J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” arXiv preprint arXiv:2010.02502 (2020).
[20] T. Salimans and J. Ho, “Progressive distillation for fast sampling of diffusion models,” arXiv preprint arXiv:2202.00512 (2022).
[21] H. Zheng, W. Nie, A. Vahdat, et al., “Fast sampling of diffusion models via operator learning,” in International Conference on Machine Learning, 42390–42402, PMLR (2023).
[22] F.-A. Croitoru, V. Hondru, R. T. Ionescu, et al., “Diffusion models in vision: A survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence (2023).
[23] Z. Shao, S. Sengupta, H. Li, et al., “Semi-supervised semantic segmentation of cell nuclei via diffusion-based large-scale pre-training and collaborative learning,” arXiv preprint arXiv:2308.04578 (2023).
[24] Z. Shao, L. Dai, Y. Wang, et al., “Augdiff: Diffusion based feature augmentation for multiple instance learning in whole slide image,” arXiv preprint arXiv:2303.06371 (2023).
[25] X. Su, J. Song, C. Meng, et al., “Dual diffusion implicit bridges for image-to-image translation,” arXiv preprint arXiv:2203.08382 (2022).
[26] N. Tumanyan, M. Geyer, S. Bagon, et al., “Plug-and-play diffusion features for text-driven image-to-image translation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1921–1930 (2023).
[27] D. Miyake, A. Iohara, Y. Saito, et al., “Negative-prompt inversion: Fast image inversion for editing with text-guided diffusion models,” arXiv preprint arXiv:2305.16807 (2023).
[28] K. He, X. Zhang, S. Ren, et al., “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778 (2016).
[29] M. Tan and Q. Le, “Efficientnet: Rethinking model scaling for convolutional neural networks,” in International conference on machine learning, 6105–6114, PMLR (2019).
[30] L.-C. Chen, G. Papandreou, F. Schroff, et al., “Rethinking atrous convolution for semantic image segmentation,” arXiv preprint arXiv:1706.05587 (2017).
[31] M. R. Zhang and J. Lucas, “Lookahead optimizer: k steps forward, 1 step back,” in International Conference on Learning Representations, (2019).
[32] C. H. Sudre, W. Li, T. Vercauteren, et al., “Generalised dice overlap as a deep learning loss function for highly unbalanced segmentations,” in Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: Third International Workshop, DLMIA 2017, and 7th International Workshop, ML-CDS 2017, Held in Conjunction with MICCAI 2017, Québec City, QC, Canada, September 14, Proceedings 3, 240–248, Springer (2017).
[33] M. Caron, H. Touvron, I. Misra, et al., “Emerging properties in self-supervised vision transformers,” in Proceedings of the International Conference on Computer Vision (ICCV), (2021).
[34] L. Ding and A. Goshtasby, “On the canny edge detector,” Pattern recognition 34(3), 721–725 (2001).