Diffusion Model with Clustering-based
Conditioning for Food Image Generation

Yue Han Purdue UniversityWest LafayetteINUSA han380@purdue.edu , Jiangpeng He Purdue UniversityWest LafayetteINUSA he416@purdue.edu , Mridul Gupta Purdue UniversityWest LafayetteINUSA gupta431@purdue.edu , Edward J. Delp Purdue UniversityWest LafayetteINUSA ace@purdue.edu and Fengqing Zhu Purdue UniversityWest LafayetteINUSA zhu0@purdue.edu

(2023)

Abstract.

Image-based dietary assessment serves as an efficient and accurate solution for recording and analyzing nutrition intake using eating occasion images as input. Deep learning-based techniques are commonly used to perform image analysis such as food classification, segmentation, and portion size estimation, which rely on large amounts of food images with annotations for training. However, such data dependency poses significant barriers to real-world applications, because acquiring a substantial, diverse, and balanced set of food images can be challenging. One potential solution is to use synthetic food images for data augmentation. Although existing work has explored the use of generative adversarial networks (GAN) based structures for generation, the quality of synthetic food images still remains subpar. In addition, while diffusion-based generative models have shown promising results for general image generation tasks, the generation of food images can be challenging due to the substantial intra-class variance. In this paper, we investigate the generation of synthetic food images based on the conditional diffusion model and propose an effective clustering-based training framework, named ClusDiff, for generating high-quality and representative food images. The proposed method is evaluated on the Food-101 dataset and shows improved performance when compared with existing image generation works. We also demonstrate that the synthetic food images generated by ClusDiff can help address the severe class imbalance issue in long-tailed food classification using the VFN-LT dataset.

Food Image Generation, Diffusion Model, Food Classification, Image-based Dietary Assessment

^†^†journalyear: 2023^†^†copyright: rightsretained^†^†conference: Proceedings of the 8th International Workshop on Multimedia Assisted Dietary Management; October 29, 2023; Ottawa, ON, Canada.^†^†booktitle: Proceedings of the 8th International Workshop on Multimedia Assisted Dietary Management (MADiMa ’23), October 29, 2023, Ottawa, ON, Canada^†^†isbn: 979-8-4007-0284-6/23/10^†^†doi: 10.1145/3607828.3617796^†^†ccs: Computing methodologies Computer vision tasks^†^†ccs: Applied computing Life and medical sciences^†^†ccs: Computing methodologies Machine learning

1. Introduction

Dietary intake has a profound impact on personal health and disease. A healthy diet is an essential aspect of keeping lifelong well-being for current and future generations across the lifespan. Dietary factors are crucial in determining the risk of developing obesity, heart disease, diabetes, and even cancer (key2020diet, 46). Dietary assessment has been utilized as a fundamental approach for understanding diet’s effects on human health. Traditional methods of dietary assessment are mainly based on self-reporting, which makes it difficult to obtain an accurate dietary assessment of individuals due to both random and systematic measurement errors (bailey2021overview, 2). Recently, image-based dietary assessment has been widely used to assist researchers and participants in recording their dietary intake with high accuracy and efficiency (boushey2017new, 4). Methods such as FoodLog (food_log, 1), DietCam (diet_cam, 47), and FoodCam (8, 44) allow participants to capture their food intake using a mobile device, then the captured images are analyzed by trained dietitians to estimate the nutrient composition. To further reduce the human effort in analyzing the captured eating occasion images, several recent approaches (shao2021_ibdasystem, 69) automate this process by using deep learning-based techniques to recognize the food (he2020multitask, 29, 27, 53) and estimate the portion size (fang-icip2018, 14, 13, 68) in the captured images. Besides using mobile devices to capture the food intake image, passive capture of eating occasion images utilizes wearable cameras assisted by on-device sensors to collect dietary intake data (sun2022improved, 73, 24).

The image-based dietary assessment methods reduce the time and labor required for nutrient analysis and allow real-time feedback to the participants. The major challenge with deep learning-based approaches is that the performance is heavily dependent on the quality and the quantity of the datasets used for training the models. It is difficult to obtain a large number of accurately annotated food images in real-world scenarios. Furthermore, food images collected in daily life usually suffer from data imbalance issues due to the nature that some foods are consumed more frequently than others (he2022long, 25, 32). One way to increase the size of the food image dataset is by using data augmentation methods to create more training samples. Traditional data augmentation methods utilize basic image manipulations including color space transformations, random erasing, geometric transformations, and kernel filtering(castro2018elastic, 7, 57). However, these augmentation methods do not work well when the training data is limited and can result in overfitting (shorten2019survey, 71). With the development of Generative Adversarial Networks (GANs), generating synthetic images from GANs for data augmentation is widely used for various deep learning methods (deepsynth, 10, 23).

Generative adversarial networks (GANs) have shown great potential in generating realistic synthetic images (goodfellow2014generative, 20). Moreover, conditional generative models (mirza2014conditional, 56) can generate images for a specific food class. Previous works (ito2018food, 37, 36, 17) for synthetic food image generation have typically focused on generating images of only one or a limited number of food classes, primarily due to the challenge of creating a dataset with a diverse variety of foods. Therefore, the limitation on scale of the current food image generation methods narrows the application of using them as data augmentation methods. Diffusion models have recently become the hottest research topic in the area of generative models, showing great capability in image generation (rombach2022high, 65), super-resolution (gao2023implicit, 19), image modification (kawar2023imagic, 45), and image inpainting (lugmayr2022repaint, 51). However, there is no existing work utilizing the diffusion model for food image generation. One of the major challenges of applying the diffusion model is that the food data usually suffers from high intra-class variance due to the differences in recipes or cooking methods with the same ingredient (jiang2020deepfood, 38). Therefore, it is relatively easy to have the mode collapse during the training of the image generation model. Mode collapse refers to the condition when the model only generates a limited set of examples instead of exploring the entire training data distribution (thanh2020catastrophic, 75).

In this paper, we first explore the performance of latent diffusion methods on food image generation by fine-tuning a stable diffusion model(rombach2022high, 65) on the Food-101 dataset(bossard2014food, 3) and compare results to the images generated using one of the latest GAN-based methods, StyleGAN3 (karras2021alias, 41). We propose a novel clustering-based training framework, ClusDiff, to solve the high intra-class variance issue when training the food image generation model. For training ClusDiff, food images within each class are clustered as sub-classes, and then the model is trained with these sub-classes. When generating images for a specific class of food, ClusDiff is conditioned on its sub-class labels, where the occurrence of the labels follows the distribution of sub-classes within a class. We evaluate ClusDiff on the Food-101 dataset and demonstrate that our proposed method achieves better food image generation performance in terms of Frechet Inception Distance Metric (FID) (heusel2017gans, 34) as compared to the baseline, which is a fine-tuned stable diffusion model. We also explore the benefit of using synthetic images for data augmentation by providing a case study that uses synthetic food images generated by ClusDiff to address the class-imbalance issue in the VFN-LT dataset (he2022long, 25) for long-tailed food image classification. Overall, the contribution of this work can be summarized into the following:

•

This work represents the first investigation and application of the diffusion model to food image generation.
•

We propose ClusDiff, a diffusion model with clustering-based conditioning, and show that the proposed method generates better food images than existing methods.
•

We demonstrate synthetic food images generated by ClusDiff can be used to address the data imbalance issue in food classification through a case study.

2. Related Work

2.1. Image-based Dietary Assessment

With recent advances in machine learning and modern computer vision, image-based dietary assessment methods have been introduced to automatically record and analyze participants’ food intake. The most important aspect of image-based dietary assessment is to detect and classify the food and beverages in the images captured by participants. DiaWear(shroff2008wearable, 72) is one of the pioneering works that uses a neural network to perform context-aware food recognition on food images captured by mobile devices. Deep learning-based detection methods, such as Faster R-CNN(girshick2017faster, 64) and YOLO(redmon2016you, 62) are frequently used for food detection and have shown good results when trained on a large number of ground truth images(han2021improving, 24, 11, 40). The most recent work focuses on estimating food portion size or energy directly from the eating occasion images. In (fang2016comparison, 15), a geometric model is introduced to estimate the food portion size. Conditional GAN (mirza2014conditional, 56) is used in (fang-icip2018, 14) to generate the energy distribution map with an end-to-end regression framework for food energy estimation (fang2019end, 13), which is further improved in (shao2021towards, 68) by integrating information from RGB images and in (shao2023endtoend, 70) by leveraging the 3D reconstruction. By combining these components, systems like goFOOD(lu2020gofoodtm, 50) and Technology-Assisted Dietary Assessment (TADA)(shao2021_ibdasystem, 69) can provide an end-to-end automatic nutrient estimation for a meal.

Despite the promising advancements in image-based dietary assessment methods, the performance of deep learning-based methods relies heavily on the quality and quantity of the datasets. Training a model for food recognition or portion estimation requires a large, diverse, and accurately annotated dataset. Nevertheless, the food images we collect from real-world scenarios often exhibit stark differences from established food image datasets such as Food2K (Food2K, 55), Food-101 (bossard2014food, 3), and UEC-256 (uec-256, 43). This discrepancy primarily stems from inherent limitations associated with real-world data, such as variability, inconsistent sizes, and potential inaccuracies in annotations. While various food recognition systems have been developed to target real-world scenarios, such as continual learning (ILIO, 28, 31, 26, 61, 30), few-shot learning (food_fewshot, 39), and long-tailed classification (gao2022dynamic_LTingredient, 18, 25, 32), they usually necessitate more sophisticated training regimes. This amplification in complexity invariably leads to a surge in computational requirements, which are typically constrained in real-world situations. A potential solution to mitigate the problem of data scarcity in the real world without requiring any changes to the existing image-based dietary assessment system is to use synthetic food images for the training process. However, generating synthetic food images could be challenging due to inter-class similarity and intra-class variability. In this work, we investigate the application of the stable diffusion model(rombach2022high, 65) for synthetic food image generation. We propose a novel framework for training a stable diffusion model to generate more diverse synthetic food images and demonstrate that these images can be used to address the data imbalance issue in tasks involving long-tailed food classification.

Refer to caption — Figure 1. The block diagram of the proposed clustering-based conditional training strategy for latent diffusion model (ClusDiff). The blue blocks indicate the network structure, and the red blocks indicate data passing through the network.

2.2. Synthetic Image Generation

One of the main challenges of supervised machine learning or deep learning methods is the requirement for a large amount of annotated training samples, and the performance heavily relies on the quality of the dataset. One way to address this issue is by generating high-quality image data to augment the training dataset. For a long time, Generative Adversarial Networks (GANs) have been the most impactful methods and have shown remarkable results for synthetic image generation. These synthetic images have found applications in various fields, from video games to forensic scenarios. Many applications have utilized GANs to create synthetic images that can be used for generating ground truth information used in training machine learning networks (deepsynth, 10, 23).

There are few studies on applying GANs in the field of food image synthesis, such as RamenGAN (ito2018food, 37), Multi-ingredient Pizza Image Generator (MPG) (han2020mpg, 22), and CookGAN (han2020cookgan, 21). RamenGAN (ito2018food, 37) and its related work (horita2019unseen, 36) generate synthetic food images of noodles and rice based on a conditional GAN network. MPG (han2020mpg, 22) uses conditional StyleGAN2 (karras2020analyzing, 42) to generate realistic pizza images with various ingredients. CookGAN (han2020cookgan, 21) proposed a method to generate a meal image conditioned on a list of ingredients, with the types of foods limited to salad, a cookie, and a muffin.

In recent years, diffusion models have shown the ability to produce high-quality images while maintaining ease of use. The diffusion model defines a Markov chain to gradually add random noise to the data, and the model learns to reverse the diffusion process. As a result, the trained model is capable of constructing the desired data samples from the noise (ho2020denoising, 35). In (dhariwal2021diffusion, 9), the authors demonstrate that the diffusion model outperforms GANs and achieves better image generation results on multiple datasets. Additionally, in (rombach2022high, 65), the authors propose latent diffusion models, which apply the diffusion process in the latent space of the data, resulting in a computational reduction and a boost in the visual fidelity of the generated images. Despite diffusion models being applied to many downstream tasks (pinaya2022brain, 60, 8, 52), the study of applying them to food image generation is still lacking. In this paper, our work represents the first investigation and application of the diffusion model to food image generation.

3. Method

A block diagram of our proposed clustering-based conditional training strategy for the latent diffusion model, ClusDiff, is shown in Fig. 1. The system includes three parts: latent diffusion model training and inference, feature space food image clustering using affinity propagation, and clustering-based conditioning. The motivation for this work is that food datasets usually exhibit high intra-class variance, with different cooking methods leading to variations within the same food class. As a result, training a generative model with simple class labels may not sufficiently cover the distribution in the training dataset. In the training step of the generative model, we employ a clustering method to divide food images from a class label into several sub-classes based on their visual representations. Subsequently, the generative model is trained conditionally on these sub-classes to further capture the variations within a food class. When we use this trained generative model to generate food images, a food class label for conditional generation is represented as one of its sub-classes, determined by the frequency of occurrence in the training dataset. By doing this, the food images generated using ClusDiff better represent the distribution of the training dataset.

3.1. Latent Diffusion Model

There are two processes in diffusion models (ho2020denoising, 35): (i) a forward diffusion process to diffuse an image into random noise, and (ii) a reverse diffusion process that converts the noise into the image from a learned data distribution during the iterative denoising process.

In the forward diffusion process, a Markov chain is employed to gradually add noise to transform the training data into Gaussian noise. This discrete Markov chain is constructed with a predefined sequence of variance values $0<\beta_{1},...,\beta_{T}<1$ . Given the input data distribution $q(x_{0})$ , a forward diffusing process at time $t$ is defined as Equation 1:

(1)		$\displaystyle q(x_{t}\|x_{t-1})\sim\mathcal{N}\left(x_{t};\sqrt{1-\beta_{t}}x_{t-1},\beta_{t}\mathbf{I}\right),$
(1)		$\displaystyle q(x_{t}\|x_{0})\sim\mathcal{N}\left(x_{t};\sqrt{\bar{\alpha_{t}}}x_{0},(1-\bar{\alpha_{t}})\epsilon\right),$

where $\epsilon\sim\mathcal{N}(0,\mathbf{I})$ denotes the injected noise, and the forward process can be interpreted as sampling $x_{t}$ at any time step $t$ in a closed form by using the notation: $\alpha_{t}:=1-\beta_{t}$ and $\bar{\alpha_{t}}:=\prod_{n}^{s=1}\alpha_{s}$ , which represent the noise level at each time step. At the final time step $T$ , $x_{T}$ approximately becomes an isotropic Gaussian noise.

In the reverse diffusion process, we can recover the input data distribution $q(x_{0})$ from $p(x_{T})\sim\mathcal{N}(x_{T};0,\mathbf{I})$ . The equations for this process are defined as follows:

(2)		$\displaystyle p_{\theta}(x_{0:T})\sim p(x_{T})\prod_{t=1}^{T}p_{\theta}(x_{t-1}\|x_{t}),$
(2)		$\displaystyle p_{\theta}(x_{t-1}\|x_{t})\sim\mathcal{N}(x_{t-1};\mu_{\theta}(x_{t},t),\sum_{\theta}(x_{t},t)).$

The training object of the diffusion model is to find a denoising autoencoder $\epsilon_{\theta}(x_{t},t)$ to predict the injected noise $\epsilon$ . The corresponding objective can be simplified as follows:

(3)

\displaystyle L_{D}=\mathbb{E}_{t,x_{0},\epsilon}\left[||\epsilon-\epsilon_{\theta}(x_{t},t)||^{2}\right],

where $t$ is uniformly sampled from $\{1,...,T\}$ .

The latent diffusion(rombach2022high, 65) follows similar processes as the diffusion model, except that it uses the latent space of images as input samples to the diffusion model. It utilizes a perceptual compression model(esser2021taming, 12) consisting of an encoder $\mathcal{E}$ and a decoder $D$ . The encoder is used to map the image $x$ to the latent space $z=\epsilon(x)$ and the decoder is used to reconstruct the image $\bar{x}=D(z)$ from the latent space. The latent space representation reduces the dimension of the input data, therefore focusing on the critical features of the input data and reducing the computational complexity of the diffusion model.

3.2. Image Clustering using Affinity Propagation

We utilize Affinity Propagation (dueck2009affinity, 16) to cluster the training images from one class into sub-classes. This helps in addressing the high intra-class variance problem of the food dataset. A pre-trained ResNet-18 (he2016deep, 33), fine-tuned on the Food-101 dataset for food classification, is used to map the images into feature vectors. These feature vectors represent the visual features of the input food images and are used for clustering. Affinity propagation creates clusters by sending messages between samples until convergence, and it does not require a pre-defined number of clusters. A small number of most representative samples are referred to as exemplars to describe the dataset. Messages sent between pairs of samples indicate how well one sample can serve as the exemplar of the other, and these messages are iteratively updated in response to the values from other pairs. The updating process continues until convergence, and the final exemplars are used to form the clusters.

Suppose $z_{i}$ and $z_{j}$ are features vectors of two food images, we define $s(i,j)=1-\frac{{z_{i}\cdot z_{j}}}{{\|z_{i}\|\cdot\|z_{j}\|}}$ as the similarity between $z_{i}$ and $z_{j}$ , which is the cosine distance between two vectors. In Affinity Propagation, two messages sent between the sample are responsibility $r$ and availability $a$ . Responsibility $r(i,j)$ describes how well $z_{j}$ to be the exemplar for $x_{i}$ , relative to other candidate exemplars for $z_{i}$ , and it is defined as:

(4)

\displaystyle r(i,j)=s(i,j)-max[a(i,j^{\prime})+s(i,j^{\prime})\forall j^{\prime}\neq j].

Availability $a(i,j)$ describes suitability for $z_{i}$ to pick $z_{j}$ as its exemplar, considering other samples’ preference for $z_{j}$ as an exemplar, and it is defined as:

(5)

\displaystyle a(i,j)=min[0,a(j,j)+\sum_{i^{\prime}\notin\{i,k\}}r(i^{\prime},j)].

Both responsibility $r$ and availability $a$ are set to zero at the beginning and updated during each iteration until convergence. The damping factor $\lambda$ is used during the process to prevent numerical oscillations. The updated functions for both responsibility and availability during each iteration are shown in Equation 6:

(6)		$\displaystyle r_{n+1}(i,j)=\lambda\cdot r_{n}(i,j)+(1-\lambda)\cdot r_{n+1}(i,j),$
(6)		$\displaystyle a_{n+1}(i,j)=\lambda\cdot a_{n}(i,j)+(1-\lambda)\cdot a_{n+1}(i,j),$

where $n$ indicates the iteration step.

3.3. Clustering-based Conditioning

After the input food images are clustered into sub-classes, we denote the new class label of each food image by its food name followed by the cluster ID (e.g., burger_1 for an image of a burger in cluster 1 of its class). A diffusion model is then trained to model the conditional distributions $p(x|y)$ , where $x$ is the input image and $y$ is the class label of the input image. To achieve this, we train a conditional denoising autoencoder $\epsilon_{\theta}(x_{t},t,y)$ , as mentioned in section 3.1, which allows us to control the image generation process through the class label $y$ . By conditioning the diffusion model on the new class label, we can generate food images that belong to a specific cluster of a food class, and this helps in generating more diverse food images within each food class.

Similar to the latent diffusion model(rombach2022high, 65), we use a U-Net with an attention mechanism as our conditional denoising autoencoder. Figure 2 shows how the sub-class labels $y_{sub}$ control the denoising process through U-Net. To incorporate the sub-class information into the denoising process, we use a pre-trained CLIP text encoder $\tau_{\theta}$ to project $y_{sub}$ into an embedding vector $\tau_{\theta}(y_{sub})$ , which is then added to the intermediate layers of the U-Net using a cross-attention mechanism. The implementation of the cross-attention mechanism is shown in the following equation:

(7)		$\displaystyle q=\pi_{l}(z_{t})\cdot W_{q}^{l},\quad k=$	$\displaystyle\tau_{\theta}(y_{sub})\cdot W_{k}^{l},\quad v=\tau_{\theta}(y_{sub})\cdot W_{v}^{l},$
(7)		$\displaystyle\mathbf{A}=$	$\displaystyle softmax(\frac{qk^{T}}{\sqrt{d}}),$

where $q$ , $k$ , and $v$ are the query, key, and value for the attention mechanism, respectively. $l$ denotes the $l-th$ intermediate layer of U-Net, and $\pi_{l}(z_{t})$ is the intermediate vector representation of U-Net. $W_{q}$ , $W_{k}$ , and $W_{v}$ are learnable projection matrices, $\mathbf{A}$ is the attention map of cross-attention mechanism.

The new training objective of the conditional latent diffusion model is shown below:

(8)

\displaystyle L_{LDM}=\mathbb{E}_{t,\mathcal{E}(z),y_{sub},\epsilon}\left[||\epsilon-\epsilon_{\theta}(z_{t},t,\tau_{\theta}(y_{sub}))||^{2}\right],

where both denonising autoencoder $\epsilon_{\theta}$ and text encoder $\tau_{\theta}$ are jointly optimized together.

When generating images for a specific class of food, the latent diffusion model is conditioned with its sub-class labels to control the generation process, where the occurrence of the labels will follow the distribution of sub-classes within the class.

4. Experiments

In this section, we first train a conditional StyleGAN3(karras2021alias, 41) on the Food-101(bossard2014food, 3) dataset as the GAN-based method. We then fine-tune a pre-trained stable diffusion model(rombach2022high, 65) on the Food-101 dataset as the baseline diffusion method. Finally, we fine-tune a pre-trained stable diffusion with clustering-based conditioning as our proposed ClusDiff. We evaluate ClusDiff and compare it to the GAN-based method and the baseline diffusion method using FID score (heusel2017gans, 34). Finally, we demonstrate the benefit of using generated synthetic food images as data augmentation by providing a case study to address the class-imbalance issue on the VFN-LT(he2022long, 25) dataset for the long-tailed food image classification.

4.1. Dataset

4.1.1. Food-101

Food-101(bossard2014food, 3) dataset consists of 101 food categories and each category contains 1000 images. The dataset contains a small amount of noisy data which are images with intense colors or wrong labels. All the images in Food-101 were rescaled to have a maximum side dimension of 512 pixels. We further resize the images to 512 by 512 pixels for training generative models.

4.1.2. VFN-LT

VFN-LT(he2022long, 25) dataset is the long-tailed version of the VFN dataset(mao2020visual, 54) constructed by removing data from each food class based on the food consumption frequency(lin2021most, 48) in the real world. Overall, the training set of VFN-LT is highly imbalanced containing 2.5K images from 74 common food classes with a maximum of 288 images per class and a minimum of 1 image per class. The test set of VFN-LT is kept balanced as 25 images per class for the long-tailed classification problem.

4.2. Evaluation Metric

4.2.1. Frechet Inception Distance Metric (FID)

FID(heusel2017gans, 34) is a metric that calculates the distance between the feature vectors of training and generated images, captured by a pre-trained Inception-v3(szegedy2016rethinking, 74) model. The FID measures the similarity between the features extracted from the training images and the generated images. A lower FID score indicates the feature distributions of generated images are close to real images, suggesting that the generator well captures the data distribution of the training images. A higher FID score indicates the generated images deviate from the real images, meaning the generator does not have a good performance.

4.2.2. Long-tailed classification

We use Top-1 image classification accuracy as the evaluation metric. In addition, we provide the performance on both head (instance-rich) and tail (instance-rare) classes. Following the benchmark protocol (he2022long, 25), there are 22 head classes and 52 tail classes in the VFN-LT dataset.

4.3. Implementation Details

4.3.1. Food image generation

The StyleGAN3 model is trained on the Food-101 dataset for 150 epochs with a batch size of 8, and the fixed learning rate is set to 0.002. The reason we did not use pre-trained weights for StyleGAN3 is that the pre-trained models of StyleGAN3 have only been trained on face images, which does not contribute effectively to food image generation. Both the baseline diffusion method and the clustering-based conditional diffusion method (ClusDiff) are fine-tuned on the pre-trained weight ’stable-diffusion-v1-4’ for 100 epochs with a batch size of 4, and the fixed learning rate is set to 1e-5. The pre-trained weight of the stable diffusion model is trained on a subset of the LAION-5B dataset (schuhmann2022laion, 67), which consists of 5.85 billion general image-text pairs, including some food images. It is a common practice to finetune this pre-trained weight for downstream applications.

4.3.2. Long-tailed classification

We use ResNet-18 (he2016deep, 33) pre-trained on ImageNet (IMAGENET1000, 66) as the backbone and train the model for 150 epochs with batch size 128 using the SGD optimizer. The initial learning rate is set as 0.1 and decreased to 0.0001 using a cosine learning rate scheduler.

Datasets	Food-101
Methods	FID
StyleGAN3(karras2021alias, 41)	39.05
Finetuned Latent Diffusion (rombach2022high, 65)	30.39
ClusDiff	27.73

Table 1. Food image generation results in terms of FID. The lower FID score indicates better diversity and fidelity of generated images and the best result is marked in bold.

4.4. Image Generation Results on Food-101

Compared methods. We compare our proposed ClusDiff with two other techniques: one of the latest GAN-based methods, StyleGAN3, and the baseline latent diffusion method. StyleGAN architectures have shown an ability to generate highly realistic images across multiple applications, including generating food images (han2020mpg, 22). Therefore, we choose the latest variant, StyleGAN3 (karras2021alias, 41), as the GAN-based method used for comparison. As for the baseline diffusion method, we select the latent diffusion model(rombach2022high, 65) to showcase the effectiveness of our proposed clustering-based training framework. This will allow us to demonstrate the improvements achieved by our modifications compared to the baseline.

We generate 100 synthetic images for each type of food in the Food-101 dataset using each image generation method and compare the FID scores. Fig. 3 shows the sample results of synthetic food images generated by each method. We manually selected three classes of food for demonstration, and the generated food images were randomly chosen. Table 1 shows the comparison results for different food image generation methods.

From Fig. 3, we can see that compared to the synthetic food images generated by StyleGAN3, the images generated by the latent diffusion method show significant improvement in the background of food images. Furthermore, the latent diffusion method generates more refined details in the food region compared to StyleGAN3, as evident by the intricate textures of the sushi fillings and the distinct presentation of hamburger ingredients. The FID scores from Table 1 further confirm our visual inspections, showing that the baseline diffusion method outperforms StyleGAN3 by approximately 10 FID points.

As shown in Fig. 3, the visual difference between the synthetic images generated by the fine-tuned latent diffusion method and our proposed ClusDiff, is minimal. This is because ClusDiff maintains the original network structure of the latent diffusion model.

The comparison between ClusDiff and the baseline latent diffusion model is shown in Table 1. Since both methods are fine-tuned on the same pre-trained weight, this comparison demonstrates that ClusDiff significantly enhances the FID results of the latent diffusion model. This improvement indicates that our proposed modifications in ClusDiff effectively assist the latent diffusion model in capturing the underlying distribution from the training images, leading to better diversity and fidelity in the generated images.

Datasets	VFN-LT
Methods	Head (%)	Tail (%)	Overall (%)
Baseline	62.3	24.4	35.8
ROS(ros, 76)	61.7	24.9	35.9
RUS(rus_ros, 5)	54.6	26.3	34.8
CMO(cmo, 58)	60.8	33.6	42.1
LDAM(ldam, 6)	60.4	29.7	38.9
BS(bsloss, 63)	61.3	32.9	41.9
IB(ibloss, 59)	60.2	30.8	39.6
Focal(focalloss, 49)	60.1	28.3	37.8
Food2stage(he2022long, 25)	61.9	37.8	45.1
Baseline + ClusDiff	68.7	42.4	49.5

Table 2. Top-1 accuracy on VFN-LT with tail class (Tail) and head class (Head) accuracy. The best results are marked in bold.

4.5. Long-tailed Food Classification Results on VFN-LT

We demonstrate the effective use of synthetic food images in addressing the challenges of class imbalance associated with the long-tailed classification problem. Specifically, we generate food images for each of the 74 food classes in the VFN-LT dataset to construct a balanced training set. The food images are generated by using a pre-trained stable diffusion model with our clustering-based conditioning on the Food-101 dataset for a fair evaluation.

Compared methods. We compare with existing long-tailed image classification methods, including random over-sampling (ROS)(ros, 76), random under-sampling (RUS) (rus_ros, 5), context-rich oversampling (CMO)(cmo, 58) and visual-aware hybrid sampling (Food2stage)(he2022long, 25) as data augmentation-based approaches. In addition, several loss functions have been commonly used to address class imbalance including label-distribution-aware margin loss (LDAM) (ldam, 6), balanced Softmax (BS) (bsloss, 63), influence-balanced loss (IB) (ibloss, 59) and Focal loss (focalloss, 49). We use vanilla training as the baseline and our method as Baseline + ClusDiff by using a balanced training set containing original and synthetic food images generated by our proposed ClusDiff.

Table 2 summarizes the results on the VFN-LT dataset. Our method achieves the best performance in terms of head, tail, and overall classification accuracy. Compared with existing data augmentation based approaches, using synthetic images is more effective, which improves the generalization ability and avoids the over-fitting issue for both instance-rich and instance-rare classes. Moreover, the use of synthetic data streamlines the training process, eliminating the need for any new loss functions, thus offering an advantage over existing methods and showing great potential for real-life applications.

5. Conclusion and Future Work

In this paper, we initially demonstrate the promising performance of latent diffusion methods in food image generation by comparing their results with one of the latest GAN-based methods, StyleGAN3. We then propose a clustering-based conditional training framework, ClusDiff, to address the challenge of high intra-class variance within the food dataset. This framework enables us to generate more diverse and representative food images that better reflect the underlying distribution of the training data. Experimental results show that our proposed method outperforms the baseline diffusion model, showcasing its effectiveness in enhancing food image generation. Additionally, we conduct a thorough study on the utilization of synthetic food images to address the class-imbalance issue in long-tailed food classification.

In the future, we plan to investigate integrating cluster information into the loss function or network structure of the diffusion model to better capture the variance of food images. This approach could potentially lead to further improvements in the quality and diversity of the generated food images.

References

(1) Kiyoharu Aizawa and Makoto Ogawa “FoodLog: Multimedia Tool for Healthcare Applications” In IEEE MultiMedia 22.2, 2015, pp. 4–8
(2) Regan L Bailey “Overview of dietary assessment methods for measuring intakes of foods, beverages, and dietary supplements in research studies” In Current Opinion in Biotechnology 70 Elsevier, 2021, pp. 91–96
(3) Lukas Bossard, Matthieu Guillaumin and Luc Van Gool “Food-101–mining discriminative components with random forests” Zurich, Switzerland In Proceedings of the European Conference on Computer Vision, 2014, pp. 446–461 Springer
(4) CJ Boushey et al. “New mobile methods for dietary assessment: review of image-assisted and image-based dietary assessment methods” London, UK In Proceedings of the Nutrition Society 76.3 Cambridge University Press, 2017, pp. 283–294
(5) Mateusz Buda, Atsuto Maki and Maciej A Mazurowski “A systematic study of the class imbalance problem in convolutional neural networks” In Neural Networks Elsevier, 2018, pp. 249–259
(6) Kaidi Cao et al. “Learning imbalanced datasets with label-distribution-aware margin loss” In Advances in Neural Information Processing Systems 32, 2019, pp. 1567–1578
(7) Eduardo Castro, Jaime S Cardoso and Jose Costa Pereira “Elastic Deformations for Data Augmentation in Breast Cancer Mass Detection” Las Vegas, NV In Proceedings of the IEEE EMBS International Conference on Biomedical & Health Informatics, 2018, pp. 230–234 IEEE
(8) Xin Chen et al. “Executing your Commands via Motion Diffusion in Latent Space” Vancouver, Canada In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 18000–18010
(9) Prafulla Dhariwal and Alexander Nichol “Diffusion models beat gans on image synthesis” In Advances in Neural Information Processing Systems 34, 2021, pp. 8780–8794
(10) K.. Dunn et al. “DeepSynth: Three-dimensional nuclear segmentation of biological images using neural networks trained with synthetic data” In Scientific Reports 9.1, 2019, pp. 18295–18309
(11) Takumi Ege and Keiji Yanai “Multi-task learning of dish detection and calorie estimation” Stockholm, Sweden In Proceedings of the Joint Workshop on Multimedia for Cooking and Eating Activities and Multimedia Assisted Dietary Management, 2018, pp. 53–58
(12) Patrick Esser, Robin Rombach and Bjorn Ommer “Taming transformers for high-resolution image synthesis” Virtual In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 12873–12883
(13) S. Fang et al. “An End-to-End Image-Based Automatic Food Energy Estimation Technique Based on Learned Energy Distribution Images: Protocol and Methodology” In Nutrients 11.4 Multidisciplinary Digital Publishing Institute, 2019, pp. 877
(14) Shaobo Fang et al. “Single-View Food Portion Estimation: Learning Image-to-Energy Mappings Using Generative Adversarial Networks” Athens, Greece In Proceedings of the IEEE International Conference on Image Processing, 2018, pp. 251–255
(15) Shaobo Fang et al. “A comparison of food portion size estimation using geometric models and depth images” Phoenix, AZ In Proceedings of the IEEE International Conference on Image Processing (ICIP), 2016, pp. 26–30 IEEE
(16) Brendan J. Frey and Delbert Dueck “Clustering by Passing Messages Between Data Points” In Science 315.5814, 2007, pp. 972–976
(17) Wenjin Fu et al. “Conditional synthetic food image generation” San Francisco, CA In Proceedings of the IS&T International Symposium on Electronic Imaging 35 Society for Imaging ScienceTechnology, 2023, pp. 1–6
(18) Jixiang Gao, Jingjing Chen, Huazhu Fu and Yu-Gang Jiang “Dynamic Mixup for Multi-Label Long-Tailed Food Ingredient Recognition” In IEEE Transactions on Multimedia IEEE, 2022, pp. 1–10
(19) Sicheng Gao et al. “Implicit diffusion models for continuous super-resolution” Vancouver, Canada In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 10021–10030
(20) Ian J. Goodfellow et al. “Generative Adversarial Nets” Montreal, Canada In Proceedings of the International Conference on Neural Information Processing Systems, 2014, pp. 2672–2680
(21) Fangda Han, Ricardo Guerrero and Vladimir Pavlovic “CookGAN: Meal Image Synthesis from Ingredients” Colorado, USA In Proceedings of the IEEE Winter Conference on Applications of Computer Vision, 2020, pp. 1450–1458
(22) Fangda Han, Guoyao Hao, Ricardo Guerrero and Vladimir Pavlovic “MPG: A Multi-ingredient Pizza Image Generator with Conditional StyleGANs” In arXiv preprint arXiv:2012.02821, 2020
(23) Yue Han et al. “An Ensemble Method With Edge Awareness for Abnormally Shaped Nuclei Segmentation” Vancouver, Canada In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR) Workshops, 2023, pp. 4314–4324
(24) Yue Han et al. “Improving Food Detection For Images From a Wearable Egocentric Camera” Burlingame, CA In Proceedings of the IS&T International Symposium on Electronic Imaging 33 Society for Imaging ScienceTechnology, 2021, pp. 1–7
(25) Jiangpeng He, Luotao Lin, Heather A Eicher-Miller and Fengqing Zhu “Long-Tailed Food Classification” In Nutrients 15.12 MDPI, 2023, pp. 2751
(26) Jiangpeng He et al. “Long-Tailed Continual Learning For Visual Food Recognition” In arXiv preprint arXiv:2307.00183, 2023
(27) Jiangpeng He et al. “An end-to-end food image analysis system” Burlingame, CA In Proceedings of the IS&T International Symposium on Electronic Imaging 2021.8 Society for Imaging ScienceTechnology, 2021, pp. 285–1
(28) Jiangpeng He, Runyu Mao, Zeman Shao and Fengqing Zhu “Incremental Learning In Online Scenario” Virtual In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020, pp. 13926–13935
(29) Jiangpeng He et al. “Multi-task Image-Based Dietary Assessment for Food Recognition and Portion Size Estimation” Virtual In Proceedings of the IEEE Conference on Multimedia Information Processing and Retrieval, 2020, pp. 49–54
(30) Jiangpeng He and Fengqing Zhu “Exemplar-Free Online Continual Learning” In 2022 IEEE International Conference on Image Processing, 2022, pp. 541–545
(31) Jiangpeng He and Fengqing Zhu “Online Continual Learning for Visual Food Classification” Virtual In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, 2021, pp. 2337–2346
(32) Jiangpeng He and Fengqing Zhu “Single-Stage Heavy-Tailed Food Classification” In arXiv preprint arXiv:2307.00182, 2023
(33) Kaiming He, Xiangyu Zhang, Shaoqing Ren and Jian Sun “Deep residual learning for image recognition” Las Vegas, NV In Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778 IEEE
(34) Martin Heusel et al. “Gans trained by a two time-scale update rule converge to a local nash equilibrium” In Advances in Neural Information Processing Systems 30, 2017, pp. 6629–6640
(35) Jonathan Ho, Ajay Jain and Pieter Abbeel “Denoising Diffusion Probabilistic Models” In Advances in Neural Information Processing Systems 33 Curran Associates, Inc., 2020, pp. 6840–6851
(36) Daichi Horita, Wataru Shimoda and Keiji Yanai “Unseen Food Creation by Mixing Existing Food Images with Conditional StyleGAN” Nice, France In Proceedings of the International Workshop on Multimedia Assisted Dietary Management, 2019, pp. 19–24
(37) Yoshifumi Ito, Wataru Shimoda and Keiji Yanai “Food Image Generation Using a Large Amount of Food Images with Conditional GAN: RamenGAN and RecipeGAN” Stockholm, Sweden In Proceedings of the Joint Workshop on Multimedia for Cooking and Eating Activities and Multimedia Assisted Dietary Management, 2018, pp. 71–74
(38) Landu Jiang et al. “DeepFood: food image analysis and dietary assessment via deep model” In IEEE Access 8 IEEE, 2020, pp. 47477–47489
(39) Shuqiang Jiang, Weiqing Min, Yongqiang Lyu and Linhu Liu “Few-Shot Food Recognition via Multi-View Representation Learning” In ACM Transactions on Multimedia Computing, Communications, and Applications 16.3, 2020, pp. 1–20
(40) Fahad Jubayer et al. “Detection of mold on the food surface using YOLOv5” In Current Research in Food Science 4 Elsevier, 2021, pp. 724–728
(41) Tero Karras et al. “Alias-free generative adversarial networks” In Advances in Neural Information Processing Systems 34, 2021, pp. 852–863
(42) Tero Karras et al. “Analyzing and Improving the Image Quality of StyleGAN” Colorado, USA In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020, pp. 8110–8119
(43) Yoshiyuki Kawano and Keiji Yanai “Automatic expansion of a food image dataset leveraging existing categories with domain adaptation” Zurich, Switzerland In Proceedings of the European Conference on Computer Vision Workshop, 2014, pp. 3–17 Springer
(44) Yoshiyuki Kawano and Keiji Yanai “FoodCam: A Real-Time Mobile Food Recognition System Employing Fisher Vector” Dublin, Ireland In Proceedings of the International Conference on Multimedia Modeling, 2014, pp. 369–373
(45) Bahjat Kawar et al. “Imagic: Text-based real image editing with diffusion models” Vancouver, Canada In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 6007–6017
(46) Timothy J Key et al. “Diet, nutrition, and cancer risk: what do we know and what is the way forward?” In BMJ 368 British Medical Journal Publishing Group, 2020, pp. m511
(47) Fanyu Kong and Jindong Tan “DietCam: Automatic dietary assessment with mobile camera phones” In Pervasive and Mobile Computing 8.1, 2012, pp. 147–163
(48) Luotao Lin, Fengqing Zhu, Edward Delp and Heather Eicher-Miller “The Most Frequently Consumed and the Largest Energy Contributing Foods of US Insulin Takers Using NHANES 2009–2016” In Current Developments in Nutrition 5 Oxford University Press, 2021, pp. 426–426
(49) Tsung-Yi Lin et al. “Focal loss for dense object detection” Honolulu, HI In Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2980–2988
(50) Ya Lu et al. “goFOODTM: an artificial intelligence system for dietary assessment” In Sensors 20.15 MDPI, 2020, pp. 4283
(51) Andreas Lugmayr et al. “Repaint: Inpainting using denoising diffusion probabilistic models” New Orleans, LA In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition IEEE, 2022, pp. 11461–11471
(52) Shitong Luo and Wei Hu “Diffusion probabilistic models for 3d point cloud generation” Virtual In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 2837–2845
(53) Runyu Mao et al. “Improving Dietary Assessment Via Integrated Hierarchy Food Classification” Virtual In Proceedings of the IEEE International Workshop on Multimedia Signal Processing, 2021, pp. 1–6
(54) Runyu Mao et al. “Visual aware hierarchy based food recognition” Virtual In Proceedings of the International Conference on Pattern Recognition Workshop, 2021, pp. 571–598
(55) Weiqing Min et al. “Large scale visual food recognition” In IEEE Transactions on Pattern Analysis and Machine Intelligence 45 IEEE, 2023, pp. 9932–9949
(56) Mehdi Mirza and Simon Osindero “Conditional generative adversarial nets” In arXiv preprint arXiv:1411.1784, 2014
(57) Daniel Mas Montserrat, Qian Lin, Jan Allebach and Edward J Delp “Training Object Detection and Recognition CNN Models Using Data Augmentation” Burlingame, CA In Proceedings of the IS&T International Symposium on Electronic Imaging 2017.10 Society for Imaging ScienceTechnology, 2017, pp. 27–36
(58) Seulki Park et al. “The Majority Can Help The Minority: Context-rich Minority Oversampling for Long-tailed Classification” New Orleans, LA In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 6887–6896
(59) Seulki Park, Jongin Lim, Younghan Jeon and Jin Young Choi “Influence-balanced loss for imbalanced visual classification” Virtual In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 735–744
(60) Walter HL Pinaya et al. “Brain imaging generation with latent diffusion models” Singapore, Singapore In Proceedings of the MICCAI Workshop on Deep Generative Models, 2022, pp. 117–126 Springer
(61) Siddeshwar Raghavan, Jiangpeng He and Fengqing Zhu “Online Class-Incremental Learning For Real-World Food Classification” In arXiv preprint arXiv: 2301.05246, 2023
(62) Joseph Redmon, Santosh Divvala, Ross Girshick and Ali Farhadi “You only look once: Unified, real-time object detection” Las Vegas, NV In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 779–788 IEEE
(63) Jiawei Ren et al. “Balanced meta-softmax for long-tailed visual recognition” In Advances in Neural Information Processing Systems 33, 2020, pp. 4175–4186
(64) Shaoqing Ren, Kaiming He, Ross Girshick and Jian Sun “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks” In IEEE Transactions on Pattern Analysis and Machine Intelligence 39.6, 2017, pp. 1137–1149
(65) Robin Rombach et al. “High-resolution image synthesis with latent diffusion models” New Orleans, LA In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10684–10695 IEEE
(66) Olga Russakovsky et al. “ImageNet Large Scale Visual Recognition Challenge” In International Journal of Computer Vision 115.3, 2015, pp. 211–252
(67) Christoph Schuhmann et al. “Laion-5b: An open large-scale dataset for training next generation image-text models” In Advances in Neural Information Processing Systems 35, 2022, pp. 25278–25294
(68) Zeman Shao et al. “Towards learning food portion from monocular images with cross-domain feature adaptation” Tampere, Finland In Proceedings of the IEEE 23rd International Workshop on Multimedia Signal Processing (MMSP), 2021, pp. 1–6 IEEE
(69) Zeman Shao et al. “An Integrated System for Mobile Image-Based Dietary Assessment” Virtual In Proceedings of the 3rd Workshop on AIxFood, 2021, pp. 19–23
(70) Zeman Shao, Gautham Vinod, Jiangpeng He and Fengqing Zhu “An End-to-end Food Portion Estimation Framework Based on Shape Reconstruction from Monocular Image” In arXiv preprint arXiv:2308.01810, 2023
(71) Connor Shorten and Taghi M Khoshgoftaar “A survey on image data augmentation for deep learning” In Journal of Big Data 6.1 SpringerOpen, 2019, pp. 1–48
(72) Geeta Shroff, Asim Smailagic and Daniel P Siewiorek “Wearable context-aware food recognition for calorie monitoring” Pittsburgh, PA In Proceedings of the IEEE International Symposium on Wearable Computers, 2008, pp. 119–120 IEEE
(73) Mingui Sun et al. “Improved wearable devices for dietary assessment using a new camera system” In Sensors 22.20 MDPI, 2022, pp. 8006
(74) Christian Szegedy et al. “Rethinking the inception architecture for computer vision” Las Vegas, NV In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2818–2826
(75) Hoang Thanh-Tung and Truyen Tran “Catastrophic forgetting and mode collapse in GANs” Glasgow, UK In Proceedings of the International Joint Conference on Neural Networks (IJCNN), 2020, pp. 1–10 IEEE
(76) Jason Van Hulse, Taghi M Khoshgoftaar and Amri Napolitano “Experimental perspectives on learning from imbalanced data” Corvallis, OR In Proceedings of the 24th International Conference on Machine Learning, 2007, pp. 935–942

Diffusion Model with Clustering-based Conditioning for Food Image Generation