This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

SGD-Mix: Enhancing Domain-Specific Image Classification with Label-Preserving Data Augmentation

Yixuan Dong1  Fang-Yi Su1,211footnotemark: 1  Jung-Hsien Chiang1
1Department of Computer Science and Information Engineering,
National Cheng Kung University, Tainan City 701, Taiwan, R.O.C.
2Department of Biomedical Informatics,
Harvard Medical School, Harvard University, Boston, MA, USA
radondong@iir.csie.ncku.edu.tw, fang-yi_su@hms.harvard.edu
Equal contribution.
Abstract

Data augmentation for domain-specific image classification tasks often struggles to simultaneously address diversity, faithfulness, and label clarity of generated data, leading to suboptimal performance in downstream tasks. While existing generative diffusion model-based methods aim to enhance augmentation, they fail to cohesively tackle these three critical aspects and often overlook intrinsic challenges of diffusion models, such as sensitivity to model characteristics and stochasticity under strong transformations. In this paper, we propose a novel framework that explicitly integrates diversity, faithfulness, and label clarity into the augmentation process. Our approach employs saliency-guided mixing and a fine-tuned diffusion model to preserve foreground semantics, enrich background diversity, and ensure label consistency, while mitigating diffusion model limitations. Extensive experiments across fine-grained, long-tail, few-shot, and background robustness tasks demonstrate our method’s superior performance over state-of-the-art approaches.

1 Introduction

In the early stages of data augmentation research for domain-specific image classification tasks [33, 28, 56], non-generative mixup-based methods (e.g., Mixup [68] and CutMix [67]) typically linearly blend two images and assign labels to the generated data based on their mixing ratios. However, due to constraints imposed by the linear sample space and issues such as unnatural overlaps in salient regions and unclear boundaries, the generated data often fail to ensure sufficient diversity and faithfulness. Moreover, determining labels solely based on mixing ratios can lead to label-semantic mismatches [3, 41], ultimately hindering performance improvements in downstream tasks.

Refer to caption
Figure 1: Top row: Non-generative mixup-based methods. Bottom row: Generative diffusion-based methods. Inside the dashed box: Methods that mix two different training images and their corresponding labels. Outside the dashed box: Label-preserving methods without label mixing. Note that the translation strength for the bottom row in the figure is consistently set to 0.7.

Subsequent non-generative improvements, such as GuidedMixup [22], which employs saliency-guided pixel-level fusion, and SUMix [41], which integrates semantic distance metrics, have helped alleviate issues related to unnatural overlaps and label-semantic mismatches, thereby improving the quality of the generated data as well as the reliability of label assignments. A growing body of evidence identifies diversity, faithfulness, and label clarity as key factors in the quality and effectiveness of generated data. However, these methods still struggle to enhance these three factors simultaneously.

With the advent of generative models [25, 13, 16], particularly diffusion models [16, 34, 58, 59] capable of nonlinear and high-fidelity synthesis [46, 52, 9, 35], a new direction for data augmentation in domain-specific image classification tasks [4] has emerged. Generated data produced by diffusion models exhibits greater diversity and faithfulness, surpassing previous non-generative mixup-based and GAN-based methods [68, 67, 22, 42, 24, 2]. Contemporary state-of-the-art (SOTA) diffusion-based methods such as Diff-Mix [64] and DiffuseMix [20] have shown promising results.

However, deeper analysis reveals potential limitations: like most existing approaches, these methods, in their design, lack a comprehensive consideration of diversity, faithfulness, and label clarity in generated data. Furthermore, they overlook inherent constraints imposed by the characteristics of diffusion models, which may hinder their ability to fully address these three critical aspects of generated data. We further analyze these methods and their limitations in Section 4 and Supplementary Materials 1 & 2.

Overall, whether considering traditional non-generative or emerging generative methods, most solutions focus on only one or two key factors, lacking a holistic approach that concurrently addresses the diversity, faithfulness, and label clarity of generated data. For a summary of how existing approaches balance these attributes, please refer to Table 1.

In response, we propose Saliency-Guided Diffusion Mix (SGD-Mix), a novel framework that simultaneously ensures the diversity, faithfulness, and label clarity of generated data. As illustrated in Figure 1, SGD-Mix achieves label-preserving, high-quality data augmentation by retaining the semantic foreground of the source image while incorporating a diverse, contextually consistent background from another image, thus avoiding label ambiguity. This is accomplished by leveraging saliency maps [21] to align and mask the source and target images, strategically preserving the foreground of the source image while using the target image as the background. The resulting masked image is then refined by a domain-specific fine-tuned diffusion model. By adjusting the translation strength, SGD-Mix flexibly balances the diversity and faithfulness of the generated images, consistently ensuring that the labels remain unambiguous.

Method Diversity Faithfulness Label Clarity
Mixup [68]
GuidedMixup [22]
DiffuseMix [20]
Diff-Aug [64]
Diff-Mix [64]
SGD-Mix (Ours)
Table 1: Comparison of representative data augmentation methods with respect to Diversity, Faithfulness, and Label Clarity. A green check () indicates that the method guarantees the corresponding factor, while a red cross () indicates it does not.

In summary, our key contributions are:

  • We first explicitly identify and emphasize the importance of simultaneously considering the diversity, faithfulness, and label clarity of generated data for effective data augmentation in domain-specific image classification tasks, rather than focusing on only one or two of these aspects.

  • We propose a simple yet effective data augmentation framework that simultaneously meets all these three criteria, demonstrating superior performance across various domain-specific image classification tasks.

  • We conduct comprehensive experiments and analyses, including detailed comparisons with previous methods and thorough ablation studies, to validate the effectiveness and robustness of our proposed approach.

2 Related Work

Saliency Detection and Thresholding Methods. Saliency maps play a crucial role in computer vision tasks by highlighting the most informative regions in an image. Early approaches [21] modeled saliency based on low-level features such as color, intensity, and orientation. Subsequent advancements incorporated spectral residual analysis [17] and deep learning-based gradient methods [57, 69, 54, 70], improving the identification of class-discriminative regions. In many applications, thresholding techniques such as Otsu’s method [37] are used to convert saliency maps into binary masks for segmentation, object detection, and data augmentation.

Mix-Based Data Augmentation. Mix-based augmentation techniques aim to improve model robustness by generating new training samples through the combination of existing ones. Mixup [68] introduced a linear interpolation strategy to blend images and labels, while CutMix [67] replaced rectangular regions between images to preserve more semantic structure. ResizeMix [42] further refined this approach by incorporating scale transformations. More recently, methods like PuzzleMix [24] and GuidedMixup [22] have leveraged saliency information to enhance semantic consistency in mixed samples. These methods have been widely used in domain-specific image classification tasks and continue to evolve with more sophisticated strategies for region selection and blending.

Generative Model-Based Data Augmentation. Generative models have provided a powerful alternative to mix-based augmentation by directly synthesizing new training samples. GANs [13, 2] were among the first models used for this purpose, generating diverse data distributions to improve model generalization [1]. More recently, diffusion models [16, 34, 58, 59] have emerged as a state-of-the-art approach, offering high-fidelity image generation with strong controllability. Several data augmentation methods, such as Diff-Mix [64] and DiffuseMix [20], have utilized diffusion models to generate augmented samples through noise scheduling and prompt-based conditioning [8, 9]. These approaches demonstrate the potential of generative models in enhancing domain-specific classification tasks.

3 Preliminary

3.1 Image-to-image Translation Through Diffusion Models

Let 𝐱0ref\mathbf{x}_{0}^{\mathrm{ref}} be a reference image. We define a forward process that injects noise up to time step sT\lfloor sT\rfloor (where 0s10\leq s\leq 1 and TT is the total number of timesteps), generating

𝐱sT=αsT𝐱0ref+1αsTϵ,\mathbf{x}_{\lfloor sT\rfloor}\;=\;\sqrt{\alpha_{\lfloor sT\rfloor}}\,\mathbf{x}_{0}^{\mathrm{ref}}\;+\;\sqrt{1-\alpha_{\lfloor sT\rfloor}}\;\boldsymbol{\epsilon}, (1)

with ϵ\boldsymbol{\epsilon} drawn from a Gaussian distribution, and αsT\alpha_{\lfloor sT\rfloor} following a noise schedule. The diffusion model then performs the backward (denoising) steps from t=sTt=\lfloor sT\rfloor down to t=0t=0:

𝐱t1=𝐱tΔ(𝐱t,c,t;θ),t=sT,,1,\mathbf{x}_{t-1}\;=\;\mathbf{x}_{t}\;-\;\Delta\!\bigl{(}\mathbf{x}_{t},\,c,\,t;\,\theta\bigr{)},\quad t=\lfloor sT\rfloor,\dots,1, (2)

where Δ()\Delta(\cdot) represents the model-predicted noise removal and sampling correction term. The scalar ss is the translation strength: larger ss induces higher noise injection (more diverse outputs), while smaller ss preserves more details of 𝐱0ref\mathbf{x}_{0}^{\mathrm{ref}} [16, 5, 31, 61, 58].

3.2 Diffusion Model Fine-tuning

Let ϵθ(𝐱t,c,t)\epsilon_{\theta}(\mathbf{x}_{t},c,t) be the noise predictor of a diffusion model, where 𝐱t\mathbf{x}_{t} is a noisy sample at time step tt, and cc is a text condition (e.g., a prompt). The simplified training objective commonly used is

(θ)=𝔼𝐱0,t,ϵϵϵθ(𝐱t,c,t)2,\mathcal{L}(\theta)\;=\;\mathbb{E}_{\mathbf{x}_{0},\,t,\,\boldsymbol{\epsilon}}\Bigl{\|}\boldsymbol{\epsilon}\;-\;\epsilon_{\theta}(\mathbf{x}_{t},\,c,\,t)\Bigr{\|}^{2}, (3)

Textual Inversion. [11] Define a new learnable token embedding 𝐯\mathbf{v}, appended to the text encoder vocabulary. During fine-tuning, we replace cc with [,𝐯,][\dots,\mathbf{v},\dots] in Eq. (3), and optimize 𝐯\mathbf{v} such that the model captures the desired concept:

min𝐯𝔼𝐱0,t,ϵϵϵθ(𝐱t,c(𝐯),t)2,\min_{\mathbf{v}}\;\mathbb{E}_{\mathbf{x}_{0},\,t,\,\boldsymbol{\epsilon}}\Bigl{\|}\boldsymbol{\epsilon}-\epsilon_{\theta}(\mathbf{x}_{t},\,c(\mathbf{v}),\,t)\Bigr{\|}^{2}, (4)

DreamBooth fine-tuning with LoRA. Instead of learning a textual embedding 𝐯\mathbf{v} alone, DreamBooth [49] fine-tunes a subset of model parameters ϕθ\phi\subset\theta, specifically focusing on the U-Net [47] of the diffusion model. To reduce the amount of parameters to be updated, LoRA (Low-Rank Adaptation) [18] is commonly employed, which injects low-rank learnable matrices into the weight tensors of the U-Net. Formally, let 𝐖dout×din\mathbf{W}\in\mathbb{R}^{d_{\mathrm{out}}\times d_{\mathrm{in}}} be a model weight, we decompose its update as

Δ𝐖=𝐀𝐁,where𝐀dout×r,𝐁r×din,\Delta\mathbf{W}=\mathbf{A}\mathbf{B},\quad\text{where}\quad\mathbf{A}\in\mathbb{R}^{d_{\mathrm{out}}\times r},\quad\mathbf{B}\in\mathbb{R}^{r\times d_{\mathrm{in}}}, (5)

and rmin(din,dout)r\ll\min(d_{\mathrm{in}},d_{\mathrm{out}}). During fine-tuning, we optimize only 𝐀,𝐁\mathbf{A},\mathbf{B}, leaving the original 𝐖\mathbf{W} frozen. Thus, the effective trainable parameters are greatly reduced.

We formulate the overall training objective as:

minϕ𝔼𝐱0,t,ϵϵϵθϕ,ϕ(𝐱t,c,t)2,\min_{\phi}\;\mathbb{E}_{\mathbf{x}_{0},\,t,\,\boldsymbol{\epsilon}}\Bigl{\|}\boldsymbol{\epsilon}-\epsilon_{\theta\setminus\phi,\phi}(\mathbf{x}_{t},\,c,\,t)\Bigr{\|}^{2}, (6)

where θϕ\theta\setminus\phi denotes the frozen parameters and ϕ\phi the low-rank adapted subset. By restricting the model updates to the LoRA components within the U-Net, we retain much of the original diffusion model’s capacity while guiding it to learn the specific concept or style from limited reference images [27, 39, 43].

Refer to caption
Figure 2: Semantic drift in DiffuseMix [20] generated images using the prompt “A transformed version of image into autumn”. Stronger transformations reduce semantic fidelity to the source image. More examples in Supplementary Materials 2.

4 Motivation

Current SOTA methods, such as Diff-Mix [64] and DiffuseMix [20], use diffusion models for data augmentation in domain-specific image classification. Yet, they struggle to balance diversity, faithfulness, and label clarity due to their designs not explicitly targeting these aspects and diffusion model limitations111See Supplementary Materials 1 & 2 for a detailed breakdown of these dependencies.. Below, we briefly analyze these shortcomings to motivate our proposed SGD-Mix.

Refer to caption
Figure 3: The pipeline of the proposed SGD-Mix method. The process involves three major stages: (1) saliency-based target selection, (2) saliency-guided mixing, and (3) refining mixed images using our domain-specific fine-tuned diffusion model.

Limitations of Diff-Mix.

Diff-Mix generates inter-class images using a fine-tuned diffusion model and assigns labels via a nonlinear mixing formula, y~=(1sγ)yi+sγyj\tilde{y}=(1-s^{\gamma})y^{i}+s^{\gamma}y^{j}, where s[0,1]s\in[0,1] controls translation strength, γ\gamma (often 0.5\approx 0.5) introduces nonlinearity, and yiy^{i} and yjy^{j} represent the labels of the reference (source) and target image classes, respectively. While effective in some cases, this approach has two key limitations. First, its label mixing lacks generalizability, as for a given ss, the amount of information retained from the source image varies due to different characteristics of the diffusion model (e.g., noise scheduler [34]) and dataset-specific factors (e.g., image distribution), yet these images are assigned the same label regardless of such variations, affecting the consistency and reliability of the labeling process. Second, higher ss values increase stochasticity [31], yielding images with inconsistent semantics for the same ss that may mismatch their assigned labels. These issues stem from relying solely on ss for label determination, limiting robustness across different diffusion backbones and datasets. See Figure 1 for an example where a Diff-Mix-generated image has a label composition over 0.8 for the source class and under 0.2 for the target class, which is clearly incorrect.

Limitations of DiffuseMix.

In contrast to Diff-Mix’s focus on inter-class mixing, DiffuseMix transforms a source image with a conditional prompt, stylizing it via a pre-trained diffusion model without disentangling the foreground and background, then merging it with the original using a binary mask before blending with a fractal image for extra variation. Our experiments and findings from DEADiff [40] indicate that while this approach is creative, it may suffer from semantic drift, where the stylized image deviates significantly from the source image’s semantic content, particularly under strong transformations (see Figure 2). Similar to CutMix [67], DiffuseMix also introduces unnatural boundaries due to the binary mask, which can further degrade the visual coherence of the generated image. Assigning the source image’s label to the resulting image without label adjustment increases the risk of mislabeling. Moreover, retaining half of the source image in the mix limits diversity, thereby reducing the effectiveness of data augmentation.

Our analysis reveals that Diff-Mix and DiffuseMix are hindered by design and diffusion model limitations. Diff-Mix’s inter-class mixing leads to label ambiguity, while DiffuseMix’s thin transformations risk semantic drift and limited diversity. To address these issues, we propose SGD-Mix, a saliency-guided framework that, with a fine-tuned diffusion model, avoids inter-class mixing and mitigates semantic drift to ensure clear labels across the entire design, while also providing sufficient diversity and faithfulness for robust data augmentation.

5 SGD-Mix

The pipeline of SGD-Mix consists of three major stages: (1) selecting a target image and its saliency map based on the overlap of foreground regions, measured by the L2 distance between saliency maps, (2) generating a mixed image using saliency-based binary masks, and (3) refining the mixed image using our domain-specific fine-tuned diffusion model. The overall process is illustrated in Figure 3. The details of each stage are described below.

5.1 Saliency-Based Target Selection

The purpose of this step is to find a target image whose foreground region overlaps the most with that of the source image, ensuring that the target image’s background can be maximally retained in the subsequent steps. Given a source image IiI_{i} and its normalized saliency map ZiZ_{i}, we aim to select a target image IjI_{j} and its corresponding normalized saliency map ZjZ_{j} from a target batch {(Ij,Zj)}j=1N\{(I_{j},Z_{j})\}_{j=1}^{N}. The target batch is randomly sampled from the training dataset, where NN is a hyperparameter specifying the batch size. The selection criterion is based on the overlap of foreground regions, quantified using the L2 distance between saliency maps:

j=argminj{1,,N}ZiZj2,j=\arg\min_{j\in\{1,\dots,N\}}\|Z_{i}-Z_{j}\|^{2}, (7)

where jj is the index of the selected target image. Saliency maps ZZ can be generated using various existing methods, including rule-based [17] or gradient-based methods [57, 69, 54, 70, 19], and are all normalized to the range [0,1][0,1] using MinMax scaling.

Refer to caption
Figure 4: Visualization of attention maps before and after saliency-guided mixing in SGD-Mix. For a source image IiI_{i} and target image IjI_{j}, the attention maps (bottom row) consistently focus on IiI_{i}’s foreground region in the mixed image I(i,j)I_{(i,j)}, preserving semantic consistency. More examples in Supplementary Materials 4.1.

5.2 Saliency-Guided Mixing

The purpose of this step is to generate a mixed image where the foreground semantics are preserved from the source image while fully replacing the background with that of the target image. After identifying the target image IjI_{j} and saliency map ZjZ_{j}, binary masks are created for both the source and target saliency maps using Otsu’s [37] thresholding method:

Mi=Otsu(Zi),Mj=Otsu(Zj),M_{i}=\text{Otsu}(Z_{i}),\quad M_{j}=\text{Otsu}(Z_{j}), (8)

This thresholding operation filters out the foreground parts of the images by segmenting the salient foreground from the less significant background, resulting in binary masks MiM_{i} and MjM_{j} that highlight the foreground regions. The masks are then combined using a union operation to form M(i,j)M_{(i,j)}:

M(i,j)=MiMj,M_{(i,j)}=M_{i}\cup M_{j}, (9)

where M(i,j)M_{(i,j)} prioritizes the retention of the source foreground while suppressing the target foreground interference. Using this mask, a saliency-guided mixed image I(i,j)I_{(i,j)} is generated via pixel-wise composition:

I(i,j)=M(i,j)Ii+(1M(i,j))Ij,I_{(i,j)}=M_{(i,j)}\odot I_{i}+(1-M_{(i,j)})\odot I_{j}, (10)

where \odot denotes pixel-wise multiplication. The mixed image retains the target image’s background while ensuring that the foreground semantics exclusively originate from the source image.

The combined objective of the two steps is to completely replace the background of the source image with that of the target image, avoiding the introduction of foreign foreground semantics from other images. At the same time, the original foreground semantics of the source image are preserved, ensuring that the label of the mixed image remains consistent with that of the source image. This process is visualized in Figure 4, which highlights how attention remains focused on the source image’s foreground before and after mixing.

5.3 Refining Mixed Images Using Our Domain-Specific Fine-Tuned Diffusion Model

To generate high-quality, class-specific images, it is feasible to directly fine-tune a diffusion model using DreamBooth [49] fine-tuning with LoRA [18] on a domain-specific dataset. By providing class names as prompts (e.g., ‘‘Prothonotary Warbler’’), the model can generate images corresponding to these classes. However, this approach faces significant challenges in datasets containing fine-grained categories with high inter-class similarity. For instance, species such as ‘‘Prothonotary Warbler’’ and ‘‘Mourning Warbler’’ exhibit substantial visual and semantic overlap. This similarity can hinder the model’s ability to distinguish between closely related classes, leading to convergence issues and reduced specificity in generated outputs.

Refer to caption
Figure 5: Examples of SGD-Mix generated images under varying translation strengths S{0.1,0.3,0.5,0.7,0.9}S\in\{0.1,0.3,0.5,0.7,0.9\}. The generated image retains the source image’s foreground semantics while the background evolves with increasing SS, balancing diversity and faithfulness. More examples in Supplementary Materials 4.2.

To address these challenges, we draw inspiration from Diff-Mix [64] and adopt a structured approach that combines DreamBooth fine-tuning with LoRA and Textual Inversion [11]. First, we replace the direct usage of class names with structured identifiers to disentangle fine-grained semantics. Specifically, the class names are reformulated as: “[viv^{i}] [metaclassmetaclass]”, where viv^{i} is a learnable embedding representing the ii-th class (i{1,2,,n}i\in\{1,2,\dots,n\}), and [metaclass][metaclass] describes the overarching category (e.g., “bird”). DreamBooth fine-tuning with LoRA is then utilized to efficiently adapt the diffusion model’s U-Net [47] to the domain-specific distributions, enabling the model to generate images that align with the target dataset’s visual characteristics.

This joint fine-tuning process optimizes:

  • Learnable identifiers (viv^{i}): These embeddings provide compact and distinct representations for each fine-grained category, ensuring effective semantic disentanglement.

  • Low-Rank Adaptations (LoRA): Efficiently adjust the model parameters to capture domain-specific nuances.

During inference, the fine-tuned diffusion model utilizes the mixed image I(i,j)I_{(i,j)} as a conditioning input and generates images guided by a structured prompt:

“a photo of a [vs] [metaclass]”,\text{``a photo of a [$v^{s}$] [$metaclass$]''},

where vsv^{s} is the learnable embedding corresponding to the source image class. This structured prompt ensures that the generated image aligns with the semantic content of the source image while preserving the saliency of the source foreground and incorporating the background diversity introduced in previous steps.

Finally, the label of the generated image is determined based on our label-preserving process. Since the mixed image I(i,j)I_{(i,j)} inherits its label from the source image IiI_{i}, and the diffusion model generates outputs conditioned solely on vsv^{s} without introducing semantics from other classes, the label of the generated image remains consistent with that of the source image which ensures that our method achieves label clarity:

Label(I^(i,j))=Label(I(i,j))=Label(Ii).\text{Label}(\hat{I}_{(i,j)})=\text{Label}(I_{(i,j)})=\text{Label}(I_{i}). (11)

To further enhance flexibility, we introduce a Strength parameter SS, which controls the noise injection level during image generation. By adjusting SS, users can balance faithfulness to the source image with diversity in the generated outputs. This mechanism ensures that the refined images achieve the desired levels of faithfulness, diversity, and label clarity, making them suitable for downstream tasks. The effect of varying SS on the generated images is demonstrated in Figure 5.

Algorithm 1 SGD-Mix

Input: A single training dataset of mm image-saliency pairs {(Ik,Zk)}k=1m\{(I_{k},Z_{k})\}_{k=1}^{m}, Batch size NN, Domain-specific fine-tuned diffusion model 𝒟\mathcal{D}

Output: Generated image set ^\hat{\mathcal{I}}

1:Initialize ^\hat{\mathcal{I}}\leftarrow\emptyset
2:for i=1i=1 to mm do
3:     Sample Target Batch (from the same dataset):
4:       Randomly sample {(Ij,Zj)}j=1N\{(I_{j},Z_{j})\}_{j=1}^{N} from {(Ik,Zk)}k=1m{(Ii,Zi)}\{(I_{k},Z_{k})\}_{k=1}^{m}\setminus\{(I_{i},Z_{i})\}
5:     Target Selection:
6:       jargminj{1,,N}ZiZj2j\leftarrow\arg\min\limits_{j\in\{1,\dots,N\}}\|Z_{i}-Z_{j}\|^{2}
7:     Mask Generation:
8:       Compute MiOtsu(Zi)M_{i}\leftarrow\mathrm{Otsu}(Z_{i}) and MjOtsu(Zj)M_{j}\leftarrow\mathrm{Otsu}(Z_{j})
9:       Merge masks: M(i,j)MiMjM_{(i,j)}\leftarrow M_{i}\cup M_{j}
10:     Saliency-Guided Mixing:
11:       Compute mixed image: I(i,j)M(i,j)Ii+(1M(i,j))IjI_{(i,j)}\leftarrow M_{(i,j)}\odot I_{i}+(1-M_{(i,j)})\odot I_{j}
12:     Diffusion-Based Refinement:
13:       Generate final image: I^(i,j)𝒟(I(i,j))\hat{I}_{(i,j)}\leftarrow\mathcal{D}(I_{(i,j)})
14:       Update generated set: ^^{I^(i,j)}\hat{\mathcal{I}}\leftarrow\hat{\mathcal{I}}\cup\{\hat{I}_{(i,j)}\}
15:end for
16:return ^\hat{\mathcal{I}}

6 Experiments

Method ResNet50@448 ViT-B/16@384
CUB Aircraft Flower Car Dog Avg CUB Aircraft Flower Car Dog Avg
- 86.64 89.09 99.27 94.54 87.48 91.40 89.37 83.50 99.56 94.21 92.06 91.74
CutMix [67] 87.23 89.44 99.25 94.73 87.59 91.65 90.52 83.50 99.64 94.83 92.13 92.12
Mixup [68] 86.68 89.41 99.40 94.49 87.42 91.48 90.32 84.31 99.73 94.98 92.02 92.27
GuidedMixup [22]\ddagger 86.50 88.60 99.49 94.99 87.52 91.42 90.00 82.99 98.98 94.20 92.51 91.70
Real-Filtering [15] 85.60 88.54 99.09 94.59 87.30 91.22 89.49 83.07 99.36 94.66 91.91 91.69
Real-Guidance [15] 86.71 89.07 99.25 94.55 87.40 91.59 89.54 83.17 99.59 94.65 92.05 91.80
Da-Fusion [60] 86.30 87.64 99.37 94.69 87.33 91.07 89.40 81.88 99.61 94.53 92.07 91.50
Diff-Mix [64] 87.16 90.25 99.54 95.12 87.74 91.96 90.05 84.33 99.64 95.09 91.99 92.22
DiffuseMix [20]\ddagger 86.19 88.81 99.08 94.59 87.39 91.21 88.99 83.02 99.20 94.17 91.63 91.40
SGD-Mix (Ours) 87.66 90.16 99.59 95.27 88.01 92.14 90.44 85.21 99.61 95.32 91.99 92.51
Table 2: Fine-grained classification accuracy (%).
Method CUB-LT Flower-LT
IF=100 50 10 IF=100 50 10
Many Medium Few All All All Many Medium Few All All All
- 79.11 64.28 13.48 33.65 44.82 58.13 99.19 94.95 58.18 80.43 90.87 95.07
CMO [38] 78.32 58.57 14.78 32.94 44.08 57.62 99.25 95.19 67.45 83.95 91.43 95.19
CMO+DRW [6] 78.97 56.36 14.66 32.57 46.43 59.25 99.97 95.06 67.31 84.07 92.06 95.92
Copy-Paste (LSJ) [12]\ddagger 78.55 58.48 11.99 30.95 43.80 57.92 98.73 93.02 57.00 75.89 89.02 94.97
SaliencyMix [62]\ddagger 81.93 62.05 14.29 33.71 46.03 61.01 99.04 93.30 58.36 76.71 89.90 95.02
DA-fusion [60]\ddagger 81.20 61.99 27.50 41.87 49.55 59.98 99.20 94.97 64.04 80.16 91.02 94.97
Real-Gen [64] 84.88 65.23 30.68 45.86 53.43 61.42 98.64 95.55 66.10 83.56 91.84 95.22
Real-Mix [64] 84.63 66.34 34.44 47.75 55.67 62.27 99.87 96.26 68.53 85.19 92.96 96.04
Diff-Mix [64] 84.07 67.79 36.55 50.35 58.19 64.48 99.25 96.98 78.41 89.46 93.86 96.63
DiffuseMix [20]\ddagger 80.96 63.97 31.02 44.65 51.98 59.87 99.04 95.32 65.94 81.20 90.94 95.99
SGD-Mix (Ours) 84.34 66.80 37.00 49.48 58.68 64.98 99.68 96.30 79.98 88.66 94.03 96.89
Table 3: Long-tail classification accuracy (%).

We evaluate our proposed SGD-Mix method through four key studies: (1) fine-grained vision classification, (2) long-tail classification, (3) few-shot classification, and (4) background robustness. These assess SGD-Mix under diverse challenges—high-resolution inputs, imbalanced distributions, data scarcity, and background shifts—comparing it against SOTA data augmentation methods to highlight its effectiveness in domain-specific image classification. Unless specified, we set batch size N=50N=50, compute saliency maps via gradient-based method (Grad-CAM) [54], and follow standard protocols for fair comparisons222Results marked with \ddagger in the tables are our reproductions; others are reported by Diff-Mix [64], slightly outperforming ours under consistent settings. We adopt their values for fairness and rigor.. More details of the experimental setup are in the Supplementary Materials 3.

6.1 Fine-Grained Vision Classification

Fine-grained visual classification (FGVC) involves distinguishing subtle intra-category differences. We evaluate SGD-Mix on five datasets—CUB [63], Stanford Cars [26], Oxford Flowers [36], Stanford Dogs [23], and FGVC Aircraft [30]—using ResNet50 [14] (ImageNet1K-pretrained [50], 448×448448\times 448) and ViT-B/16 [10] (ImageNet21K-pretrained [45], 384×384384\times 384). SGD-Mix employs a translation strength of S{0.5,0.7,0.9}S\in\{0.5,0.7,0.9\}, label smoothing [32] with confidence 0.9, an expansion multiplier of 5, and a replacement probability of p=0.1p=0.1. Competitors include generative diffusion-based methods (Diff-Mix [64], DiffuseMix [20], Real-Filtering [15], Real-Guidance [15], Da-Fusion [60]) and non-generative methods (Mixup [68], CutMix [67], GuidedMixup [22]). For fair comparison, Diff-Mix and DiffuseMix adopt the same SS as SGD-Mix, with Diff-Mix using γ=0.5\gamma=0.5, while Da-Fusion uses a randomly sampled S{0.25,0.5,0.75,1.0}S\in\{0.25,0.5,0.75,1.0\}.

Table 2 shows SGD-Mix outperforming SOTA, achieving the highest average accuracy (92.14% ResNet50, 92.51% ViT) across datasets, surpassing Diff-Mix (91.96% ResNet50, 92.22% ViT) by +0.18% and +0.29%, with notable gains on Dog (+0.27% ResNet50) and Aircraft (+0.88% ViT). SGD-Mix succeeds by generating data with reliable foregrounds and diverse backgrounds, capturing subtle class distinctions. See Q1 in Section 7 for details.

6.2 Long-Tail Classification

Long-tail classification addresses class imbalance, a common real-world challenge. We test SGD-Mix on CUB-LT and Flower-LT datasets [53, 6, 29, 38], varying imbalance factors (IF=100, 50, 10), using SYNAuG [65] to balance distributions with synthetic data. Competitors include generative diffusion-based methods (DiffuseMix [20], Diff-Mix [64], Real-Mix [64], Real-Gen [64], DA-fusion [60]) and non-generative methods (SaliencyMix [62], Copy-Paste [12], CMO [38], CMO+DRW [6]), with S=0.7S=0.7 for SGD-Mix/Diff-Mix/Real-Mix and γ=0.5\gamma=0.5 for Diff-Mix/Real-Mix.

Table 3 reveals SGD-Mix surpassing SOTA Diff-Mix on CUB-LT (64.98% vs. 64.48% at IF=10) and Flower-LT (96.89% vs. 96.63%), excelling in Few classes (+0.45%/+1.57% on CUB-LT/Flower-LT IF=100). Reliable intra-class foregrounds in reference images ensure accurate augmentation for rare classes. See Q1 in Section 7 for details.

Method 1-shot (p=0.5p=0.5) 5-shot (p=0.3p=0.3) 10-shot (p=0.2p=0.2) All-shot (p=0.1p=0.1)
-\ddagger 16.10 50.98 68.69 81.74
Copy-Paste (LSJ) [12]\ddagger 16.81 52.97 69.02 81.77
SaliencyMix [62]\ddagger 17.47 53.02 69.50 81.90
DA-fusion [60]\ddagger 20.02 53.04 69.97 81.79
Diff-Aug [64]\ddagger 20.50 57.49 71.00 82.02
Diff-Mix [64]\ddagger 25.49 59.30 71.51 82.34
Diff-Gen [64]\ddagger 26.10 58.35 71.14 82.26
SGD-Mix (Ours) 25.68 59.61 71.90 82.88
Table 4: Few-shot classification accuracy (%).

6.3 Few-Shot Classification

Few-shot classification tests generalization from limited data. On CUB [63] with ResNet50 [14] (224×224224\times 224), we evaluate 1-shot, 5-shot, 10-shot, and all-shot settings, setting S=0.9S=0.9 for SGD-Mix, Diff-Mix [64], and Diff-Aug [64], with an expansion multiplier of 5 and replacement probability p={0.5,0.3,0.2,0.1}p=\{0.5,0.3,0.2,0.1\} respectively.

Table 4 shows SGD-Mix leading in 5-shot (59.61% vs. 59.30% Diff-Mix), 10-shot (71.90% vs. 71.51%), and all-shot (82.88% vs. 82.34%), with competitive 1-shot performance (25.68% vs. 26.10% Diff-Gen [64]). Reliable intra-class foregrounds in reference images ensure correct foreground generation despite limited samples. See Q1 in Section 7 for details.

6.4 Background Robustness

To assess if diverse samples boost Diff-Mix’s resilience to background shifts, we test it on the out-of-distribution Waterbird dataset [51] (CUB [63] foregrounds with Places [71] backgrounds), split into (waterbird, water), (waterbird, land), (landbird, land), and (landbird, water). Models trained on CUB and synthetic data are evaluated for robustness.

Table 5 indicates SGD-Mix slightly edges out Diff-Mix (72.49% vs. 72.47% avg.), with a standout +6.5% gain in (waterbird, land) over the baseline. Its foreground-background disentanglement mitigates background bias. See Q1 in Section 7 for details.

Method (waterbird, water) (waterbird, land) (landbird, land) (landbird, water) Avg
- 59.50 56.70 73.48 73.97 70.19
CutMix [67] 62.46 60.12 73.39 74.72 71.23
Copy-Paste (LSJ) [12]\ddagger 59.97 56.70 73.17 72.95 69.80
SaliencyMix [62]\ddagger 62.15 60.28 73.48 73.53 70.78
DA-Fusion [60] 60.90 58.10 72.94 72.77 69.90
Diff-Aug [64] 61.83 60.12 73.04 73.52 70.28
Diff-Mix [64] 63.83 63.24 75.64 74.36 72.47
SGD-Mix (Ours) 63.86 63.24 75.57 74.50 72.49
Table 5: Classification accuracy (%) on the Waterbird dataset.
Dataset Baseline SGD-Mix (SR) SGD-Mix
CUB 86.64 86.73 87.66
Aircraft 89.09 89.11 90.16
Flower 99.27 99.49 99.59
Car 94.54 94.70 95.27
Dog 87.48 86.96 88.01
Table 6: Comparison of Baseline, SGD-Mix, and SGD-Mix (SR) using ResNet50 at 448×\times448, per Section 6.1. SGD-Mix (SR) uses spectral residual [17] saliency maps.

7 Discussion

Q1: Why does SGD-Mix consistently outperform across diverse tasks?

A: SGD-Mix excels by jointly optimizing diversity, faithfulness, and label clarity—key factors for effective data augmentation in domain-specific image classification. It leverages saliency maps to separate foreground and background, preserving the source image’s foreground semantics while diversifying backgrounds with target images. A fine-tuned diffusion model further refines these mixes, generating high-quality outputs through non-linear synthesis. Unlike Diff-Mix [64], which risks label ambiguity due to inter-class mixing and requires filters [44] for unnatural images, SGD-Mix’s label-preserving method ensures semantic alignment, eliminating mismatches without the need for additional filtering. Similarly, unlike DiffuseMix [20], our approach does not suffer from semantic drift. It follows the foreground-background disentanglement principle [64], producing reliable foregrounds with diverse backgrounds to enhance robust feature learning. Moreover, its saliency-guided mixing inherently aligns with the VRM [7]. These combined advantages strengthen robustness in fine-grained classification (capturing subtle details), long-tail settings (handling rare classes), few-shot learning (adapting to scarce data), and background robustness (mitigating bias).

1010202030305050100100868687878888Batch Size NNAccuracy on CUB (%)SGD-MixBaseline
Figure 6: Accuracy on CUB with ResNet50, varying NN (Section 6.1). SGD-Mix improves with NN, saturating near 50.

Q2: Why avoid additional background datasets or segmentation models, and how does SGD-Mix differ from simple background swapping?

A: We prioritize a method using only the input dataset, as external background datasets may cause semantic inconsistency due to uncertain contextual relevance. Segmentation models, though viable, are computationally costly and less adaptable, struggling to match foregrounds and backgrounds within a dataset without an optimal pairing mechanism. In contrast, SGD-Mix uses L2 distance between saliency maps to pair similar foregrounds with diverse backgrounds efficiently and precisely in a lightweight framework. Unlike simple background swapping [48, 66, 55], which only swaps backgrounds without changing foregrounds and thus lacks diversity, SGD-Mix preserves foreground semantics, adds diverse backgrounds, and leverages a fine-tuned diffusion model for nonlinear and high-fidelity generation, overcoming these limitations.

Q3: Is SGD-Mix tied to gradient-based saliency maps?

A: No, SGD-Mix supports any saliency map method. While gradient-based (Grad-CAM) [54] saliency maps yield optimal accuracy, ablation studies (Table 6) with spectral residual methods [17] show minor performance drops yet sustained effectiveness, proving the framework’s flexibility.

Q4: How does batch size NN impact performance?

A: Ablation studies (Figure 6) testing N=10,20,30,50,100N=10,20,30,50,100 reveal: 1) small NN limits target selection, reducing diversity; 2) larger NN boosts augmentation quality, but increases computation (See Supplementary Materials 3.3.2 for details.), plateauing beyond 50 with diminishing returns. We recommend N[30,50]N\in[30,50] for efficiency and effectiveness.

8 Conclusion

In this work, we introduce SGD-Mix, a data augmentation framework for domain-specific image classification that simultaneously enhances diversity, faithfulness, and label clarity in generated data. Using saliency-guided mixing and a fine-tuned diffusion model, SGD-Mix overcomes prior method limitations and diffusion model challenges. Experiments on fine-grained, long-tail, few-shot, and background robustness tasks show it outperforms state-of-the-art methods with balanced, robust augmentation.

Limitations

SGD-Mix incurs computational overhead from saliency map processing, which may hinder scalability in resource-limited settings. We detail this in Supplementary Materials 3.3.2 and plan to explore lightweight saliency processing techniques in future work.

References

  • [1] A. Antoniou, A. Storkey, and H. Edwards. Augmenting image classifiers using data augmentation generative adversarial networks. In International conference on artificial neural networks, pages 594–603. Springer, 2018.
  • [2] A. Antoniou, A. Storkey, and H. Edwards. Data augmentation generative adversarial networks. stat, 1050:8, 2018.
  • [3] N. Araslanov and S. Roth. Self-supervised augmentation consistency for adapting semantic segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15384–15394, 2021.
  • [4] S. Azizi, S. Kornblith, C. Saharia, M. Norouzi, and D. J. Fleet. Synthetic data from diffusion models improves imagenet classification. Transactions on Machine Learning Research.
  • [5] T. Brooks, A. Holynski, and A. A. Efros. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18392–18402, 2023.
  • [6] K. Cao, C. Wei, A. Gaidon, N. Arechiga, and T. Ma. Learning imbalanced datasets with label-distribution-aware margin loss. Advances in neural information processing systems, 32, 2019.
  • [7] O. Chapelle, J. Weston, L. Bottou, and V. Vapnik. Vicinal risk minimization. Advances in neural information processing systems, 13, 2000.
  • [8] T. Chen. On the importance of noise scheduling for diffusion models. arXiv preprint arXiv:2301.10972, 2023.
  • [9] P. Dhariwal and A. Nichol. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
  • [10] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2020.
  • [11] R. Gal, Y. Alaluf, Y. Atzmon, O. Patashnik, A. H. Bermano, G. Chechik, and D. Cohen-or. An image is worth one word: Personalizing text-to-image generation using textual inversion. In The Eleventh International Conference on Learning Representations.
  • [12] G. Ghiasi, Y. Cui, A. Srinivas, R. Qian, T.-Y. Lin, E. D. Cubuk, Q. V. Le, and B. Zoph. Simple copy-paste is a strong data augmentation method for instance segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2918–2928, 2021.
  • [13] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
  • [14] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [15] R. He, S. Sun, X. Yu, C. Xue, W. Zhang, P. Torr, S. Bai, and X. QI. Is synthetic data from generative models ready for image recognition? In The Eleventh International Conference on Learning Representations.
  • [16] J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  • [17] X. Hou and L. Zhang. Saliency detection: A spectral residual approach. In 2007 IEEE Conference on computer vision and pattern recognition, pages 1–8. Ieee, 2007.
  • [18] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen, et al. Lora: Low-rank adaptation of large language models. ICLR, 1(2):3, 2022.
  • [19] S. Huang, X. Wang, and D. Tao. Snapmix: Semantically proportional mixing for augmenting fine-grained data. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 1628–1636, 2021.
  • [20] K. Islam, M. Z. Zaheer, A. Mahmood, and K. Nandakumar. Diffusemix: Label-preserving data augmentation with diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27621–27630, 2024.
  • [21] L. Itti, C. Koch, and E. Niebur. A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on pattern analysis and machine intelligence, 20(11):1254–1259, 2002.
  • [22] M. Kang and S. Kim. Guidedmixup: an efficient mixup strategy guided by saliency maps. In Proceedings of the AAAI conference on artificial intelligence, volume 37, pages 1096–1104, 2023.
  • [23] A. Khosla, N. Jayadevaprakash, B. Yao, and F.-F. Li. Novel dataset for fine-grained image categorization: Stanford dogs. In Proc. CVPR workshop on fine-grained visual categorization (FGVC), volume 2, 2011.
  • [24] J.-H. Kim, W. Choo, and H. O. Song. Puzzle mix: Exploiting saliency and local statistics for optimal mixup. In International conference on machine learning, pages 5275–5285. PMLR, 2020.
  • [25] D. P. Kingma, M. Welling, et al. Auto-encoding variational bayes.
  • [26] J. Krause, M. Stark, J. Deng, and L. Fei-Fei. 3d object representations for fine-grained categorization. In Proceedings of the IEEE international conference on computer vision workshops, pages 554–561, 2013.
  • [27] N. Kumari, B. Zhang, R. Zhang, E. Shechtman, and J.-Y. Zhu. Multi-concept customization of text-to-image diffusion. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1931–1941, 2023.
  • [28] G. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi, M. Ghafoorian, J. A. Van Der Laak, B. Van Ginneken, and C. I. Sánchez. A survey on deep learning in medical image analysis. Medical image analysis, 42:60–88, 2017.
  • [29] Z. Liu, Z. Miao, X. Zhan, J. Wang, B. Gong, and S. X. Yu. Large-scale long-tailed recognition in an open world. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2537–2546, 2019.
  • [30] S. Maji, E. Rahtu, J. Kannala, M. Blaschko, and A. Vedaldi. Fine-grained visual classification of aircraft. 2013.
  • [31] C. Meng, Y. He, Y. Song, J. Song, J. Wu, J.-Y. Zhu, and S. Ermon. Sdedit: Guided image synthesis and editing with stochastic differential equations. In International Conference on Learning Representations.
  • [32] R. Müller, S. Kornblith, and G. E. Hinton. When does label smoothing help? Advances in neural information processing systems, 32, 2019.
  • [33] A. Mumuni and F. Mumuni. Data augmentation: A comprehensive survey of modern approaches. Array, 16:100258, 2022.
  • [34] A. Q. Nichol and P. Dhariwal. Improved denoising diffusion probabilistic models. In International conference on machine learning, pages 8162–8171. PMLR, 2021.
  • [35] A. Q. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishkin, B. Mcgrew, I. Sutskever, and M. Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In International Conference on Machine Learning, pages 16784–16804. PMLR, 2022.
  • [36] M.-E. Nilsback and A. Zisserman. Automated flower classification over a large number of classes. In 2008 Sixth Indian conference on computer vision, graphics & image processing, pages 722–729. IEEE, 2008.
  • [37] N. Otsu et al. A threshold selection method from gray-level histograms. Automatica, 11(285-296):23–27, 1975.
  • [38] S. Park, Y. Hong, B. Heo, S. Yun, and J. Y. Choi. The majority can help the minority: Context-rich minority oversampling for long-tailed classification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6887–6896, 2022.
  • [39] M. Patel, T. Gokhale, C. Baral, and Y. Yang. Conceptbed: Evaluating concept learning abilities of text-to-image diffusion models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 14554–14562, 2024.
  • [40] T. Qi, S. Fang, Y. Wu, H. Xie, J. Liu, L. Chen, Q. He, and Y. Zhang. Deadiff: An efficient stylization diffusion model with disentangled representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8693–8702, 2024.
  • [41] H. Qin, X. Jin, H. Zhu, H. Liao, M. A. El-Yacoubi, and X. Gao. Sumix: Mixup with semantic and uncertain information. In European Conference on Computer Vision, pages 70–88. Springer, 2024.
  • [42] J. Qin, J. Fang, Q. Zhang, W. Liu, X. Wang, and X. Wang. Resizemix: Mixing data with preserved object information and true labels. arXiv preprint arXiv:2012.11101, 2020.
  • [43] Z. Qiu, W. Liu, H. Feng, Y. Xue, Y. Feng, Z. Liu, D. Zhang, A. Weller, and B. Schölkopf. Controlling text-to-image diffusion by orthogonal finetuning. Advances in Neural Information Processing Systems, 36:79320–79362, 2023.
  • [44] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PmLR, 2021.
  • [45] T. Ridnik, E. Ben-Baruch, A. Noy, and L. Zelnik-Manor. Imagenet-21k pretraining for the masses. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1).
  • [46] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  • [47] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, pages 234–241. Springer, 2015.
  • [48] C. Rother, V. Kolmogorov, and A. Blake. ” grabcut” interactive foreground extraction using iterated graph cuts. ACM transactions on graphics (TOG), 23(3):309–314, 2004.
  • [49] N. Ruiz, Y. Li, V. Jampani, Y. Pritch, M. Rubinstein, and K. Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22500–22510, 2023.
  • [50] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115:211–252, 2015.
  • [51] S. Sagawa, P. W. Koh, T. B. Hashimoto, and P. Liang. Distributionally robust neural networks. In International Conference on Learning Representations.
  • [52] C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, K. Ghasemipour, R. Gontijo Lopes, B. Karagol Ayan, T. Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in neural information processing systems, 35:36479–36494, 2022.
  • [53] D. Samuel, Y. Atzmon, and G. Chechik. From generalized zero-shot learning to long-tail with class descriptors. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 286–295, 2021.
  • [54] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pages 618–626, 2017.
  • [55] S. Sengupta, V. Jayaram, B. Curless, S. M. Seitz, and I. Kemelmacher-Shlizerman. Background matting: The world is your green screen. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2291–2300, 2020.
  • [56] C. Shorten and T. M. Khoshgoftaar. A survey on image data augmentation for deep learning. Journal of big data, 6(1):1–48, 2019.
  • [57] K. Simonyan, A. Vedaldi, and A. Zisserman. Deep inside convolutional networks: visualising image classification models and saliency maps. International Conference on Learning Representations, 2014.
  • [58] J. Song, C. Meng, and S. Ermon. Denoising diffusion implicit models. In International Conference on Learning Representations.
  • [59] Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations.
  • [60] B. Trabucco, K. Doherty, M. Gurinas, and R. Salakhutdinov. Effective data augmentation with diffusion models. In R0-FoMo: Robustness of Few-shot and Zero-shot Learning in Large Foundation Models.
  • [61] N. Tumanyan, M. Geyer, S. Bagon, and T. Dekel. Plug-and-play diffusion features for text-driven image-to-image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1921–1930, 2023.
  • [62] A. S. Uddin, M. S. Monira, W. Shin, T. Chung, and S.-H. Bae. Saliencymix: A saliency guided data augmentation strategy for better regularization. In International Conference on Learning Representations.
  • [63] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The caltech-ucsd birds-200-2011 dataset. 2011.
  • [64] Z. Wang, L. Wei, T. Wang, H. Chen, Y. Hao, X. Wang, X. He, and Q. Tian. Enhance image classification via inter-class image mixup with diffusion model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17223–17233, 2024.
  • [65] M. Ye-Bin, N. Hyeon-Woo, W. Choi, N. Kim, S. Kwak, and T.-H. Oh. Exploiting synthetic data for data imbalance problems: baselines from a data perspective. arXiv preprint arXiv:2308.00994, 6, 2023.
  • [66] J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang. Generative image inpainting with contextual attention. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5505–5514, 2018.
  • [67] S. Yun, D. Han, S. J. Oh, S. Chun, J. Choe, and Y. Yoo. Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6023–6032, 2019.
  • [68] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz. mixup: Beyond empirical risk minimization. In International Conference on Learning Representations, 2018.
  • [69] R. Zhao, W. Ouyang, H. Li, and X. Wang. Saliency detection by multi-context deep learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1265–1274, 2015.
  • [70] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba. Learning deep features for discriminative localization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2921–2929, 2016.
  • [71] B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba. Places: A 10 million image database for scene recognition. IEEE transactions on pattern analysis and machine intelligence, 40(6):1452–1464, 2017.