Exploring Bias in over 100 Text-to-Image
Generative Models

Jordan Vice¹
Naveed Akhtar²
Richard Hartley^3,4
Ajmal Mian¹
¹ University of Western Australia ² University of Melbourne
³ Australian National University ⁴ Google

Abstract

We investigate bias trends in text-to-image generative models over time, focusing on the increasing availability of models through open platforms like Hugging Face. While these platforms democratize AI, they also facilitate the spread of inherently biased models, often shaped by task-specific fine-tuning. Ensuring ethical and transparent AI deployment requires robust evaluation frameworks and quantifiable bias metrics. To this end, we assess bias across three key dimensions: (i) distribution bias, (ii) generative hallucination, and (iii) generative miss-rate. Analyzing over 100 models, we reveal how bias patterns evolve over time and across generative tasks. Our findings indicate that artistic and style-transferred models exhibit significant bias, whereas foundation models, benefiting from broader training distributions, are becoming progressively less biased. By identifying these systemic trends, we contribute a large-scale evaluation corpus to inform bias research and mitigation strategies, fostering more responsible AI development.

1 Introduction

Text-to-image (T2I) generative models, while capable of high-fidelity image synthesis, inherently reflect the biases present in their training data (Garcia et al., 2023; Mehrabi et al., 2021; Zhang et al., 2023). The wide accessibility of training, fine-tuning and deployment resources has resulted in a plethora of T2I models being published by AI practitioners and hobbyists alike. Whereas there are many debates on the biased nature of these models, there is no concrete evidence on how the community is responding in terms of accounting for bias in T2I generative models, particularly in light of the volume of models continuing to be released. Hence, we conduct this crucial research.

The abundance of publicly available data and models democratizes AI development, but also underscores the need for responsible usage (Arrieta et al., 2020; Bakr et al., 2023; Teo et al., 2024) and comprehensive evaluation tools that characterize bias characteristics of these black box models (Bakr et al., 2023; Chinchure et al., 2024; D’Incà et al., 2024; Hu et al., 2023; Luo et al., 2024; Vice et al., 2023). The ability to develop unsafe, inappropriate or biased models presents a significant challenge and evaluating fundamental bias characteristics is a crucial step in the right direction.

Biased representations in generated images stem from factors such as class imbalances in training data, human labeling biases, and hyperparameter choices during model training and fine-tuning (Garcia et al., 2023; Mehrabi et al., 2021; Zhang et al., 2023). Theoretically, generative model biases are not confined to a single concept or direction. Analyzing a model’s overall bias provides a more comprehensive understanding of its learned representations and underlying manifold structure. For instance, when generating generic images of “animals,” a model may disproportionately favor certain species or environments. While social biases (e.g., those related to age, race, or gender) are particularly consequential in public-facing applications (Abid et al., 2021; Luccioni et al., 2023; Naik & Nushi, 2023; Seshadri et al., 2023), they are manifestations of broader model biases, observed from a specific viewpoint. Since biases extend beyond social domains, it is essential to first characterize the general bias properties of learned concepts to better understand their implications.

In this work, we perform an extensive analysis of publicly available T2I models to examine how bias characteristics have evolved over time and across different generative tasks. We construct a comprehensive evaluation framework that considers: (i) distribution bias, (ii) Jaccard hallucination, (iii) generative miss-rate, (iv) log-based bias scores, (v) model popularity, and (vi) metadata features such as the intended generative task and timestamp.

Repositories such as the HuggingFace Hub offer a vast array of fine-tuned models, including approximately 56,240 text-to-image (T2I) models¹¹1as of the time of writing this manuscript. This extensive collection enables our comprehensive evaluations. The field of conditional image generation has evolved significantly, from the widely-used Stable Diffusion architecture (Rombach et al., 2022) (spanning versions v1 to v3/XL) to the latest rectified-flow transformer (FLUX)-based models (BlackForestLabs, 2024). To capture this progression, we conduct extensive evaluations across more than 100 unique models, varying in artistic style, generative task, and release date.

To quantify bias along distribution bias ‘ $B_{D}$ ’, Jaccard hallucination ‘ $H_{J}$ ’ and, generative miss-rate ‘ $M_{G}$ ’ dimensions, we utilize the open-source “Try Before You Bias” (TBYB) evaluation code (Vice et al., 2023), which aligns well with models hosted on HuggingFace. We introduce a log-based bias score that integrates these metrics into a single, interpretable value, computable in black-box settings. This approach provides a unified framework for evaluating and comparing model biases.

Our evaluations offer valuable insights into the bias characteristics of various categories of generative models, revealing a trade-off between artistic style transfer and perceived bias. We also observe that modern foundation models and photo-realism models have benefited from larger datasets, improved architectures, and careful curation efforts, leading to a positive trend in bias mitigation over time. By analyzing model popularity, we further explore whether user engagement is influenced by bias. This study represents a significant step forward in understanding how the community responds to biases T2I models, particularly in light of the rapid proliferation of diverse models.

Through this work we contribute:

1.

an extensive evaluation of bias trends in generative text-to-image models over time, uncovering key observations across three dimensions: distribution bias, hallucination and generative miss-rate.
2.

a singular, log-based bias evaluation score that advances existing methodologies. This score enables end-to-end bias assessments in black-box settings, eliminating the need for normalization relative to a corpus of evaluated models.
3.

a categorization and analysis of bias characteristics across several classes of trained and fine-tuned text-to-image models, namely: foundation, photo realism, animation, art. Additionally, we provide a quantifiable measure of model popularity, offering insights into how bias may influence user engagement and adoption.

2 Background and Related Work

Generative Text-to-Image Models have gained significant attention among AI practitioners and the wider, general public. These models, composed of tokenizers, text encoders, denoising networks, and schedulers, enable users to generate unique images from conditional prompts. The foundational de-noising process proposed by Sohl-Dickstein et al. (2015) inspired many of the underlying generative capabilities of modern T2I models. Subsequent advancements include denoising diffusion probabilistic models (DDPMs) (Sohl-Dickstein et al., 2015; Ho et al., 2020), denoising diffusion implicit models (DDIMs) (Song et al., 2020a), and stochastic differential equation (SDE)-based approaches (Song et al., 2020b). Rectified Flow-based de-noising paradigms have recently gained prominence, as seen in Stable Diffusion 3 (Esser et al., 2024), FLUX (BlackForestLabs, 2024) and PixArt(Chen et al., 2023; 2025).

These models often use a modified, conditional U-Net (Ronneberger & Fischer, 2015) for latent denoising. Conditional generative models integrate a network to convert user inputs into guidance vectors, steering the denoising process to match input prompts. In T2I models, Contrastive Language-Image Pre-training (CLIP) (Radford et al., 2021) and T5 encoders (Ni et al., 2021) are commonly used to map textual inputs into semantically rich embedding spaces. Larger models often combine multiple text encoders to enhance performance (Esser et al., 2024; BlackForestLabs, 2024).

By combining embedded denoising networks and text encoders, various T2I foundation models have been developed and released to the public. Notable examples include Stable Diffusion (v1 to v3.5/XL variants) (Rombach et al., 2022; Esser et al., 2024; Podell et al., 2023), DALL-E 2/3 (Betker et al., 2023; Saharia et al., 2022), and Imagen (Ramesh et al., 2022). Through cost-effective fine-tuning techniques like DreamBooth (Ruiz et al., 2023), Low-Rank Adaptation (LoRA) (Hu et al., 2021), and Textual Inversion (Gal et al., 2022), AI practitioners and hobbyists can create custom T2I models with tailored representations of learned concepts. However, these models are often shared on platforms like the HuggingFace Hub without sufficient acknowledgment of their potential biases, raising concerns about their responsible dissemination.

Bias and Ethical AI Evaluation Frameworks. Modern foundation models are trained on large, uncurated internet datasets, which often contain harmful, inaccurate, or biased representations that can manifest in generated outputs (Ferrara, 2023; Mehrabi et al., 2021). Unlike biased classification systems, bias in generative models is subtler and harder to detect due to their expansive input/output spaces and complex semantic relationships arising from massive training datasets. Without proper mitigation or quantification, these biases can lead to the proliferation of harmful stereotypes and misinformation. Compounded training and fine-tuning processes can thereby exacerbate or shift a model’s bias characteristics, raising ethical concerns, especially in front-facing applications. This underscores the critical need for bias quantification to address ethical AI considerations.

Several ethical AI evaluation frameworks have manifested as a result of these open research questions (Cho et al., 2023; Luccioni et al., 2023; Luo et al., 2024; Chinchure et al., 2024; Vice et al., 2023; Bakr et al., 2023; Hu et al., 2023; Teo et al., 2024; Gandikota et al., 2024; Huang et al., 2024; Schramowski et al., 2023; Seshadri et al., 2023; Naik & Nushi, 2023; D’Incà et al., 2024), addressing issues of fairness, bias, reliability and safety. While this work focuses primarily on biases, it is important to consider the synergy that exists across these four ethical AI dimensions. To conduct these evaluations, many works deploy auxiliary captioning or VLM/VQA models to facilitate the extraction of descriptive metrics.

The TIFA method introduced by (Hu et al., 2023) defines a comprehensive list of quantifiable T2I statistics, leveraging a VQA model to provide an extensive evaluation results on generated image and model characteristics. In a similar vein, the HRS benchmark proposed by (Bakr et al., 2023) also considers a wide range of T2I model characteristics - beyond the bias dimension, as it considers image quality and semantic alignment (scene composition). The StableBias (Luccioni et al., 2023) and DALL-Eval (Cho et al., 2023) methods have been proposed to assess reasoning skills and social biases (including gender/ethnicity) of text-to-image models, deploying captioning and VQA models for their analyses. Similarly, frameworks like FAIntbench (Luo et al., 2024), TIBET (Chinchure et al., 2024) and OpenBias (D’Incà et al., 2024) each consider he recognition of biases along several dimensions, proposing a wider definition of biases, all incorporating LLM and/or VQA models in their evaluation frameworks. FAIntbench considers four dimensions of bias i.e.: manifestation, visibility and acquired/protected attributes (Luo et al., 2024). In comparison, the TIBET framework identifies relevant, potential biases w.r.t. the input prompt (Chinchure et al., 2024). The ‘Try Before you Bias’ (TBYB) evaluation tool encompasses the evaluation methodology proposed by Vice et al. (2023), characterizing bias through: hallucination, distribution bias and generative miss-rate.

While evaluation frameworks are extensive, large-scale bias analysis of open-source, community-driven models remains limited. Existing efforts often focus on narrow subsets of models, leaving a critical need for a systematic, scalable approach. We bridge this gap with a comprehensive evaluation of over 100 models, utilizing the TBYB tool for its compatibility with the HuggingFace Hub.

3 Methodology

In this work, we conduct comprehensive bias evaluations of 103 unique T2I models that have been released from August 2022 to December 2024. To identify general bias characteristics, we employ the general bias evaluation methodology defined in Vice et al. (2023) to generate images of 100 random objects, (3 images/prompt = 300 images per evaluated model). This allows us to infer diverse, fundamental bias characteristics of each model.

3.1 Evaluation Metrics

Data biases can propagate into T2I models, leading to skewed representations in their outputs. Furthermore, compounded training and fine-tuning of large foundation models can fundamentally alter their bias characteristics. Regardless of intent, the severity of these biases must be quantifiable and must capture the diverse ways in which bias can manifest. To address these requirements, we employ three metrics for quantifying bias, motivated by fundamental examples that illustrate their relevance and applicability in evaluating model behavior.

(i) When prompted with “a picture of an apple”, a text-to-image model may generate an apple hanging off a tree. While semantically-logical, one could argue that generating the tree in the image evidences a hallucinated object in the scene (by addition) - as it was not explicitly requested in the prompt. Or, the model may generate an apple tree with no apples - omitting the object in the prompt. To account for both cases here, we compute Jaccard hallucination ‘ $H_{J}$ ’, derived from the IoU.

(ii) Nation- $\mathbf{X}$ commissions the development of a generative model for producing content for tourism with blended national flag iconography. The distribution of generated content would reflect the intentional skew by showing peaks in the number of occurrences for concepts relating to Nation- $\mathbf{X}$ . Thus, we consider distribution bias ‘ $B_{D}$ ’ as a quantifiable means of evaluating this phenomenon.

(iii) A T2I model has been fine-tuned with an intentionally-biased dataset that replaces images labeled with ‘car’ to ‘person’. This results in an intentionally-biased and misaligned output space that would cause misclassification w.r.t. the label provided by the input prompt. This justifies the need for quantifying generative miss-rate ‘ $M_{G}$ ’.

Covering the underlying motivations of the above examples, we use $H_{J}$ , $B_{D}$ and $M_{G}$ to analyze model bias. We also combine them into a single, log-based bias evaluation score ‘ $\mathcal{B}_{\log}$ ’ to characterize the overall bias behavior, which is useful for independently ranking different models. We visualize our bias evaluation framework in Fig. 1.

Refer to caption — Figure 1: Illustrating the process of quantifying biases in generative models in black-box settings. General prompts are used to query a test model. From the generated image set, we quantify bias along: (i) distribution bias, (ii) hallucination and, (iii) generative miss-rate dimensions.

Jaccard Hallucination - $H_{J}$ . While usually discussed from the context of language models (Gunjal et al., 2023; Ji et al., 2023), hallucinations are a common side effect in many foundation models (Rawte et al., 2023). They have been proposed as a vehicle for image out-painting (Xiao et al., 2020) and generative model improvement (Li et al., 2022b; Xiao et al., 2020) tasks. When drawing representations of objects and and classes from a learned distribution, it is logical that the semantically-rich manifolds may cause a model to also generate semantically-relevant objects as a result.

Here, $H_{J}$ considers two hallucination perspectives i.e.: (i) by addition of unspecified objects in the output and (ii) by omission of objects specified in the input. For a set of $N$ output images ‘ $Y_{i}~{}\forall~{}i\in N$ ’, generated from input prompts ‘ $\mathbf{x}_{i}~{}\forall~{}i\in N$ ’

H_{J}=\frac{\Sigma_{i=0}^{N-1}1-\frac{||\mathcal{X}_{i}\cap\mathcal{Y}_{i}||}{||\mathcal{X}_{i}\cup\mathcal{Y}_{i}||}}{N},

(1)

where ‘ $\mathcal{X}$ ’ defines input objects extracted from $\mathbf{x}_{i}$ and ‘ $\mathcal{Y}$ ’ defines the objects detected in the output image $Y_{i}$ , extracted from a generated caption. $H_{J}\rightarrow 0$ indicates a smaller discrepancy between the input and output objects/concepts and thus, demonstrates less hallucinatory (biased) behavior.

Distribution Bias - $B_{D}$ is derived from the area under the curve (AuC) of detected objects, capturing the frequency of objects/concepts that appear in generated images (that were not specified in the prompt) (Vice et al., 2023). After generating images and filtering objects, an object token dictionary ‘ $W_{O}=\{w_{i},n_{i}\}_{i=1}^{M}$ ’ is constructed, containing concept (word) ‘ $w_{i}$ ’ and no. of occurrences ‘ $n_{i}$ ’ pair. The distribution bias $B_{D}$ can be calculated through the AuC, after sorting $W_{O}$ (high to low) and applying min-max normalization:

	$\displaystyle\{w_{i},\tilde{n_{i}}\}=\{w_{i},\frac{n_{i}-\underset{i=1,...,M}{\min}(n~{}\in~{}[W_{O}])}{\underset{i=1,...,M}{\max}(n~{}\in~{}[W_{O}])-\underset{i=1,...,M}{\min}(n~{}\in~{}[W_{O}])}\},$		(2)
	$\displaystyle B_{D}=\Sigma_{i=1}^{M}\frac{\tilde{n}_{i}+\tilde{n}_{i+1}}{2}.$		(3)

Peaks in generated object distributions may report that significant attention is being applied along a specific bias direction and thus, represents another avenue in which bias can manifest itself.

Generative Miss Rate - $M_{G}$ . Bias can affect model performance, particularly if they shift the output representations in such a way that causes significant misalignment (Vice et al., 2023). As visualized in Fig. 1, a separate vision transformer (ViT) is deployed to classify generated images and determine $M_{G}$ . Generally, model alignment should be high and thus, the miss-rate should demonstrate a low variance across models. Significantly high $M_{G}$ may indicate that a model’s learned biases are shifting output representations away from the expected output (as governed by the prompt). For models trained to complete specific tasks (like generate a particular art style), we may find that the miss rate is much higher, potentially by design.

Given a prompt (classifier target label) ‘ $\mathbf{x}$ ’ and generated image $Y$ , the deployed ViT outputs a prediction, measuring the alignment of the image ‘ $Y$ ’ to the label ‘ $\mathbf{x}$ ’. For $N$ generated images,

M_{G}=\frac{\Sigma_{i=0}^{N-1}(\mathcal{P}_{1}=p(Y_{i};\theta))}{N},

(4)

where $\mathcal{P}_{1}$ represents $\neg$ target class. If the classifier fails to detect the generated image as a valid representation of $\mathbf{x}$ then $M_{G}$ increases. A higher $M_{G}$ indicates a greater misalignment with input prompts which may be (a) a symptom of a biased output space and/or (b) the result of a task that causes significant changes in output representations. We visualize how $B_{D},H_{J}$ and $M_{G}$ manifest in the output representations of these models in Fig. 2.

The Try Before You Bias (TBYB) Tool is a publicly available, practical software implementation of the three-dimensional bias evaluation framework discussed prior. The TBYB interface allows users to evaluate T2I models hosted on the HuggingFace Hub in a black-box evaluation set-up, provided repositories contain a model_index.json file. The BLIP (Li et al., 2022a) model is deployed for image captioning. Synonym detection functions in the NLTK (Bird & Loper, 2009) package are deployed to mitigate natural language discrepancies between the input prompt and generated caption.

3.2 Systematic Bias Evaluation Strategy

Based on the generated outputs and model metadata, we identify the model type as one of {foundation, photo realism, art ( $\neg$ anime) and animation/anime}. We define Foundation models as those designed for general purposes, encompassing a wider range of tasks. Photo realism models are those that are fine-tuned for higher-fidelity, photo realistic generation tasks. Art-based models are those which have been designed for style-transfer tasks in which non-anime artistic styles are the target. Animation/anime-tuned models are designed for replicating anime-inspired art-styles, a common application of models hosted on HuggingFace.

For time-based evaluations, we construct a timeline spanning from August 2022 to December 2024, analyzing trends across various model types. We then extrapolate these trends to understand how different categories of models are evolving. As part of this analysis, we investigate whether larger, more sophisticated foundation models, such as Stable Diffusion 3/XL, have achieved better alignment, reduced hallucinations, and fairer distributions of generated objects. Additionally, we provide a detailed analysis of each model type and explore the relationship between model popularity and bias statistics. Finally, we conduct bias evaluations across different noise schedulers to identify potential bias behaviors associated with their deployment.

In this work, we improve on the similarity detection function in (Vice et al., 2023) by incorporating a similarity-score-based approach to handle similar concepts e.g. ‘sneakers’ vs. ‘shoes’. Additionally, we omit commonly occurring primary (red, blue, yellow), secondary (green, orange, purple) and neutral colors (black, white, brown, grey) from generated captions as it was found in our analyses that color descriptions are not a fool-proof symptom of hallucination and can adversely skew results in a lot of cases. Furthermore, we propose combining the three metrics into one singular bias score, using a log scale to account for varied metric ranges, such that:

\mathcal{B}_{\log}=-(\ln{(B_{D})}+\ln{(1-H_{J})}+\ln{(1-M_{G})}),

(5)

where a proportional relationship exists between observed model bias and $\mathcal{B}_{log}$ . This allows for the calculation of biases for a single model, in a black-box setup, without relying on normalized relationships to a set of evaluated models as initially proposed in (Vice et al., 2023).

Model Popularity. As part of our analysis, we aim to analyze the relationship (if any) between model popularity and bias. To quantify model popularity, we designed a quantifiable score ‘ $\mathcal{S}_{pop.}$ ’, leveraging reported engagement information on the HuggingFace Hub i.e., the number of likes (historical) ‘ $N_{lk}$ ’, and the number of downloads ‘ $N_{dl}$ ’ in the last month (recent engagement). Given that the number of likes is generally less the number of downloads, we apply logarithmic scaling and proportional scaling factors ‘ $\alpha_{lk}$ ’ and ‘ $\alpha_{dl}$ ’, to account for the importance of continued engagement ( $N_{lk}$ ) and mitigate spikes in $N_{dl}$ associated with recency bias. Thus, we define:

\mathcal{S}_{pop.}=\alpha_{lk}\ln(1+N_{lk})+\alpha_{dl}\ln(1+N_{dl}),

(6)

where we deploy $\alpha_{lk}=0.6$ and $\alpha_{dl}=0.4$ in our experiments to account slightly more for historical influence while managing recency bias, such that $\mathcal{S}_{pop.}=0.6\ln(1+N_{lk})+0.4\ln(1+N_{dl})$ .

4 Results and Discussion

Our appraisal of the general bias characteristics of text-to-image models allows us to conduct a suite of evaluation studies to explore and formalize relationships between observed biases and model characteristics. Temporal-, categorical- and popularity-based analyses allow us to identify how bias characteristics: (i) have evolved over time, (ii) change with respect to different generative tasks or embedded de-noising schedulers and, (iii) impact how users engage with these models.

High-level Observations of General Bias Characteristics. We report a truncated list of evaluation results in Table 1, highlighting models that exhibit high, low and median bias behavior. Along with these, we also report results for highly-popular foundation models like the various stable diffusion versions. Analyzing Table 1 and Figs. 2, 3, we can observe that photo-realism and foundation models tend to generate relatively unbiased representations, which is expected given that these models are designed for general user inference tasks and improvements in generative fidelity - as is the case with photo-realism models. In comparison, at the bottom of Table 1, we can observe that many animation and art-tuned models report relatively more biased behavior, resulting from the task-oriented generative tasks. Observing the outputs of these models, we found that the tendency to focus on generating specific characters or art-styles irrespective of the prompt, resulted in high levels of hallucination and misalignment (see Figs. 2, 3).

Table 1: Truncated bias evaluation results. For brevity, we report the highest, median and lowest evaluation results. We also report results for highly-popular stable diffusion foundation models. We indicate row-wise separation of results via ‘:’. We also report popularity score ‘

\mathcal{S}_{pop.}

’. We highlight “most desirable” and “least desirable” values in green and red, respectively. Cells highlighted in orange indicate values closest to the average. A full list of results is provided in Appendix A.

Model	Task Category	Denoiser	Resolution	Release (dd/mm/yy)	$\mathcal{S}_{pop.}$	$B_{D}$	$H_{J}$	$M_{G}$	$B_{\log}$
Envvi/Inkpunk-Diffusion	art	PNDMScheduler	512 x 512	25/11/22	7.2323	18.9000	0.5346	0.0033	-2.1711
Yntec/beLIEve	photo realism	DPMSolverMultistepScheduler	768 x 768	01/08/24	5.2547	17.6176	0.5083	0.0000	-2.1589
segmind/SSD-1B	photo realism	EulerDiscreteScheduler	1024 x 1024	19/10/23	6.7116	15.7000	0.4747	0.0000	-2.1098
RunDiffusion/Juggernaut-X-v10	foundation	EulerDiscreteScheduler	1024 x 1024	20/04/24	6.8125	16.3571	0.4992	0.0000	-2.1031
prompthero/openjourney-v4	photo realism	PNDMScheduler	512 x 512	12/12/22	8.4414	15.9211	0.4881	0.0000	-2.0981
Lykon/dreamshaper-8	photo realism	DEISMultistepScheduler	512 x 512	27/08/23	6.8769	17.3947	0.5467	0.0000	-2.0649
RunDiffusion/Juggernaut-XL-v9	foundation	DDPMScheduler	1024 x 1024	19/02/24	8.6025	14.4048	0.4847	0.0000	-2.0046
stabilityai/sd-turbo	foundation	EulerDiscreteScheduler	512 x 512	28/11/23	7.9498	14.5476	0.4930	0.0000	-1.9982
eienmojiki/Anything-XL	animation / anime	EulerAncestralDiscreteScheduler	1024 x 1024	11/03/24	5.6000	14.8333	0.5287	0.0000	-1.9446
stabilityai/stable-diffusion-3.5-medium	foundation	FlowMatchEulerDiscreteScheduler	1024 x 1024	29/10/24	8.2481	14.0455	0.5049	0.0000	-1.9393
MirageML/dreambooth-nike	photo realism	PNDMScheduler	512 x 512	01/11/22	3.3402	14.4048	0.5206	0.0000	-1.9323
dataautogpt3/ProteusV0.3	foundation	EulerDiscreteScheduler	1024 x 1024	13/02/24	7.3949	14.5833	0.5324	0.0000	-1.9196
stablediffusionapi/juggernaut-reborn	foundation	PNDMScheduler	512 x 512	21/01/24	5.0168	13.9783	0.5073	0.0167	-1.9129
:	:	:	:	:	:	:	:	:	:
stabilityai/stable-diffusion-3.5-large	foundation	FlowMatchEulerDiscreteScheduler	1024 x 1024	22/10/24	9.2260	11.5769	0.4939	0.0000	-1.7680
:	:	:	:	:	:	:	:	:	:
stabilityai/stable-diffusion-2-1	foundation	DDIMScheduler	512 x 512	07/12/22	10.5860	11.7963	0.5349	0.0100	-1.6921
:	:	:	:	:	:	:	:	:	:
CompVis/stable-diffusion-v1-4	foundation	PNDMScheduler	512 x 512	20/08/22	10.7885	11.7258	0.5621	0.0000	-1.6360
:	:	:	:	:	:	:	:	:	:
lemon2431/toonify_v20	animation / anime	PNDMScheduler	512 x 512	16/10/23	4.1148	10.9412	0.5469	0.0033	-1.5976
SG161222/RealVisXL_V4.0	photo realism	EulerDiscreteScheduler	1024 x 1024	13/02/24	8.6080	10.6250	0.5405	0.0000	-1.5856
ckpt/anything-v4.5	art	PNDMScheduler	512 x 512	19/01/23	5.8262	11.5968	0.5784	0.0033	-1.5837
emilianJR/chilloutmix_NiPrunedFp32Fix	photo realism	PNDMScheduler	512 x 512	19/04/23	6.0849	10.7941	0.5498	0.0000	-1.5810
Kernel/sd-nsfw	photo realism	PNDMScheduler	512 x 512	15/07/23	5.5154	11.5333	0.5828	0.0000	-1.5711
Lykon/AAM_XL_AnimeMix	animation / anime	EulerDiscreteScheduler	1024 x 1024	19/01/24	6.8090	9.0152	0.4843	0.0000	-1.5367
stablediffusionapi/realistic-stock-photo	photo realism	EulerDiscreteScheduler	1024 x 1024	22/10/23	5.1367	10.2188	0.5451	0.0000	-1.5366
GraydientPlatformAPI/comicbabes2	art	PNDMScheduler	512 x 512	07/01/24	4.0530	10.0313	0.5447	0.0033	-1.5156
SG161222/Realistic_Vision_V6.0_B1_noVAE	photo realism	PNDMScheduler	896 x 896	29/11/23	7.7733	11.1571	0.5957	0.0000	-1.5066
scenario-labs/juggernaut_reborn	photo realism	DPMSolverMultistepScheduler	512 x 512	29/05/24	4.5414	10.0294	0.5504	0.0000	-1.5061
:	:	:	:	:	:	:	:	:	:
stabilityai/stable-diffusion-xl-base-1.0	photo realism	EulerDiscreteScheduler	1024 x 1024	25/07/23	11.1360	8.6515	0.5076	0.0000	-1.4492
:	:	:	:	:	:	:	:	:	:
stabilityai/sdxl-turbo	foundation	EulerAncestralDiscreteScheduler	512 x 512	27/11/23	10.1726	8.0778	0.5493	0.0000	-1.2922
:	:	:	:	:	:	:	:	:	:
dataautogpt3/ProteusV0.4-Lightning	foundation	EulerDiscreteScheduler	1024 x 1024	22/02/24	6.2324	4.6739	0.4055	0.0000	-1.0220
SG161222/Realistic_Vision_V2.0	photo realism	PNDMScheduler	512 x 512	21/03/23	8.3000	6.1154	0.5480	0.0000	-1.0167
sd-community/sdxl-flash	foundation	DPMSolverSinglestepScheduler	1024 x 1024	19/05/24	7.2798	4.7826	0.4424	0.0000	-0.9809
Mitsua/mitsua-diffusion-cc0	art	PNDMScheduler	512 x 512	22/12/22	5.1305	10.8947	0.7426	0.0900	-0.9368
OnomaAIResearch/Illustrious-xl-early-release-v0	animation / anime	EulerDiscreteScheduler	1024 x 1024	20/09/24	7.3851	11.6000	0.7524	0.1700	-0.8688
digiplay/ZHMix-Dramatic-v2.0	animation / anime	EulerDiscreteScheduler	768 x 768	03/12/23	4.7306	5.7800	0.6506	0.0333	-0.6689
DGSpitzer/Cyberpunk-Anime-Diffusion	animation / anime	PNDMScheduler	704 x 704	28/10/22	6.7796	2.9722	0.5845	0.0600	-0.1492
Emanon14/NONAMEmix_v1	animation / anime	EulerAncestralDiscreteScheduler	1024 x 1024	23/11/24	5.1821	5.3400	0.7447	0.1700	-0.1237
Onodofthenorth/SD_PixelArt_SpriteSheet_Generator	art	PNDMScheduler	512 x 512	01/11/22	6.4996	6.5930	0.7898	0.3367	0.0841
Niggendar/duchaitenPonyXLNo_ponyNoScoreV40	art	EulerDiscreteScheduler	1024 x 1024	01/06/24	5.1912	3.1620	0.7458	0.1000	0.3237
lambdalabs/sd-pokemon-diffusers	animation / anime	PNDMScheduler	512 x 512	16/09/22	6.3118	9.6509	0.8639	0.6033	0.6519
Raelina/Raehoshi-illust-XL-3	animation / anime	EulerDiscreteScheduler	1024 x 1024	11/12/24	3.7427	4.8772	0.8926	0.6700	1.7549
monadical-labs/minecraft-skin-generator-sdxl	animation / anime	EulerDiscreteScheduler	768 x 768	19/02/24	5.3317	5.9786	0.9721	0.7933	3.3680

Figure 2 presents a qualitative overview of bias manifestations in model outputs, using examples from Table 1 to contrast biased and unbiased behaviors. These results align with quantitative metrics: a higher average $M_{G}$ will generally demonstrate a greater semantic misalignment (e.g. lambdalabs/sd-pokemon-diffusers). Low $B_{D}$ models show constrained diversity or representational bias. Changes in $H_{J}$ are straightforward, reflecting disparities between input and generated objects.

The varying scales of the three metrics necessitate a logarithmic scale for comparing overall model bias. Each metric uniquely characterizes bias. Table 1 shows that low-bias models ( $\downarrow\mathcal{B}{\log}$ ) typically report $B_{D}\geq 14.0$ , indicating a fairer distribution of generated objects. In contrast, highly biased models ( $\mathcal{B}{\log}\geq-1.0$ ) report $B_{D}\leq 7.0$ , which suggests outliers or peaks in the output distribution. For $H_{J}$ , T2I models inherently hallucinate due to their semantically rich latent spaces. The average $H_{J}\approx 0.55$ implies a 55% IoU between prompted and generated objects. Foundation and photo-realism models cluster near the mean, whereas highly biased models exhibit extreme values, with a maximum $H_{J}=0.9721$ meaning just 2.79% correlation between the input and output for the biased model. $M_{G}$ remains low across most models, with a mean ( $M_{G}=0.0333$ ) near the minimum ( $M_{G}=0.0000$ ), indicating valid outputs $\approx$ 97% of the time despite hallucinations. Models with high $M_{G}$ ( $\geq 0.60$ ) exhibit misaligned behavior, which based on model design may be intentional.

Evolution of Biases over Time. The release of the seminal latent diffusion work (Rombach et al., 2022), culminating in the public availability of the popular stable diffusion architecture on August 22, 2022, marked a pivotal moment in text-to-image generative models. Its launch on the HuggingFace Hub and subsequent community engagement spurred significant advancements in foundation models and task-specific variants. Accordingly, we use August 2022 as the starting point for our time-based analyses, with the latest evaluated model released in December 2024.

Our evaluation spans 103 models over 28 months, presenting time-based bias analyses by individual metrics (Fig. 3) and model categories (Fig. 4). The timeline ( $08/22\rightarrow 12/24$ ) is consistent across sub-figures, with models grouped by task categories to examine trends. Bias trends, such as the steep increase in art and animation models’ bias over time (Fig. 4(e)), highlight the impact of hobbyists and practitioners embedding stylistic preferences or specific characters into these models. These intentional biases are reflected in their outputs, as supported by observations in Fig. 3.

In comparison, we see that models associated with general tasks i.e., those that belong to the foundation and photo-realism model categories, have maintained consistent if not lower bias characteristics over time, as is the case with foundation models (See Fig. 3 and 4(e)). The increase in training data sizes and conscious improvements made to human-labeling and captioning in general have resulted in wider and denser manifolds with a greater diversity of concept representations. Significantly, looking at Table 1, if we compare stable diffusion v1.4/2.1/3.5 rows, we can see that hallucination and distribution bias scores improve with each significant version upgrade through time.

Table 2: Observed mean and (standard deviation) across model categories. Column-wise bold values indicate the most biased behavior. We sort this table along the

\mathcal{S}_{pop.}

column in descending order. Arrows in each column indicate the direction in which observed biases increases.

Model Type	$\mathcal{S}_{pop.}$	$\downarrow$ $B_{D}$	$\uparrow$ $H_{J}$	$\uparrow$ $M_{G}$	$\uparrow$ $\mathcal{B}_{\log}$
Foundation	7.2033 ( $\pm$ 1.670)	10.7919 ( $\pm$ 3.195)	0.5175 ( $\pm$ 0.038)	0.0019 ( $\pm$ 0.005)	-1.5960 ( $\pm$ 0.308)
Photo Realism	6.4354 ( $\pm$ 1.612)	11.1097 ( $\pm$ 2.979)	0.5392 ( $\pm$ 0.030)	0.0004 ( $\pm$ 0.001)	-1.5949 ( $\pm$ 0.292)
Art	6.2180 ( $\pm$ 1.341)	10.0088 ( $\pm$ 4.135)	0.6159 ( $\pm$ 0.101)	0.0537 ( $\pm$ 0.107)	-1.1581 ( $\pm$ 0.786)
Animation/Anime	6.0786 ( $\pm$ 1.367)	10.4494 ( $\pm$ 3.147)	0.5969 ( $\pm$ 0.120)	0.0830 ( $\pm$ 0.201)	-1.1503 ( $\pm$ 1.134)

On the Influence of Model Type and Popularity. We conducted an evaluation of biases w.r.t. model categories and their popularity, exploiting Eq. (6) to quantify the latter. We report the results of these findings in Table 2 and observe that foundation and photo-realism models are on-average, the most popular for users. Interestingly, these models tend to have more unbiased output representations when we consider the quantitative findings. Additionally, by analyzing the $\mathcal{B}_{\log}$ standard deviation results in Table 2, we see that foundation and photo-realism model performances are typically more consistent than art/animation counterparts.

De-noising Scheduler-Dependent Bias Evaluation. Much of the conditional latent diffusion process is predicated on the deployed de-noising scheduler. While similarities exist across different families of schedulers and the task remains the same i.e. use a conditional vector to guide latent de-noising steps to generate an aligned image representation of the input prompt, the mathematical foundations of each scheduler is unique. We report the descriptive statistics of different schedulers in Table 3, highlighting eight scheduler categories. The FlowMatchEulerDiscrete scheduler is deployed in Stable diffusion 3 variants which points to its high popularity and low $\mathcal{B}_{\log}$ score. Recently, flow-based de-noising schedulers have gained increased attention in state-of-the-art T2I models like Stable Diffusion 3 and FLUX (Esser et al., 2024; BlackForestLabs, 2024).

In comparison, the EulerDiscrete scheduler (Karras et al., 2022) reports the largest bias and the highest average miss-rate. Incremental improvements in scheduler architectures and technology since the release of EulerDiscrete, along with the modern T2I models opting for modern schedulers are logical reasons as to why this scheduler reports significantly high bias scores. Similarly, the EulerAncestralDiscrete scheduler, which contributes “ancestral sampling” also boasts consistent performance with its predecessor. These seminal works have inspired improvements which as shown through the FlowMatchEulerDiscrete scheduler, have resulted in significant performance gains.

We note that while using quantifiable metrics like those reported here present a step in the right direction, any definitive correlations will require a deeper analysis into the schedulers themselves.

Table 3: Observed mean and (standard deviation) across deployed schedulers. Column-wise bold values indicate the most biased behavior. We sort the schedulers along the

\mathcal{S}_{pop.}

in descending order. Arrows in each column indicate the direction in which observed bias increases.

Scheduler	$\mathcal{S}_{pop.}$	$\downarrow$ $B_{D}$	$\uparrow$ $H_{J}$	$\uparrow$ $M_{G}$	$\uparrow$ $\mathcal{B}_{\log}$
FlowMatchEulerDiscrete	7.5101 ( $\pm$ 2.181)	12.2788 ( $\pm$ 1.541)	0.4959 ( $\pm$ 0.008)	0.0000 ( $\pm$ 0.000)	-1.8177 ( $\pm$ 0.106)
KDPM2AncestralDiscrete	7.4314 ( $\pm$ 0.424)	9.5803 ( $\pm$ 2.667)	0.5279 ( $\pm$ 0.010)	0.0000 ( $\pm$ 0.000)	-1.4894 ( $\pm$ 0.304)
DDIM/DDPM	7.3371 ( $\pm$ 1.973)	10.1441 ( $\pm$ 2.548)	0.5321 ( $\pm$ 0.039)	0.0067 ( $\pm$ 0.014)	-1.5185 ( $\pm$ 0.261)
EulerAncestralDiscrete	7.1003 ( $\pm$ 1.667)	9.1328 ( $\pm$ 3.083)	0.5447 ( $\pm$ 0.084)	0.0213 ( $\pm$ 0.060)	-1.3332 ( $\pm$ 0.552)
DEISMultistep	6.6288 ( $\pm$ 0.357)	12.8746 ( $\pm$ 3.953)	0.5713 ( $\pm$ 0.025)	0.0067 ( $\pm$ 0.012)	-1.6710 ( $\pm$ 0.341)
PNDM	6.3172 ( $\pm$ 1.584)	11.1714 ( $\pm$ 3.002)	0.5694 ( $\pm$ 0.072)	0.0302 ( $\pm$ 0.106)	-1.4658 ( $\pm$ 0.547)
EulerDiscrete	6.2381 ( $\pm$ 1.392)	10.3751 ( $\pm$ 3.573)	0.5702 ( $\pm$ 0.124)	0.0621 ( $\pm$ 0.190)	-1.2184 ( $\pm$ 1.175)
DPMSolverMultistep	4.9966 ( $\pm$ 0.603)	12.1618 ( $\pm$ 3.654)	0.5662 ( $\pm$ 0.046)	0.0042 ( $\pm$ 0.008)	-1.6255 ( $\pm$ 0.360)

5 Conclusion

We have conducted an extensive evaluation of text-to-image models, utilizing the open HuggingFace Hub to facilitate our analyses of the bias characteristics of 103 unique models. To improve on existing evaluation methodologies, we combine three independent metrics i.e., (i) distribution bias, (ii) Jaccard hallucination and, (iii) generative miss-rate into a single log-scaled metric. By accounting for various generative model categories and quantifying public engagement, we have presented a comprehensive set of model evaluations. Identifying the fundamental bias characteristics of large, publicly-available text-to-image models is a critical task that must be considered in a democratized AI environment, considering the exposure of these models to wider audiences that continues to grow over time. So, to answer the question of “are models more biased now than they were 3 years ago?” really depends on the task. We see that iterative releases of Stable diffusion models for example, have resulted in marginal improvements in bias characteristics over time (from SD 1.1 to 3.5). Foundation and photo-realism models have demonstrated significant reductions in hallucination and increases in alignment, which is beneficial for improving the reliability for a wider range of audiences. As it pertains to style-transferred, art and animation models, these have demonstrated increased bias characteristics - a byproduct of intentionally-designing models to achieve specific tasks. We hope this work inspires further research in the field and a greater exposure to bias evaluation efforts.

Acknowledgments

This research and Dr. Jordan Vice are supported by the NISDRG project #20100007, funded by the Australian Government. Dr. Naveed Akhtar is a recipient of the ARC Discovery Early Career Researcher Award (project #DE230101058), funded by the Australian Government. Professor Ajmal Mian is the recipient of an ARC Future Fellowship Award (project #FT210100268) funded by the Australian Government.

References

Abid et al. (2021) Abubakar Abid, Maheen Farooqi, and James Zou. Persistent anti-muslim bias in large language models. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, AIES ’21, pp. 298–306, 2021. ISBN 9781450384735. doi: 10.1145/3461702.3462624. URL https://doi.org/10.1145/3461702.3462624.
Arrieta et al. (2020) Alejandro Barredo Arrieta, Natalia Díaz-Rodríguez, Javier Del Ser, Adrien Bennetot, Siham Tabik, Alberto Barbado, Salvador García, Sergio Gil-López, Daniel Molina, Richard Benjamins, et al. Explainable artificial intelligence (xai): Concepts, taxonomies, opportunities and challenges toward responsible ai. Information fusion, 58:82–115, 2020.
Bakr et al. (2023) Eslam Mohamed Bakr, Pengzhan Sun, Xiaoqian Shen, Faizan Farooq Khan, Li Erran Li, and Mohamed Elhoseiny. Hrs-bench: Holistic, reliable and scalable benchmark for text-to-image models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 20041–20053, October 2023.
Betker et al. (2023) James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions. Computer Science., 2(3):8, 2023.
Bird & Loper (2009) Steven Bird and Ewan Loper, Edward Klein. Natural language processing with python, o’reilly media inc. https://github.com/nltk/nltk, 2009.
BlackForestLabs (2024) BlackForestLabs. Flux.1. https://huggingface.co/black-forest-labs/FLUX.1-schnell, 2024.
Chen et al. (2023) Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart- $\alpha$ : Fast training of diffusion transformer for photorealistic text-to-image synthesis, 2023. URL https://arxiv.org/abs/2310.00426.
Chen et al. (2025) Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart- $\sigma$ : Weak-to-strong training of diffusion transformer for 4k text-to-image generation. In European Conference on Computer Vision, pp. 74–91. Springer, 2025.
Chinchure et al. (2024) Aditya Chinchure, Pushkar Shukla, Gaurav Bhatt, Kiri Salij, Kartik Hosanagar, Leonid Sigal, and Matthew Turk. Tibet: Identifying and evaluating biases in text-to-image generative models. In Aleš Leonardis, Elisa Ricci, Stefan Roth, Olga Russakovsky, Torsten Sattler, and Gül Varol (eds.), Computer Vision – ECCV 2024, pp. 429–446, Cham, 2024. Springer Nature Switzerland. ISBN 978-3-031-72986-7.
Cho et al. (2023) Jaemin Cho, Abhay Zala, and Mohit Bansal. Dall-eval: Probing the reasoning skills and social biases of text-to-image generation models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3043–3054, October 2023.
D’Incà et al. (2024) Moreno D’Incà, Elia Peruzzo, Massimiliano Mancini, Dejia Xu, Vidit Goel, Xingqian Xu, Zhangyang Wang, Humphrey Shi, and Nicu Sebe. Openbias: Open-set bias detection in text-to-image generative models. arXiv preprint arXiv:2404.07990, 2024.
Esser et al. (2024) Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yannik Marek, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis, 2024. URL https://arxiv.org/abs/2403.03206.
Ferrara (2023) Emilio Ferrara. Should chatgpt be biased? challenges and risks of bias in large language models. arXiv preprint arXiv:2304.03738, 2023.
Gal et al. (2022) Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H. Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion, 2022. URL https://arxiv.org/abs/2208.01618.
Gandikota et al. (2024) Rohit Gandikota, Hadas Orgad, Yonatan Belinkov, Joanna Materzyńska, and David Bau. Unified concept editing in diffusion models. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 5111–5120, January 2024.
Garcia et al. (2023) Noa Garcia, Yusuke Hirota, Yankun Wu, and Yuta Nakashima. Uncurated image-text datasets: Shedding light on demographic bias. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6957–6966, June 2023.
Gunjal et al. (2023) Anisha Gunjal, Jihan Yin, and Erhan Bas. Detecting and preventing hallucinations in large vision language models. arXiv preprint arXiv:2308.06394, 2023.
Ho et al. (2020) Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
Hu et al. (2021) Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models, 2021. URL https://arxiv.org/abs/2106.09685.
Hu et al. (2023) Yushi Hu, Benlin Liu, Jungo Kasai, Yizhong Wang, Mari Ostendorf, Ranjay Krishna, and Noah A. Smith. Tifa: Accurate and interpretable text-to-image faithfulness evaluation with question answering. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 20406–20417, October 2023.
Huang et al. (2024) Yihao Huang, Felix Juefei-Xu, Qing Guo, Jie Zhang, Yutong Wu, Ming Hu, Tianlin Li, Geguang Pu, and Yang Liu. Personalization as a shortcut for few-shot backdoor attack against text-to-image diffusion models. Proceedings of the AAAI Conference on Artificial Intelligence, 38(19):21169–21178, Mar. 2024. doi: 10.1609/aaai.v38i19.30110. URL https://ojs.aaai.org/index.php/AAAI/article/view/30110.
Ji et al. (2023) Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38, mar 2023. ISSN 0360-0300. doi: 10.1145/3571730. URL https://doi.org/10.1145/3571730.
Karras et al. (2022) Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models, 2022. URL https://arxiv.org/abs/2206.00364.
Li et al. (2022a) Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp. 12888–12900. PMLR, 17–23 Jul 2022a.
Li et al. (2022b) Yi Li, Rameswar Panda, Yoon Kim, Chun-Fu Richard Chen, Rogerio S Feris, David Cox, and Nuno Vasconcelos. Valhalla: Visual hallucination for machine translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5216–5226, 2022b.
Luccioni et al. (2023) Sasha Luccioni, Christopher Akiki, Margaret Mitchell, and Yacine Jernite. Stable bias: Evaluating societal representations in diffusion models. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, pp. 1–14, 2023.
Luo et al. (2024) Hanjun Luo, Ziye Deng, Ruizhe Chen, and Zuozhu Liu. Faintbench: A holistic and precise benchmark for bias evaluation in text-to-image models, 2024. URL https://arxiv.org/abs/2405.17814.
Mehrabi et al. (2021) Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, and Aram Galstyan. A survey on bias and fairness in machine learning. ACM Computing Surveys, 54(6):1–35, 2021.
Naik & Nushi (2023) Ranjita Naik and Besmira Nushi. Social biases through the text-to-image generation lens. arXiv preprint arXiv:2304.06034, 2023.
Ni et al. (2021) Jianmo Ni, Gustavo Hernández Ábrego, Noah Constant, Ji Ma, Keith B. Hall, Daniel Cer, and Yinfei Yang. Sentence-t5: Scalable sentence encoders from pre-trained text-to-text models, 2021. URL https://arxiv.org/abs/2108.08877.
Podell et al. (2023) Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis, 2023. URL https://arxiv.org/abs/2307.01952.
Radford et al. (2021) Alec Radford, Jong Wook Kim, Chris Hallacy, et al. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 8748–8763. PMLR, 18–24 Jul 2021.
Ramesh et al. (2022) Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
Rawte et al. (2023) Vipula Rawte, Amit Sheth, and Amitava Das. A survey of hallucination in large foundation models, 2023. URL https://arxiv.org/abs/2309.05922.
Rombach et al. (2022) Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10684–10695, June 2022.
Ronneberger & Fischer (2015) Olaf Ronneberger and Thomas Fischer, Philippand Brox. U-net: Convolutional networks for biomedical image segmentation. In Nassir Navab, Joachim Hornegger, William M. Wells, and Alejandro F. Frangi (eds.), Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015, pp. 234–241, Cham, 2015. Springer International Publishing.
Ruiz et al. (2023) Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 22500–22510, June 2023.
Saharia et al. (2022) Chitwan Saharia, William Chan, Saurabh Saxena, et al. Photorealistic text-to-image diffusion models with deep language understanding. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems, volume 35, pp. 36479–36494, 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/ec795aeadae0b7d230fa35cbaf04c041-Paper-Conference.pdf.
Schramowski et al. (2023) Patrick Schramowski, Manuel Brack, Björn Deiseroth, and Kristian Kersting. Safe latent diffusion: Mitigating inappropriate degeneration in diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 22522–22531, June 2023.
Seshadri et al. (2023) Preethi Seshadri, Sameer Singh, and Yanai Elazar. The bias amplification paradox in text-to-image generation. arXiv preprint arXiv:2308.00755, 2023.
Sohl-Dickstein et al. (2015) Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pp. 2256–2265. PMLR, 2015.
Song et al. (2020a) Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020a.
Song et al. (2020b) Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020b.
Teo et al. (2024) Christopher Teo, Milad Abdollahzadeh, and Ngai-Man Man Cheung. On measuring fairness in generative models. Advances in Neural Information Processing Systems, 36, 2024.
Vice et al. (2023) Jordan Vice, Naveed Akhtar, Richard Hartley, and Ajmal Mian. Quantifying bias in text-to-image generative models. arXiv preprint arXiv:2312.13053, 2023.
Xiao et al. (2020) Qingguo Xiao, Guangyao Li, and Qiaochuan Chen. Image outpainting: Hallucinating beyond the image. IEEE Access, 8:173576–173583, 2020.
Zhang et al. (2023) Cheng Zhang, Xuanbai Chen, Siqi Chai, Chen Henry Wu, Dmitry Lagun, Thabo Beeler, and Fernando De la Torre. Iti-gen: Inclusive text-to-image generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 3969–3980, October 2023.

Appendices

Appendix A: Full bias evaluations results of 103 text-to-image generative models. Evaluations are reported in $B_{\log}$ ascending order. Truncated results in Table 1 of the main manuscript are a sub-set of the full results presented here.

Model	Task Category	Denoiser	Resolution	Release (dd/mm/yy)	$\mathcal{S}_{pop.}$	$B_{D}$	$H_{J}$	$M_{G}$	$B_{\log}$
Envvi/Inkpunk-Diffusion	art	PNDMScheduler	512 x 512	25/11/22	7.2323	18.9000	0.5346	0.0033	-2.1711
Yntec/beLIEve	photo realism	DPMSolverMultistepScheduler	768 x 768	01/08/24	5.2547	17.6176	0.5083	0.0000	-2.1589
segmind/SSD-1B	photo realism	EulerDiscreteScheduler	1024 x 1024	19/10/23	6.7116	15.7000	0.4747	0.0000	-2.1098
RunDiffusion/Juggernaut-X-v10	foundation	EulerDiscreteScheduler	1024 x 1024	20/04/24	6.8125	16.3571	0.4992	0.0000	-2.1031
prompthero/openjourney-v4	photo realism	PNDMScheduler	512 x 512	12/12/22	8.4414	15.9211	0.4881	0.0000	-2.0981
Lykon/dreamshaper-8	photo realism	DEISMultistepScheduler	512 x 512	27/08/23	6.8769	17.3947	0.5467	0.0000	-2.0649
RunDiffusion/Juggernaut-XL-v9	foundation	DDPMScheduler	1024 x 1024	19/02/24	8.6025	14.4048	0.4847	0.0000	-2.0046
stabilityai/sd-turbo	foundation	EulerDiscreteScheduler	512 x 512	28/11/23	7.9498	14.5476	0.4930	0.0000	-1.9982
eienmojiki/Anything-XL	animation / anime	EulerAncestralDiscreteScheduler	1024 x 1024	11/03/24	5.6000	14.8333	0.5287	0.0000	-1.9446
stabilityai/stable-diffusion-3.5-medium	foundation	FlowMatchEulerDiscreteScheduler	1024 x 1024	29/10/24	8.2481	14.0455	0.5049	0.0000	-1.9393
MirageML/dreambooth-nike	photo realism	PNDMScheduler	512 x 512	01/11/22	3.3402	14.4048	0.5206	0.0000	-1.9323
dataautogpt3/ProteusV0.3	foundation	EulerDiscreteScheduler	1024 x 1024	13/02/24	7.3949	14.5833	0.5324	0.0000	-1.9196
stablediffusionapi/juggernaut-reborn	foundation	PNDMScheduler	512 x 512	21/01/24	5.0168	13.9783	0.5073	0.0167	-1.9129
hongdthaui/ManmaruMix_v30	animation / anime	PNDMScheduler	512 x 512	19/01/24	4.4554	16.2391	0.5857	0.0033	-1.9029
digiplay/Photon_v1	photo realism	EulerDiscreteScheduler	768 x 768	09/06/23	6.6038	13.4583	0.5195	0.0000	-1.8667
lambdalabs/miniSD-diffusers	foundation	PNDMScheduler	512 x 512	24/11/22	5.5457	14.6600	0.5640	0.0000	-1.8550
nitrosocke/mo-di-diffusion	animation / anime	PNDMScheduler	512 x 512	28/10/22	7.4843	14.3500	0.5696	0.0033	-1.8175
Lykon/DreamShaper	photo realism	PNDMScheduler	512 x 512	17/03/23	9.2373	13.7000	0.5521	0.0000	-1.8142
ItsJayQz/GTA5_Artwork_Diffusion	animation / anime	PNDMScheduler	512 x 512	13/12/22	6.9009	12.9348	0.5257	0.0033	-1.8106
digiplay/PerfectDeliberate-Anime_v2	animation / anime	EulerDiscreteScheduler	768 x 768	07/04/24	5.9243	14.6923	0.5863	0.0000	-1.8048
RunDiffusion/Juggernaut-XL-v6	foundation	EulerDiscreteScheduler	1024 x 1024	22/02/24	5.9995	14.1296	0.5766	0.0000	-1.7889
Lykon/AbsoluteReality	photo realism	PNDMScheduler	512 x 512	01/06/23	5.3458	12.6923	0.5297	0.0000	-1.7866
segmind/Segmind-Vega	photo realism	EulerDiscreteScheduler	1024 x 1024	01/12/23	6.1051	12.8462	0.5427	0.0000	-1.7706
nitrosocke/redshift-diffusion	animation / anime	PNDMScheduler	512 x 512	07/11/22	6.5239	12.0769	0.5139	0.0000	-1.7699
stabilityai/stable-diffusion-3.5-large	foundation	FlowMatchEulerDiscreteScheduler	1024 x 1024	22/10/24	9.2260	11.5769	0.4939	0.0000	-1.7680
circulus/canvers-real-v3.9.1	photo realism	PNDMScheduler	512 x 512	05/05/24	3.8623	12.4615	0.5308	0.0000	-1.7658
digiplay/AbsoluteReality_v1.8.1	photo realism	DDIMScheduler	768 x 768	04/08/23	6.5521	11.6481	0.5034	0.0000	-1.7552
Yntec/YiffyMix	animation / anime	PNDMScheduler	512 x 512	24/10/23	5.9138	12.1296	0.5259	0.0000	-1.7492
Yntec/RealLife	photo realism	EulerDiscreteScheduler	768 x 768	04/01/24	5.5369	11.9828	0.5216	0.0000	-1.7461
digiplay/MilkyWonderland_v1	animation / anime	EulerDiscreteScheduler	768 x 768	30/09/23	4.8211	12.6538	0.5394	0.0167	-1.7458
aipicasso/emi-3	animation / anime	FlowMatchEulerDiscreteScheduler	1024 x 1024	05/12/24	5.0561	11.2140	0.4891	0.0000	-1.7457
WarriorMama777/AbyssOrangeMix2	animation / anime	PNDMScheduler	512 x 512	30/01/23	7.8778	12.3400	0.5353	0.0067	-1.7397
WarriorMama777/AbyssOrangeMix	animation / anime	PNDMScheduler	512 x 512	30/01/23	7.8316	11.5000	0.5096	0.0000	-1.7299
liamhvn/disney-pixar-cartoon-b	animation / anime	PNDMScheduler	512 x 512	12/07/23	4.9147	12.8548	0.5567	0.0167	-1.7235
openart-custom/CrystalClearXL	photo realism	EulerDiscreteScheduler	1024 x 1024	13/08/24	5.5520	12.2308	0.5464	0.0000	-1.7134
stablediffusionapi/sdxl-unstable-diffusers-y	foundation	EulerDiscreteScheduler	1024 x 1024	08/10/23	5.1028	11.3462	0.5151	0.0000	-1.7052
dataautogpt3/ProteusV0.2	foundation	KDPM2AncestralDiscreteScheduler	1024 x 1024	19/01/24	7.1317	11.4655	0.5207	0.0000	-1.7040
stabilityai/stable-diffusion-2-1	foundation	DDIMScheduler	512 x 512	07/12/22	10.5860	11.7963	0.5349	0.0100	-1.6921
nitrosocke/Arcane-Diffusion	animation / anime	LMSDiscreteScheduler	512 x 512	02/10/22	7.1673	10.9074	0.5004	0.0067	-1.6887
cagliostrolab/animagine-xl-3.1	animation / anime	EulerAncestralDiscreteScheduler	1024 x 1024	13/03/24	8.9596	11.2143	0.5179	0.0000	-1.6876
nuigurumi/basil_mix	art	PNDMScheduler	512 x 512	04/01/23	7.0493	12.1800	0.5598	0.0000	-1.6792
fluently/Fluently-XL-Final	foundation	EulerAncestralDiscreteScheduler	1024 x 1024	06/06/24	6.6590	10.8226	0.5147	0.0000	-1.6586
CompVis/stable-diffusion-v1-4	foundation	PNDMScheduler	512 x 512	20/08/22	10.7885	11.7258	0.5621	0.0000	-1.6360
SPO-Diffusion-Models/SPO-SDXL_4k-p_10ep	foundation	EulerDiscreteScheduler	1024 x 1024	07/06/24	6.2646	11.1667	0.5426	0.0000	-1.6307
krnl/realisticVisionV51_v51VAE	photo realism	PNDMScheduler	512 x 512	12/01/24	5.3375	11.1667	0.5510	0.0000	-1.6122
openart-custom/AlbedoBase	foundation	EulerDiscreteScheduler	1024 x 1024	13/09/24	5.5861	11.0161	0.5462	0.0000	-1.6093
xyn-ai/anything-v4.0	animation / anime	PNDMScheduler	512 x 512	23/03/23	6.4597	11.6250	0.5692	0.0067	-1.6044
lemon2431/toonify_v20	animation / anime	PNDMScheduler	512 x 512	16/10/23	4.1148	10.9412	0.5469	0.0033	-1.5976
SG161222/RealVisXL_V4.0	photo realism	EulerDiscreteScheduler	1024 x 1024	13/02/24	8.6080	10.6250	0.5405	0.0000	-1.5856
ckpt/anything-v4.5	art	PNDMScheduler	512 x 512	19/01/23	5.8262	11.5968	0.5784	0.0033	-1.5837
emilianJR/chilloutmix_NiPrunedFp32Fix	photo realism	PNDMScheduler	512 x 512	19/04/23	6.0849	10.7941	0.5498	0.0000	-1.5810
Kernel/sd-nsfw	photo realism	PNDMScheduler	512 x 512	15/07/23	5.5154	11.5333	0.5828	0.0000	-1.5711
Lykon/AAM_XL_AnimeMix	animation / anime	EulerDiscreteScheduler	1024 x 1024	19/01/24	6.8090	9.0152	0.4843	0.0000	-1.5367
stablediffusionapi/realistic-stock-photo	photo realism	EulerDiscreteScheduler	1024 x 1024	22/10/23	5.1367	10.2188	0.5451	0.0000	-1.5366
GraydientPlatformAPI/comicbabes2	art	PNDMScheduler	512 x 512	07/01/24	4.0530	10.0313	0.5447	0.0033	-1.5156
SG161222/Realistic_Vision_V6.0_B1_noVAE	photo realism	PNDMScheduler	896 x 896	29/11/23	7.7733	11.1571	0.5957	0.0000	-1.5066
scenario-labs/juggernaut_reborn	photo realism	DPMSolverMultistepScheduler	512 x 512	29/05/24	4.5414	10.0294	0.5504	0.0000	-1.5061
SG161222/RealVisXL_V5.0	photo realism	DDIMScheduler	1024 x 1024	05/08/24	6.6694	9.9412	0.5482	0.0000	-1.5022
stable-diffusion-v1-5/stable-diffusion-v1-5	foundation	PNDMScheduler	512 x 512	07/09/24	9.2497	9.6429	0.5361	0.0000	-1.4981
Corcelio/openvision	foundation	EulerDiscreteScheduler	1024 x 1024	13/05/24	5.7229	9.2576	0.5161	0.0033	-1.4963
gsdf/Counterfeit-V2.5	animation / anime	DDIMScheduler	512 x 512	02/02/23	8.8241	12.2059	0.6204	0.0433	-1.4890
fluently/Fluently-XL-v4	foundation	EulerAncestralDiscreteScheduler	1024 x 1024	02/05/24	6.8337	9.2222	0.5205	0.0000	-1.4866
Ojimi/anime-kawai-diffusion	animation / anime	DEISMultistepScheduler	512 x 512	24/03/23	6.2192	11.1667	0.5968	0.0200	-1.4843
tilake/China-Chic-illustration	art	PNDMScheduler	512 x 512	15/01/23	5.7534	10.0294	0.5655	0.0000	-1.4720
digiplay/FormCleansingMix_v1	animation / anime	DPMSolverMultistepScheduler	768 x 768	20/06/23	4.4630	10.8243	0.5936	0.0167	-1.4647
Lykon/dreamshaper-7	photo realism	DEISMultistepScheduler	512 x 512	27/08/23	6.7903	10.0625	0.5704	0.0000	-1.4638
gligen/diffusers-generation-text-box	photo realism	PNDMScheduler	512 x 512	11/03/23	5.5617	10.0294	0.5719	0.0000	-1.4570
stabilityai/stable-diffusion-xl-base-1.0	photo realism	EulerDiscreteScheduler	1024 x 1024	25/07/23	11.1360	8.6515	0.5076	0.0000	-1.4492
pt-sk/stable-diffusion-1.5	foundation	PNDMScheduler	512 x 512	02/03/24	5.0628	9.5556	0.5584	0.0000	-1.4397
playgroundai/playground-v2.5-1024px-aesthetic	art	EDMDPMSolverMultistepScheduler	1024 x 1024	17/02/24	8.8552	9.5270	0.5649	0.0000	-1.4219
danbrown/RevAnimated-v1-2-2	animation / anime	DDIMScheduler	256 x 256	01/05/23	5.0277	8.5000	0.5134	0.0000	-1.4198
redstonehero/animesh_prunedv21	animation / anime	PNDMScheduler	512 x 512	17/08/23	4.0369	9.6622	0.5705	0.0033	-1.4197
dreamlike-art/dreamlike-photoreal-2.0	photo realism	DDIMScheduler	768 x 768	04/01/23	8.6776	8.8415	0.5451	0.0033	-1.3885
emilianJR/epiCRealism	photo realism	PNDMScheduler	512 x 512	25/06/23	6.9505	8.8846	0.5499	0.0067	-1.3793
naclbit/trinart_characters_19.2m_stable_diffusion_v1	animation / anime	PNDMScheduler	512 x 512	15/10/22	6.2754	12.3750	0.6536	0.0767	-1.3757
segmind/tiny-sd	photo realism	DPMSolverMultistepScheduler	512 x 512	28/07/23	5.7272	10.1757	0.6124	0.0000	-1.3722
yodayo-ai/kivotos-xl-2.0	animation / anime	EulerAncestralDiscreteScheduler	1024 x 1024	02/06/24	6.5836	7.1410	0.4717	0.0000	-1.3277
ZB-Tech/Text-to-Image	photo realism	PNDMScheduler	512 x 512	10/03/24	6.4625	8.7162	0.5697	0.0000	-1.3220
digiplay/aurorafantasy_v1	animation / anime	EulerDiscreteScheduler	768 x 768	06/04/24	5.2524	8.8864	0.5813	0.0133	-1.3005
stablediffusionapi/mklan-xxx-nsfw-pony	photo realism	EulerDiscreteScheduler	1024 x 1024	29/05/24	6.5259	7.4744	0.5100	0.0000	-1.2981
stabilityai/sdxl-turbo	foundation	EulerAncestralDiscreteScheduler	512 x 512	27/11/23	10.1726	8.0778	0.5493	0.0000	-1.2922
Lykon/dreamshaper-xl-v2-turbo	foundation	EulerDiscreteScheduler	1024 x 1024	08/02/24	6.6775	6.7381	0.4679	0.0000	-1.2769
dataautogpt3/OpenDalleV1.1	foundation	KDPM2AncestralDiscreteScheduler	1024 x 1024	22/12/23	7.7311	7.6951	0.5351	0.0000	-1.2747
Corcelio/mobius	foundation	EulerDiscreteScheduler	1024 x 1024	13/05/24	6.0584	7.2273	0.5280	0.0000	-1.2271
kandinsky-community/kandinsky-2-1	art	DDIMScheduler	512 x512	24/05/23	6.5896	7.1735	0.5332	0.0000	-1.2086
friedrichor/stable-diffusion-2-1-realistic	photo realism	DDIMScheduler	512 x 512	04/06/23	4.5053	6.7857	0.5061	0.0033	-1.2060
CompVis/stable-diffusion-v1-1	foundation	PNDMScheduler	512 x 512	19/08/22	6.5538	6.8864	0.5218	0.0200	-1.1716
stablediffusionapi/realistic-vision-51	photo realism	PNDMScheduler	512 x 512	07/08/23	5.8350	6.9490	0.5456	0.0000	-1.1498
UnfilteredAI/NSFW-gen-v2	photo realism	EulerAncestralDiscreteScheduler	1024 x 1024	15/04/24	6.8121	6.4111	0.5102	0.0000	-1.1442
nitrosocke/Ghibli-Diffusion	animation / anime	PNDMScheduler	704 x 512	18/11/22	7.6340	6.3478	0.5523	0.0000	-1.0444
dataautogpt3/ProteusV0.4-Lightning	foundation	EulerDiscreteScheduler	1024 x 1024	22/02/24	6.2324	4.6739	0.4055	0.0000	-1.0220
SG161222/Realistic_Vision_V2.0	photo realism	PNDMScheduler	512 x 512	21/03/23	8.3000	6.1154	0.5480	0.0000	-1.0167
sd-community/sdxl-flash	foundation	DPMSolverSinglestepScheduler	1024 x 1024	19/05/24	7.2798	4.7826	0.4424	0.0000	-0.9809
Mitsua/mitsua-diffusion-cc0	art	PNDMScheduler	512 x 512	22/12/22	5.1305	10.8947	0.7426	0.0900	-0.9368
OnomaAIResearch/Illustrious-xl-early-release-v0	animation / anime	EulerDiscreteScheduler	1024 x 1024	20/09/24	7.3851	11.6000	0.7524	0.1700	-0.8688
digiplay/ZHMix-Dramatic-v2.0	animation / anime	EulerDiscreteScheduler	768 x 768	03/12/23	4.7306	5.7800	0.6506	0.0333	-0.6689
DGSpitzer/Cyberpunk-Anime-Diffusion	animation / anime	PNDMScheduler	704 x 704	28/10/22	6.7796	2.9722	0.5845	0.0600	-0.1492
Emanon14/NONAMEmix_v1	animation / anime	EulerAncestralDiscreteScheduler	1024 x 1024	23/11/24	5.1821	5.3400	0.7447	0.1700	-0.1237
Onodofthenorth/SD_PixelArt_SpriteSheet_Generator	art	PNDMScheduler	512 x 512	01/11/22	6.4996	6.5930	0.7898	0.3367	0.0841
Niggendar/duchaitenPonyXLNo_ponyNoScoreV40	art	EulerDiscreteScheduler	1024 x 1024	01/06/24	5.1912	3.1620	0.7458	0.1000	0.3237
lambdalabs/sd-pokemon-diffusers	animation / anime	PNDMScheduler	512 x 512	16/09/22	6.3118	9.6509	0.8639	0.6033	0.6519
Raelina/Raehoshi-illust-XL-3	animation / anime	EulerDiscreteScheduler	1024 x 1024	11/12/24	3.7427	4.8772	0.8926	0.6700	1.7549
monadical-labs/minecraft-skin-generator-sdxl	animation / anime	EulerDiscreteScheduler	768 x 768	19/02/24	5.3317	5.9786	0.9721	0.7933	3.3680

Exploring Bias in over 100 Text-to-Image Generative Models