Exploring Bias in over 100 Text-to-Image
Generative Models
Abstract
We investigate bias trends in text-to-image generative models over time, focusing on the increasing availability of models through open platforms like Hugging Face. While these platforms democratize AI, they also facilitate the spread of inherently biased models, often shaped by task-specific fine-tuning. Ensuring ethical and transparent AI deployment requires robust evaluation frameworks and quantifiable bias metrics. To this end, we assess bias across three key dimensions: (i) distribution bias, (ii) generative hallucination, and (iii) generative miss-rate. Analyzing over 100 models, we reveal how bias patterns evolve over time and across generative tasks. Our findings indicate that artistic and style-transferred models exhibit significant bias, whereas foundation models, benefiting from broader training distributions, are becoming progressively less biased. By identifying these systemic trends, we contribute a large-scale evaluation corpus to inform bias research and mitigation strategies, fostering more responsible AI development.
1 Introduction
Text-to-image (T2I) generative models, while capable of high-fidelity image synthesis, inherently reflect the biases present in their training data (Garcia et al., 2023; Mehrabi et al., 2021; Zhang et al., 2023). The wide accessibility of training, fine-tuning and deployment resources has resulted in a plethora of T2I models being published by AI practitioners and hobbyists alike. Whereas there are many debates on the biased nature of these models, there is no concrete evidence on how the community is responding in terms of accounting for bias in T2I generative models, particularly in light of the volume of models continuing to be released. Hence, we conduct this crucial research.
The abundance of publicly available data and models democratizes AI development, but also underscores the need for responsible usage (Arrieta et al., 2020; Bakr et al., 2023; Teo et al., 2024) and comprehensive evaluation tools that characterize bias characteristics of these black box models (Bakr et al., 2023; Chinchure et al., 2024; D’Incà et al., 2024; Hu et al., 2023; Luo et al., 2024; Vice et al., 2023). The ability to develop unsafe, inappropriate or biased models presents a significant challenge and evaluating fundamental bias characteristics is a crucial step in the right direction.
Biased representations in generated images stem from factors such as class imbalances in training data, human labeling biases, and hyperparameter choices during model training and fine-tuning (Garcia et al., 2023; Mehrabi et al., 2021; Zhang et al., 2023). Theoretically, generative model biases are not confined to a single concept or direction. Analyzing a model’s overall bias provides a more comprehensive understanding of its learned representations and underlying manifold structure. For instance, when generating generic images of “animals,” a model may disproportionately favor certain species or environments. While social biases (e.g., those related to age, race, or gender) are particularly consequential in public-facing applications (Abid et al., 2021; Luccioni et al., 2023; Naik & Nushi, 2023; Seshadri et al., 2023), they are manifestations of broader model biases, observed from a specific viewpoint. Since biases extend beyond social domains, it is essential to first characterize the general bias properties of learned concepts to better understand their implications.
In this work, we perform an extensive analysis of publicly available T2I models to examine how bias characteristics have evolved over time and across different generative tasks. We construct a comprehensive evaluation framework that considers: (i) distribution bias, (ii) Jaccard hallucination, (iii) generative miss-rate, (iv) log-based bias scores, (v) model popularity, and (vi) metadata features such as the intended generative task and timestamp.
Repositories such as the HuggingFace Hub offer a vast array of fine-tuned models, including approximately 56,240 text-to-image (T2I) models111as of the time of writing this manuscript. This extensive collection enables our comprehensive evaluations. The field of conditional image generation has evolved significantly, from the widely-used Stable Diffusion architecture (Rombach et al., 2022) (spanning versions v1 to v3/XL) to the latest rectified-flow transformer (FLUX)-based models (BlackForestLabs, 2024). To capture this progression, we conduct extensive evaluations across more than 100 unique models, varying in artistic style, generative task, and release date.
To quantify bias along distribution bias ‘’, Jaccard hallucination ‘’ and, generative miss-rate ‘’ dimensions, we utilize the open-source “Try Before You Bias” (TBYB) evaluation code (Vice et al., 2023), which aligns well with models hosted on HuggingFace. We introduce a log-based bias score that integrates these metrics into a single, interpretable value, computable in black-box settings. This approach provides a unified framework for evaluating and comparing model biases.
Our evaluations offer valuable insights into the bias characteristics of various categories of generative models, revealing a trade-off between artistic style transfer and perceived bias. We also observe that modern foundation models and photo-realism models have benefited from larger datasets, improved architectures, and careful curation efforts, leading to a positive trend in bias mitigation over time. By analyzing model popularity, we further explore whether user engagement is influenced by bias. This study represents a significant step forward in understanding how the community responds to biases T2I models, particularly in light of the rapid proliferation of diverse models.
Through this work we contribute:
-
1.
an extensive evaluation of bias trends in generative text-to-image models over time, uncovering key observations across three dimensions: distribution bias, hallucination and generative miss-rate.
-
2.
a singular, log-based bias evaluation score that advances existing methodologies. This score enables end-to-end bias assessments in black-box settings, eliminating the need for normalization relative to a corpus of evaluated models.
-
3.
a categorization and analysis of bias characteristics across several classes of trained and fine-tuned text-to-image models, namely: foundation, photo realism, animation, art. Additionally, we provide a quantifiable measure of model popularity, offering insights into how bias may influence user engagement and adoption.
2 Background and Related Work
Generative Text-to-Image Models have gained significant attention among AI practitioners and the wider, general public. These models, composed of tokenizers, text encoders, denoising networks, and schedulers, enable users to generate unique images from conditional prompts. The foundational de-noising process proposed by Sohl-Dickstein et al. (2015) inspired many of the underlying generative capabilities of modern T2I models. Subsequent advancements include denoising diffusion probabilistic models (DDPMs) (Sohl-Dickstein et al., 2015; Ho et al., 2020), denoising diffusion implicit models (DDIMs) (Song et al., 2020a), and stochastic differential equation (SDE)-based approaches (Song et al., 2020b). Rectified Flow-based de-noising paradigms have recently gained prominence, as seen in Stable Diffusion 3 (Esser et al., 2024), FLUX (BlackForestLabs, 2024) and PixArt(Chen et al., 2023; 2025).
These models often use a modified, conditional U-Net (Ronneberger & Fischer, 2015) for latent denoising. Conditional generative models integrate a network to convert user inputs into guidance vectors, steering the denoising process to match input prompts. In T2I models, Contrastive Language-Image Pre-training (CLIP) (Radford et al., 2021) and T5 encoders (Ni et al., 2021) are commonly used to map textual inputs into semantically rich embedding spaces. Larger models often combine multiple text encoders to enhance performance (Esser et al., 2024; BlackForestLabs, 2024).
By combining embedded denoising networks and text encoders, various T2I foundation models have been developed and released to the public. Notable examples include Stable Diffusion (v1 to v3.5/XL variants) (Rombach et al., 2022; Esser et al., 2024; Podell et al., 2023), DALL-E 2/3 (Betker et al., 2023; Saharia et al., 2022), and Imagen (Ramesh et al., 2022). Through cost-effective fine-tuning techniques like DreamBooth (Ruiz et al., 2023), Low-Rank Adaptation (LoRA) (Hu et al., 2021), and Textual Inversion (Gal et al., 2022), AI practitioners and hobbyists can create custom T2I models with tailored representations of learned concepts. However, these models are often shared on platforms like the HuggingFace Hub without sufficient acknowledgment of their potential biases, raising concerns about their responsible dissemination.
Bias and Ethical AI Evaluation Frameworks. Modern foundation models are trained on large, uncurated internet datasets, which often contain harmful, inaccurate, or biased representations that can manifest in generated outputs (Ferrara, 2023; Mehrabi et al., 2021). Unlike biased classification systems, bias in generative models is subtler and harder to detect due to their expansive input/output spaces and complex semantic relationships arising from massive training datasets. Without proper mitigation or quantification, these biases can lead to the proliferation of harmful stereotypes and misinformation. Compounded training and fine-tuning processes can thereby exacerbate or shift a model’s bias characteristics, raising ethical concerns, especially in front-facing applications. This underscores the critical need for bias quantification to address ethical AI considerations.
Several ethical AI evaluation frameworks have manifested as a result of these open research questions (Cho et al., 2023; Luccioni et al., 2023; Luo et al., 2024; Chinchure et al., 2024; Vice et al., 2023; Bakr et al., 2023; Hu et al., 2023; Teo et al., 2024; Gandikota et al., 2024; Huang et al., 2024; Schramowski et al., 2023; Seshadri et al., 2023; Naik & Nushi, 2023; D’Incà et al., 2024), addressing issues of fairness, bias, reliability and safety. While this work focuses primarily on biases, it is important to consider the synergy that exists across these four ethical AI dimensions. To conduct these evaluations, many works deploy auxiliary captioning or VLM/VQA models to facilitate the extraction of descriptive metrics.
The TIFA method introduced by (Hu et al., 2023) defines a comprehensive list of quantifiable T2I statistics, leveraging a VQA model to provide an extensive evaluation results on generated image and model characteristics. In a similar vein, the HRS benchmark proposed by (Bakr et al., 2023) also considers a wide range of T2I model characteristics - beyond the bias dimension, as it considers image quality and semantic alignment (scene composition). The StableBias (Luccioni et al., 2023) and DALL-Eval (Cho et al., 2023) methods have been proposed to assess reasoning skills and social biases (including gender/ethnicity) of text-to-image models, deploying captioning and VQA models for their analyses. Similarly, frameworks like FAIntbench (Luo et al., 2024), TIBET (Chinchure et al., 2024) and OpenBias (D’Incà et al., 2024) each consider he recognition of biases along several dimensions, proposing a wider definition of biases, all incorporating LLM and/or VQA models in their evaluation frameworks. FAIntbench considers four dimensions of bias i.e.: manifestation, visibility and acquired/protected attributes (Luo et al., 2024). In comparison, the TIBET framework identifies relevant, potential biases w.r.t. the input prompt (Chinchure et al., 2024). The ‘Try Before you Bias’ (TBYB) evaluation tool encompasses the evaluation methodology proposed by Vice et al. (2023), characterizing bias through: hallucination, distribution bias and generative miss-rate.
While evaluation frameworks are extensive, large-scale bias analysis of open-source, community-driven models remains limited. Existing efforts often focus on narrow subsets of models, leaving a critical need for a systematic, scalable approach. We bridge this gap with a comprehensive evaluation of over 100 models, utilizing the TBYB tool for its compatibility with the HuggingFace Hub.
3 Methodology
In this work, we conduct comprehensive bias evaluations of 103 unique T2I models that have been released from August 2022 to December 2024. To identify general bias characteristics, we employ the general bias evaluation methodology defined in Vice et al. (2023) to generate images of 100 random objects, (3 images/prompt = 300 images per evaluated model). This allows us to infer diverse, fundamental bias characteristics of each model.
3.1 Evaluation Metrics
Data biases can propagate into T2I models, leading to skewed representations in their outputs. Furthermore, compounded training and fine-tuning of large foundation models can fundamentally alter their bias characteristics. Regardless of intent, the severity of these biases must be quantifiable and must capture the diverse ways in which bias can manifest. To address these requirements, we employ three metrics for quantifying bias, motivated by fundamental examples that illustrate their relevance and applicability in evaluating model behavior.
(i) When prompted with “a picture of an apple”, a text-to-image model may generate an apple hanging off a tree. While semantically-logical, one could argue that generating the tree in the image evidences a hallucinated object in the scene (by addition) - as it was not explicitly requested in the prompt. Or, the model may generate an apple tree with no apples - omitting the object in the prompt. To account for both cases here, we compute Jaccard hallucination ‘’, derived from the IoU.
(ii) Nation- commissions the development of a generative model for producing content for tourism with blended national flag iconography. The distribution of generated content would reflect the intentional skew by showing peaks in the number of occurrences for concepts relating to Nation-. Thus, we consider distribution bias ‘’ as a quantifiable means of evaluating this phenomenon.
(iii) A T2I model has been fine-tuned with an intentionally-biased dataset that replaces images labeled with ‘car’ to ‘person’. This results in an intentionally-biased and misaligned output space that would cause misclassification w.r.t. the label provided by the input prompt. This justifies the need for quantifying generative miss-rate ‘’.
Covering the underlying motivations of the above examples, we use , and to analyze model bias. We also combine them into a single, log-based bias evaluation score ‘’ to characterize the overall bias behavior, which is useful for independently ranking different models. We visualize our bias evaluation framework in Fig. 1.

Jaccard Hallucination - . While usually discussed from the context of language models (Gunjal et al., 2023; Ji et al., 2023), hallucinations are a common side effect in many foundation models (Rawte et al., 2023). They have been proposed as a vehicle for image out-painting (Xiao et al., 2020) and generative model improvement (Li et al., 2022b; Xiao et al., 2020) tasks. When drawing representations of objects and and classes from a learned distribution, it is logical that the semantically-rich manifolds may cause a model to also generate semantically-relevant objects as a result.
Here, considers two hallucination perspectives i.e.: (i) by addition of unspecified objects in the output and (ii) by omission of objects specified in the input. For a set of output images ‘’, generated from input prompts ‘’
(1) |
where ‘’ defines input objects extracted from and ‘’ defines the objects detected in the output image , extracted from a generated caption. indicates a smaller discrepancy between the input and output objects/concepts and thus, demonstrates less hallucinatory (biased) behavior.
Distribution Bias - is derived from the area under the curve (AuC) of detected objects, capturing the frequency of objects/concepts that appear in generated images (that were not specified in the prompt) (Vice et al., 2023). After generating images and filtering objects, an object token dictionary ‘’ is constructed, containing concept (word) ‘’ and no. of occurrences ‘’ pair. The distribution bias can be calculated through the AuC, after sorting (high to low) and applying min-max normalization:
(2) | |||
(3) |
Peaks in generated object distributions may report that significant attention is being applied along a specific bias direction and thus, represents another avenue in which bias can manifest itself.
Generative Miss Rate - . Bias can affect model performance, particularly if they shift the output representations in such a way that causes significant misalignment (Vice et al., 2023). As visualized in Fig. 1, a separate vision transformer (ViT) is deployed to classify generated images and determine . Generally, model alignment should be high and thus, the miss-rate should demonstrate a low variance across models. Significantly high may indicate that a model’s learned biases are shifting output representations away from the expected output (as governed by the prompt). For models trained to complete specific tasks (like generate a particular art style), we may find that the miss rate is much higher, potentially by design.
Given a prompt (classifier target label) ‘’ and generated image , the deployed ViT outputs a prediction, measuring the alignment of the image ‘’ to the label ‘’. For generated images,
(4) |
where represents target class. If the classifier fails to detect the generated image as a valid representation of then increases. A higher indicates a greater misalignment with input prompts which may be (a) a symptom of a biased output space and/or (b) the result of a task that causes significant changes in output representations. We visualize how and manifest in the output representations of these models in Fig. 2.
The Try Before You Bias (TBYB) Tool is a publicly available, practical software implementation of the three-dimensional bias evaluation framework discussed prior. The TBYB interface allows users to evaluate T2I models hosted on the HuggingFace Hub in a black-box evaluation set-up, provided repositories contain a model_index.json file. The BLIP (Li et al., 2022a) model is deployed for image captioning. Synonym detection functions in the NLTK (Bird & Loper, 2009) package are deployed to mitigate natural language discrepancies between the input prompt and generated caption.

3.2 Systematic Bias Evaluation Strategy
Based on the generated outputs and model metadata, we identify the model type as one of {foundation, photo realism, art (anime) and animation/anime}. We define Foundation models as those designed for general purposes, encompassing a wider range of tasks. Photo realism models are those that are fine-tuned for higher-fidelity, photo realistic generation tasks. Art-based models are those which have been designed for style-transfer tasks in which non-anime artistic styles are the target. Animation/anime-tuned models are designed for replicating anime-inspired art-styles, a common application of models hosted on HuggingFace.
For time-based evaluations, we construct a timeline spanning from August 2022 to December 2024, analyzing trends across various model types. We then extrapolate these trends to understand how different categories of models are evolving. As part of this analysis, we investigate whether larger, more sophisticated foundation models, such as Stable Diffusion 3/XL, have achieved better alignment, reduced hallucinations, and fairer distributions of generated objects. Additionally, we provide a detailed analysis of each model type and explore the relationship between model popularity and bias statistics. Finally, we conduct bias evaluations across different noise schedulers to identify potential bias behaviors associated with their deployment.
In this work, we improve on the similarity detection function in (Vice et al., 2023) by incorporating a similarity-score-based approach to handle similar concepts e.g. ‘sneakers’ vs. ‘shoes’. Additionally, we omit commonly occurring primary (red, blue, yellow), secondary (green, orange, purple) and neutral colors (black, white, brown, grey) from generated captions as it was found in our analyses that color descriptions are not a fool-proof symptom of hallucination and can adversely skew results in a lot of cases. Furthermore, we propose combining the three metrics into one singular bias score, using a log scale to account for varied metric ranges, such that:
(5) |
where a proportional relationship exists between observed model bias and . This allows for the calculation of biases for a single model, in a black-box setup, without relying on normalized relationships to a set of evaluated models as initially proposed in (Vice et al., 2023).
Model Popularity. As part of our analysis, we aim to analyze the relationship (if any) between model popularity and bias. To quantify model popularity, we designed a quantifiable score ‘’, leveraging reported engagement information on the HuggingFace Hub i.e., the number of likes (historical) ‘’, and the number of downloads ‘’ in the last month (recent engagement). Given that the number of likes is generally less the number of downloads, we apply logarithmic scaling and proportional scaling factors ‘’ and ‘’, to account for the importance of continued engagement () and mitigate spikes in associated with recency bias. Thus, we define:
(6) |
where we deploy and in our experiments to account slightly more for historical influence while managing recency bias, such that .
4 Results and Discussion
Our appraisal of the general bias characteristics of text-to-image models allows us to conduct a suite of evaluation studies to explore and formalize relationships between observed biases and model characteristics. Temporal-, categorical- and popularity-based analyses allow us to identify how bias characteristics: (i) have evolved over time, (ii) change with respect to different generative tasks or embedded de-noising schedulers and, (iii) impact how users engage with these models.
High-level Observations of General Bias Characteristics. We report a truncated list of evaluation results in Table 1, highlighting models that exhibit high, low and median bias behavior. Along with these, we also report results for highly-popular foundation models like the various stable diffusion versions. Analyzing Table 1 and Figs. 2, 3, we can observe that photo-realism and foundation models tend to generate relatively unbiased representations, which is expected given that these models are designed for general user inference tasks and improvements in generative fidelity - as is the case with photo-realism models. In comparison, at the bottom of Table 1, we can observe that many animation and art-tuned models report relatively more biased behavior, resulting from the task-oriented generative tasks. Observing the outputs of these models, we found that the tendency to focus on generating specific characters or art-styles irrespective of the prompt, resulted in high levels of hallucination and misalignment (see Figs. 2, 3).
Model | Task Category | Denoiser | Resolution | Release (dd/mm/yy) | |||||
---|---|---|---|---|---|---|---|---|---|
Envvi/Inkpunk-Diffusion | art | PNDMScheduler | 512 x 512 | 25/11/22 | 7.2323 | 18.9000 | 0.5346 | 0.0033 | -2.1711 |
Yntec/beLIEve | photo realism | DPMSolverMultistepScheduler | 768 x 768 | 01/08/24 | 5.2547 | 17.6176 | 0.5083 | 0.0000 | -2.1589 |
segmind/SSD-1B | photo realism | EulerDiscreteScheduler | 1024 x 1024 | 19/10/23 | 6.7116 | 15.7000 | 0.4747 | 0.0000 | -2.1098 |
RunDiffusion/Juggernaut-X-v10 | foundation | EulerDiscreteScheduler | 1024 x 1024 | 20/04/24 | 6.8125 | 16.3571 | 0.4992 | 0.0000 | -2.1031 |
prompthero/openjourney-v4 | photo realism | PNDMScheduler | 512 x 512 | 12/12/22 | 8.4414 | 15.9211 | 0.4881 | 0.0000 | -2.0981 |
Lykon/dreamshaper-8 | photo realism | DEISMultistepScheduler | 512 x 512 | 27/08/23 | 6.8769 | 17.3947 | 0.5467 | 0.0000 | -2.0649 |
RunDiffusion/Juggernaut-XL-v9 | foundation | DDPMScheduler | 1024 x 1024 | 19/02/24 | 8.6025 | 14.4048 | 0.4847 | 0.0000 | -2.0046 |
stabilityai/sd-turbo | foundation | EulerDiscreteScheduler | 512 x 512 | 28/11/23 | 7.9498 | 14.5476 | 0.4930 | 0.0000 | -1.9982 |
eienmojiki/Anything-XL | animation / anime | EulerAncestralDiscreteScheduler | 1024 x 1024 | 11/03/24 | 5.6000 | 14.8333 | 0.5287 | 0.0000 | -1.9446 |
stabilityai/stable-diffusion-3.5-medium | foundation | FlowMatchEulerDiscreteScheduler | 1024 x 1024 | 29/10/24 | 8.2481 | 14.0455 | 0.5049 | 0.0000 | -1.9393 |
MirageML/dreambooth-nike | photo realism | PNDMScheduler | 512 x 512 | 01/11/22 | 3.3402 | 14.4048 | 0.5206 | 0.0000 | -1.9323 |
dataautogpt3/ProteusV0.3 | foundation | EulerDiscreteScheduler | 1024 x 1024 | 13/02/24 | 7.3949 | 14.5833 | 0.5324 | 0.0000 | -1.9196 |
stablediffusionapi/juggernaut-reborn | foundation | PNDMScheduler | 512 x 512 | 21/01/24 | 5.0168 | 13.9783 | 0.5073 | 0.0167 | -1.9129 |
: | : | : | : | : | : | : | : | : | : |
stabilityai/stable-diffusion-3.5-large | foundation | FlowMatchEulerDiscreteScheduler | 1024 x 1024 | 22/10/24 | 9.2260 | 11.5769 | 0.4939 | 0.0000 | -1.7680 |
: | : | : | : | : | : | : | : | : | : |
stabilityai/stable-diffusion-2-1 | foundation | DDIMScheduler | 512 x 512 | 07/12/22 | 10.5860 | 11.7963 | 0.5349 | 0.0100 | -1.6921 |
: | : | : | : | : | : | : | : | : | : |
CompVis/stable-diffusion-v1-4 | foundation | PNDMScheduler | 512 x 512 | 20/08/22 | 10.7885 | 11.7258 | 0.5621 | 0.0000 | -1.6360 |
: | : | : | : | : | : | : | : | : | : |
lemon2431/toonify_v20 | animation / anime | PNDMScheduler | 512 x 512 | 16/10/23 | 4.1148 | 10.9412 | 0.5469 | 0.0033 | -1.5976 |
SG161222/RealVisXL_V4.0 | photo realism | EulerDiscreteScheduler | 1024 x 1024 | 13/02/24 | 8.6080 | 10.6250 | 0.5405 | 0.0000 | -1.5856 |
ckpt/anything-v4.5 | art | PNDMScheduler | 512 x 512 | 19/01/23 | 5.8262 | 11.5968 | 0.5784 | 0.0033 | -1.5837 |
emilianJR/chilloutmix_NiPrunedFp32Fix | photo realism | PNDMScheduler | 512 x 512 | 19/04/23 | 6.0849 | 10.7941 | 0.5498 | 0.0000 | -1.5810 |
Kernel/sd-nsfw | photo realism | PNDMScheduler | 512 x 512 | 15/07/23 | 5.5154 | 11.5333 | 0.5828 | 0.0000 | -1.5711 |
Lykon/AAM_XL_AnimeMix | animation / anime | EulerDiscreteScheduler | 1024 x 1024 | 19/01/24 | 6.8090 | 9.0152 | 0.4843 | 0.0000 | -1.5367 |
stablediffusionapi/realistic-stock-photo | photo realism | EulerDiscreteScheduler | 1024 x 1024 | 22/10/23 | 5.1367 | 10.2188 | 0.5451 | 0.0000 | -1.5366 |
GraydientPlatformAPI/comicbabes2 | art | PNDMScheduler | 512 x 512 | 07/01/24 | 4.0530 | 10.0313 | 0.5447 | 0.0033 | -1.5156 |
SG161222/Realistic_Vision_V6.0_B1_noVAE | photo realism | PNDMScheduler | 896 x 896 | 29/11/23 | 7.7733 | 11.1571 | 0.5957 | 0.0000 | -1.5066 |
scenario-labs/juggernaut_reborn | photo realism | DPMSolverMultistepScheduler | 512 x 512 | 29/05/24 | 4.5414 | 10.0294 | 0.5504 | 0.0000 | -1.5061 |
: | : | : | : | : | : | : | : | : | : |
stabilityai/stable-diffusion-xl-base-1.0 | photo realism | EulerDiscreteScheduler | 1024 x 1024 | 25/07/23 | 11.1360 | 8.6515 | 0.5076 | 0.0000 | -1.4492 |
: | : | : | : | : | : | : | : | : | : |
stabilityai/sdxl-turbo | foundation | EulerAncestralDiscreteScheduler | 512 x 512 | 27/11/23 | 10.1726 | 8.0778 | 0.5493 | 0.0000 | -1.2922 |
: | : | : | : | : | : | : | : | : | : |
dataautogpt3/ProteusV0.4-Lightning | foundation | EulerDiscreteScheduler | 1024 x 1024 | 22/02/24 | 6.2324 | 4.6739 | 0.4055 | 0.0000 | -1.0220 |
SG161222/Realistic_Vision_V2.0 | photo realism | PNDMScheduler | 512 x 512 | 21/03/23 | 8.3000 | 6.1154 | 0.5480 | 0.0000 | -1.0167 |
sd-community/sdxl-flash | foundation | DPMSolverSinglestepScheduler | 1024 x 1024 | 19/05/24 | 7.2798 | 4.7826 | 0.4424 | 0.0000 | -0.9809 |
Mitsua/mitsua-diffusion-cc0 | art | PNDMScheduler | 512 x 512 | 22/12/22 | 5.1305 | 10.8947 | 0.7426 | 0.0900 | -0.9368 |
OnomaAIResearch/Illustrious-xl-early-release-v0 | animation / anime | EulerDiscreteScheduler | 1024 x 1024 | 20/09/24 | 7.3851 | 11.6000 | 0.7524 | 0.1700 | -0.8688 |
digiplay/ZHMix-Dramatic-v2.0 | animation / anime | EulerDiscreteScheduler | 768 x 768 | 03/12/23 | 4.7306 | 5.7800 | 0.6506 | 0.0333 | -0.6689 |
DGSpitzer/Cyberpunk-Anime-Diffusion | animation / anime | PNDMScheduler | 704 x 704 | 28/10/22 | 6.7796 | 2.9722 | 0.5845 | 0.0600 | -0.1492 |
Emanon14/NONAMEmix_v1 | animation / anime | EulerAncestralDiscreteScheduler | 1024 x 1024 | 23/11/24 | 5.1821 | 5.3400 | 0.7447 | 0.1700 | -0.1237 |
Onodofthenorth/SD_PixelArt_SpriteSheet_Generator | art | PNDMScheduler | 512 x 512 | 01/11/22 | 6.4996 | 6.5930 | 0.7898 | 0.3367 | 0.0841 |
Niggendar/duchaitenPonyXLNo_ponyNoScoreV40 | art | EulerDiscreteScheduler | 1024 x 1024 | 01/06/24 | 5.1912 | 3.1620 | 0.7458 | 0.1000 | 0.3237 |
lambdalabs/sd-pokemon-diffusers | animation / anime | PNDMScheduler | 512 x 512 | 16/09/22 | 6.3118 | 9.6509 | 0.8639 | 0.6033 | 0.6519 |
Raelina/Raehoshi-illust-XL-3 | animation / anime | EulerDiscreteScheduler | 1024 x 1024 | 11/12/24 | 3.7427 | 4.8772 | 0.8926 | 0.6700 | 1.7549 |
monadical-labs/minecraft-skin-generator-sdxl | animation / anime | EulerDiscreteScheduler | 768 x 768 | 19/02/24 | 5.3317 | 5.9786 | 0.9721 | 0.7933 | 3.3680 |

Figure 2 presents a qualitative overview of bias manifestations in model outputs, using examples from Table 1 to contrast biased and unbiased behaviors. These results align with quantitative metrics: a higher average will generally demonstrate a greater semantic misalignment (e.g. lambdalabs/sd-pokemon-diffusers). Low models show constrained diversity or representational bias. Changes in are straightforward, reflecting disparities between input and generated objects.
The varying scales of the three metrics necessitate a logarithmic scale for comparing overall model bias. Each metric uniquely characterizes bias. Table 1 shows that low-bias models () typically report , indicating a fairer distribution of generated objects. In contrast, highly biased models () report , which suggests outliers or peaks in the output distribution. For , T2I models inherently hallucinate due to their semantically rich latent spaces. The average implies a 55% IoU between prompted and generated objects. Foundation and photo-realism models cluster near the mean, whereas highly biased models exhibit extreme values, with a maximum meaning just 2.79% correlation between the input and output for the biased model. remains low across most models, with a mean () near the minimum (), indicating valid outputs 97% of the time despite hallucinations. Models with high () exhibit misaligned behavior, which based on model design may be intentional.
Evolution of Biases over Time. The release of the seminal latent diffusion work (Rombach et al., 2022), culminating in the public availability of the popular stable diffusion architecture on August 22, 2022, marked a pivotal moment in text-to-image generative models. Its launch on the HuggingFace Hub and subsequent community engagement spurred significant advancements in foundation models and task-specific variants. Accordingly, we use August 2022 as the starting point for our time-based analyses, with the latest evaluated model released in December 2024.
Our evaluation spans 103 models over 28 months, presenting time-based bias analyses by individual metrics (Fig. 3) and model categories (Fig. 4). The timeline () is consistent across sub-figures, with models grouped by task categories to examine trends. Bias trends, such as the steep increase in art and animation models’ bias over time (Fig. 4(e)), highlight the impact of hobbyists and practitioners embedding stylistic preferences or specific characters into these models. These intentional biases are reflected in their outputs, as supported by observations in Fig. 3.
In comparison, we see that models associated with general tasks i.e., those that belong to the foundation and photo-realism model categories, have maintained consistent if not lower bias characteristics over time, as is the case with foundation models (See Fig. 3 and 4(e)). The increase in training data sizes and conscious improvements made to human-labeling and captioning in general have resulted in wider and denser manifolds with a greater diversity of concept representations. Significantly, looking at Table 1, if we compare stable diffusion v1.4/2.1/3.5 rows, we can see that hallucination and distribution bias scores improve with each significant version upgrade through time.

Model Type | |||||
---|---|---|---|---|---|
Foundation | 7.2033 (1.670) | 10.7919 (3.195) | 0.5175 (0.038) | 0.0019 (0.005) | -1.5960 (0.308) |
Photo Realism | 6.4354 (1.612) | 11.1097 (2.979) | 0.5392 (0.030) | 0.0004 (0.001) | -1.5949 (0.292) |
Art | 6.2180 (1.341) | 10.0088 (4.135) | 0.6159 (0.101) | 0.0537 (0.107) | -1.1581 (0.786) |
Animation/Anime | 6.0786 (1.367) | 10.4494 (3.147) | 0.5969 (0.120) | 0.0830 (0.201) | -1.1503 (1.134) |
On the Influence of Model Type and Popularity. We conducted an evaluation of biases w.r.t. model categories and their popularity, exploiting Eq. (6) to quantify the latter. We report the results of these findings in Table 2 and observe that foundation and photo-realism models are on-average, the most popular for users. Interestingly, these models tend to have more unbiased output representations when we consider the quantitative findings. Additionally, by analyzing the standard deviation results in Table 2, we see that foundation and photo-realism model performances are typically more consistent than art/animation counterparts.
De-noising Scheduler-Dependent Bias Evaluation. Much of the conditional latent diffusion process is predicated on the deployed de-noising scheduler. While similarities exist across different families of schedulers and the task remains the same i.e. use a conditional vector to guide latent de-noising steps to generate an aligned image representation of the input prompt, the mathematical foundations of each scheduler is unique. We report the descriptive statistics of different schedulers in Table 3, highlighting eight scheduler categories. The FlowMatchEulerDiscrete scheduler is deployed in Stable diffusion 3 variants which points to its high popularity and low score. Recently, flow-based de-noising schedulers have gained increased attention in state-of-the-art T2I models like Stable Diffusion 3 and FLUX (Esser et al., 2024; BlackForestLabs, 2024).
In comparison, the EulerDiscrete scheduler (Karras et al., 2022) reports the largest bias and the highest average miss-rate. Incremental improvements in scheduler architectures and technology since the release of EulerDiscrete, along with the modern T2I models opting for modern schedulers are logical reasons as to why this scheduler reports significantly high bias scores. Similarly, the EulerAncestralDiscrete scheduler, which contributes “ancestral sampling” also boasts consistent performance with its predecessor. These seminal works have inspired improvements which as shown through the FlowMatchEulerDiscrete scheduler, have resulted in significant performance gains.
We note that while using quantifiable metrics like those reported here present a step in the right direction, any definitive correlations will require a deeper analysis into the schedulers themselves.
Scheduler | |||||
---|---|---|---|---|---|
FlowMatchEulerDiscrete | 7.5101 (2.181) | 12.2788 (1.541) | 0.4959 (0.008) | 0.0000 (0.000) | -1.8177 (0.106) |
KDPM2AncestralDiscrete | 7.4314 (0.424) | 9.5803 (2.667) | 0.5279 (0.010) | 0.0000 (0.000) | -1.4894 (0.304) |
DDIM/DDPM | 7.3371 (1.973) | 10.1441 (2.548) | 0.5321 (0.039) | 0.0067 (0.014) | -1.5185 (0.261) |
EulerAncestralDiscrete | 7.1003 (1.667) | 9.1328 (3.083) | 0.5447 (0.084) | 0.0213 (0.060) | -1.3332 (0.552) |
DEISMultistep | 6.6288 (0.357) | 12.8746 (3.953) | 0.5713 (0.025) | 0.0067 (0.012) | -1.6710 (0.341) |
PNDM | 6.3172 (1.584) | 11.1714 (3.002) | 0.5694 (0.072) | 0.0302 (0.106) | -1.4658 (0.547) |
EulerDiscrete | 6.2381 (1.392) | 10.3751 (3.573) | 0.5702 (0.124) | 0.0621 (0.190) | -1.2184 (1.175) |
DPMSolverMultistep | 4.9966 (0.603) | 12.1618 (3.654) | 0.5662 (0.046) | 0.0042 (0.008) | -1.6255 (0.360) |
5 Conclusion
We have conducted an extensive evaluation of text-to-image models, utilizing the open HuggingFace Hub to facilitate our analyses of the bias characteristics of 103 unique models. To improve on existing evaluation methodologies, we combine three independent metrics i.e., (i) distribution bias, (ii) Jaccard hallucination and, (iii) generative miss-rate into a single log-scaled metric. By accounting for various generative model categories and quantifying public engagement, we have presented a comprehensive set of model evaluations. Identifying the fundamental bias characteristics of large, publicly-available text-to-image models is a critical task that must be considered in a democratized AI environment, considering the exposure of these models to wider audiences that continues to grow over time. So, to answer the question of “are models more biased now than they were 3 years ago?” really depends on the task. We see that iterative releases of Stable diffusion models for example, have resulted in marginal improvements in bias characteristics over time (from SD 1.1 to 3.5). Foundation and photo-realism models have demonstrated significant reductions in hallucination and increases in alignment, which is beneficial for improving the reliability for a wider range of audiences. As it pertains to style-transferred, art and animation models, these have demonstrated increased bias characteristics - a byproduct of intentionally-designing models to achieve specific tasks. We hope this work inspires further research in the field and a greater exposure to bias evaluation efforts.
Acknowledgments
This research and Dr. Jordan Vice are supported by the NISDRG project #20100007, funded by the Australian Government. Dr. Naveed Akhtar is a recipient of the ARC Discovery Early Career Researcher Award (project #DE230101058), funded by the Australian Government. Professor Ajmal Mian is the recipient of an ARC Future Fellowship Award (project #FT210100268) funded by the Australian Government.
References
- Abid et al. (2021) Abubakar Abid, Maheen Farooqi, and James Zou. Persistent anti-muslim bias in large language models. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, AIES ’21, pp. 298–306, 2021. ISBN 9781450384735. doi: 10.1145/3461702.3462624. URL https://doi.org/10.1145/3461702.3462624.
- Arrieta et al. (2020) Alejandro Barredo Arrieta, Natalia Díaz-Rodríguez, Javier Del Ser, Adrien Bennetot, Siham Tabik, Alberto Barbado, Salvador García, Sergio Gil-López, Daniel Molina, Richard Benjamins, et al. Explainable artificial intelligence (xai): Concepts, taxonomies, opportunities and challenges toward responsible ai. Information fusion, 58:82–115, 2020.
- Bakr et al. (2023) Eslam Mohamed Bakr, Pengzhan Sun, Xiaoqian Shen, Faizan Farooq Khan, Li Erran Li, and Mohamed Elhoseiny. Hrs-bench: Holistic, reliable and scalable benchmark for text-to-image models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 20041–20053, October 2023.
- Betker et al. (2023) James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions. Computer Science., 2(3):8, 2023.
- Bird & Loper (2009) Steven Bird and Ewan Loper, Edward Klein. Natural language processing with python, o’reilly media inc. https://github.com/nltk/nltk, 2009.
- BlackForestLabs (2024) BlackForestLabs. Flux.1. https://huggingface.co/black-forest-labs/FLUX.1-schnell, 2024.
- Chen et al. (2023) Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-: Fast training of diffusion transformer for photorealistic text-to-image synthesis, 2023. URL https://arxiv.org/abs/2310.00426.
- Chen et al. (2025) Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-: Weak-to-strong training of diffusion transformer for 4k text-to-image generation. In European Conference on Computer Vision, pp. 74–91. Springer, 2025.
- Chinchure et al. (2024) Aditya Chinchure, Pushkar Shukla, Gaurav Bhatt, Kiri Salij, Kartik Hosanagar, Leonid Sigal, and Matthew Turk. Tibet: Identifying and evaluating biases in text-to-image generative models. In Aleš Leonardis, Elisa Ricci, Stefan Roth, Olga Russakovsky, Torsten Sattler, and Gül Varol (eds.), Computer Vision – ECCV 2024, pp. 429–446, Cham, 2024. Springer Nature Switzerland. ISBN 978-3-031-72986-7.
- Cho et al. (2023) Jaemin Cho, Abhay Zala, and Mohit Bansal. Dall-eval: Probing the reasoning skills and social biases of text-to-image generation models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3043–3054, October 2023.
- D’Incà et al. (2024) Moreno D’Incà, Elia Peruzzo, Massimiliano Mancini, Dejia Xu, Vidit Goel, Xingqian Xu, Zhangyang Wang, Humphrey Shi, and Nicu Sebe. Openbias: Open-set bias detection in text-to-image generative models. arXiv preprint arXiv:2404.07990, 2024.
- Esser et al. (2024) Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yannik Marek, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis, 2024. URL https://arxiv.org/abs/2403.03206.
- Ferrara (2023) Emilio Ferrara. Should chatgpt be biased? challenges and risks of bias in large language models. arXiv preprint arXiv:2304.03738, 2023.
- Gal et al. (2022) Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H. Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion, 2022. URL https://arxiv.org/abs/2208.01618.
- Gandikota et al. (2024) Rohit Gandikota, Hadas Orgad, Yonatan Belinkov, Joanna Materzyńska, and David Bau. Unified concept editing in diffusion models. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 5111–5120, January 2024.
- Garcia et al. (2023) Noa Garcia, Yusuke Hirota, Yankun Wu, and Yuta Nakashima. Uncurated image-text datasets: Shedding light on demographic bias. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6957–6966, June 2023.
- Gunjal et al. (2023) Anisha Gunjal, Jihan Yin, and Erhan Bas. Detecting and preventing hallucinations in large vision language models. arXiv preprint arXiv:2308.06394, 2023.
- Ho et al. (2020) Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
- Hu et al. (2021) Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models, 2021. URL https://arxiv.org/abs/2106.09685.
- Hu et al. (2023) Yushi Hu, Benlin Liu, Jungo Kasai, Yizhong Wang, Mari Ostendorf, Ranjay Krishna, and Noah A. Smith. Tifa: Accurate and interpretable text-to-image faithfulness evaluation with question answering. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 20406–20417, October 2023.
- Huang et al. (2024) Yihao Huang, Felix Juefei-Xu, Qing Guo, Jie Zhang, Yutong Wu, Ming Hu, Tianlin Li, Geguang Pu, and Yang Liu. Personalization as a shortcut for few-shot backdoor attack against text-to-image diffusion models. Proceedings of the AAAI Conference on Artificial Intelligence, 38(19):21169–21178, Mar. 2024. doi: 10.1609/aaai.v38i19.30110. URL https://ojs.aaai.org/index.php/AAAI/article/view/30110.
- Ji et al. (2023) Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38, mar 2023. ISSN 0360-0300. doi: 10.1145/3571730. URL https://doi.org/10.1145/3571730.
- Karras et al. (2022) Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models, 2022. URL https://arxiv.org/abs/2206.00364.
- Li et al. (2022a) Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp. 12888–12900. PMLR, 17–23 Jul 2022a.
- Li et al. (2022b) Yi Li, Rameswar Panda, Yoon Kim, Chun-Fu Richard Chen, Rogerio S Feris, David Cox, and Nuno Vasconcelos. Valhalla: Visual hallucination for machine translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5216–5226, 2022b.
- Luccioni et al. (2023) Sasha Luccioni, Christopher Akiki, Margaret Mitchell, and Yacine Jernite. Stable bias: Evaluating societal representations in diffusion models. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, pp. 1–14, 2023.
- Luo et al. (2024) Hanjun Luo, Ziye Deng, Ruizhe Chen, and Zuozhu Liu. Faintbench: A holistic and precise benchmark for bias evaluation in text-to-image models, 2024. URL https://arxiv.org/abs/2405.17814.
- Mehrabi et al. (2021) Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, and Aram Galstyan. A survey on bias and fairness in machine learning. ACM Computing Surveys, 54(6):1–35, 2021.
- Naik & Nushi (2023) Ranjita Naik and Besmira Nushi. Social biases through the text-to-image generation lens. arXiv preprint arXiv:2304.06034, 2023.
- Ni et al. (2021) Jianmo Ni, Gustavo Hernández Ábrego, Noah Constant, Ji Ma, Keith B. Hall, Daniel Cer, and Yinfei Yang. Sentence-t5: Scalable sentence encoders from pre-trained text-to-text models, 2021. URL https://arxiv.org/abs/2108.08877.
- Podell et al. (2023) Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis, 2023. URL https://arxiv.org/abs/2307.01952.
- Radford et al. (2021) Alec Radford, Jong Wook Kim, Chris Hallacy, et al. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 8748–8763. PMLR, 18–24 Jul 2021.
- Ramesh et al. (2022) Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
- Rawte et al. (2023) Vipula Rawte, Amit Sheth, and Amitava Das. A survey of hallucination in large foundation models, 2023. URL https://arxiv.org/abs/2309.05922.
- Rombach et al. (2022) Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10684–10695, June 2022.
- Ronneberger & Fischer (2015) Olaf Ronneberger and Thomas Fischer, Philippand Brox. U-net: Convolutional networks for biomedical image segmentation. In Nassir Navab, Joachim Hornegger, William M. Wells, and Alejandro F. Frangi (eds.), Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015, pp. 234–241, Cham, 2015. Springer International Publishing.
- Ruiz et al. (2023) Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 22500–22510, June 2023.
- Saharia et al. (2022) Chitwan Saharia, William Chan, Saurabh Saxena, et al. Photorealistic text-to-image diffusion models with deep language understanding. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems, volume 35, pp. 36479–36494, 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/ec795aeadae0b7d230fa35cbaf04c041-Paper-Conference.pdf.
- Schramowski et al. (2023) Patrick Schramowski, Manuel Brack, Björn Deiseroth, and Kristian Kersting. Safe latent diffusion: Mitigating inappropriate degeneration in diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 22522–22531, June 2023.
- Seshadri et al. (2023) Preethi Seshadri, Sameer Singh, and Yanai Elazar. The bias amplification paradox in text-to-image generation. arXiv preprint arXiv:2308.00755, 2023.
- Sohl-Dickstein et al. (2015) Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pp. 2256–2265. PMLR, 2015.
- Song et al. (2020a) Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020a.
- Song et al. (2020b) Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020b.
- Teo et al. (2024) Christopher Teo, Milad Abdollahzadeh, and Ngai-Man Man Cheung. On measuring fairness in generative models. Advances in Neural Information Processing Systems, 36, 2024.
- Vice et al. (2023) Jordan Vice, Naveed Akhtar, Richard Hartley, and Ajmal Mian. Quantifying bias in text-to-image generative models. arXiv preprint arXiv:2312.13053, 2023.
- Xiao et al. (2020) Qingguo Xiao, Guangyao Li, and Qiaochuan Chen. Image outpainting: Hallucinating beyond the image. IEEE Access, 8:173576–173583, 2020.
- Zhang et al. (2023) Cheng Zhang, Xuanbai Chen, Siqi Chai, Chen Henry Wu, Dmitry Lagun, Thabo Beeler, and Fernando De la Torre. Iti-gen: Inclusive text-to-image generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 3969–3980, October 2023.
Appendices
Appendix A: Full bias evaluations results of 103 text-to-image generative models. Evaluations are reported in ascending order. Truncated results in Table 1 of the main manuscript are a sub-set of the full results presented here.
Model | Task Category | Denoiser | Resolution | Release (dd/mm/yy) | |||||
---|---|---|---|---|---|---|---|---|---|
Envvi/Inkpunk-Diffusion | art | PNDMScheduler | 512 x 512 | 25/11/22 | 7.2323 | 18.9000 | 0.5346 | 0.0033 | -2.1711 |
Yntec/beLIEve | photo realism | DPMSolverMultistepScheduler | 768 x 768 | 01/08/24 | 5.2547 | 17.6176 | 0.5083 | 0.0000 | -2.1589 |
segmind/SSD-1B | photo realism | EulerDiscreteScheduler | 1024 x 1024 | 19/10/23 | 6.7116 | 15.7000 | 0.4747 | 0.0000 | -2.1098 |
RunDiffusion/Juggernaut-X-v10 | foundation | EulerDiscreteScheduler | 1024 x 1024 | 20/04/24 | 6.8125 | 16.3571 | 0.4992 | 0.0000 | -2.1031 |
prompthero/openjourney-v4 | photo realism | PNDMScheduler | 512 x 512 | 12/12/22 | 8.4414 | 15.9211 | 0.4881 | 0.0000 | -2.0981 |
Lykon/dreamshaper-8 | photo realism | DEISMultistepScheduler | 512 x 512 | 27/08/23 | 6.8769 | 17.3947 | 0.5467 | 0.0000 | -2.0649 |
RunDiffusion/Juggernaut-XL-v9 | foundation | DDPMScheduler | 1024 x 1024 | 19/02/24 | 8.6025 | 14.4048 | 0.4847 | 0.0000 | -2.0046 |
stabilityai/sd-turbo | foundation | EulerDiscreteScheduler | 512 x 512 | 28/11/23 | 7.9498 | 14.5476 | 0.4930 | 0.0000 | -1.9982 |
eienmojiki/Anything-XL | animation / anime | EulerAncestralDiscreteScheduler | 1024 x 1024 | 11/03/24 | 5.6000 | 14.8333 | 0.5287 | 0.0000 | -1.9446 |
stabilityai/stable-diffusion-3.5-medium | foundation | FlowMatchEulerDiscreteScheduler | 1024 x 1024 | 29/10/24 | 8.2481 | 14.0455 | 0.5049 | 0.0000 | -1.9393 |
MirageML/dreambooth-nike | photo realism | PNDMScheduler | 512 x 512 | 01/11/22 | 3.3402 | 14.4048 | 0.5206 | 0.0000 | -1.9323 |
dataautogpt3/ProteusV0.3 | foundation | EulerDiscreteScheduler | 1024 x 1024 | 13/02/24 | 7.3949 | 14.5833 | 0.5324 | 0.0000 | -1.9196 |
stablediffusionapi/juggernaut-reborn | foundation | PNDMScheduler | 512 x 512 | 21/01/24 | 5.0168 | 13.9783 | 0.5073 | 0.0167 | -1.9129 |
hongdthaui/ManmaruMix_v30 | animation / anime | PNDMScheduler | 512 x 512 | 19/01/24 | 4.4554 | 16.2391 | 0.5857 | 0.0033 | -1.9029 |
digiplay/Photon_v1 | photo realism | EulerDiscreteScheduler | 768 x 768 | 09/06/23 | 6.6038 | 13.4583 | 0.5195 | 0.0000 | -1.8667 |
lambdalabs/miniSD-diffusers | foundation | PNDMScheduler | 512 x 512 | 24/11/22 | 5.5457 | 14.6600 | 0.5640 | 0.0000 | -1.8550 |
nitrosocke/mo-di-diffusion | animation / anime | PNDMScheduler | 512 x 512 | 28/10/22 | 7.4843 | 14.3500 | 0.5696 | 0.0033 | -1.8175 |
Lykon/DreamShaper | photo realism | PNDMScheduler | 512 x 512 | 17/03/23 | 9.2373 | 13.7000 | 0.5521 | 0.0000 | -1.8142 |
ItsJayQz/GTA5_Artwork_Diffusion | animation / anime | PNDMScheduler | 512 x 512 | 13/12/22 | 6.9009 | 12.9348 | 0.5257 | 0.0033 | -1.8106 |
digiplay/PerfectDeliberate-Anime_v2 | animation / anime | EulerDiscreteScheduler | 768 x 768 | 07/04/24 | 5.9243 | 14.6923 | 0.5863 | 0.0000 | -1.8048 |
RunDiffusion/Juggernaut-XL-v6 | foundation | EulerDiscreteScheduler | 1024 x 1024 | 22/02/24 | 5.9995 | 14.1296 | 0.5766 | 0.0000 | -1.7889 |
Lykon/AbsoluteReality | photo realism | PNDMScheduler | 512 x 512 | 01/06/23 | 5.3458 | 12.6923 | 0.5297 | 0.0000 | -1.7866 |
segmind/Segmind-Vega | photo realism | EulerDiscreteScheduler | 1024 x 1024 | 01/12/23 | 6.1051 | 12.8462 | 0.5427 | 0.0000 | -1.7706 |
nitrosocke/redshift-diffusion | animation / anime | PNDMScheduler | 512 x 512 | 07/11/22 | 6.5239 | 12.0769 | 0.5139 | 0.0000 | -1.7699 |
stabilityai/stable-diffusion-3.5-large | foundation | FlowMatchEulerDiscreteScheduler | 1024 x 1024 | 22/10/24 | 9.2260 | 11.5769 | 0.4939 | 0.0000 | -1.7680 |
circulus/canvers-real-v3.9.1 | photo realism | PNDMScheduler | 512 x 512 | 05/05/24 | 3.8623 | 12.4615 | 0.5308 | 0.0000 | -1.7658 |
digiplay/AbsoluteReality_v1.8.1 | photo realism | DDIMScheduler | 768 x 768 | 04/08/23 | 6.5521 | 11.6481 | 0.5034 | 0.0000 | -1.7552 |
Yntec/YiffyMix | animation / anime | PNDMScheduler | 512 x 512 | 24/10/23 | 5.9138 | 12.1296 | 0.5259 | 0.0000 | -1.7492 |
Yntec/RealLife | photo realism | EulerDiscreteScheduler | 768 x 768 | 04/01/24 | 5.5369 | 11.9828 | 0.5216 | 0.0000 | -1.7461 |
digiplay/MilkyWonderland_v1 | animation / anime | EulerDiscreteScheduler | 768 x 768 | 30/09/23 | 4.8211 | 12.6538 | 0.5394 | 0.0167 | -1.7458 |
aipicasso/emi-3 | animation / anime | FlowMatchEulerDiscreteScheduler | 1024 x 1024 | 05/12/24 | 5.0561 | 11.2140 | 0.4891 | 0.0000 | -1.7457 |
WarriorMama777/AbyssOrangeMix2 | animation / anime | PNDMScheduler | 512 x 512 | 30/01/23 | 7.8778 | 12.3400 | 0.5353 | 0.0067 | -1.7397 |
WarriorMama777/AbyssOrangeMix | animation / anime | PNDMScheduler | 512 x 512 | 30/01/23 | 7.8316 | 11.5000 | 0.5096 | 0.0000 | -1.7299 |
liamhvn/disney-pixar-cartoon-b | animation / anime | PNDMScheduler | 512 x 512 | 12/07/23 | 4.9147 | 12.8548 | 0.5567 | 0.0167 | -1.7235 |
openart-custom/CrystalClearXL | photo realism | EulerDiscreteScheduler | 1024 x 1024 | 13/08/24 | 5.5520 | 12.2308 | 0.5464 | 0.0000 | -1.7134 |
stablediffusionapi/sdxl-unstable-diffusers-y | foundation | EulerDiscreteScheduler | 1024 x 1024 | 08/10/23 | 5.1028 | 11.3462 | 0.5151 | 0.0000 | -1.7052 |
dataautogpt3/ProteusV0.2 | foundation | KDPM2AncestralDiscreteScheduler | 1024 x 1024 | 19/01/24 | 7.1317 | 11.4655 | 0.5207 | 0.0000 | -1.7040 |
stabilityai/stable-diffusion-2-1 | foundation | DDIMScheduler | 512 x 512 | 07/12/22 | 10.5860 | 11.7963 | 0.5349 | 0.0100 | -1.6921 |
nitrosocke/Arcane-Diffusion | animation / anime | LMSDiscreteScheduler | 512 x 512 | 02/10/22 | 7.1673 | 10.9074 | 0.5004 | 0.0067 | -1.6887 |
cagliostrolab/animagine-xl-3.1 | animation / anime | EulerAncestralDiscreteScheduler | 1024 x 1024 | 13/03/24 | 8.9596 | 11.2143 | 0.5179 | 0.0000 | -1.6876 |
nuigurumi/basil_mix | art | PNDMScheduler | 512 x 512 | 04/01/23 | 7.0493 | 12.1800 | 0.5598 | 0.0000 | -1.6792 |
fluently/Fluently-XL-Final | foundation | EulerAncestralDiscreteScheduler | 1024 x 1024 | 06/06/24 | 6.6590 | 10.8226 | 0.5147 | 0.0000 | -1.6586 |
CompVis/stable-diffusion-v1-4 | foundation | PNDMScheduler | 512 x 512 | 20/08/22 | 10.7885 | 11.7258 | 0.5621 | 0.0000 | -1.6360 |
SPO-Diffusion-Models/SPO-SDXL_4k-p_10ep | foundation | EulerDiscreteScheduler | 1024 x 1024 | 07/06/24 | 6.2646 | 11.1667 | 0.5426 | 0.0000 | -1.6307 |
krnl/realisticVisionV51_v51VAE | photo realism | PNDMScheduler | 512 x 512 | 12/01/24 | 5.3375 | 11.1667 | 0.5510 | 0.0000 | -1.6122 |
openart-custom/AlbedoBase | foundation | EulerDiscreteScheduler | 1024 x 1024 | 13/09/24 | 5.5861 | 11.0161 | 0.5462 | 0.0000 | -1.6093 |
xyn-ai/anything-v4.0 | animation / anime | PNDMScheduler | 512 x 512 | 23/03/23 | 6.4597 | 11.6250 | 0.5692 | 0.0067 | -1.6044 |
lemon2431/toonify_v20 | animation / anime | PNDMScheduler | 512 x 512 | 16/10/23 | 4.1148 | 10.9412 | 0.5469 | 0.0033 | -1.5976 |
SG161222/RealVisXL_V4.0 | photo realism | EulerDiscreteScheduler | 1024 x 1024 | 13/02/24 | 8.6080 | 10.6250 | 0.5405 | 0.0000 | -1.5856 |
ckpt/anything-v4.5 | art | PNDMScheduler | 512 x 512 | 19/01/23 | 5.8262 | 11.5968 | 0.5784 | 0.0033 | -1.5837 |
emilianJR/chilloutmix_NiPrunedFp32Fix | photo realism | PNDMScheduler | 512 x 512 | 19/04/23 | 6.0849 | 10.7941 | 0.5498 | 0.0000 | -1.5810 |
Kernel/sd-nsfw | photo realism | PNDMScheduler | 512 x 512 | 15/07/23 | 5.5154 | 11.5333 | 0.5828 | 0.0000 | -1.5711 |
Lykon/AAM_XL_AnimeMix | animation / anime | EulerDiscreteScheduler | 1024 x 1024 | 19/01/24 | 6.8090 | 9.0152 | 0.4843 | 0.0000 | -1.5367 |
stablediffusionapi/realistic-stock-photo | photo realism | EulerDiscreteScheduler | 1024 x 1024 | 22/10/23 | 5.1367 | 10.2188 | 0.5451 | 0.0000 | -1.5366 |
GraydientPlatformAPI/comicbabes2 | art | PNDMScheduler | 512 x 512 | 07/01/24 | 4.0530 | 10.0313 | 0.5447 | 0.0033 | -1.5156 |
SG161222/Realistic_Vision_V6.0_B1_noVAE | photo realism | PNDMScheduler | 896 x 896 | 29/11/23 | 7.7733 | 11.1571 | 0.5957 | 0.0000 | -1.5066 |
scenario-labs/juggernaut_reborn | photo realism | DPMSolverMultistepScheduler | 512 x 512 | 29/05/24 | 4.5414 | 10.0294 | 0.5504 | 0.0000 | -1.5061 |
SG161222/RealVisXL_V5.0 | photo realism | DDIMScheduler | 1024 x 1024 | 05/08/24 | 6.6694 | 9.9412 | 0.5482 | 0.0000 | -1.5022 |
stable-diffusion-v1-5/stable-diffusion-v1-5 | foundation | PNDMScheduler | 512 x 512 | 07/09/24 | 9.2497 | 9.6429 | 0.5361 | 0.0000 | -1.4981 |
Corcelio/openvision | foundation | EulerDiscreteScheduler | 1024 x 1024 | 13/05/24 | 5.7229 | 9.2576 | 0.5161 | 0.0033 | -1.4963 |
gsdf/Counterfeit-V2.5 | animation / anime | DDIMScheduler | 512 x 512 | 02/02/23 | 8.8241 | 12.2059 | 0.6204 | 0.0433 | -1.4890 |
fluently/Fluently-XL-v4 | foundation | EulerAncestralDiscreteScheduler | 1024 x 1024 | 02/05/24 | 6.8337 | 9.2222 | 0.5205 | 0.0000 | -1.4866 |
Ojimi/anime-kawai-diffusion | animation / anime | DEISMultistepScheduler | 512 x 512 | 24/03/23 | 6.2192 | 11.1667 | 0.5968 | 0.0200 | -1.4843 |
tilake/China-Chic-illustration | art | PNDMScheduler | 512 x 512 | 15/01/23 | 5.7534 | 10.0294 | 0.5655 | 0.0000 | -1.4720 |
digiplay/FormCleansingMix_v1 | animation / anime | DPMSolverMultistepScheduler | 768 x 768 | 20/06/23 | 4.4630 | 10.8243 | 0.5936 | 0.0167 | -1.4647 |
Lykon/dreamshaper-7 | photo realism | DEISMultistepScheduler | 512 x 512 | 27/08/23 | 6.7903 | 10.0625 | 0.5704 | 0.0000 | -1.4638 |
gligen/diffusers-generation-text-box | photo realism | PNDMScheduler | 512 x 512 | 11/03/23 | 5.5617 | 10.0294 | 0.5719 | 0.0000 | -1.4570 |
stabilityai/stable-diffusion-xl-base-1.0 | photo realism | EulerDiscreteScheduler | 1024 x 1024 | 25/07/23 | 11.1360 | 8.6515 | 0.5076 | 0.0000 | -1.4492 |
pt-sk/stable-diffusion-1.5 | foundation | PNDMScheduler | 512 x 512 | 02/03/24 | 5.0628 | 9.5556 | 0.5584 | 0.0000 | -1.4397 |
playgroundai/playground-v2.5-1024px-aesthetic | art | EDMDPMSolverMultistepScheduler | 1024 x 1024 | 17/02/24 | 8.8552 | 9.5270 | 0.5649 | 0.0000 | -1.4219 |
danbrown/RevAnimated-v1-2-2 | animation / anime | DDIMScheduler | 256 x 256 | 01/05/23 | 5.0277 | 8.5000 | 0.5134 | 0.0000 | -1.4198 |
redstonehero/animesh_prunedv21 | animation / anime | PNDMScheduler | 512 x 512 | 17/08/23 | 4.0369 | 9.6622 | 0.5705 | 0.0033 | -1.4197 |
dreamlike-art/dreamlike-photoreal-2.0 | photo realism | DDIMScheduler | 768 x 768 | 04/01/23 | 8.6776 | 8.8415 | 0.5451 | 0.0033 | -1.3885 |
emilianJR/epiCRealism | photo realism | PNDMScheduler | 512 x 512 | 25/06/23 | 6.9505 | 8.8846 | 0.5499 | 0.0067 | -1.3793 |
naclbit/trinart_characters_19.2m_stable_diffusion_v1 | animation / anime | PNDMScheduler | 512 x 512 | 15/10/22 | 6.2754 | 12.3750 | 0.6536 | 0.0767 | -1.3757 |
segmind/tiny-sd | photo realism | DPMSolverMultistepScheduler | 512 x 512 | 28/07/23 | 5.7272 | 10.1757 | 0.6124 | 0.0000 | -1.3722 |
yodayo-ai/kivotos-xl-2.0 | animation / anime | EulerAncestralDiscreteScheduler | 1024 x 1024 | 02/06/24 | 6.5836 | 7.1410 | 0.4717 | 0.0000 | -1.3277 |
ZB-Tech/Text-to-Image | photo realism | PNDMScheduler | 512 x 512 | 10/03/24 | 6.4625 | 8.7162 | 0.5697 | 0.0000 | -1.3220 |
digiplay/aurorafantasy_v1 | animation / anime | EulerDiscreteScheduler | 768 x 768 | 06/04/24 | 5.2524 | 8.8864 | 0.5813 | 0.0133 | -1.3005 |
stablediffusionapi/mklan-xxx-nsfw-pony | photo realism | EulerDiscreteScheduler | 1024 x 1024 | 29/05/24 | 6.5259 | 7.4744 | 0.5100 | 0.0000 | -1.2981 |
stabilityai/sdxl-turbo | foundation | EulerAncestralDiscreteScheduler | 512 x 512 | 27/11/23 | 10.1726 | 8.0778 | 0.5493 | 0.0000 | -1.2922 |
Lykon/dreamshaper-xl-v2-turbo | foundation | EulerDiscreteScheduler | 1024 x 1024 | 08/02/24 | 6.6775 | 6.7381 | 0.4679 | 0.0000 | -1.2769 |
dataautogpt3/OpenDalleV1.1 | foundation | KDPM2AncestralDiscreteScheduler | 1024 x 1024 | 22/12/23 | 7.7311 | 7.6951 | 0.5351 | 0.0000 | -1.2747 |
Corcelio/mobius | foundation | EulerDiscreteScheduler | 1024 x 1024 | 13/05/24 | 6.0584 | 7.2273 | 0.5280 | 0.0000 | -1.2271 |
kandinsky-community/kandinsky-2-1 | art | DDIMScheduler | 512 x512 | 24/05/23 | 6.5896 | 7.1735 | 0.5332 | 0.0000 | -1.2086 |
friedrichor/stable-diffusion-2-1-realistic | photo realism | DDIMScheduler | 512 x 512 | 04/06/23 | 4.5053 | 6.7857 | 0.5061 | 0.0033 | -1.2060 |
CompVis/stable-diffusion-v1-1 | foundation | PNDMScheduler | 512 x 512 | 19/08/22 | 6.5538 | 6.8864 | 0.5218 | 0.0200 | -1.1716 |
stablediffusionapi/realistic-vision-51 | photo realism | PNDMScheduler | 512 x 512 | 07/08/23 | 5.8350 | 6.9490 | 0.5456 | 0.0000 | -1.1498 |
UnfilteredAI/NSFW-gen-v2 | photo realism | EulerAncestralDiscreteScheduler | 1024 x 1024 | 15/04/24 | 6.8121 | 6.4111 | 0.5102 | 0.0000 | -1.1442 |
nitrosocke/Ghibli-Diffusion | animation / anime | PNDMScheduler | 704 x 512 | 18/11/22 | 7.6340 | 6.3478 | 0.5523 | 0.0000 | -1.0444 |
dataautogpt3/ProteusV0.4-Lightning | foundation | EulerDiscreteScheduler | 1024 x 1024 | 22/02/24 | 6.2324 | 4.6739 | 0.4055 | 0.0000 | -1.0220 |
SG161222/Realistic_Vision_V2.0 | photo realism | PNDMScheduler | 512 x 512 | 21/03/23 | 8.3000 | 6.1154 | 0.5480 | 0.0000 | -1.0167 |
sd-community/sdxl-flash | foundation | DPMSolverSinglestepScheduler | 1024 x 1024 | 19/05/24 | 7.2798 | 4.7826 | 0.4424 | 0.0000 | -0.9809 |
Mitsua/mitsua-diffusion-cc0 | art | PNDMScheduler | 512 x 512 | 22/12/22 | 5.1305 | 10.8947 | 0.7426 | 0.0900 | -0.9368 |
OnomaAIResearch/Illustrious-xl-early-release-v0 | animation / anime | EulerDiscreteScheduler | 1024 x 1024 | 20/09/24 | 7.3851 | 11.6000 | 0.7524 | 0.1700 | -0.8688 |
digiplay/ZHMix-Dramatic-v2.0 | animation / anime | EulerDiscreteScheduler | 768 x 768 | 03/12/23 | 4.7306 | 5.7800 | 0.6506 | 0.0333 | -0.6689 |
DGSpitzer/Cyberpunk-Anime-Diffusion | animation / anime | PNDMScheduler | 704 x 704 | 28/10/22 | 6.7796 | 2.9722 | 0.5845 | 0.0600 | -0.1492 |
Emanon14/NONAMEmix_v1 | animation / anime | EulerAncestralDiscreteScheduler | 1024 x 1024 | 23/11/24 | 5.1821 | 5.3400 | 0.7447 | 0.1700 | -0.1237 |
Onodofthenorth/SD_PixelArt_SpriteSheet_Generator | art | PNDMScheduler | 512 x 512 | 01/11/22 | 6.4996 | 6.5930 | 0.7898 | 0.3367 | 0.0841 |
Niggendar/duchaitenPonyXLNo_ponyNoScoreV40 | art | EulerDiscreteScheduler | 1024 x 1024 | 01/06/24 | 5.1912 | 3.1620 | 0.7458 | 0.1000 | 0.3237 |
lambdalabs/sd-pokemon-diffusers | animation / anime | PNDMScheduler | 512 x 512 | 16/09/22 | 6.3118 | 9.6509 | 0.8639 | 0.6033 | 0.6519 |
Raelina/Raehoshi-illust-XL-3 | animation / anime | EulerDiscreteScheduler | 1024 x 1024 | 11/12/24 | 3.7427 | 4.8772 | 0.8926 | 0.6700 | 1.7549 |
monadical-labs/minecraft-skin-generator-sdxl | animation / anime | EulerDiscreteScheduler | 768 x 768 | 19/02/24 | 5.3317 | 5.9786 | 0.9721 | 0.7933 | 3.3680 |