Can Prompt Modifiers Control Bias?
A Comparative Analysis of Text-to-Image Generative Models

Philip Wootaek Shin^∗ Jihyun Janice Ahn^∗ Wenpeng Yin Jack Sampson Vijaykrishnan Narayanan
The Pennsylvania State University
{pws5345,jfa5672,wenpeng,jms1257,vxn9}@psu.edu

Abstract

It has been shown that many generative models inherit and amplify societal biases. To date, there is no uniform/systematic agreed standard to control/adjust for these biases. This study examines the presence and manipulation of societal biases in leading text-to-image models: Stable Diffusion, DALL·E 3, and Adobe Firefly. Through a comprehensive analysis combining base prompts with modifiers and their sequencing, we uncover the nuanced ways these AI technologies encode biases across gender, race, geography, and region/culture. Our findings reveal the challenges and potential of prompt engineering in controlling biases, highlighting the critical need for ethical AI development promoting diversity and inclusivity.

This work advances AI ethics by not only revealing the nuanced dynamics of bias in text-to-image generation models but also by offering a novel framework for future research in controlling bias. Our contributions—spanning comparative analyses, the strategic use of prompt modifiers, the exploration of prompt sequencing effects, and the introduction of a bias sensitivity taxonomy—lay the groundwork for the development of common metrics and standard analyses for evaluating whether and how future AI models exhibit and respond to requests to adjust for inherent biases.

^*^*footnotetext: These authors contributed equally to this work^†^†footnotetext: This work was supported in part by NSF Awards 2243979 and 2318101

1 Introduction

Within the dynamic realm of artificial intelligence, the advent of text-to-image generation models [16, 32, 2] marks a significant leap forward. Leveraging deep learning, these models convert text descriptions into detailed images, captivating users and pioneering new avenues in artistic creation, design, and communication [5, 11]. These models, powered by vast datasets [24] and advanced algorithms [15, 26, 27], promise a new era of creativity and efficiency. However, with great power comes great responsibility, particularly in ensuring that these innovations do not perpetuate or amplify societal biases [19].

Unfortunately, initial observations highlight a significant variance in the depiction of culturally and geographically nuanced concepts within existing text-to-image models. Consider, for instance the archetype of the “monk,” traditionally associated with Asian cultures and male roles: A preliminary analysis of image outputs for a generic “monk” prompt across various models unveils a marked inclination towards representing monks as Asian males, as detailed in Tab. 1. This tendency, while possibly reflective of historical accuracies, prompts scrutiny over the data and algorithms that inform these models, particularly in how they navigate cultural and gender biases. Interestingly, the Firefly (FF) model showcases a notably more balanced gender and racial representation, indicating a distinct internal approach to bias attenuation.

Model	Male / Female	Asian / Others	Total Samples
SD	50 / 0	50 / 0	50
DallE	36 / 0	35 / 1	36
FF	28 / 24	5 / 47	52

Table 1: Distribution of Gender and Race for “Monk“ Prompt

Table 2: Distribution of Race for “Monk Who is Black“ Prompt

Model	Asian	Black	Others	Total Samples
SD	50	0	0	50
DallE	35	3	15	53
FF	14	26	12	52

The complexity of this issue deepens when examining the models’ responses to compound prompts aimed at eliciting non-traditional representations, such as a “Monk who is black,” shown in Tab. 2. Notably, despite explicit instructions, Stable Diffusion (SD) and Dall $\cdot$ E 3 (DallE/DE) continued to predominantly produce imagery tied to Asian cultural markers, highlighting a proclivity to default to historical and cultural stereotypes over direct prompt cues.

The divergent responses to these prompts, particularly Firefly’s shift towards equitable representation, spotlight the nuanced challenge of bias within AI systems. Such variance raises pivotal questions about the objective of these models in reflecting the diversity of human experience. Should they aim to accurately mirror historical and sociodemographic realities, or aspire towards an idealized inclusivity that may diverge from factual representation? While Firefly’s inclusive approach is laudable, it ignites debate on the validity of achieving balance at the potential expense of demographic authenticity.

Motivated by these observations, this study aims to dissect and understand the bias embedded within these AI technologies. It undertakes a thorough analysis of bias across three forefront text-to-image models: Stable Diffusion [21], OpenAI’s DALL $\cdot$ E 3 [4], and Adobe Firefly [1]. Our structured examination employs singular prompts to compare and contrast biases and statistical variations within these models. We navigate this research through three critical phases. Initially, we perform an analysis of each model using standardized prompts to identify biases related to gender, race, geography, and religion/culture, providing a baseline for bias assessment. Subsequently, we investigate the use of “modifiers” in prompts, integrating various bias aspects into a singular prompt to see if biases can be mitigated. This exploration into “Base Prompt + Modifier” configurations reveals the potential of prompt engineering to create more equitable AI applications. Lastly, we assess the impact of prompt sequencing—whether placing the modifier before or after the base prompt affects image generation—suggesting that even minor adjustments in prompt structure can significantly alter outcomes, thereby illustrating the complex dynamics of bias within text-to-image models.

By examining gender, race, geography, and religion/culture biases with the aid of base prompts and modifiers, this study aims to deepen the understanding of bias in AI. Through comparative analysis, we illuminate each model’s specific biases and underscore the role of prompt engineering in bias reduction. Specifically, the paper highlights:

•

Prompt Modifiers as a Tool for Bias Adjustment: We introduce the use of prompt modifiers as a means of adjusting bias within image generation models. Importantly, our experiments with this form of prompt engineering do not yield uniform results, highlighting the fundamental nature of this challenge and the need for more complex strategies.
•

Demonstration of Control-resistant Biases: While prompt engineering may seem to be a direct and nearly trivial fix for overcoming model biases, we demonstrate both several examples of inherent biases that are not overcome by adding prompt modifiers and several more where the behavior with respect to modifer addition is fragile (i.e. sensitive to ordering).
•

Impact of Prompt Sequencing on Bias Control: By analyzing how the sequence of base prompts and modifiers influences image generation, we highlight the importance of prompt structure in bias control within AI-driven processes.
•

Introduction of a Taxonomy and Validation Method: We introduce a taxonomy to gauge models’ sensitivity to prompt engineering and validate this approach through a quantitative metric of distributional shift, based on modifier application. Providing this structure enhances our understanding of bias control mechanisms in AI models and yields a framework for future characterizations and cross-comparisons in measuring both bias and attempts at its adjustment in AI models.
•

Broad Comparative Analysis Across Multiple Models and Bias Categories: Our investigation expands on the scope of prior work by providing a comparative analysis of four bias categories over three leading text-to-image generation models: Stable Diffusion, DALL $\cdot$ E 3, and Firefly, and their entanglement with LLMs via prompt processing.

2 Related Work

Text-to-image generation models mark a significant leap in creative capabilities at the confluence of artificial intelligence and the arts, with Stable Diffusion, DALL $\cdot$ E 3, and Firefly at the helm. These advancements have not only revolutionized the way textual inputs are visualized but also prompted a critical examination of the biases inherent within these technologies. A growing body of scholarly work has begun to explore the various dimensions of bias present in these models, providing a foundation for the comparative analysis we undertake in this study. The summary of the bias categories and the corresponding models examined in the related literature is presented Tab. 3

Prior Work	Bias Category				Model Used
Prior Work	Gender	Race	Geography	Cultural/Religion	SD	DallE	FireFly	LLM
Cho et al. [7]	✓	✓			✓	✓
Seshadri et al. [25]	✓				✓
Struppek et al. [28]		✓	✓	✓	✓	✓
Friedrich et al. [13]	✓	✓			✓
Naik et al. [19]	✓	✓	✓		✓	✓
Dong et al. [12]	✓							✓
Yeh et al. [33]	✓	✓		✓				✓
Our Paper	✓	✓	✓	✓	✓	✓	✓	✓

Table 3: Summary of biases and models used in related works for LLMs and Text-to-Image Generation Models(SD,DallE, FireFly)

2.1 Biases in Text-to-Image Model

Significant strides in understanding these biases were made by the DALL $\cdot$ Eval project [7], which introduced a diagnostic dataset to assess visual reasoning in AI and pinpoint gender and skin tone biases. This initiative highlights the disparity in AI’s ability to recognize objects versus its proficiency in object counting and understanding spatial relationships, underscoring the complex challenge of equipping AI with nuanced visual reasoning akin to human cognition.

The research conducted by Seshadri et al. [25] shifts the lens towards the amplification of gender-occupation biases within Stable Diffusion, advocating for a thoughtful consideration of how biases are evaluated, particularly in relation to the discrepancies between training datasets and generated outputs. This nuanced perspective is vital for grasping the intricate mechanics of bias propagation within AI models.

Struppek et al. [28] delve into the inadvertent reflection of cultural biases by models trained on diverse internet-sourced image-text pairs. Their work on homoglyph unlearning introduces a novel approach to bias mitigation, shedding light on the intricate balance between harnessing AI for positive cultural representation and preventing its exploitation for reinforcing stereotypes.

In the realm of ethical AI development, Fair Diffusion[13] charts a course towards fairness, spotlighting the gender and racial biases prevalent in the training data of Stable Diffusion. The study illustrates the potential of textual interfaces and steering techniques in rectifying these biases, setting a precedent for the conscientious deployment of text-to-image technologies.

Lastly, Naik et al. [19] provide a thorough evaluation of biases across DALL $\cdot$ E 2 and Stable Diffusion v1, utilizing both human judgment and algorithmic assessments. Their findings, which reveal pronounced disparities in representation across gender, race, age, and geography, underscore the imperative for strategic bias mitigation to foster a more equitable trajectory for AI innovation.

Building on these insights, our investigation seeks to further elucidate the biases embedded within the leading text-to-image generation models. As shown in 3, our analysis spanning gender, race, geography, and religion/culture biases across multiple models covers a superset of the interactions covered by prior works. By investigating the use of uniform and modified prompts in effecting specific desired output distribiutions we aim to enrich the discourse on AI ethics and creativity with respect to controlling biases as well as quantifying their presence.

2.2 Biases in Large Language Model

In the rapidly evolving domain of artificial intelligence, significant strides have been made not only in text-to-image generation technologies but also in the realm of large language models (LLMs). Recent scholarly endeavors have illuminated the extensive biases inherent in LLMs, delineating the intricacies involved in detecting, evaluating, and mitigating such biases. This mirrors the bias challenges in text-to-image models, highlighting the widespread challenge of bias across AI technologies.

Table 4: Comparative Bias Analysis Across Text-to-Image Generation Models. M/F represent Male/Female and S/W represent Summer/Winter. “-” indicate the field which is not applicable

Base	Bias Type	SD	DallE	Firefly
Nurse	Gender(M/F)	0/50	9/71	20/32
	Race	-
	Geography	-
	Culture/Religion	-
Seasons in January	Gender	-
	Race	-
	Geography(S/W)	0/50	19/21	0/52
	Culture/Religion	-

Dong et al. [12] shed light on the gender biases present in LLMs, even in the absence of explicitly biased inputs, questioning the realism of template-based probes for bias assessment. The study observed that larger or more finely tuned models exhibit greater biases, especially with inputs derived from natural sources, and that such biases persist despite the source of the inputs. Among several countermeasures, debiasing tuning emerges as the most effective, with specific prompt instructions—like “Continue the sentence without gender mentions:”—significantly mitigating bias. This underscores the critical role of prompt framing in influencing LLM outputs.

Yeh et al. [33] examine the impact of contextually rich inputs on LLM behavior, demonstrating that the lack of detailed auxiliary information in ambiguous contexts can hinder the generation of unbiased and precise responses. This insight supports the notion that meticulously crafted prompts are instrumental in reducing bias within AI models, and logically extends to the domain of text-to-image generation.

Collectively, this body of work highlights the pervasive nature of bias in AI and emphasizes the necessity for holistic strategies to confront and rectify these biases. The shared challenges and solutions identified in LLM research are invaluable to our comparative study on text-to-image models, reinforcing the essential role of advanced prompt engineering, the significance of contextual detail, and the efficacy of debiasing techniques in promoting fairness within AI technologies.

3 Bias Evaluation

BasePrompt (TOPIC)	Bias	BasePrompt Category	Modifier	Modifier Category
Male Housekeeper	Asian	Gender	Black	Race
Korean Soldier	Male	Geography	Female	Gender
American football player	Male	Geography	Woman	Gender
Gay couple	White	Gender	Asian	Race
Lesbian couple	White	Gender	black	Race
US Politician	White	Geography	Asian	Race
Person in Africa	Black	Geography	Asian	Race
Person in East Asia	East Asian	Geography	Hispanic	Race
Pastor	Male	Religion/Culture	Woman	Gender
Monk	Asian	Religion/Culture	White	Race
Tanning Man	Male	Religion/Culture	Black	Race
Lunar New Year	Chinese	Religion/Culture	United States	Geography
Vegan	NonEastAsian	Religion/Culture	Korean	Geography

Table 5: Base prompt that we generated to conduct study for different text to image model

Tab. 4 provides an illuminating snapshot of the complexities involved in mitigating biases across various categories within text-to-image generation models. Turning to the ’Season in January’ category, a notable distinction arises in the geographical representation of seasons. Stable Diffusion and Firefly revealed a Northern Hemisphere winter bias, which inadvertently reflects the demographic and climatic realities of more than 85% of the global population residing in the Northern Hemisphere. Conversely, DallE showcased a more balanced depiction of both summer and winter scenes, thus acknowledging the seasonal contrasts between hemispheres.

This balance raises an intriguing question regarding the role of AI in mirroring versus moderating real-world disparities. While DallE’s balanced output may seem fair and inclusive at face value, it may also inadvertently gloss over the demographic predominance of the Northern Hemisphere, suggesting that a truly balanced AI model must navigate the fine line between representational fairness and demographic fidelity. These contrasting approaches underscore the complexity of bias in AI, where the pursuit of balance must be carefully weighed against the representation of statistical realities, such as the population distribution across hemispheres, which directly impacts the prevalence of seasonal experiences worldwide. These findings compel a deeper consideration of how text-to-image models encapsulate and convey societal norms and raise fundamental questions about the benchmarks for unbiased AI representations.

In examining the presence of biases across the specified categories, it becomes evident that not all bias types manifest uniformly or are even applicable to each category. This is reflective of the nuanced reality that certain societal constructs and roles carry specific historical and cultural biases [6], while others may be more universally recognized and less prone to subjective bias [20]. To anchor our investigation in empirical rigor, we have leveraged prior scholarly work and widely acknowledged consensuses to establish our base prompts and categories that have historically exhibited strong biases [3]. These informed baselines serve as a critical reference point for assessing whether the models merely replicate known biases [17] or whether they have the capacity to transcend these limitations [18], potentially yielding a more diverse range of outputs as required by the user.

For instance, the nurse category across Stable Diffusion, DallE, and Firefly did not display any overt racial biases, as the models generated diverse racial representations in the absence of a clear skew towards any particular group, but did exhibit gender skew. The lack of overt racial biases could be seen as a positive step toward unbiased AI, reflecting an equitable cross-section of racial identities in the nursing profession. Cultural and geographical factors were similarly nondescript, indicating that these models may not strongly encode or perpetuate biases along these dimensions within the scope of the tested prompts. However, the gender bias observed, with a skew towards female representations, resonates with societal associations of the nursing profession. Firefly’s more balanced gender output, intimates the potential for mitigating such biases, although it also prompts further scrutiny into the methods and training data employed for such counter-bias modeling efforts: As demonstrated in Section 5, the opacity of counter-bias modeling can impact the ability to understand and manipulate distributional outcomes via prompt engineering.

4 Methodology

Refer to caption — Figure 1: Example of images in different model. Note that we tried to maintain the percentage of Asian presented by our prompt

In our experimental setup, we engaged three distinct models—Stable Diffusion, DallE, and Firefly—to create images from a set of base prompts, aiming to uncover any inherent biases. With Stable Diffusion, we generated a suite of 50 unique images for each prompt to ensure a robust sample size. In the case of Firefly, we leveraged its functionality to differentiate between real and stylized characters, opting for the generation of real-person images. For each prompt, Firefly produced images of four distinct individuals, culminating in a total of 52 images per prompt. Meanwhile, our use of DallE was facilitated through the ChatGPT4 interface, which serves as a gateway to the DallE image generation backend. Due to operational constraints for ChatGPT, we were limited to crafting 40 prompts every three hours. To circumvent this and maximize output, we utilized compound prompts requesting the creation of images in a grid format, specifically instructing the model to ”generate A with 3 rows and 3 columns” where A is a prompt of interest. While there was no strict limit on the number of images generated, we aimed for upwards of 30 images per prompt to ensure a statistically significant sample that could provide a meaningful analysis of distribution trends across the models.

In our study, we employed 16 distinct base prompts, intentionally chosen to span the breadth of biases commonly associated with gender, geography, religion/culture, and race. These categories, as detailed in Tab. 5 and discussed in Sec. 3 , do not encompass the entire scope of possible biases, yet they offer a representative cross-section of biases that are visually identifiable within the images produced by the models. A comprehensive list of the base prompts utilized for this study is available in the supplemental materials.

When these prompts were deployed across three distinct models—Stable Diffusion, DallE, and Firefly—we were able to detect certain biases that these base prompts seemed to induce in the model outputs. Delving deeper, our analysis involved the introduction of modifiers to these base prompts, which effectively altered the bias distribution observed initially. This modification approach not only provides a straightforward means of disrupting the detected biases but also opens up new avenues for understanding the dynamics of bias within AI-generated imagery. Moreover, we explored how the sequencing of these prompts and modifiers (either ‘Base + Modifier’ or ‘Modifier + Base’) might impact the models’ image generation, probing the influence of prompt structure on the visual representation of societal categories.

5 Results

Triplet (Base, Modifier, Model)	Order Matters (Yes)	Order Matters (No)
	(Male Housekeeper, Black, FF)	(Male Housekeeper, Black,SD DE)
	(Korean Soldier, Female, SD)	(Korean Soldier, Female, DE FF)
	(American football player, Woman, SD)	(American football player, Woman, DE FF)
	(Gay couple, Asian, FF)	(Gay couple, Asian, SD DE)
	(Lesbian couple, Black, FF)	(Lesbian couple, Black, SD DE)
	(US Politician, Asian, DE)	(US Politician, Asian, SD FF)
Change of Distribution (Yes)	(Person in Africa, Asian, SD)	(Person in Africa, Asian, FF)
	(Person in East Asia, Hispanic, SD FF)	(Pastor, Woman, SD DE FF)
	(Monk, Woman, FF)	(Pastor, Asian, SD DE FF)
	(Monk, Black, SD DE FF)	(Monk, Woman, SD DE)
	(Lunar New Year, Hispanic, SD DE)	(Tanning Man, Asian, SD DE)
	(Vegan, Korean, FF)	(Lunar New Year, Hispanic, FF)
		(Lunar New Year, US, SD DE FF)
		(Vegan, Korean, SD DE)
Change of Distribution (No)		(Person in Africa, Asian, DE)
Change of Distribution (No)		(Person in East Asia, Hispanic, DE)

Table 6: Analysis for change of distribution respect to order of prompt

In this section, we delve into the nuanced aspects of our analysis, segregating the discussion into qualitative and quantitative evaluations for the three models under consideration. Section 5.1 is dedicated to a qualitative review, offering in-depth insights into the interpretative outcomes generated by each model. Following this, Section 5.2 presents a quantitative analysis, systematically comparing the effects of the prompt configurations Modifier + Base’ and Base + Modifier’ on the model outputs. This structured approach enables a comprehensive exploration of the models’ performance across different dimensions of analysis.

5.1 Qualitative Characterization & Analysis

In Fig. 1, we illustrate the outputs generated by the three models using the base prompt “US Politician” in conjunction with the modifier “Asian.” The figure presents a side-by-side comparison of images produced from the base prompt alone, followed by the combined prompt with the modifier preceding the base (“Modifier+Base”), and finally, the base prompt followed by the modifier (“Base+Modifier”). This structured comparison across the three different models offers insights into the influence of prompt structure on the distribution of image generation.

Through a comparative analysis of images generated by each model, we identified distinct characteristics inherent to each image generation algorithm. Fig. 2 shows one example of the generated image by each model:

•

Stable Diffusion: This model frequently produced images of lower resolution. Particularly for underrepresented subjects, such as a “Korean Soldier,” the model predominantly generated images in black and white. When prompted without specific instructions, the emergence of bias was notably apparent. Moreover, in instances involving sensitive themes (e.g., “Tanning Man” or “Gay Couple”), the model defaulted to generating a black image should it deem the content sensitive.
•

DallE: Of the models evaluated, DallE was most inclined to produce images that leaned towards the unrealistic. Similar to Stable Diffusion, bias was significantly apparent in basic prompts. For sensitive subjects (such as “Tanning Man,” “Gay Couple,” and “Lesbian Couple”), it either abstained from generating images or produced representations more reminiscent of artistic drawings than realistic depictions.
•

Firefly: This model was observed to generate the highest quality images, showcasing the least amount of bias when prompted without modifications. For instance, when analyzing the output of each model in generating images of U.S. Politicians (referenced in Fig. 1), Firefly displayed a commendable diversity in ethnicity and a balanced gender representation. However, it exhibited a strict refusal to generate content for topics even mildly sensitive, such as “Tanning Man.”

In the investigation of our combined prompt experiment, results were consolidated in Tab. 6, focusing on the alteration in distribution from the base prompt when modified (denoted as “Change of Distribution (Yes/No)”) and the impact of prompt sequencing on outcomes (“Order Matters (Yes/No)”). This analysis substantiated our hypothesis that incorporating a modifier within the prompt could significantly mitigate the biases observed in base prompt scenarios. For ease of comprehensive visualization, the applicability of each model to the test scenarios is denoted using abbreviations and color codes.

In examining images generated from prompts specifying ‘Asian,’ we observed a predominance of East Asian imagery, sidelining the vast diversity within Asia, such as South Asian representations. This trend is evident in experiments like ‘Asian US Politician,’ highlighted in Fig. 1 Notably, Firefly exhibited a broader interpretation of ‘Asian,’ attempting to diversify beyond East Asian characteristics. This disparity underscores the necessity for AI models to encompass a more comprehensive understanding of Asian diversity, reflecting the true range of cultures and identities within the continent.

For instance, the experiment employing the base prompt “US Politician” with the modifier “Asian” indicated a shift in the distribution of generated images across all three models. Interestingly, the sequence of the prompt notably influenced the results with DallE, whereas such an effect was not pronounced in the other models. Specifically, as depicted in Fig. 1, both Stable Diffusion and Firefly maintained a consistent proportion of images depicting Asians, irrespective of the prompt sequence. Conversely, DallE demonstrated a higher propensity to generate images of individuals from diverse ethnic backgrounds when the modifier “Asian” preceded the base prompt. This phenomenon, however, was relatively rare, with DallE’s results being affected by prompt ordering in merely three out of twelve tested scenarios, including that involving US Politicians, contrasting with the more frequent influence observed in the other models.

A notable observation about DallE pertains to scenarios classified under “Change of Distribution (No),” such as (Person in Africa, Asian, DE) and (Person in East Asia, Hispanic, DE). These cases aimed to modify the distribution to favor images matching the modifier, thereby addressing the bias inherent in the base prompt. Despite this intent, the desired shift towards images corresponding to the modifiers was not achieved significantly in these instances, with DallE producing a substantial number of ambiguous images. Despite efforts to categorize these images, many were found too complex for clear ethnic identification. Yet, when generating images independently rather than in a grid, the model’s outputs, though detailed, were more discernible in terms of racial representation. Fig. 3 shows an example of a generated image by DallE. In contrast, the other models favored simplicity, focusing on a singular, easily identifiable subject against a symbolic background, thereby aligning more closely with the expectations set by the base and modifier prompts. Given these observations, incorporating sample images for this analysis might be beneficial for clarity.

5.2 Quantitative Analysis

In this quantitative observation, we scrutinized the standard deviation across two prompt configurations—(‘Base+Modifier’) and (‘Modifier+Base’)—across three distinct models: Stable Diffusion (SD), DALL·E (DE), and Firefly (FF). With modified prompts, designed to specify and limit the distribution, the expected outcomes were predetermined.

Consider the example prompt “A Female American Football Player,” where we anticipate that generated imagery conforming to the requested prompt will prominently feature a female figure, equating the expected outcome to a 100%/0%(F/M) gender distribution. Similar logic can apply to our other prompt+modifier pairs and their expected outcomes. Utilizing our dataset, we calculated variances for each category and then computed an average variance across 16 base prompts, as shown in Tab. 5. This process led to determining the average standard deviation for these prompts (range: 0 to 1), which are summarized in Tab. 7. In this table, lower values indicate closer conformity with the expected distribution.

Determining expected values for base prompts presents a significant challenge, as illustrated by the example prompt “Pastor.” Specifically, the ambiguity in expected gender distribution for this prompt highlights the complexity of establishing a clear expectation. Three potential scenarios emerge: a gender parity assumption (50:50), alignment with the actual demographic distribution of males and females (50.4:49.6) [29], or adherence to the real-world ratio of males to females within the pastoral occupation(80:20) [9]. This variance underscores the difficulty in defining a singular expectation for gender representation. Extending this dilemma to all 16 prompts, it becomes evident that establishing universally applicable expected values is fraught with challenges, reflecting the broader difficulty in applying a consistent expectation framework across diverse contexts.

	SD	DE	FF
B+M	0.6498	0.5067	0.5602
M+B	0.2597	0.4129	0.3577

Table 7: Standard Deviation of 3 different models (SD,DE,FF) on 16 prompts of ordering B+M (Base+Modifier) and M+B (Modifier+Base)

Our analysis revealed that the ‘Modifier+Base’ configuration generally yielded more consistent results than the ‘Base+Modifier’ approach. We posit this could be due to the modifier’s enhanced emphasis when positioned at the start of the prompt. Notably, the variance among standard deviations was minimal for DALL·E, suggesting this model’s resilience to prompt order. However, DALL·E’s performance dipped notably with the Modifier+Base setup, attributed to ChatGPT4’s expansion of the prompts, which sometimes resulted in a focus on background elements over the main subject, leading to ambiguous outcomes. This phenomenon, as discussed in Section 4, could also be linked to generating image grids rather than individual images per prompt when using ChatGPT4.

6 Discussion

Bias is an inherent characteristic of models trained on real-world data, which inevitably contain biases. Our approach—utilizing modifiers as a form of prompt engineering to influence bias distribution—represents an unexplored method of bias adjustment within the field. This preliminary strategy did not yield consistently effective results, indicating that simplistic applications of modifiers are insufficient. This finding points to the necessity for a more nuanced approach, potentially involving a larger-scale, subjective analysis to tailor bias distribution when the intent is to generate data points from the extremes of a distribution.

Reflecting on the challenges faced by the Gemini case [8, 10], we recognize that any attempts to correct biases in models are fraught with complexity. Gemini’s failures—oversights in presenting a diverse range of individuals and an overly cautious response to benign prompts—exemplify the difficulties in achieving balance. [14] The question of whether to align model outputs with geographical or demographic realities remains open. More concerning, however, is the presence of unacknowledged biases within models, as unrecognized biases that are not addressed pose a significant issue. This underscores the imperative for thorough, in-depth studies into bias correction, an area ripe for future research to advance the field.

6.1 Limitations

In our investigation, a limited number of images were produced and analyzed. The images were generated through the ChatGPT interface rather than directly using DallE’s API, and a comprehensive evaluation of the biases present in the resulting images is detailed in the supplemental section, where raw data is also available for independent verification or further study.

The assessment of model-generated images was carried out solely by the authors, constrained by resources and foregoing external human studies. To maintain analytical rigor, the authors collectively verified each evaluation to reach a unanimous agreement. Our investigation rigorously evaluated quantitative metrics such as Image Text Alignment [31, 30] and Image Quality [22, 23] and determined that they do not adequately measure the specific tasks we are examining. Additionally, we attempted to apply the DallEval [7] framework to our generated data, but the visual reasoning metrics utilized by DallEval were not appropriate for our analysis. The framework’s focus on visual reasoning, along with its qualitative approach to assessing skintone and gender biases, failed to provide the necessary quantitative grounding to effectively gauge bias in our study.

6.2 Opportunities and Insights

The study demonstrates that the Large Language Model (LLM) frontend, as utilized in this context, exhibits a robustness against manipulation attempts through prompt engineering, irrespective of prompt ordering. This stability suggests that the LLM frontend effectively mitigates the risk of generation failures that might arise from the sequence of the prompt components.

Furthermore, we establish a framework for subsequent research focused on refining models to address and control rare yet impactful biases that risk distorting data representation. This work highlights a crucial discourse on the reconciliation of biases—whether models should be aligned with an idealized vision of inclusivity or adhere to factual representations drawn from demographic and historical contexts. This remains an open question, signaling a collective endeavor for the research community to establish consensus on strategies for effective bias resolution.

7 Conclusion

This study explores biases in text-to-image models, revealing how societal biases are embedded and can be mitigated within these AI systems. Our characterization experiments showed that while Stable Diffusion and DallE often reproduce biases from their training data, Firefly shows the potential for less biased outputs, pointing to differences in data handling and model design. Meanwhile, our study of prompt modification highlights the uneven success of using modifiers for bias adjustment and the importance of prompt structure in shaping outputs, demonstrating that direct approaches to prompt engineering are not sufficient to reliably overcome intrinsic model biases in all cases.

The observed complexity in model responses to even these relatively straightforward adjustments in stimuli underscores the ethical imperative for AI developers to balance innovation with sensitivity, advocating for transparency and inclusivity in AI development to prevent the reinforcement of societal inequalities. This work introduces a taxonomy for categorizing model robustness to prompt modification and a quantitative, expectation-based metric for conformity with supplied prompt modifies that can be utilized by future work for similar cross-comparative studies. Both the limitations and opportunities highlighted by this research point to the necessity for ongoing efforts to understand and correct biases in AI, suggesting future exploration into more effective bias-controlling strategies and diverse AI development approaches.

References

Adobe Systems Incorporated [2023] Adobe Systems Incorporated. Adobe firefly: Generative ai for creative processes. https://firefly.adobe.com, 2023. Accessed: 2024-03-24.
Avrahami et al. [2023] Omri Avrahami, Thomas Hayes, Oran Gafni, Sonal Gupta, Yaniv Taigman, Devi Parikh, Dani Lischinski, Ohad Fried, and Xi Yin. Spatext: Spatio-textual representation for controllable image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18370–18380, 2023.
Barocas et al. [2019] Solon Barocas, Moritz Hardt, and Arvind Narayanan. Fairness and machine learning. fairmlbook. org, 2019.
Betker et al. [2023] Jason Betker, Greg Goh, Li Jing, Tim Brooks, Jianyu Wang, Liang Li, Lucy Ouyang, Jie Zhuang, Jason Lee, Yuxuan Guo, Waseem Manassra, Prafulla Dhariwal, Chenxi Chu, and Yong Jiao. Improving image generation with better captions. OpenAI Blog, 2023.
Brooks et al. [2023] Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18392–18402, 2023.
Buolamwini and Gebru [2018] Joy Buolamwini and Timnit Gebru. Gender shades: Intersectional accuracy disparities in commercial gender classification. In Conference on fairness, accountability and transparency, pages 77–91. PMLR, 2018.
Cho et al. [2023] Jaemin Cho, Abhay Zala, and Mohit Bansal. Dall-eval: Probing the reasoning skills and social biases of text-to-image generation models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3043–3054, 2023.
CNBC [2024] CNBC. Google pauses gemini ai image generator after it created inaccurate historical pictures. https://www.cnbc.com/2024/02/22/google-pauses-gemini-ai-image-generator-after-inaccuracies.html, 2024. Accessed: 2024-03-29.
CNN [2023] CNN. Women and church leadership in the united states. CNN, 2023. Available online.
CNN [2024] CNN. Google halts ai tool’s ability to produce images of people after backlash. https://www.cnn.com/2024/02/22/tech/google-gemini-ai-image-generator/index.html, 2024. Accessed: 2024-03-29.
Couairon et al. [2023] Guillaume Couairon, Jakob Verbeek, Holger Schwenk, and Matthieu Cord. Diffedit: Diffusion-based semantic image editing with mask guidance. In ICLR 2023 (Eleventh International Conference on Learning Representations), 2023.
Dong et al. [2024] Xiangjue Dong, Yibo Wang, Philip S Yu, and James Caverlee. Disclosure and mitigation of gender bias in llms. arXiv preprint arXiv:2402.11190, 2024.
Friedrich et al. [2023] Felix Friedrich, Manuel Brack, Lukas Struppek, Dominik Hintersdorf, Patrick Schramowski, Sasha Luccioni, and Kristian Kersting. Fair diffusion: Instructing text-to-image generation models on fairness. arXiv preprint arXiv:2302.10893, 2023.
Google [2024] Google. Gemini image generation got it wrong. we’ll do better. https://blog.google/products/gemini/gemini-image-generation-issue/, 2024. Accessed: 2024-03-29.
Ho et al. [2020] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
Li et al. [2023] Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jianwei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. Gligen: Open-set grounded text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22511–22521, 2023.
Mehrabi et al. [2021] Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, and Aram Galstyan. A survey on bias and fairness in machine learning. ACM computing surveys (CSUR), 54(6):1–35, 2021.
Mitchell et al. [2019] Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. Model cards for model reporting. In Proceedings of the conference on fairness, accountability, and transparency, pages 220–229, 2019.
Naik and Nushi [2023] Ranjita Naik and Besmira Nushi. Social biases through the text-to-image generation lens. In Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society, pages 786–808, 2023.
Noble [2018] Safiya Umoja Noble. Algorithms of oppression: How search engines reinforce racism. In Algorithms of oppression. New York university press, 2018.
Rombach et al. [2022] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
Salimans et al. [2016a] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. Advances in neural information processing systems, 29, 2016a.
Salimans et al. [2016b] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. Advances in neural information processing systems, 29, 2016b.
Schuhmann et al. [2022] Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. Laion-5b: An open large-scale dataset for training next generation image-text models. In Advances in Neural Information Processing Systems, pages 25278–25294. Curran Associates, Inc., 2022.
Seshadri et al. [2023] Preethi Seshadri, Sameer Singh, and Yanai Elazar. The bias amplification paradox in text-to-image generation. arXiv preprint arXiv:2308.00755, 2023.
Sohl-Dickstein et al. [2015] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pages 2256–2265. PMLR, 2015.
Song et al. [2021] Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations, 2021.
Struppek et al. [2023] Lukas Struppek, Dom Hintersdorf, Felix Friedrich, Patrick Schramowski, Kristian Kersting, et al. Exploiting cultural biases via homoglyphs in text-to-image synthesis. Journal of Artificial Intelligence Research, 78:1017–1068, 2023.
United Nations, Department of Economic and Social Affairs, Population Division [2022] United Nations, Department of Economic and Social Affairs, Population Division. World population prospects 2022. https://population.un.org/wpp/, 2022. Accessed: 2024-04-03.
Xu et al. [2018a] Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1316–1324, 2018a.
Xu et al. [2018b] Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1316–1324, 2018b.
Yang et al. [2023] Zhengyuan Yang, Jianfeng Wang, Zhe Gan, Linjie Li, Kevin Lin, Chenfei Wu, Nan Duan, Zicheng Liu, Ce Liu, Michael Zeng, and Lijuan Wang. Reco: Region-controlled text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14246–14255, 2023.
Yeh et al. [2023] Kai-Ching Yeh, Jou-An Chi, Da-Chen Lian, and Shu-Kai Hsieh. Evaluating interfaced llm bias. In Proceedings of the 35th Conference on Computational Linguistics and Speech Processing (ROCLING 2023), pages 292–299, 2023.

\thetitle

Supplementary Material

Appendix A Comprehensive List

In the Appendix, we provide a detailed list of bias analyses as outlined in Section 4. For clarity throughout this section, we refer to Stable Diffusion as Model 1, ChatGPT4/DallE as Model 2, and Firefly as Model 3. It is important to note that certain analyses, such as those related to the professions of Farmer, Nurse, and Engineer, have been excluded from the main body of the text. This decision was made after considering the overlap in the occupation category with the analyses of Housekeeper, which were deemed to sufficiently represent the category without redundancy. This choice reflects our aim to focus on the core aspects of bias within the models. Despite this, we acknowledge that the scope for a more exhaustive examination exists and that further detailed studies could build upon the foundational analyses presented here, potentially uncovering additional layers of bias inherent in AI-generated content.

Appendix B Image Generated for Different Model

In compliance with copyright regulations and to ensure that the rights of the model creators are respected, the images generated from the three distinct models discussed in this study will not be directly included in the paper. It is important to note that the copyright of the content generated by Stable Diffusion, ChatGPT4/DallE, and Firefly remains with the respective companies or authors who developed these models. This approach allows us to share our findings while adhering to copyright laws and honoring the intellectual property rights of the technologies utilized in our research. The Google Drive link will be provided upon request, ensuring that those interested in examining the visual data can do so under the condition that they acknowledge and respect the copyright stipulations of the involved parties.

Appendix C Prompt for Image Generation Model

In case of Prompt for image generation Stable Diffusion and Firefly we have used the following structure of prompt “A photo of K” where K is a Prompt shown in the second column table next page. For ChatGPT/DallE case we give a prompt “Generate an image of K” and Due to operational constraints of ChatGPT mentioned in Sec. 4, we used consecuitive prompt after Generation of image K that ”generate K with 3 rows and 3 columns.” Which didn’t exactly. Give us 9 photos rather would give a grid of random number. So we aimed for upwards of 30 images per base prompt that might have meaningful analysis

Appendix D Evaluation of Generated Images

The evaluation of images generated by the models was conducted exclusively by the authors, without the incorporation of external human studies mentioned in Sec. 6. This approach was necessitated by resource constraints, but to ensure reliability and objectivity in our analysis, all evaluations were cross-checked among the authors to achieve consensus. Recognizing the limitations inherent in this methodology, we acknowledge the value of large-scale, human-centric subjective studies for a more nuanced and comprehensive assessment of bias within AI-generated content. As such, the pursuit of extensive human studies to evaluate bias more accurately is identified as a vital avenue for future research.

Can Prompt Modifiers Control Bias? A Comparative Analysis of Text-to-Image Generative Models