ConstitutionMaker: Interactively Critiquing Large Language Models by Converting Feedback into Principles

Savvas Petridis Google ResearchNew YorkNew YorkUSA petridis@google.com , Ben Wedin Google ResearchCambridgeMassachusettsUSA wedin@google.com , James Wexler Google ResearchCambridgeMassachusettsUSA jwexler@google.com , Aaron Donsbach Google ResearchSeattleWashingtonUSA donsbach@google.com , Mahima Pushkarna Google ResearchCambridgeMassachusettsUSA mahimap@google.com , Nitesh Goyal Google ResearchNew YorkNew YorkUSA teshg@google.com , Carrie J. Cai Google ResearchMountain ViewCaliforniaUSA cjcai@google.com and Michael Terry Google ResearchCambridgeMassachusettsUSA michaelterry@google.com

Abstract.

Large language model (LLM) prompting is a promising new approach for users to create and customize their own chatbots. However, current methods for steering a chatbot’s outputs, such as prompt engineering and fine-tuning, do not support users in converting their natural feedback on the model’s outputs to changes in the prompt or model. In this work, we explore how to enable users to interactively refine model outputs through their feedback, by helping them convert their feedback into a set of principles (i.e. a constitution) that dictate the model’s behavior. From a formative study, we (1) found that users needed support converting their feedback into principles for the chatbot and (2) classified the different principle types desired by users. Inspired by these findings, we developed ConstitutionMaker, an interactive tool for converting user feedback into principles, to steer LLM-based chatbots. With ConstitutionMaker, users can provide either positive or negative feedback in natural language, select auto-generated feedback, or rewrite the chatbot’s response; each mode of feedback automatically generates a principle that is inserted into the chatbot’s prompt. In a user study with 14 participants, we compare ConstitutionMaker to an ablated version, where users write their own principles. With ConstitutionMaker, participants felt that their principles could better guide the chatbot, that they could more easily convert their feedback into principles, and that they could write principles more efficiently, with less mental demand. ConstitutionMaker helped users identify ways to improve the chatbot, formulate their intuitive responses to the model into feedback, and convert this feedback into specific and clear principles. Together, these findings inform future tools that support the interactive critiquing of LLM outputs.

Large Language Models, Conversational AI, Interactive Critique

^†^†ccs: Human-centered computing Empirical studies in HCI^†^†ccs: Human-centered computing Interactive systems and tools^†^†ccs: Computing methodologies Machine learning

Refer to caption — Figure 1. ConstitutionMaker’s Interface. First, users name and describe the chatbot they’d like to create (A). ConstitutionMaker constructs a dialogue prompt, and users can then immediately start a conversation with this chatbot (B). At each conversational turn, users are presented three candidate responses by the chatbot, and for each one, three ways to provide feedback: (1) kudos, (2) critique, and (3) rewrite. Each feedback method elicits a principle, which gets added to the Constitution in (C). Principles are rules that get appended to the dialogue prompt. Giving kudos to an output (D) entails providing positive feedback, either through selecting one of three generated positive rationales or by writing custom positive feedback. Critiquing (F) is the same but providing negative feedback. And finally, rewriting (G) entails revising the response to generate a principle.

1. Introduction

Large language models (LLMs) can be applied to a wide range of problems, ranging from creative writing assistance (Yuan et al., 2022; Gero et al., 2022; Petridis et al., 2023; Wang et al., 2023a) to code synthesis (Liu et al., 2023a; Jiang et al., 2022b, 2021). Users currently customize these models to specific tasks through strategies such as prompt engineering (Brown et al., 2020), parameter-efficient tuning (Lester et al., 2021), and fine-tuning (Howard and Ruder, 2018).

In addition to these common methods for customizing LLMs, recent work has shown that users would also like to directly steer these models with natural language feedback (Figure 2A). More specifically, some users want to be able to critique the model’s outputs to specify how they should be different (Bubeck et al., 2023). We call this customization strategy interactive critique.

When interacting with a chatbot like ChatGPT¹¹1https://chat.openai.com/ (Radford et al., 2019) or Bard²²2https://bard.google.com, interactive critique will often alter the chatbot’s subsequent responses to conform to the critique. However, these changes are not persistent: users must repeat these instructions during each new interaction with the model. Users must also be aware that they can actually alter the model’s behavior in this way, and must formulate their critique in a way that is likely to lead to changes in the model’s future responses. Given the potential value of this mode of customizing there is an opportunity to provide first-class support for empowering users to customize LLMs via natural language critique.

In the context of model customization, Constitutional AI (Bai et al., 2022) offers a specific customization strategy involving natural language principles. A principle can be thought of as a rule that the language model should follow, such as, “Do not create harmful, sexist, or racist content”. Given a set of principles, a Constitutional AI system will 1) rewrite model responses that violate principles and 2) fine-tune the model with the rewritten responses. Returning to the notion of interactive critique, one can imagine deriving new or refined Constitutional AI principles from users’ critiques. These derived principles could then be used to alter an LLM’s prompt (Figure 2B) or to generate new training data, as in the original Constitutional AI work.

While this recent work has shown principles can be an explainable and effective strategy to customize an LLM, little is known about the human processes of writing these principles from our feedback. From a formative study, we discovered that there are many cognitive challenges involved in converting critiques into principles. To address these challenges, we present ConstitutionMaker, an interactive critique system that transforms users’ model critiques into principles that refine the model’s behavior. ConstitutionMaker generates three candidate responses at each conversational turn. In addition to these three candidate responses, ConstitutionMaker provides three principle-elicitation features: 1) kudos, where users can provide positive feedback for a response, 2) critique, where users can provide negative feedback for a response, and 3) rewrite, where users can rewrite a given response. From this feedback, ConstitutionMaker infers a principle, which is incorporated in the chatbot’s prompt.

To evaluate how well ConstitutionMaker helps users write principles, we conducted a within-subjects user study with 14 industry professionals familiar with prompting. Participants used ConstitutionMaker and an ablated version that lacked the multiple candidate responses and the principle-elicitation features. In both cases, their goal was to write principles to customize two chatbots. From the study, we found that the two different versions yielded very different workflows. With the ablated version, participants only wrote principles when the bot deviated quite a bit from their expectations, resulting in significantly fewer principles being written, in total. In contrast, in the ConstitutionMaker condition, participants engaged in a workflow where they scanned the multiple candidate responses and gave kudos to their favorite response, leading to more principles overall. These different workflows also yielded condition-specific challenges in writing principles. With the ablated version, users would often under-specify principles; whereas, with ConstitutionMaker, users sometimes overspecified their principles, though this occurred less often. Finally, both conditions would sometimes lead to an issue where two or more of the principles were in conflict with one another.

Overall, with ConstitutionMaker, participants felt that their principles could better guide the chatbot, that they could more easily convert their feedback into principles, and that they could write principles more efficiently, with less mental demand. ConstitutionMaker also supported their thought processes as they wrote principles by helping participants 1) recognize ways responses could be better through the multiple candidate responses, 2) convert their intuition on why they liked or disliked a response into verbal feedback, and 3) phrase this feedback as a specific principle.

Collectively, this work makes the following contributions:

•

A classification of the kinds of principles participants want to write to steer chatbot behavior.
•

The design of ConstitutionMaker, an interactive tool for converting user feedback into principles to steer chatbot behavior. ConstitutionMaker introduces three novel principle-elicitation features: kudos, critique, and rewrite, which each generate a principle that is inserted into the chatbot’s prompt.
•

Findings from a 14-participant user study, where participants felt that ConstitutionMaker enabled them to 1) write principles that better guide the chatbot, 2) convert their feedback into principles more easily, and 3) write principles more efficiently, with less mental demand.
•

We describe how ConstitutionMaker supported participants’ thought processes, including helping them identify ways to improve responses, convert their intuition into natural language feedback, and phrase their feedback as specific principles. We also describe how the different workflows enabled by the two systems led to different challenges in writing principles and the limits of principles.

Together, these findings inform future tools for interactively refining LLM outputs via interactive critique.

2. Related Work

2.1. Designing Chatbot Behavior

There are a few methods of creating and customizing chatbots. Earlier chatbots employed rule-based approaches to construct a dialogue flow (Ramesh et al., 2017; Wu et al., 2017; Jia, 2009), where the user’s input would be matched to a pre-canned response written by the chatbot designer. Later on, supervised machine learning approaches (Xu et al., 2017; Zhou et al., 2018) became popular, where chatbot designers constructed datasets consisting of ideal conversational flows. Both of these approaches, while fairly effective, require a significant amount of time and labor to implement, from either constructing an expansive rule set that determines the chatbot’s behavior or from building a large dataset consisting of ideal conversational flows.

More recently, large language model prompting has shown promise for enabling easier chatbot design. Large, pre-trained models like Chat-GPT (Radford et al., 2019) can hold sophisticated conversations out of the box, and these models are already being used to create custom chatbots in a number of domains, including medicine (Lee et al., 2023). There are a few ways of customizing an LLM-based chatbot, including prompt engineering and fine-tuning. Prompt engineering involves providing instructions or conversational examples in the prompt to steer the chatbot’s behavior (Brown et al., 2020). To more robustly steer the model, users can also fine-tune (Lester et al., 2021) the LLM with a larger set of conversational examples. Recent work has shown that users would also like to steer LLMs by interactively critiquing its outputs; during the conversation they refine the model’s outputs by providing follow-up instructions and feedback (Bubeck et al., 2023). In this work, we explore how to support users with this type of model steering: naturally customizing the LLM’s behavior through feedback, as they interact with it.

A new approach to steering LLM-based chatbots (and LLMs in general), called Constitutional AI (Bai et al., 2022) involves writing natural language principles to direct the model. These principles are essentially rules, such as: “Do not create harmful, sexist, or racist content”. Given a set of principles, the Constitutional AI approach involves rewriting LLM responses that violate these principles, and then using these tuples of original and rewritten responses to fine tune the LLM. Writing principles could be a viable and intuitive way for users to steer LLM-based chatbot behavior, with the added benefit of being able to use these principles later to fine tune the model. However, relatively little is known about the kinds of principles users want to write, and how we might support users in converting their natural feedback on the model’s outputs into principles. In this work, we evaluate three principle elicitation features that help users convert their feedback into principles to steer chatbot behavior.

2.2. Helping Users Design LLM Prompts

While LLM prompting has democratized and dramatically sped up AI prototyping (Jiang et al., 2022a), it is still a difficult and ambiguous process for users (Zamfirescu-Pereira et al., 2023b, a; Reynolds and McDonell, 2021); they have challenges with finding the right phrasing for a prompt, choosing good demonstrative examples, experimenting with different parameters, and evaluating how well their prompt is performing. (Jiang et al., 2022a). Accordingly, a number of tools have been developed to support prompt writing along these lines.

To help users find a better phrasing for their prompt, automatic approaches have been developed that search the LLM’s training data for a more effective phrasing (Pryzant et al., 2023; Shin et al., 2020). In the text-to-image domain, researchers have employed LLMs to generate better prompt phrasings or keywords for generative image models (Liu et al., 2022, 2023b; Brade et al., 2023; Wang et al., 2023b). Next, to support users in sourcing good examples for their prompt, ScatterShot (Wu et al., 2023) suggests underrepresented data to include in the prompt from a dataset, and enables users to iteratively evaluate their prompt with these examples. Similar systems help users source diverse and representative examples via techniques like clustering (Chang et al., 2021) or graph-based search (Su et al., 2022). To support easy exploration of prompt parameters, Cells, Generators, and Lenses (Kim et al., 2023) enables users to flexibly test different inputs with instantiations of models with different parameters. In addition to improving the performance of a single run prompt, recent work has also investigated the benefits of chaining multiple prompts together, to improve performance on more complicated tasks (Wu et al., 2022b, a). Finally, tools like PromptIDE (Strobelt et al., 2023), PromptAid (Mishra et al., 2023), and LinguisticLens (Reif et al., 2023) support users in evaluating their prompts, by either visualizing the data it produces, or its performance in comparison to other prompt variations.

This work explores a novel, more natural way of customizing a prompt’s behavior through interactive critique. ConstitutionMaker enables users to provide natural language feedback on a prompt’s outputs, and this feedback is converted into principles that are then incorporated back into the prompt. We illustrate the value of helping users update a prompt via their feedback, and we introduce three novel mechanisms for converting users’ natural feedback into principles for the prompt.

2.3. Interactive Model Refinement via Feedback

Finally, ConstitutionMaker is broadly related to systems that enable users to customize their outputs via limited or underspecified feedback. For example, programming-by-example tools enable users to provide input-output examples, for which the system generates a function that fits them (Chen et al., 2020; Verbruggen et al., 2021; Zhang et al., 2020). Input-output examples are inherently ambiguous, potentially mapping to multiple functions, and these systems employ a number of methods to specify and clarify the function with the user. In a similar process, ConstitutionMaker takes ambiguous natural language feedback on the model’s output and generates a more specific principle for the user to inspect and edit. Next, recommender systems (Bostandjiev et al., 2012; Petridis et al., 2022; Kunkel et al., 2017) also enable users to provide limited feedback to steer model outputs. One such system (Kunkel et al., 2017) projects movie recommendations on a 2D plane, which users can interactively raise or lower portions of it to affect a list of recommendations; in response to these changes, the system provides representative movies for each raised portion to demonstrate how it has interpreted the user’s feedback. Overall, in contrast to these systems, ConstitutionMaker leverages LLMs to enable users to provide natural language feedback and critique the model in the same way we would provide feedback to another person.

3. Formative Study

To understand how to support users with writing principles for chatbots, we conducted (1) a one-hour formative study, where we observed eight industry professionals write principles for chatbots of their choice. These participants all had prompting experience. Two participants are designers and six are software engineers, all at a large technology company. During the workshop, participants used an early version of ConstitutionMaker, without principle elicitation features. They spent 25 minutes writing principles for their chatbot. Afterwards, we discussed the difficulties they faced while writing principles. Finally, we collected the principles they wrote and classified them to understand the kind of principles they wanted to write.

3.1. Design Goals

In this section, we summarize a set of three design goals for ConstitutionMaker we established from the formative workshop and subsequent think-alouds.

D.1

Help users recognize ways to improve the chatbot’s responses by showing alternative chatbot responses. Today’s LLMs are quite sophisticated, and even with just a preamble describing how the bot should behave, the chatbot can hold a convincing conversation. Because of this, participants mentioned that it was sometimes hard to imagine how the chatbot’s responses could be improved. This did not mean, however, that they thought the chatbot’s response was perfect, but instead passable and without any glaring errors. Therefore, to help participants recognize better kinds of responses to steer the chatbot to, our first design goal was to provide multiple candidate responses from the chatbot at each conversational turn. This way, participants can compare them and recognize components they like more than others.
D.2

Help convert user feedback into specific principles to make principle writing easier. One piece of feedback we got from participants was that writing principles involves a difficult two-step process of first (1) articulating one’s feedback on the model’s current output, and then (2) converting this feedback into a principle for the LLM to follow. Often, one’s initial reaction to the model’s output is intuitive, and converting that intuition into a principle for the chatbot to follow can be challenging. In addition, once participants had a particular bit of feedback in mind (e.g., “I don’t like how the chatbot didn’t introduce itself”), they were unsure how to phrase their principle. However, in line with prior research (Zamfirescu-Pereira et al., 2023b, a), they found that more concrete principles that specified what should happen and when (e.g., “Introduce yourself at the start of the conversation, and state what you can help with”) generally led to better results. Thus, our second design goal was to help users go from their initial reaction to the model’s output to a specific, clearly written principle to steer the model.
D.3

Enable easier testing of principles to help users understand how well their principles are steering the chatbot’s behavior. As participants wrote more principles, they wanted ways to test these principles to make sure they worked. The early version of ConstitutionMaker only let users restart the conversation, and did not let users enable or disable principles. Users wanted to test individual principles on certain portions of the conversation, to see if the model was generating the correct content. And so, our last design goal was to enable easier testing of principles.

3.2. Principle Classification

From the formative workshop and follow up sessions, we collected 79 principles in total and classified them to understand the kinds of principles users wanted to write. These principles correspond to a number of very different chatbots, including a show recommender, chemistry tutor, role playing game manager, travel agent and more. We describe common types of principles below.

Principles can be either unconditional or conditional. Unconditional principles are those that apply at every conversational turn. Examples include: (1) those that define a consistent personality of the bot (e.g., “Act grumpy all the time” or “Speak informally and in the first person”), (2) those that place guardrails on the conversational content (e.g., “Don’t talk about anything but planning a vacation”), and (3) those that establish a consistent form for the bot’s responses (e.g., “Limit responses to 20 words”). Meanwhile, a conditional principle only applies when a certain condition is meant. For example, “Generate an itinerary after all the information has been collected,” only applies to the conversation when some set of information has been acquired. Writing a conditional principle essentially defines a computational interaction; users establish a set of criteria that make the principle applicable to the conversation, and once that set of criteria is met, the principle is executed (e.g., an itinerary is generated).

Conditional principles can depend on the entire conversation history, the user’s latest response, or the action the bot is about to take. For example, “Generate an itinerary after all the information has been collected” depends on the entire conversation history to determine if all of the requisite information has been collected. Similarly the following principle written for a machine learning tutor, “After verifying a user’s goal, provide help to solve their problem,” depends on the conversation history to identify if the user’s goal has been verified. Meanwhile, the principle “When the user says they had a particular cuisine the night before, recommend a different cuisine,” written for a food recommender, pertains just to the latest response by the user. Finally, the condition can depend on the action the bot is about to take, like “When providing a list of suggestions, use free text rather than bullet points,” which applies to any situation when the bot thinks it is appropriate to make suggestions.

Conditional principles can be fulfilled in a single or multiple conversational turns. For example, the principle “At the start of the conversation, introduce yourself and ask a fun question to kick off the conversation” is fulfilled in a single conversational turn, in which the bot introduces itself. Similarly, “Before recommending a restaurant, ask the user for their location” is also fulfilled in a single turn. Meanwhile, for a role playing game (RPG) bot that guides the user through an adventure, a participant wrote the following principle: “When the user tries to do something, put up small obstacles. Don’t let them succeed on the first attempt.” This principle implies that the bot needs to take action multiple turns prior to being fulfilled (e.g., by first putting up a small obstacle and then subsequently letting the user succeed). Similarly, for a travel agent bot, a user wrote “Ask questions one-by-one to get an idea of their preferences,” which also requires multiple conversational turns prior to fulfillment.

In summary, principles can either be conditional, where they apply when a certain condition is met, or unconditional, where they apply at every conversational step. Conditional principles further break down into those that depend on the entire conversation history, the user’s last response, or the action the bot is about to take. And finally, conditional principles are either fulfilled in a single turn or multiple conversational turns.

4. ConstitutionMaker

Inspired by our findings from the formative studies and workshop, we built ConstitutionMaker, an interactive web tool that supports users in converting their feedback into principles to steer a chatbot’s behavior. ConstitutionMaker enables users to define a chatbot, converse with it, and within the conversation, interactively provide feedback to steer the chatbot’s behavior.

4.1. Interface and Walk Through

To illustrate how ConstitutionMaker works, let us consider a music enthusiast, Penelope, who would like to design a chatbot that helps you learn about and find new music, called MusicBot. She starts by entering the name of her bot and roughly describing its purpose in the “Capabilities” section of the interface (Figure 1A). She then starts a conversation with MusicBot, and after MusicBot’s introductory message, she asks to learn about punk music (Figure 1B). Fulfilling our first design goal, help users recognize ways to improve the bot’s responses, at each conversational turn, ConstitutionMaker provides three candidate responses from the bot (Figure 1D) that the user can compare and provide feedback on. Penelope peruses these candidate responses, and of the three responses, she likes the first one, as it invites the user to continue the conversation with a question at the end. She now wants to write a principle to help ensure that the chatbot will continue to do this in future conversations.

Fulfilling D.2, help convert user feedback into principles, ConstitutionMaker provides three principle-elicitation features to support users in converting their feedback to principles: kudos, critique, and rewrite. Since Penelope likes the response, she selects kudos underneath it (Figure 1D), which reveals a menu with three automatically generated rationales on why the response is good, as well as a text field for Penelope to enter her own reason. After scanning the rationales, she selects the second, as it closely matched her own feedback, and subsequently a principle is automatically generated from that rationale (Figure 1C). The critique (Figure 1F) and rewrite (Figure 1G) principle elicitation features work similarly, where Penelope can select a negative rationale or rewrite the model’s response to generate a principle respectively. She then inspects the generated principle, decides that it captured her intention well, and decides not to edit it.

Fulfilling D.3, enable easy testing of principle, she can then test to see if the chatbot is following her principle by rewinding the conversation (Figure 1H) to get a new set of candidate responses from the model. Ultimately, she decides to continue conversing with MusicBot, exploring different user journeys, and using the principle elicitation features to create a comprehensive set of principles.

5. Implementation

ConstitutionMaker is a web application and utilizes an LLM ³³3anonymized for peer review that is promptable in the same way as GPT-3 ⁴⁴4https://openai.com/api/ or PaLM ⁵⁵5https://developers.generativeai.google/. In the following section, we go through the implementation of ConstitutionMaker’s key features.

5.1. Facilitating the Conversation

To generate the chatbot’s response, ConstitutionMaker builds a dialogue prompt (Figure 3A) behind the scenes. The dialogue prompt consists of (1) a description of the bot’s capabilities, entered by the user (Figure 1A), (2) the current set of principles, and (3) the conversation history, ending with the user’s latest input. The prompt then generates the bot’s next response, for which we choose the top-3 completions outputted by the LLM to display to users (Figure 3B). When the conversation is restarted or rewound, the conversation history within the dialogue prompt is modified; in the case of restarting, the entire history is deleted, whereas for rewinding, everything after the rewind point is deleted. And finally, if the conversation gets too long for the prompt context window, we remove the oldest conversational turns until it fits.

5.2. Three Principle Elicitation Features

All three principle elicitation features output a principle that is then incorporated back into the dialogue prompt (Figure 3A) to influence future conversational turns. Giving kudos and critiquing a bot’s response consist of a similar process. For both, the selected bot output is fed into a few-shot prompt that generates rationales, either positive (Figure 3C) or negative (Figure 3D). The user’s selected rationale (or their own written rationale) is then sent to a few-shot prompt that converts this rationale into a principle (Figure 3F and 3G). This few-shot prompt leverages the conversation history to create a specific, conditional principle. For example, for MusicBot, if the critique is “The bot did not ask questions about the user’s preferences,” a specific, conditional principle might be “Prior to giving a music recommendation, ask the user what genres or artists they currently listen to.” Next, for critiques, after the principle is inserted into the dialogue prompt, new outputs are generated to show to the user (Figure 3G). Finally, for rewriting the bot’s response, we leverage a chain-of-thought (Wei et al., 2022) style prompt that first generates a “thought,” which reasons how the original and rewritten outputs are different from each other, and then generates a specific principle based on that reasoning. Constructing the prompt with a “thought” portion led to principles that captured the difference between the two outputs better than our earlier versions without it.

6. User Study

To understand (1) if the principle elicitation features help users write principles to steer LLM outputs and (2) what other kinds of feedback they wanted to give, we conducted a 14-participant within-subjects user study, comparing ConstitutionMaker to an ablated version without the principle elicitation features. This ablated version still offered users the ability to rewind parts of the conversation, but participants could only see one chatbot output at a time and had to write their own principles.

Measure	Statement (7-point Likert scale)
Effectively Guide	Q1. With {Tool A/B}, I feel like I was able to write rules that can effectively guide the bot to produce my desired outcomes.
Diversity	Q2. With {Tool A/B}, I feel like I can think of more diverse rules that can guide the bot in a number of different ways and situations.
Ease	Q3. With {Tool A/B}, I felt like it was easy to convert my thoughts and feedback on the bot’s behavior into rules for the bot to follow.
Efficiency	Q4. With {Tool A/B}, I felt like I could quickly and efficiently write rules for the bot.
Mental Demand	Q5. With {Tool A/B}, I had to work very hard (mentally) to think of and write rules.

Table 1. Post-task questionnaire filled out by participants after they wrote principles for two chatbots, one with both PromptInfuser and the other, with the ablated version. Each statement was rated on a 7-point Likert scale.

\Description

6.1. Procedure

The overall outline of this study is as follows: (1) Participants spent 40 minutes writing principles for two separate chatbots, one with ConstitutionMaker (20 minutes) and the other with the baseline version (20 minutes), while thinking aloud. Condition order was counterbalanced, with chatbot assignment per condition also balanced. (2) After writing principles for both chatbots, participants completed a post-study questionnaire, which compared the process of writing principles with each tool. (4) Finally, in a semi-structured interview, participants described the positives and negatives of each tool and their workflow. The total time commitment of the study was 1 hour.

The two chatbots participants wrote principles for were VacationBot, an assistant that helps users plan and explore different vacation options, and FoodBot, an assistant that helps you plan your meals and figure out what to eat. These two chatbots were chosen because they support tasks that most people are experienced with, so that participants could have opinions on their outputs and write principles. For both chatbots, participants were given the name and capabilities (Figure 1A), so as to focus predominantly on principle writing. Also, we balanced which chatbot went with each condition, so half of the participants used ConstitutionMaker to write principles for VacationBot, and the other half used the baseline for VacationBot. Finally, prior to using each version, participants watched a short video showing that respective tool’s features.

To situate the task, participants were asked to pretend to be a chatbot designer and that they were writing principles to dictate the chatbot’s behavior so that it performs better for users. We wanted to observe their process for writing principles and see if the tools impacted how many principles they could write, so we encouraged participants to write at least 10 principles for each chatbot, to give them a concrete goal and to motivate them to write more principles. However, we emphasized that this was only to encourage them to write principles and that they should only write a principle if they thought it would be useful to future users.

6.2. Measurements and Analysis

6.2.1. Questionnaire

We wanted to understand if and how well ConstitutionMaker’s principle elicitation features help users write principles. Our questionnaire (Table 1) probes a few aspects of principle writing including participants’ perception of (1) how well the output principles effectively guide the bot, the diversity of the output principles, how easy it was to convert their feedback into principles, the efficiency of their principle writing process, and the requisite mental demand (Hart and Staveland, 1988) for writing principles with each tool. To compare the two conditions, we conducted paired sample Wilcoxon tests with full Bonferroni correction, since the study was within-subjects and the questionnaire data was ordinal.

6.2.2. Feature Usage Metrics

To shed further light on which tool helped participants write principles more, we recorded the number of principles written for each condition. Moreover, to understand which of the principles elicitation features was most helpful, we recorded how often each was used during the experimental (full ConstitutionMaker) condition. To compare the average number of principles collected across the two conditions, we conducted a paired t-test.

6.3. Participants

We recruited 14 industry professionals at a large technology company (average age = 32, 6 female and 8 male) via an email call for participation and word of mouth. These industry professionals included UX designers, software engineers, data scientists, and UX researchers. Eligible participants were those that had at least written a few LLM prompts in the past. The interviews were conducted remotely. Participants received a $25 gift card for their time.

7. Findings

Quantitative Findings. From the exit interviews, 12 of 14 participants preferred ConstitutionMaker to the baseline version. The results from the questionnaire are summarized in Figure 4. We found that ConstitutionMaker was perceived to be more helpful for writing rules that effectively guided the bot (Z = 8, p = .007), scoring on average 5.79 ( $\sigma$ = 1.01) whereas the baseline scored 4.0 (1.51). When participants rewound parts of the conversation, they thought the ConstitutionMaker principles were followed by the bot better. Next, participants felt it was significantly easier (Z = 10.5, p = .006) to convert their feedback into principles with ConstitutionMaker ( $\mu$ = 5.86, $\sigma$ = 0.91) than with the baseline ( $\mu$ = 3.93, $\sigma$ = 1.39). The automatic conversion of kudos and critiques into principles eased the process of converting intuitive feedback into clear criteria for the bot to follow. Participants perceived that they were significantly more efficient (Z = 5, p = .004) writing rules with ConstitutionMaker ( $\mu$ = 5.86, $\sigma$ = 1.51) than with the baseline ( $\mu$ = 3.64, $\sigma$ = 1.34). Finally, participants also felt that writing principles with ConstitutionMaker ( $\mu$ = 3.0, $\sigma$ = 1.41) required significantly less mental demand (Z = 1.5, p = .002) than with the baseline ( $\mu$ = 5.21, $\sigma$ = 1.78). There was no statistically significant difference for diversity (Z = 19, p = .06); participants felt that they exercised their creativity and wrote relatively diverse principles in the baseline.

Next, regarding the feature usage metrics, participants wrote significantly more principles (t(13)=4.73, p ¡ .001) with ConstitutionMaker than with the baseline; participants wrote on average 6.78 ( $\sigma$ = 2.11) principles per chatbot with ConstitutionMaker and 4.42 ( $\sigma$ = 1.24) principles with the baseline. Of the 95 principles written in the ConstitutionMaker condition, 40 (42.1%) came from kudos, where 37 were selected from the generate rationales and 3 were written; 28 (29.5%) came from critique, where 8 were selected and 20 were written; 13 (13.6%) came from the rewrite features; 14 (14.7%) were manually written. Participants found rewriting a bit cumbersome and preferred the less intensive workflow of describing what they liked or did not like to generate principles. In the following sections, we provide further context to these findings.

7.1. Participants’ Workflows for Writing Principles

The two conditions led to quite different workflows for writing principles. In the ConstitutionMaker condition, participants commonly scanned the three candidate outputs from the chatbot, identified the output they liked the most, and then gave kudos to that output if they thought it had a quality that was not currently reflected in their principles. For example, while P1 was working on FoodBot, he was asking for easy-to-make vegetarian dishes, and in one of the bot’s candidate outputs, each suggested dish had a short description and explanation on why it was easy. He appreciated this, skimmed the kudos, and then selected one that conveyed this positive feature of the bot’s output. However, if participants disliked all of the responses, they would then switch to critiquing one of the outputs. Accordingly, this kudos-first workflow helps to explain how most of the principles participants wrote came from that principle elicitation feature.

Meanwhile, in the baseline condition, participants generally wrote principles when the bot deviated quite a bit (in a negative way) from what they expected. P8 explained, “Here it feels like what I more naturally do is write corrective rules, to guardrail anything that goes weird…If it’s already doing the right thing it doesn’t need a rule from me. I wouldn’t feel the need to write those.” In the baseline condition, participants only see one candidate output from the chatbot, and this might deemphasize the stochastic nature of the chatbot’s outputs. As a result, when participants are okay with a response, they could feel that they do not need to write a principle to encourage that kind of response further. Overall, with the baseline, participants predominantly wrote principles to steer the LLM from less optimal behavior, while with ConstitutionMaker, participants mostly used kudos the most to encourage behavior they liked.

7.2. ConstitutionMaker Supported Users’ Thought Processes

In the following section, we discuss how ConstitutionMaker supported participants’ thought process from (1) forming an intuition on ways the chatbot could be improved, (2) expressing this intuition as feedback, and (3) converting this feedback into a specific and clear principle.

7.2.1. Multiple chatbot outputs helped participants form an intuition on how the model could be steered.

As P5 was using the baseline after using ConstitutionMaker, she explained how she wished she could see multiple outputs again: “Sometimes, I don’t know what I’m missing [in the baseline]. I’m thinking of the Denali hiking example [which occurred when she wrote principles for VacationBot with ConstitutionMaker]. Two of the responses didn’t mention that Denali was good for young children. But one did, and I was able to pull that out as a positive principle.” While she was writing principles for VacationBot with ConstitutionMaker, P5 started off the conversation saying she was looking for suggestions for her family, which included two young children. As the conversation progressed and a general location was established, P5 then asked for hiking recommendations, for which the bot gave some, but only one of its responses highlighted that the hikes it was recommending were good for young children. P5 gave kudos to that response and created the following principle: “Consider information previously inputted by the user when providing recommendations.” Ultimately, it can be hard to form opinions on responses without getting exposed to alternatives, so by providing multiple chatbot outputs, ConstitutionMaker supported participants in forming an intuition on how the model might be steered.

7.2.2. Automatically providing kudos and critique rationales helped participants formulate their intuitive feedback.

Upon seeing a candidate response from the chatbot, participants could intuitively tell if they liked or disliked it, but struggled to articulate their thoughts. The automatically generated kudos and critiques helped participants recognize and formulate this feedback. For example, while working on FoodBot with ConstitutionMaker, P9 asked the bot to identify the pizzeria with the best thin crust from a list of restaurants provided in a prior turn. The bot responded with, “Pizzaiolo has the best thin crust pies.” P9 knew he did not like the response, so he went to critique it and selected the following generated option: “This response is bad because it does not provide any information about the other pizza places that the user asked about.” The following principle was generated: “If the user asks about a specific attribute of a list of items, provide information about all of the items in the list that have that attribute,” which then produced a set of revised responses that compared the qualities of each pizzeria’s crusts. Reflecting on this process, P9 stated, “I didn’t like that last answer [from FoodBot], but I didn’t have a concrete reason why yet…I didn’t really know how to put it into words yet…but reading the suggestions gave me at least one of many potential reasons on why I didn’t like the response.” Thus, ConstitutionMaker helped participants transition from fast thinking (Kahneman, 2011), that is, their intuitive and unconscious responses to the bot’s outputs, to slow thinking, a more deliberate, conscious formulation of their feedback.

7.2.3. Generating principles from feedback helped users write clear, specific principles

Sometimes the generated kudos and critique rationales did not capture participants’ particular feedback on the chatbot, and so they would then write their own. Their feedback was often under-specified, and ConstitutionMaker helped convert this feedback into a clear and specific principle. For example, P4 was writing principles for VacationBot using ConstitutionMaker. During the conversation, they had told VacationBot that they were planning a week long vacation to Japan, to which the bot immediately responded with a comprehensive 7-day itinerary. P4 then wrote in their own critique: “This response is bad because it does not take into account the user’s interests.” The resulting principle was, “When the user mentions a location, ask them questions about what they are interested in before providing an itinerary.” This principle was aligned with what P4 had in mind, and reflecting on her experience using ConstitutionMaker, she stated, “When I would critique or kudos, it would give examples of principles that were putting it into words a little bit better than I could about like what exactly I was trying to narrow down to here.” Finally, even when the resulting principle was not exactly what they had in mind, participants appreciated the useful starting point it provided. Along these lines, P11 explained, “It was easier to say yes-and with Tool A [ConstitutionMaker]. Where it [the generated principle] wasn’t all the way there, but I think it’s 50% of the way there, and I can get it to where I want to go.” Overall, ConstitutionMaker helped participants specify their feedback into clear principles.

7.3. Participants’ Workflows Introduced Challenges with Writing Principles

Participants struggled to find the right level of granularity for their principles, and the two conditions led to different problems in this regard. Both workflows had participants switch roles from end-user, where participants experimented with different user journeys, to bot designer, where they evaluated the bot’s responses to write principles. The more conversation-forward interface of the baseline blurred the distinction between these two roles. P3 explained that without the multiple bot outputs and principle elicitation features, “you can simulate yourself as the user a lot better in this mode [the baseline].” And by leaning further into this user role, participants wrote principles that were more conversational, but under-specified. For example, while writing principles for FoodBot with the baseline, P11 wrote the principle “Be cognizant of the user’s dietary preferences.” What P11 really had in mind was a principle that specified that the bot should ask the user for their preferences and allergies prior to generating a meal plan. These underspecified principles often did not impact the bot’s responses and would frustrate participants while they used the baseline.

Meanwhile, while using ConstitutionMaker, the opposite problem occurred, where users’ workflows led to principles that were over-specified. For example, while working on VacationBot, P7 asked the model to help him narrow down a few vacation options, and the model proceeded to ask them questions (without any principle written specifying so). Appreciating that the model was gathering context, they selected a kudos that praised the model for asking about the user’s budget constraints prior to recommending a vacation destination. The resulting principle was, “Ask the user their budget before providing vacation options.” However, once this principle came into effect, the model’s behavior then anchored specifically to only asking for budget prior to making a recommendation. And so, this workflow of providing feedback at every conversational step, instead of for entire conversations, led to a principle that was too specific and impacted the bot’s performance negatively. While users generally appreciated ConstitutionMaker’s ability to form specific principles from their feedback, there were rare instances where the principles were too specific.

Finally, in both conditions, by switching back and forth between the end-user to bot designer roles, participants would sometimes write principles that conflicted with each other. For example, while P2 was working on VacationBot with the baseline, they asked the bot for dog-friendly hotel recommendations in the Bay Area, and VacationBot responded with three recommendations. P2 wanted more recommendations and wrote a principle to “Provide ¿= 10 recommendations.” Later on in the conversation, P2 now had a list of dog-friendly restaurants, with their requisite costs, and he asked VacationBot which it recommends, to which it responded by listing positive attributes of all the hotel options. P2, who now wanted a decisive, single response wrote the following principle: “If I ask for a recommendation, give *1* recommendation only.” VacationBot, now with two conflicting principles on the number of recommendations to provide, alternated between the two. Ultimately, by providing feedback on individual conversational turns, participants ended up with conflicting principles. P8 imagined a different workflow, where he would experiment with full user journeys and then write principles: “I think it might help me to actually go through the [whole] user flow and then analyze it as a piece instead of switching…it would allow me to inhabit one mindset [either bot designer or user] for a period of time and then switch mindsets.” In summary, one’s workflow as they probe and test the model impacts the types of principles they produce and the challenges they face.

7.4. The Limits of Principles

Some participants questioned if writing natural language principles was the optimal way to steer all aspects of a bot’s behavior. While writing a principle to shorten the length of the chatbot’s responses, P13 reflected, “It feels a little weird to use natural language to generate the principles…it doesn’t feel efficient, and I’m not sure how it’s going to interpret it.” They imagined that aspects like the form of the model’s responses would be better customized with UI elements like sliders to adjust the length of the bot’s responses, or exemplifying the structure of the bot’s response (e.g., an indented, numbered list for recommendations) for the model to follow, instead of describing these requests in natural language. In a similar vein, P14 noticed that her principles pertained to different parts of the conversation, and as a list, they seemed hard to relate to each other. She wanted to structure and provide feedback on higher-level “conversational arcs,” visually illustrating the flow and “forks in the road” of the conversation (e.g., “If the user does X, do Y. Otherwise, do Z”). Principles are computational in a sense, and they dictate the ways the conversation can unfold; there might be better ways to let users author this flow, other than with individual principles.

8. Discussion

8.1. Supporting Users in Clarifying and Iterating Over Principles

Finding the right granularity for a principle was sometimes challenging for participants; they created under-specified principles that did not impact the conversation, as well as over-specified principles that applied too strictly to certain portions of the conversation. One way to support users in finding the right granularity could be generating questions to help them reflect on their principle. For example, if an abstract principle is written (e.g., “Ask follow up questions to understand the users preferences”), an LLM prompt can be used to pose clarifying questions, such as, “What kind of follow up questions should be asked?” or, “Should I do anything differently, depending on the user’s answers?” Users could then answer these questions the best they can and then the principle could be updated automatically. Alternatively, another way to help users reflect on their principles might be to engage in a side conversation with a chatbot to help clarify them. This chatbot could pose similar questions as those suggested above, but it might also provide examples of chat snippets that adhere to or violate the principle. As they converse, the chatbot might continue to pose clarifying questions, while the principle is updated on the fly. Thus, future work could examine supporting users in interactively reflecting upon and clarifying their principles.

8.2. Organizing Principles and Supporting Multiple Principle Writers

As participants accumulated principles, it was increasingly likely that there was a conflict between two of them, and it was harder for them to get an overview of how their principles were affecting the conversation. One way to prevent potential principle conflicts is to leverage an LLM to conduct a pairwise comparison of principles to assess if any two are at odds, and then suggest a solution. This kind of conflict resolution, while useful for a single principle writer, would be crucial in cases when multiple individuals are writing principles to improve a model. Multiple principle writers would be useful for gathering feedback to improve the overall performance of the model, but with many generated principles, it is increasingly important to understand how they might impact the conversation. Perhaps a better way to prevent conflicts is to organize principles in a way that summarizes their impact on the conversation. Principles are small bits of computation; there are conditions when they are applicable, and depending on those conditions, the bot’s behavior might branch in separate directions. One way to organize principles is to construct a state diagram, which illustrates the potential set of supported user flows and bot responses. With this overview, users could be made better aware of the overall impact of their principles, and they could then easily revise it to prevent conflicts. Therefore, another rich vein of future work is developing mechanisms to resolve conflicts in larger sets of principles, as well as organizing them into an easily digestible overview.

8.3. Automatically Ideating Multiple User Journeys to Support Principle Writing

To test the chatbot and identify instances where the model could be improved, participants employed a strategy where they went through different user journeys with the chatbot. Often their choice of user journeys were biased toward what they were interested in, or they only tested the most common journeys. One way to enable more robust testing of chatbots could be to generate potential user personas and journeys to inspire principle writers. Going further, these generated user personas could then be used simulate conversations with the chatbot being tested (Park et al., 2023). For example, for VacationBot, one user persona might be a parent looking for nearby, family friendly vacations. A dialogue prompt could be generated for this persona and then VacationBot and this test persona could converse for a predefined set of conversational turns. Afterwards, users could inspect the conversation, and edit or critique VacationBot’s responses to generate principles. This kind of workflow could sidestep the challenge of repeatedly shifting from an end-user’s perspective to a bot-designer’s perspective, which exists in the current workflow. At the same time, users would be able to evaluate fuller conversational arcs, as opposed to single conversational turns. Thus, another line of future work is supporting users in exploring diverse user journeys with their chatbot, as well as exploring workflows that require less perspective switching.

8.4. Limitations and Future Work

A set of well-written principles is often not enough to robustly steer an LLM’s outputs. As more principles are written, an LLM might “forget” to apply older principles (Zamfirescu-Pereira et al., 2023a). This work focuses on helping participants convert their intuitive feedback into clear principles, and we illustrate that the principle-elicitation features help with that process. However, in line with the original Constitutional AI workflow (Bai et al., 2022), future work can focus on using these principles to generate a fine-tuning dataset, so that the model robustly follows them.

Next, while we selected chatbots for two very common use cases for the study (vacation and food), participants might not have been very knowledgeable or opinionated in these areas. Future work can explore how these principle-elicitation features help users when writing principles for chatbot use cases that they are experts in. That being said, it was necessary to choose two chatbot use cases for the study to enable a fair comparison across the two conditions.

9. Conclusion

This paper presents ConstitutionMaker, a tool for interactively refining LLM outputs by converting users’ intuitive feedback into principles. ConstitutionMaker’s design is informed by a formative study, where we also collected and classified the types of principles users wanted to write. ConstitutionMaker incorporates three principle-elicitation features: kudos, critique, and rewrite. In a user study with 14 industry professionals, participants felt that ConstitutionMaker helped them (1) write principles that effectively guided the chatbot, (2) convert their feedback into principles more easily, and (3) write principles more efficiently, with (4) less mental demand than the baseline. This was due to ConstitutionMaker supporting their thought processes, including helping them to: identify ways to improve the bot’s responses, convert their intuition into verbal feedback, and phrase their feedback as specific principles. There are many avenues of future work, including supporting users in iterating on and clarifying their principles, organizing larger sets of principles and supporting multiple writers, and helping users test chatbots across multiple user journeys. Together, these findings inform future tools that support interactively customizing LLM outputs.

References

(1)
Bai et al. (2022) Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, Kamile Lukosuite, Liane Lovitt, Michael Sellitto, Nelson Elhage, Nicholas Schiefer, Noemi Mercado, Nova DasSarma, Robert Lasenby, Robin Larson, Sam Ringer, Scott Johnston, Shauna Kravec, Sheer El Showk, Stanislav Fort, Tamera Lanham, Timothy Telleen-Lawton, Tom Conerly, Tom Henighan, Tristan Hume, Samuel R. Bowman, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas Joseph, Sam McCandlish, Tom Brown, and Jared Kaplan. 2022. Constitutional AI: Harmlessness from AI Feedback. arXiv:2212.08073 [cs.CL]
Bostandjiev et al. (2012) Svetlin Bostandjiev, John O’Donovan, and Tobias Höllerer. 2012. TasteWeights: A Visual Interactive Hybrid Recommender System. In Proceedings of the Sixth ACM Conference on Recommender Systems (Dublin, Ireland) (RecSys ’12). Association for Computing Machinery, New York, NY, USA, 35–42. https://doi.org/10.1145/2365952.2365964
Brade et al. (2023) Stephen Brade, Bryan Wang, Mauricio Sousa, Sageev Oore, and Tovi Grossman. 2023. Promptify: Text-to-Image Generation through Interactive Prompt Exploration with Large Language Models. In Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology (UIST). https://doi.org/10.1145/3586183.3606725 arXiv:2304.09337 [cs.HC]
Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (Eds.), Vol. 33. Curran Associates, Inc., 1877–1901. https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf
Bubeck et al. (2023) Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, Harsha Nori, Hamid Palangi, Marco Tulio Ribeiro, and Yi Zhang. 2023. Sparks of Artificial General Intelligence: Early experiments with GPT-4. arXiv:2303.12712 [cs.CL]
Chang et al. (2021) Ernie Chang, Xiaoyu Shen, Hui-Syuan Yeh, and Vera Demberg. 2021. On Training Instance Selection for Few-Shot Neural Text Generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). Association for Computational Linguistics, Online, 8–13. https://doi.org/10.18653/v1/2021.acl-short.2
Chen et al. (2020) Qiaochu Chen, Xinyu Wang, Xi Ye, Greg Durrett, and Isil Dillig. 2020. Multi-Modal Synthesis of Regular Expressions. In Proceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation (London, UK) (PLDI 2020). Association for Computing Machinery, New York, NY, USA, 487–502. https://doi.org/10.1145/3385412.3385988
Gero et al. (2022) Katy Ilonka Gero, Vivian Liu, and Lydia Chilton. 2022. Sparks: Inspiration for Science Writing Using Language Models. In Proceedings of the 2022 ACM Designing Interactive Systems Conference (Virtual Event, Australia) (DIS ’22). Association for Computing Machinery, New York, NY, USA, 1002–1019. https://doi.org/10.1145/3532106.3533533
Hart and Staveland (1988) Sandra G. Hart and Lowell E. Staveland. 1988. Development of NASA-TLX (Task Load Index): Results of Empirical and Theoretical Research. In Human Mental Workload, Peter A. Hancock and Najmedin Meshkati (Eds.). Advances in Psychology, Vol. 52. North-Holland, 139–183. https://doi.org/10.1016/S0166-4115(08)62386-9
Howard and Ruder (2018) Jeremy Howard and Sebastian Ruder. 2018. Universal Language Model Fine-tuning for Text Classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Melbourne, Australia, 328–339. https://doi.org/10.18653/v1/P18-1031
Jia (2009) Jiyou Jia. 2009. CSIEC: A computer assisted English learning chatbot based on textual knowledge and reasoning. Knowledge-Based Systems 22, 4 (2009), 249–255. https://doi.org/10.1016/j.knosys.2008.09.001 Artificial Intelligence (AI) in Blended Learning.
Jiang et al. (2022a) Ellen Jiang, Kristen Olson, Edwin Toh, Alejandra Molina, Aaron Donsbach, Michael Terry, and Carrie J Cai. 2022a. PromptMaker: Prompt-Based Prototyping with Large Language Models. In Extended Abstracts of the 2022 CHI Conference on Human Factors in Computing Systems (New Orleans, LA, USA) (CHI EA ’22). Association for Computing Machinery, New York, NY, USA, Article 35, 8 pages. https://doi.org/10.1145/3491101.3503564
Jiang et al. (2021) Ellen Jiang, Edwin Toh, Alejandra Molina, Aaron Donsbach, Carrie J Cai, and Michael Terry. 2021. GenLine and GenForm: Two Tools for Interacting with Generative Language Models in a Code Editor. In Adjunct Proceedings of the 34th Annual ACM Symposium on User Interface Software and Technology (Virtual Event, USA) (UIST ’21 Adjunct). Association for Computing Machinery, New York, NY, USA, 145–147. https://doi.org/10.1145/3474349.3480209
Jiang et al. (2022b) Ellen Jiang, Edwin Toh, Alejandra Molina, Kristen Olson, Claire Kayacik, Aaron Donsbach, Carrie J Cai, and Michael Terry. 2022b. Discovering the Syntax and Strategies of Natural Language Programming with Generative Language Models. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (New Orleans, LA, USA) (CHI ’22). Association for Computing Machinery, New York, NY, USA, Article 386, 19 pages. https://doi.org/10.1145/3491102.3501870
Kahneman (2011) Daniel Kahneman. 2011. Thinking, fast and slow. macmillan.
Kim et al. (2023) Tae Soo Kim, Yoonjoo Lee, Minsuk Chang, and Juho Kim. 2023. Cells, Generators, and Lenses: Design Framework for Object-Oriented Interaction with Large Language Models. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology. ACM. 1–18.
Kunkel et al. (2017) Johannes Kunkel, Benedikt Loepp, and Jürgen Ziegler. 2017. A 3D Item Space Visualization for Presenting and Manipulating User Preferences in Collaborative Filtering. In Proceedings of the 22nd International Conference on Intelligent User Interfaces (Limassol, Cyprus) (IUI ’17). Association for Computing Machinery, New York, NY, USA, 3–15. https://doi.org/10.1145/3025171.3025189
Lee et al. (2023) Peter Lee, Sebastien Bubeck, and Joseph Petro. 2023. Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine. New England Journal of Medicine 388, 13 (2023), 1233–1239. https://doi.org/10.1056/NEJMsr2214184 arXiv:https://doi.org/10.1056/NEJMsr2214184 PMID: 36988602.
Lester et al. (2021) Brian Lester, Rami Al-Rfou, and Noah Constant. 2021. The Power of Scale for Parameter-Efficient Prompt Tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 3045–3059. https://doi.org/10.18653/v1/2021.emnlp-main.243
Liu et al. (2023a) Michael Xieyang Liu, Advait Sarkar, Carina Negreanu, Benjamin Zorn, Jack Williams, Neil Toronto, and Andrew D. Gordon. 2023a. “What It Wants Me To Say”: Bridging the Abstraction Gap Between End-User Programmers and Code-Generating Large Language Models. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (Hamburg, Germany) (CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 598, 31 pages. https://doi.org/10.1145/3544548.3580817
Liu et al. (2022) Vivian Liu, Han Qiao, and Lydia Chilton. 2022. Opal: Multimodal Image Generation for News Illustration. In Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology (Bend, OR, USA) (UIST ’22). Association for Computing Machinery, New York, NY, USA, Article 73, 17 pages. https://doi.org/10.1145/3526113.3545621
Liu et al. (2023b) Vivian Liu, Jo Vermeulen, George Fitzmaurice, and Justin Matejka. 2023b. 3DALL-E: Integrating Text-to-Image AI in 3D Design Workflows. In Proceedings of the 2023 ACM Designing Interactive Systems Conference (Pittsburgh, PA, USA) (DIS ’23). Association for Computing Machinery, New York, NY, USA, 1955–1977. https://doi.org/10.1145/3563657.3596098
Mishra et al. (2023) Aditi Mishra, Utkarsh Soni, Anjana Arunkumar, Jinbin Huang, Bum Chul Kwon, and Chris Bryan. 2023. PromptAid: Prompt Exploration, Perturbation, Testing and Iteration using Visual Analytics for Large Language Models. arXiv:2304.01964 [cs.HC]
Park et al. (2023) Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. 2023. Generative Agents: Interactive Simulacra of Human Behavior. arXiv:2304.03442 [cs.HC]
Petridis et al. (2022) Savvas Petridis, Nediyana Daskalova, Sarah Mennicken, Samuel F Way, Paul Lamere, and Jennifer Thom. 2022. TastePaths: Enabling Deeper Exploration and Understanding of Personal Preferences in Recommender Systems. In 27th International Conference on Intelligent User Interfaces (Helsinki, Finland) (IUI ’22). Association for Computing Machinery, New York, NY, USA, 120–133. https://doi.org/10.1145/3490099.3511156
Petridis et al. (2023) Savvas Petridis, Nicholas Diakopoulos, Kevin Crowston, Mark Hansen, Keren Henderson, Stan Jastrzebski, Jeffrey V Nickerson, and Lydia B Chilton. 2023. AngleKindling: Supporting Journalistic Angle Ideation with Large Language Models. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (Hamburg, Germany) (CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 225, 16 pages. https://doi.org/10.1145/3544548.3580907
Pryzant et al. (2023) Reid Pryzant, Dan Iter, Jerry Li, Yin Tat Lee, Chenguang Zhu, and Michael Zeng. 2023. Automatic Prompt Optimization with ”Gradient Descent” and Beam Search. arXiv:2305.03495 [cs.CL]
Radford et al. (2019) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9.
Ramesh et al. (2017) Kiran Ramesh, Surya Ravishankaran, Abhishek Joshi, and K. Chandrasekaran. 2017. A Survey of Design Techniques for Conversational Agents. In Information, Communication and Computing Technology, Saroj Kaushik, Daya Gupta, Latika Kharb, and Deepak Chahal (Eds.). Springer Singapore, Singapore, 336–350.
Reif et al. (2023) Emily Reif, Minsuk Kahng, and Savvas Petridis. 2023. Visualizing Linguistic Diversity of Text Datasets Synthesized by Large Language Models. arXiv:2305.11364 [cs.CL]
Reynolds and McDonell (2021) Laria Reynolds and Kyle McDonell. 2021. Prompt Programming for Large Language Models: Beyond the Few-Shot Paradigm. In Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems (Yokohama, Japan) (CHI EA ’21). Association for Computing Machinery, New York, NY, USA, Article 314, 7 pages. https://doi.org/10.1145/3411763.3451760
Shin et al. (2020) Taylor Shin, Yasaman Razeghi, Robert L. Logan IV, Eric Wallace, and Sameer Singh. 2020. AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 4222–4235. https://doi.org/10.18653/v1/2020.emnlp-main.346
Strobelt et al. (2023) Hendrik Strobelt, Albert Webson, Victor Sanh, Benjamin Hoover, Johanna Beyer, Hanspeter Pfister, and Alexander M. Rush. 2023. Interactive and Visual Prompt Engineering for Ad-hoc Task Adaptation with Large Language Models. IEEE Transactions on Visualization and Computer Graphics 29, 1 (2023), 1146–1156. https://doi.org/10.1109/TVCG.2022.3209479
Su et al. (2022) Hongjin Su, Jungo Kasai, Chen Henry Wu, Weijia Shi, Tianlu Wang, Jiayi Xin, Rui Zhang, Mari Ostendorf, Luke Zettlemoyer, Noah A. Smith, and Tao Yu. 2022. Selective Annotation Makes Language Models Better Few-Shot Learners. arXiv:2209.01975 [cs.CL]
Verbruggen et al. (2021) Gust Verbruggen, Vu Le, and Sumit Gulwani. 2021. Semantic Programming by Example with Pre-Trained Models. Proc. ACM Program. Lang. 5, OOPSLA, Article 100 (oct 2021), 25 pages. https://doi.org/10.1145/3485477
Wang et al. (2023a) Sitong Wang, Savvas Petridis, Taeahn Kwon, Xiaojuan Ma, and Lydia B Chilton. 2023a. PopBlends: Strategies for Conceptual Blending with Large Language Models. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (Hamburg, Germany) (CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 435, 19 pages. https://doi.org/10.1145/3544548.3580948
Wang et al. (2023b) Yunlong Wang, Shuyuan Shen, and Brian Y Lim. 2023b. RePrompt: Automatic Prompt Editing to Refine AI-Generative Art Towards Precise Expressions. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (Hamburg, Germany) (CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 22, 29 pages. https://doi.org/10.1145/3544548.3581402
Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. 2022. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. In Advances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Eds.), Vol. 35. Curran Associates, Inc., 24824–24837. https://proceedings.neurips.cc/paper_files/paper/2022/file/9d5609613524ecf4f15af0f7b31abca4-Paper-Conference.pdf
Wu et al. (2023) Sherry Wu, Hua Shen, Daniel S Weld, Jeffrey Heer, and Marco Tulio Ribeiro. 2023. ScatterShot: Interactive In-Context Example Curation for Text Transformation. In Proceedings of the 28th International Conference on Intelligent User Interfaces (Sydney, NSW, Australia) (IUI ’23). Association for Computing Machinery, New York, NY, USA, 353–367. https://doi.org/10.1145/3581641.3584059
Wu et al. (2022a) Tongshuang Wu, Ellen Jiang, Aaron Donsbach, Jeff Gray, Alejandra Molina, Michael Terry, and Carrie J Cai. 2022a. PromptChainer: Chaining Large Language Model Prompts through Visual Programming. In Extended Abstracts of the 2022 CHI Conference on Human Factors in Computing Systems (New Orleans, LA, USA) (CHI EA ’22). Association for Computing Machinery, New York, NY, USA, Article 359, 10 pages. https://doi.org/10.1145/3491101.3519729
Wu et al. (2022b) Tongshuang Wu, Michael Terry, and Carrie Jun Cai. 2022b. AI Chains: Transparent and Controllable Human-AI Interaction by Chaining Large Language Model Prompts. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (New Orleans, LA, USA) (CHI ’22). Association for Computing Machinery, New York, NY, USA, Article 385, 22 pages. https://doi.org/10.1145/3491102.3517582
Wu et al. (2017) Yu Wu, Wei Wu, Chen Xing, Ming Zhou, and Zhoujun Li. 2017. Sequential Matching Network: A New Architecture for Multi-turn Response Selection in Retrieval-Based Chatbots. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Vancouver, Canada, 496–505. https://doi.org/10.18653/v1/P17-1046
Xu et al. (2017) Anbang Xu, Zhe Liu, Yufan Guo, Vibha Sinha, and Rama Akkiraju. 2017. A New Chatbot for Customer Service on Social Media. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems (Denver, Colorado, USA) (CHI ’17). Association for Computing Machinery, New York, NY, USA, 3506–3510. https://doi.org/10.1145/3025453.3025496
Yuan et al. (2022) Ann Yuan, Andy Coenen, Emily Reif, and Daphne Ippolito. 2022. Wordcraft: Story Writing With Large Language Models. In 27th International Conference on Intelligent User Interfaces (Helsinki, Finland) (IUI ’22). Association for Computing Machinery, New York, NY, USA, 841–852. https://doi.org/10.1145/3490099.3511105
Zamfirescu-Pereira et al. (2023a) J.D. Zamfirescu-Pereira, Heather Wei, Amy Xiao, Kitty Gu, Grace Jung, Matthew G Lee, Bjoern Hartmann, and Qian Yang. 2023a. Herding AI Cats: Lessons from Designing a Chatbot by Prompting GPT-3. In Proceedings of the 2023 ACM Designing Interactive Systems Conference (Pittsburgh, PA, USA) (DIS ’23). Association for Computing Machinery, New York, NY, USA, 2206–2220. https://doi.org/10.1145/3563657.3596138
Zamfirescu-Pereira et al. (2023b) J.D. Zamfirescu-Pereira, Richmond Y. Wong, Bjoern Hartmann, and Qian Yang. 2023b. Why Johnny Can’t Prompt: How Non-AI Experts Try (and Fail) to Design LLM Prompts. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (Hamburg, Germany) (CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 437, 21 pages. https://doi.org/10.1145/3544548.3581388
Zhang et al. (2020) Tianyi Zhang, London Lowmanstone, Xinyu Wang, and Elena L. Glassman. 2020. Interactive Program Synthesis by Augmented Examples. In Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology (Virtual Event, USA) (UIST ’20). Association for Computing Machinery, New York, NY, USA, 627–648. https://doi.org/10.1145/3379337.3415900
Zhou et al. (2018) Hao Zhou, Minlie Huang, Tianyang Zhang, Xiaoyan Zhu, and Bing Liu. 2018. Emotional Chatting Machine: Emotional Conversation Generation with Internal and External Memory. arXiv:1704.01074 [cs.CL]