Debugging Without Error Messages

How LLM Prompting Strategy Affects Programming Error Explanation Effectiveness

Audrey Salmon 0009-0000-5511-9904 Department of Computer ScienceUniversity of North Carolina at Chapel HillChapel HillUSA audrey.b.salmon@gmail.com , Katie Hammer 0009-0002-0668-5349 School of Computer ScienceNorth Carolina State UniversityRaleighUSA kahammer@ncsu.edu , Eddie Antonio Santos 0000-0001-5337-715X School of Computer ScienceUniversity College DublinDublinIreland eddie.santos@ucdconnect.ie and Brett A. Becker 0000-0003-1446-647X School of Computer ScienceUniversity College DublinDublinIreland brett.becker@ucd.ie

(2025)

Abstract.

Making errors is part of the programming process—even for the most seasoned professionals. Novices in particular are bound to make many errors while learning. It is well known that traditional (compiler/interpreter) programming error messages have been less than helpful for many novices and can have effects such as being frustrating, containing confusing jargon, and being downright misleading. Recent work has found that large language models (LLMs) can generate excellent error explanations, but that the effectiveness of these error messages heavily depends on whether the LLM has been provided with context—typically the original source code where the problem occurred. Knowing that programming error messages can be misleading and/or contain jargon that serves little-to-no use (particularly for novices) we explore the reverse: what happens when GPT-3.5 is prompted for error explanations on just the erroneous source code itself—original compiler/interpreter produced error message excluded. We utilized various strategies to make more effective error explanations, including one-shot prompting and fine-tuning. We report the baseline results of how effective the error explanations are at providing feedback, as well as how various prompting strategies might improve the explanations’ effectiveness. Our results can help educators by understanding how LLMs respond to such prompts that novices are bound to make, and hopefully lead to more effective use of Generative AI in the classroom.

AI; artificial intelligence; ChatGPT; compiler error messages; computer programming; few-shot prompting; fine-tuning; GenAI; Generative AI; GPT; GPT-3.5; large language models; LLM; LLMs; programming; programming error messages

^†^†journalyear: 2025^†^†copyright: rightsretained^†^†isbn: 978-1-4503-XXXX-X/18/06^†^†ccs: Computing methodologies Artificial intelligence^†^†ccs: Social and professional topics Computing education

1. Introduction

Students and professionals alike frequently make mistakes when writing code. Some of these mistakes completely grind the programming process to a halt, resulting in diagnostics called programming error messages (PEMs). Much ink has been spilled lamenting the quality of PEMs throughout the decades (Becker et al., 2019)—often labeled as “terse”, “cryptic”, and “misleading”—but recently, a new alternative to traditional error messages has been made available.

Generative AI tools such as ChatGPT have suddenly shaken up multiple fields including software development and computing education. Prior work has shown that large language models (LLMs), like the one used in ChatGPT, can provide acceptable explanations for programming error messages (Leinonen et al., 2023; Santos et al., 2023; Widjojo and Treude, 2023). A common theme in these works is that adding source code context drastically improves the quality of generated explanations.

The present work asks a curious question: How effective is the feedback generated by LLMs when the original programming error message is omitted entirely? This work not only establishes a baseline for feedback without error messages, but attempts to improve on this baseline using a variety of prompting strategies. We compare one-shot prompting—providing an example of the desired feedback in the prompt—with fine-tuning—a more involved process that requires modifying the language model’s weights using additional training examples.

This work hopes to offer insights to educators, both with regards to the role of feedback when resolving programming errors, and to the use of generative AI in the classroom.

1.1. Research questions

We are guided by the following research questions:

RQ1

How effective are LLM-generated error message explanations that omit the original error message from the prompt?
RQ2
How does prompting strategy affect various aspects of LLM-generated error explanations, including…
- •
  
  …the accuracy of error explanations?
- •
  
  …the relevancy of error explanations?
- •
  
  …the verbosity of error explanations?
RQ3

What trade-offs are there between different prompting strategies in the context of generating error explanations?

1.2. Contributions

With this work, we provide empirical evidence that:

•

Prompting GPT 3.5 for feedback without the original error message produces roughly 2–3 useful responses for every misleading explanation generated.
•

Alternative prompting strategies (one-shot and fine-tuning) do not appreciably increase the accuracy of LLM generated error explanations.
•

One-shot prompting and fine-tuning produce fewer instances of distracting, extraneous information in the generated error explanations.
•

Fine-tuning generates error explanations that are more concise and on-topic than the other prompting strategies.

2. Background and Related Work

Although the present work proposes to remove programming error messages from the process of debugging, it is useful to understand why such an idea would seem reasonable in the first place.

2.1. Programming Error Messages

Programming error messages (PEMs) are the (usually textual) diagnostic messages that are generated when an unrecoverable error is detected while programming—either at compile-time, or while the program is running. They are one of the primary forms of feedback that both novice and professionals programmers alike receive while coding. PEMs have had a long and unfortunate history of being perceived as unhelpful (Becker et al., 2019). They often contain technical jargon unfamiliar to novices and have a penchant for making misleading suggestions (Dy and Rodrigo, 2010; Kohn, 2019; Marceau et al., 2011). PEMs have such a bad reputation that the present work proposes removing them altogether—however, prior work has found that any error message is better than no error message at all (Shneiderman, 1982).

Recent work has tried to understand the factors that make PEMs understandable. These factors suggest that appropriate feedback should be short and succinct, use simple vocabulary without any jargon, and be written in clear sentences (Denny et al., 2021). Unfortunately, these guidelines are difficult to action within compilers because producing succinct and precise feedback requires a level of understanding of the programmer’s intent. Programming errors and programming error messages are not the same thing (McCall and Kölling, 2014); producing useful feedback—especially for novices—requires more careful consideration of what mistakes are likely and which are difficult to overcome while programming (McCall and Kölling, 2019). That said, statistical modeling has shown that programmers’ intent is largely predictable (Hindle et al., 2012), enabling efforts to further use statistical modeling to generate feedback as an alternative to conventional programming error messages (Campbell et al., 2014; Santos et al., 2018).

2.2. Generative AI and Large Language Models

The world has been rocked by the flood of easily accessible Generative AI applications. Approachable chatbot interfaces like ChatGPT allow laypeople to interact with the world’s largest AI models using the same social scripts that they would use in human-human interaction (Nass and Moon, 2000). Large language models (LLMs) such as GPT-3.5 (the LLM introduced with ChatGPT in November 2022), have had a profound impact on computing education in a relatively short time. Early work found that LLMs can solve most CS1 (Finnie-Ansley et al., 2022) and CS2 problems (Finnie-Ansley et al., 2023), raising concerns that automated assessment could soon be trivially circumvented. Indeed, the computing education community is struggling to integrate generative AI in teaching (Lau and Guo, 2023).

2.3. Using LLMs to generate programming error feedback

Prior work has used LLMs to tackle the problem of frustrating programming error messages. One work using an early, code-oriented LLM, found that it could produce correct explanations for error messages in 48% of cases, but only 33% of all generated explanations had correct fixes. Later work, using more capable models like GPT-4, found that LLMs could generate responses with up to 99% correct explanations and 83% correct fixes (Santos et al., 2023), and up to 100% “useful” responses (Widjojo and Treude, 2023). Most notably, all of the previous studies showed that programming feedback was most effective when it included the original erroneous source code as part of the prompt. These studies have also shown that only providing the programming error message to LLMs often produces vague, “technically correct” responses, instead of targeted, actionable feedback. Other work has used LLMs in clever ways to generate feedback for syntax errors, without the use of the original programming error message (Phung et al., 2023).

3. Methodology

To compare the three different prompting strategies (baseline, one-shot, fine-tuned), we first collected 100 erroneous student programs from the TigerJython dataset. We manually wrote error explanations for 60 programs (Section 3.2), holding out the remaining 40 for evaluation. We prompted GPT-3.5 (Section 3.3), incorporating the manual explanations into our one-shot and fine-tuned prompting strategies. Finally, we manually evaluated the three prompting strategies using the remaining 40 programs, rating based on criteria derived from prior work (Section 3.4).

3.1. Collecting erroneous student programs

We sampled 100 erroneous student programs from the TigerJython 2022 ProgSnap2 database (Kohn and Manaris, 2020). TigerJython is a introductory programming environment, aimed at secondary school students. The programming language is a modified version of Jython, itself a dialect of Python 2. Though it resembles Python 2, TigerJython features a few extended syntactic constructs that are not present in either Python 2 or Python 3. The fact that TigerJython code resembles but is not identical to Python 3 became an issue when obtaining responses from GPT-3.5 (see Section 5.4).

To find suitable erroneous code, we used the following criteria:

error $\rightarrow$ fixed:: The erroneous program must originate from a pair of events where the code raised an error in one event, and then ran successfully in the subsequent event.
$\leq$ 20 lines:: The erroneous program consists of at most 20 lines of code.
Python 2:: The fixed program uses valid Python 2.7 syntax.
1 error:: The erroneous program has exactly one programming error that results in an error message being emitted, either at compile-time or at runtime.

The first three criteria were used to automatically filter the dataset. The last criterion (exactly one programming error) required manual analysis. As such, the first author randomly sampled from the filtered dataset until 100 programs were found to have exactly one programming error. Of the 100 erroneous programs selected, 60 programs were used to write manual error explanations (Section 3.2), while the remaining 40 were held out for the evaluation (Section 3.4).

Censored strings

For privacy reasons, programs in the dataset had “censored” string literals such that some text would be replaced with a series of X’s. For example, setColor("Yellow") would show in the dataset as setColor("Xxxxxx"). In preliminary testing, we found that GPT-3.5 would suggest changing the string to a valid color name, which was not the intended programming error to correct. However, the strings were only censored in the source code—the error messages stored in the dataset (inadvertently) revealed uncensored string literals. In these cases, we manually changed the censored string literal back to a “reasonable” string literal (e.g., "Xxxxxx" $\rightarrow$ "Yellow") to prevent the LLM from suggesting this as the error.

3.2. Creating manual error explanations

We manually created error explanations for 60 erroneous programs. Two authors split the task, writing 30 explanations each. To better understand the students’ intent, authors examined both the erroneous program and the student’s fix to the problem. Using this information, we wrote concise error explanations in the following consistent format, featuring:

(1)

One or two complete sentences explaining the problem.
(2)

One or two complete sentences explaining the fix.
(3)

A minimal example of correct source code.

For example:

Running the provided code results in an error because the forward() function needs to include a numerical value. To fix the problem, give forward() a value. For example, forward(30).

In cases where the fix contained the example source code in its entirety, we omitted the (now redundant) example to make each message more succinct. For example:

Running the provided code results in an error because the maketurtle() function needs to have a capital T. To fix the problem, change it to makeTurtle().

3.3. Models and prompting strategies

We compared three different prompting strategies: the baseline, prompting a model with one example of the desired error explanation (one-shot), and prompting a model fine-tuned on manual error explanations. In all cases, we used OpenAI’s gpt-3.5-turbo-1106 model. As of this writing, GPT-3.5 is the latest model family from OpenAI that is generally available for fine-tuning.¹¹1 https://platform.openai.com/docs/models/gpt-3-5-turbo We prompted using OpenAI Playground,²²2 https://platform.openai.com/playground/chat?models=gpt-3.5-turbo-1106 which allowed us to directly control hyperparameters. In particular, we used a temperature of 0.0 when obtaining responses, as this was found to be the most effective in prior work (Leinonen et al., 2023). For all three prompting strategies, we provided the full, unabridged erroneous source code as the user message.

3.3.1. Baseline

The baseline (control) was to prompt gpt-3.5-turbo-1106 using the following system message:

Provide a plain English explanation of why running the Python 2 code causes an error and how to fix the problem. Do not output the entire fixed source code.

This message is based on prompt #1 from Leinonen et al. (2023), with a few notable changes: we explicitly specify that the code is in Python 2 because we found in preliminary testing that the models would suggest to fix syntactic elements to make them compatible with Python 3 rather than explain the true programming error. In addition, OpenAI’s language models seem to consistently reproduce the entire source code with the problem fixed when providing suggestions (Santos et al., 2023). For our purposes, this is completely unnecessary, so we included wording in the system message to omit outputting the entire fixed program.

3.3.2. One-shot prompting

A way to improve the output from an LLM is to provide one or more examples of the desired output in the prompt. This is called few-shot prompting (OpenAI, 2024b).³³3The baseline (not using examples in the prompt) is also known as zero-shot learning. This is a relatively low-effort method to potentially increase the effectiveness of an LLM’s output. In our case, we evaluated the case where exactly one example is added to the prompt (i.e., one-shot prompting). For each erroneous program, we augmented the baseline’s system message by concatenating it with the phrase “For example:” followed by an error explanation randomly sampled from our set of 60 manually written explanations (Section 3.2).

3.3.3. Fine-tuned model

A more involved method of improving model output is taking a pretrained model (the ‘P’ in ‘GPT’) and adjusting some of the model’s weights with further training examples. This process, known as fine-tuning, turns a general-purpose model into a specialized model. A fine-tuned model has the same architecture as the base model, but has more targeted or specialized capabilities. Research has indicated that large language models require relatively few samples for fine-tuning, requiring as few as 100 examples to match the performance of a model trained from scratch on $100\times$ more data (Howard and Ruder, 2018). OpenAI recommends 50 to 100 training examples for fine-tuning GPT-3.5, but notes that “the right number varies greatly based on the exact use case” (OpenAI, 2024a).

Using OpenAI Playground, we took the baseline model (Section 3.3.1), and fine-tuned it with all 60 handwritten explanations (Section 3.2) with their respective erroneous programs. For each training example, we provided our manually written explanation as the assistant message, leaving both the system message and user message as they would be in the baseline. We let epochs, batch size, and LR multiplier all be automatically set, which resulted in the system using 3 epochs, a batch size of 1, and an LR multiplier of 2. In all, fine-tuning took just under 8 minutes. During the evaluation, the fine-tuned model was prompted in exactly the same way as the baseline model.

3.4. Evaluation

Three authors rated the LLMs’ responses on two axes: feedback quality and extraneous information.

Labels for feedback quality were adapted from Mahajan et al. (2020) via Widjojo and Treude (2023). We partially quote Table 3 from the latter work here:

Instrumental (I):

Response perfectly targets underlying cause of error and provides a clear action on how to fix.

Helpful (H):

Response provides general but not exact help. […]

Misleading (M):

Response does not provide a clear direction on how to fix the issue and/or causes confusion. […]

A critique of Widjojo and Treude (2023) is that in the original formulation, “misleading” seems to cover two axes: that the primary feedback is misleading, or that the response contains extra information that may further mislead. Indeed, preliminary work revealed that GPT would occasionally produce correct responses with extraneous information that did not directly address the error. This extraneous information would range from being technically correct (“Additionally, you may need to make some adjustments to the function calls and syntax to ensure compatibility with Python 3.”) or incorrect and misleading (“To fix the problem, you can use Python 3 instead, as the gturtle library is designed for Python 3” while commenting on correct usage of gturtle in TigerJython). As a result, we assessed the message on an additional axis: whether the response contained extraneous information. For this, the first three authors collectively agreed on codes for labeling extraneous information:

No extraneous information:: The entirety of the response is relevant to diagnosing and fixing the one programming error.
Correct, but extraneous information:: The response contains irrelevant information, but it is factually accurate or useful.
Incorrect extraneous information:: The response contains irrelevant information that may further mislead.

Evaluation metrics

As in Widjojo and Treude (2023), we report summary metrics (Table 1) that are intended to give an idea of the overall feedback quality of each prompting strategy.

Table 1. Metrics for evaluating feedback quality, adapted from Widjojo and Treude (2023).

Metric	Definition	Description
I-Score	$\frac{I}{I+H+M}\times 100\%$	What proportion of all responses are instrumental (highest quality)?
IH-Score	$\frac{I+H}{I+H+M}\times 100\%$	What proportion of all responses are useful (instrumental or helpful)?
M-Score	$\frac{M}{I+H+M}\times 100\%$	What proportion of all responses are misleading?
IH:M ratio	$(I+H):M$	How many instrumental/helpful responses are there for every misleading response produced?

Protocol

Before rating commenced, the raters convened in person and collectively agreed on the interpretation of all labels. Once all 120 LLM responses were generated ( $40$ erroneous programs $\times$ $3$ prompting strategies), the raters provided an initial rating of every response, with respect to the erroneous program and the student’s intention. After this initial rating, the authors reconvened and discussed special and marginal cases. There was only one case in which all three authors completely disagreed. After brief discussion, the ratings were revised such that there was a majority opinion for each response.

Table 2. Summary of majority ratings (40 erroneous programs per row).

	Feedback quality							Extraneous information
	I	H	M	I-score	IH-Score	M-Score	IH:M	None	Correct	Incorrect
Baseline	17	11	12	45.0%	70.8%	29.2%	2.43:1	33	6	1
One-shot	21	9	10	50.8%	76.7%	23.3%	3.29:1	37	2	1
Fine-tuned	21	8	11	52.5%	71.7%	28.3%	2.53:1	40	0	0

4. Results

Table 2 summarizes the results of from the three raters. We report the raters’ majority opinion for Instrumental/Helpful/Misleading, and the presence of extraneous information, such that these totals add up to 40 (the number of erroneous programs); however, the other measures are calculated on the raw frequencies from all raters. Fliess’ $\kappa$ for the three raters was calculated on the two axes separately. We obtained $\kappa=0.725$ for feedback quality and $\kappa=0.809$ for extraneous information, which may be interpreted as significant and almost-perfect agreement (Landis and Koch, 1977), respectively.

Feedback quality

The feedback quality obtained on the baseline is remarkable: without providing the original programming error message to GPT-3.5, the baseline results in 70.8% of its responses as being rated useful and 45.0% perfectly explaining and fixing the error. Compare this to the 11% perfect fix rate reported in prior work (Santos et al., 2023) that prompted using only the programming error message using a more capable model (GPT-4).

We conducted a $\chi^{2}$ test to determine whether feedback quality is affected by prompting strategy, and found no statically significant difference ( $p=0.53$ ). Thus, we conclude that prompting strategy—whether using an ordinary prompt, a prompt with an example of desired output, or even a model fine-tuned with desired feedback—does not affect the proportion of instrumental, helpful, or misleading explanations produced by GPT-3.5. One can expect between 2.43–3.29 useful explanations for every misleading explanation obtained.

Extraneous information

A $\chi^{2}$ test revealed that there is a statistically significant ( $p<0.001$ ) effect between prompting strategy and the presence of extraneous information in the response. Notably, while the baseline and one-shot strategies produced explanations that were judged as sometimes containing extraneous information, all three raters unanimously agreed that every response from the fine-tuned model was completely devoid of extraneous information. That is, the fine-tuned model’s explanations were 100% on topic.

Refer to caption — Figure 1. Histograms of word count by prompting strategy.

Message length

We counted the number of words in each message, where a “word” is defined by calling Python’s str.split() method on each message and counting length of the resultant list. Figure 1 shows the distributions of word counts, grouped by prompting strategy. A Kruskal-Wallis rank sum test reveals that the differences between distributions is statistically significant ( $p<0.001$ ). Post hoc paired Wilcoxon rank sum tests show that statistical significant differences exist between the distribution of word counts of the fine-tuned messages against the baseline ( $p<0.001$ ) and the one-shot prompting strategy ( $p<0.001$ ); further one-tailed tests reveal that the fine-tuned messages are shorter than both the baseline ( $p<0.001$ ) and the one-shot strategy ( $p<0.001$ ). We conclude that explanations obtained from the fine-tuned model are less loquacious than explanations obtained from the alternatives. However, we found that there is no statistically significant difference between the distributions of word counts between the baseline and one-shot strategies ( $p=0.11$ ).

5. Discussion

5.1. RQ1: How effective are LLM-generated error message explanations that omit the original error message from the prompt?

Our results show that one can expect roughly 2–3 useful error explanations for every misleading error explanation generated by GPT 3.5 when the original programming error message is completely absent from the prompt. This is not too surprising, as prior work has shown that the presence of source code context dramatically improves an LLM’s “perfect fix” rate of error explanations from 11% to 79% (Santos et al., 2023). Indeed, early work on applying various automated approaches to fixing students’ syntax errors did not use the error message at all (Santos et al., 2018; Gupta et al., 2017; Campbell et al., 2014). When a later work finally incorporated Java’s compiler error messages into their AI tool, they saw only a modest 2.7% improvement in fix rates (Ahmed et al., 2022). They also claimed that their model learned to ignore the fix hinted at in the message. This puts in to question the utility of traditional programming error messages as a debugging tool—at least for novice programmers.

5.2. RQ2: How does prompting strategy affect various aspects of LLM-generated error explanations?

Prompting strategy had an effect on the presence of extraneous information, and in the case of the fine-tuned messages, the length of the response. However, prompting strategy did not seem to affect the veracity of the error explanations produced. It is possible that using a larger model, trained on more data would be more effective as has been tried in other studies (Mozannar et al., 2024; Santos et al., 2023). However, we will note that prior work has found that, when it comes to using LLMs to solve programming tasks, “further gains in benchmark performance do not necessarily translate into equivalent gains in human productivity” (Mozannar et al., 2024). The aspects that make an LLM’s response actionable may be more complex than whether the response simply contains the correct answer within it.

GPT-3.5’s responses did not contain much extraneous information overall, but prompting strategy did seem to lower it—especially using the fine-tuned model. A lack of extraneous information does not necessarily imply a shorter response, as the one-shot strategy’s responses were not any shorter than the baseline’s messages overall. Providing just one example of the desired LLM response may not be enough to produce shorter, more relevant responses; however, if producing shorter, more relevant responses is a priority, then fine-tuning is more effective.

5.3. RQ3: What trade-offs are there between different prompting strategies in the context of generating error explanations?

There is a varying amount of effort required (both for students and instructors) for using all three prompting strategies, increasing from the baseline strategy, to one-shot/few-shot prompting, to the most time consuming: fine-tuning. Since the veracity of the feedback does not see any appreciable improvements when using prompting strategies that require more effort, it may be wiser to use a more capable model, if available.

However, the usability of an error explanation may not just be a function of whether it gives a correct answer, but also whether it can produce a concise message. For this, it seems that fine-tuning is the most effective strategy overall. From our experience, fine-tuning is not an overly onerous task: it took two authors less than one working day to create 30 error explanations each,⁴⁴4 An embarrassingly parallel task. and only 8 minutes to fine-tune the model proper. We consider this to be a relatively small time investment for worthwhile results.

The real challenge in fine-tuning is finding diverse examples of students’ programming errors. To properly train models, it is imperative that training data have plentiful and varied examples, and not feature a class imbalance where some categories of programming errors occur disproportionately. This, however, is always the case since students’ programming errors are found in a long-tail distribution (McCall and Kölling, 2014). That is, the most common programming errors are extremely frequent, and the least common programming errors are vanishingly rare. In our sample of 60 programming errors, we found 15 cases (25%) where the error was a misspelling of a variable or function name, and 6 cases (10%) where the mistake was forgetting to initialize the environment. These imbalances may have caused our fine-tuned model to overfit and thus, not improve in its feedback quality rating. It is worthwhile to note that programming errors are not the same as programming error messages: comparing the distributions of both reflects this (McCall and Kölling, 2014). Thus, care must be taken when creating datasets of programming errors for the purpose of fine-tuning models such that there are examples of broad and diverse programming errors—not programming error messages.

5.4. Implications for pedagogy

Minimizing extraneous information in error explanations may be invaluable due to a reduction in extraneous cognitive load for the student. This is especially true in the context of novice programmers using pedagogically-oriented language dialects, as is the case with TigerJython. Since general-purpose large language models like GPT-3.5 have been trained on code coming from a wide variety of domains, its extra hints may not be helpful or even distracting when using it for highly-specialized teaching languages like TigerJython. We noticed that GPT-3.5 had a fixation on making the code compatible with Python 3. Of the 10 cases of extraneous information that we labeled, 8 were messages either stating that the code was not compatible with Python 3, or suggesting how to change the code to be compatible with Python 3. None of these suggestions would be relevant to students who were debugging TigerJython (Python 2) code. It seems that LLMs put students who use pedagogically-oriented languages at a disadvantage because the LLMs’ output skews heavily towards the norms of mainstream programming languages and professional software development practices.

An opportunity that this work presents is an exercise in which students reflect on what kind of feedback they require when debugging. Before debugging with an LLM such as ChatGPT, students could be asked to craft examples of the structured feedback that they would like to receive, much like the template defined in Section 3.2. Students would then use their example explanations as part of a one-shot or few-shot prompt to fix a novel programming error, and reflect on the effectiveness of the resultant error explanation.

5.5. Limitations

As with other studies that evaluate LLMs on synthetic benchmarks, the true test of how helpful LLMs are is demonstrated in how students actually use the output in practice. Related work has shown that promising results in benchmarks do not necessarily translate to promising results in practice (Mozannar et al., 2024; Prather et al., 2024; Nguyen et al., 2024; Vaithilingam et al., 2022; Santos and Becker, 2024).

Another limitation is our dataset: we trimmed down the dataset for practical purposes: namely, evaluation was less onerous for the raters if the programs were short (20 lines or fewer), and if it was relatively simple to assess whether the one programming error was fixed. In reality, students’ programs routinely contain several programming errors simultaneously, and students are not limited to 20 line programs. Additionally, our “uncensoring” of string literals means that some of the erroneous programs we used were not, strictly speaking, identical to the ones that students actually wrote.

6. Conclusion

We have demonstrated that, when prompting GPT-3.5 without programming error messages, one can conservatively expect 2–3 useful error explanations for every misleading response. Additionally, prompting strategy does not appreciably change the accuracy of the generated error explanations, but it may at least make the explanations shorter and more focused. This work adds to the growing pile of evidence that suitable programming error feedback is more reliant on the erroneous source code context than the resultant error message. More broadly, we hope instructors focus on the causes and resolutions to underlying programming errors rather than programming error messages, regardless of their use of generative AI. If GenAI tools such as ChatGPT are introduced in the classroom, we suggest that it is a better use of time to focus on explaining the underlying programming errors rather than prompting chatbots to explain programming error messages in isolation. We hope this work better equips educators on how to effectively utilize LLMs in the classroom and helps establish realistic expectations regarding the capabilities of the now ubiquitous generative AI tools.

Acknowledgements.

We are indebted to Tobias Kohn for providing the TigerJython dataset and for his kind advice.

References

(1)
Ahmed et al. (2022) Toufique Ahmed, Noah Rose Ledesma, and Premkumar Devanbu. 2022. SynShine: Improved Fixing of Syntax Errors. IEEE Transactions on Software Engineering 49, 4 (2022), 2169–2181.
Becker et al. (2019) Brett A. Becker, Paul Denny, Raymond Pettit, Durell Bouchard, Dennis J. Bouvier, Brian Harrington, Amir Kamil, Amey Karkare, Chris McDonald, Peter-Michael Osera, Janice L. Pearce, and James Prather. 2019. Compiler Error Messages Considered Unhelpful: The Landscape of Text-Based Programming Error Message Research. In Proceedings of the Working Group Reports on Innovation and Technology in Computer Science Education (Aberdeen, Scotland, UK) (ITiCSE-WGR ’19). ACM, NY, NY, USA, 177–210. https://doi.org/10.1145/3344429.3372508
Campbell et al. (2014) Hazel Victoria Campbell, Abram Hindle, and José Nelson Amaral. 2014. Syntax Errors Just Aren’t Natural: Improving Error Reporting with Language Models. In Proceedings of the 11th Working Conference on Mining Software Repositories (Hyderabad, India) (MSR 2014). Association for Computing Machinery, New York, NY, USA, 252–261. https://doi.org/10.1145/2597073.2597102
Denny et al. (2021) Paul Denny, James Prather, Brett A. Becker, Catherine Mooney, John Homer, Zachary C Albrecht, and Garrett B. Powell. 2021. On Designing Programming Error Messages for Novices: Readability and Its Constituent Factors. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (Yokohama, Japan) (CHI ’21). ACM, NY, NY, USA, Article 55, 15 pages. https://doi.org/10.1145/3411764.3445696
Dy and Rodrigo (2010) Thomas Dy and Ma. Mercedes Rodrigo. 2010. A Detector for Non-Literal Java Errors. In Proceedings of the 10th Koli Calling International Conference on Computing Education Research (Koli, Finland) (Koli Calling ’10). ACM, NY, NY, USA, 118–122. https://doi.org/10.1145/1930464.1930485
Finnie-Ansley et al. (2022) James Finnie-Ansley, Paul Denny, Brett A. Becker, Andrew Luxton-Reilly, and James Prather. 2022. The Robots Are Coming: Exploring the Implications of OpenAI Codex on Introductory Programming. In Proceedings of the 24th Australasian Computing Education Conference (Virtual Event, Australia) (ACE ’22). ACM, New York, NY, USA, 10–19. https://doi.org/10.1145/3511861.3511863
Finnie-Ansley et al. (2023) James Finnie-Ansley, Paul Denny, Andrew Luxton-Reilly, Eddie Antonio Santos, James Prather, and Brett A. Becker. 2023. My AI Wants to Know if this will be on the Exam: Testing OpenAI’s Codex on CS2 Programming Exercises. In Australasian Computing Education Conference (Melbourne, VIC, Australia) (ACE ’23). ACM, NY, NY, USA, 8 pages. https://doi.org/10.1145/3576123.3576134
Gupta et al. (2017) Rahul Gupta, Soham Pal, Aditya Kanade, and Shirish Shevade. 2017. DeepFix: Fixing Common C Language Errors by Deep Learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 31. Association for the Advancement of Artificial Intelligence, San Francisco, CA, USA, 1345–1351. https://doi.org/10.1609/aaai.v31i1.10742
Hindle et al. (2012) Abram Hindle, Earl T. Barr, Zhendong Su, Mark Gabel, and Premkumar Devanbu. 2012. On the Naturalness of Software. In 34th International Conference on Software Engineering (ICSE). IEEE, Zurich, CH, 837–847. https://doi.org/10.1109/ICSE.2012.6227135 ISSN: 1558-1225.
Howard and Ruder (2018) Jeremy Howard and Sebastian Ruder. 2018. Universal Language Model Fine-tuning for Text Classification. arXiv:1801.06146 [cs.CL] https://arxiv.org/abs/1801.06146
Kohn (2019) Tobias Kohn. 2019. The Error Behind the Message: Finding the Cause of Error Messages in Python. In Proceedings of the 50th ACM Technical Symposium on Computer Science Education. ACM, NY, NY, USA, 524–530.
Kohn and Manaris (2020) Tobias Kohn and Bill Manaris. 2020. Tell Me What’s Wrong: A Python IDE with Error Messages. In Proceedings of the 51st ACM Technical Symposium on Computer Science Education. ACM, NY, NY, USA, 1054–1060. http://doi.org/10.1145/3328778.3366920
Landis and Koch (1977) J. Richard Landis and Gary G. Koch. 1977. The Measurement of Observer Agreement for Categorical Data. Biometrics 33, 1 (Mar 1977), 159–174.
Lau and Guo (2023) Sam Lau and Philip Guo. 2023. From “Ban It Till We Understand It” to “Resistance is Futile”: How University Programming Instructors Plan to Adapt as More Students Use AI Code Generation and Explanation Tools such as ChatGPT and GitHub Copilot. In Proceedings of the 2023 ACM Conference on International Computing Education Research - Volume 1 (Chicago, IL, USA) (ICER ’23). Association for Computing Machinery, New York, NY, USA, 106–121. https://doi.org/10.1145/3568813.3600138
Leinonen et al. (2023) Juho Leinonen, Arto Hellas, Sami Sarsa, Brent Reeves, Paul Denny, James Prather, and Brett A. Becker. 2023. Using Large Language Models to Enhance Programming Error Messages. In Proceedings of the 54th ACM Technical Symposium on Computer Science Education V. 1 (Toronto ON, Canada) (SIGCSE 2023). ACM, NY, NY, USA, 563–569. https://doi.org/10.1145/3545945.3569770
Mahajan et al. (2020) Sonal Mahajan, Negarsadat Abolhassani, and Mukul R. Prasad. 2020. Recommending Stack Overflow Posts for Fixing Runtime Exceptions Using Failure Scenario Matching. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (Virtual Event, USA) (ESEC/FSE 2020). Association for Computing Machinery, New York, NY, USA, 1052–1064. https://doi.org/10.1145/3368089.3409764
Marceau et al. (2011) Guillaume Marceau, Kathi Fisler, and Shriram Krishnamurthi. 2011. Mind Your Language: On Novices’ Interactions with Error Messages. In Proceedings of the 10th SIGPLAN Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software (Portland, Oregon, USA) (Onward! 2011). ACM, NY, NY, USA, 3–18. https://doi.org/10.1145/2048237.2048241
McCall and Kölling (2014) Davin McCall and Michael Kölling. 2014. Meaningful Categorisation of Novice Programmer Errors. In 2014 IEEE Frontiers in Education Conference (FIE) Proceedings. IEEE, Madrid, Spain, 1–8.
McCall and Kölling (2019) Davin McCall and Michael Kölling. 2019. A New Look at Novice Programmer Errors. ACM Transactions on Computing Education 19, 4 (July 2019), 38:1–38:30. https://doi.org/10.1145/3335814 https://doi.org/10.1145/3335814.
Mozannar et al. (2024) Hussein Mozannar, Valerie Chen, Mohammed Alsobay, Subhro Das, Sebastian Zhao, Dennis Wei, Manish Nagireddy, Prasanna Sattigeri, Ameet Talwalkar, and David Sontag. 2024. The RealHumanEval: Evaluating Large Language Models’ Abilities to Support Programmers. arXiv:2404.02806 [cs.SE]
Nass and Moon (2000) Clifford Nass and Youngme Moon. 2000. Machines and Mindlessness: Social Responses to Computers. Journal of Social Issues 56, 1 (2000), 81–103. https://doi.org/10.1111/0022-4537.00153
Nguyen et al. (2024) Sydney Nguyen, Hannah McLean Babe, Yangtian Zi, Arjun Guha, Carolyn Jane Anderson, and Molly Q Feldman. 2024. How Beginning Programmers and Code LLMs (Mis)read Each Other. arXiv:2401.15232 [cs.HC]
OpenAI (2024a) OpenAI. 2024a. Fine-tuning - OpenAI API. OpenAI. Retrieved 2024-07-21 from https://platform.openai.com/docs/guides/fine-tuning/example-count-recommendations
OpenAI (2024b) OpenAI. 2024b. Prompt engineering - OpenAI API. OpenAI. Retrieved 2024-07-21 from https://platform.openai.com/docs/guides/prompt-engineering/tactic-provide-examples
Phung et al. (2023) Tung Phung, José Cambronero, Sumit Gulwani, Tobias Kohn, Rupak Majumdar, Adish Singla, and Gustavo Soares. 2023. Generating High-Precision Feedback for Programming Syntax Errors using Large Language Models. arXiv:2302.04662 [cs.PL]
Prather et al. (2024) James Prather, Brent Reeves, Juho Leinonen, Stephen MacNeil, Arisoa S. Randrianasolo, Brett Becker, Bailey Kimmel, Jared Wright, and Ben Briggs. 2024. The Widening Gap: The Benefits and Harms of Generative AI for Novice Programmers. arXiv:2405.17739 [cs.AI] https://arxiv.org/abs/2405.17739
Santos and Becker (2024) Eddie Antonio Santos and Brett A. Becker. 2024. Not the Silver Bullet: LLM-enhanced Programming Error Messages are Ineffective in Practice. In Proceedings of the 2024 Conference on United Kingdom & Ireland Computing Education Research (Manchester, United Kingdom) (UKICER ’24). ACM, New York, NY, USA, Article 5, 7 pages. https://doi.org/10.1145/3689535.3689554
Santos et al. (2018) Eddie Antonio Santos, Hazel Victoria Campbell, Dhvani Patel, Abram Hindle, and José Nelson Amaral. 2018. Syntax and Sensibility: Using Language Models to Detect and Correct Syntax Errors. In 2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, Campobasso, Italy, 311–322.
Santos et al. (2023) Eddie Antonio Santos, Prajish Prasad, and Brett A. Becker. 2023. Always Provide Context: The Effects of Code Context on Programming Error Message Enhancement. In Proceedings of the ACM Conference on Global Computing Education Vol 1 (Hyderabad, India) (CompEd 2023). ACM, New York, NY, USA, 147–153. https://doi.org/10.1145/3576882.3617909
Shneiderman (1982) Ben Shneiderman. 1982. System Message Design: Guidelines and Experimental Results. In Directions in Human/Computer Interaction, Albert Badre and Ben Shneiderman (Eds.). Ablex Publishing Company, Norwood, NJ, Chapter 3, 55–77.
Vaithilingam et al. (2022) Priyan Vaithilingam, Tianyi Zhang, and Elena L. Glassman. 2022. Expectation vs. Experience: Evaluating the Usability of Code Generation Tools Powered by Large Language Models. In CHI Conference on Human Factors in Computing Systems Extended Abstracts. ACM, NY NY, USA, 1–7.
Widjojo and Treude (2023) Patricia Widjojo and Christoph Treude. 2023. Addressing Compiler Errors: Stack Overflow or Large Language Models? arXiv:2307.10793 [cs.SE]