¹¹institutetext: Division of AI Data Convergence, Hankuk University of Foreign Studies, Seoul, South Korea
jaekeol.choi@hufs.ac.kr

Identifying Key Terms in Prompts for Relevance Evaluation with GPT Models

Jaekeol Choi

Abstract

Relevance evaluation of a query and a passage is essential in Information Retrieval (IR). Recently, numerous studies have been conducted on tasks related to relevance judgment using Large Language Models (LLMs) such as GPT-4, demonstrating significant improvements. However, the efficacy of LLMs is considerably influenced by the design of the prompt. The purpose of this paper is to identify which specific terms in prompts positively or negatively impact relevance evaluation with LLMs. We employed two types of prompts: those used in previous research and generated automatically by LLMs. By comparing the performance of these prompts in both few-shot and zero-shot settings, we analyze the influence of specific terms in the prompts. We have observed two main findings from our study. First, we discovered that prompts using the term ‘answer’ lead to more effective relevance evaluations than those using ‘relevant.’ This indicates that a more direct approach, focusing on answering the query, tends to enhance performance. Second, we noted the importance of appropriately balancing the scope of ‘relevance.’ While the term ‘relevant’ can extend the scope too broadly, resulting in less precise evaluations, an optimal balance in defining relevance is crucial for accurate assessments. The inclusion of few-shot examples helps in more precisely defining this balance. By providing clearer contexts for the term ‘relevance,’ few-shot examples contribute to refine relevance criteria. In conclusion, our study highlights the significance of carefully selecting terms in prompts for relevance evaluation with LLMs.

Keywords:

chatGPT, GPT-3.5, GPT-4, Information Retrieval, Large Language Models (LLMs), relevance evaluation, prompt engineering, passage ranking.

1 Introduction

Ranking models are foundational in the domain of Information Retrieval (IR). Their success relies heavily on relevant sets that are used as standards during both training and testing stages. Traditionally, crowd-sourced human assessors have been used for relevance judgement, as indicated by several studies [1, 3]. However, this method is often time-consuming, expensive, and can yield inconsistent results due to the inherent subjectivity of human judgement [20, 21].

As technology keeps advancing, diverse machine learning techniques have stepped into the realm of relevance judgment [29, 1, 5, 7]. Driven by sophisticated algorithms, these methods tried to replicate or even enhance the human ability to discern relevance within vast information collections. Despite their potential, there remains skepticism among researchers about whether these techniques can match human accuracy and reliability in relevance judgment.

The major change came about with the advent of LLMs, notably GPT-3 and GPT-4. With their large architectures and extensive training datasets, these LLMs brought the possibility of automated relevance judgments. The performance of these models across diverse natural language processing tasks has fostered a renewed belief in the ability of machines to evaluate passage relevance accurately. Encouraged by this paradigm shift, a couple of relevance judgment [8, 10] and ranking models [30] rooted in GPT architectures have been proposed. These models have demonstrated exceptional performance, often equaling or surpassing traditional methods.

However, The accuracy and robustness of relevance assessment using LLMs are significantly influenced by the prompts employed during the evaluation [18, 31]. These prompts serve as critical guides, aligning the model’s responses with the user’s intent. Consequently, prompt formulation becomes a pivotal component, demanding careful design and optimization.

In this paper, we primarily focus on the prompts used for relevance evaluation in GPT models, particularly examining which terms in the prompts are beneficial or detrimental to performance. We investigate how the performance of LLMs varies with the use of different types of prompts: those utilized in previous research and those generated by LLMs. Our aim is to identify which terms in the prompts improve or impair the performance in relevance assessment tasks. To provide a comprehensive understanding, we conduct these experiments in both few-shot and zero-shot settings.

This study concludes that the term ‘answer’ in prompt design is notably more effective than ‘relevant’ for relevance evaluation tasks using LLMs. This finding emphasizes the importance of a well-calibrated approach to defining relevance. While ‘relevant’ broadly encompasses various aspects of the query-passage relationship, ‘answer’ more directly targets the core of the query, leading to more precise and effective evaluations. Therefore, balancing the scope of ‘relevance’ in prompt design is crucial for enhancing the efficiency and accuracy of LLMs in relevance assessment.

The rest of this paper is organized as follows: ‘2 Related Work’ delves into the background and previous studies. ‘3 Methodology’ outlines the methods and approaches used in our study, including the details of the LLMs and the dataset. ‘4 Experimental Results’ presents the findings from our experiments, providing a comprehensive analysis of the performance of different prompts. ‘5 Discussion’ explore the implications of our findings. Finally, ‘6 Conclusions’ summarizes the key insights from our study.

2 Related Work

The field of IR has seen a significant evolution with the advent of advanced machine learning models and techniques. This section reviews the relevant literature, focusing on the development of relevance judgment methods in IR and the role of prompt engineering in the effective utilization of LLMs.

2.1 Relevance Judgement in Information Retrieval

The relevance evaluation between a query and a passage has been a fundamental task since the inception of ranking systems. This assessment has historically been conducted in a binary manner, categorizing results as either relevant or non-relevant, but has evolved to include graded relevance scales offering more detailed evaluations.

In the realm of traditional IR, the reliance on human assessors for relevance judgment has been extensively documented [1, 3]. Despite their ability to provide nuanced evaluations, this approach has been criticized for its time and cost inefficiencies, as well as the subjective variability in results it can produce [20, 21].

The advancement of machine learning and its integration into IR has marked a transition towards automated relevance judgment. This area, particularly the use of transformer-based models like BERT, has been the focus of recent research [7]. The challenge, however, lies in achieving a balance between the precision offered by human assessment and the scalability of automated methods.

The introduction of LLMs, especially GPT-3 and GPT-4, has further transformed the landscape of relevance judgment. Initial studies, such as those by [32] and [8], explored the use of GPT-3 in annotation tasks, including relevance judgment. Sun et al. [30]’s research extends this to examining GPT-3’s broader capabilities in data annotation. In a distinct approach, MacAvaney and Soldaini [19] investigated the use of LLMs for evaluating unassessed documents, aiming to improve the consistency and trustworthiness of these evaluations. Complementing this, Thomas et al. [31] delved into the integration of LLMs for comprehensive relevance tagging, highlighting their comparable precision to human annotators. On the contrary, Faggioli et al. [10] has presented theoretical concerns regarding the exclusive use of GPT models for independent relevance judgment.

While extensive research has been conducted in this field, the specific influence of terms within a prompt on relevance evaluation remains unexplored. This study seeks to bridge this gap by investigating the impact of individual terms used in prompts.

2.2 Few-shot and Zero-shot Approaches

Recent advancements in LLMs have emphasized their capability for in-context learning, classified as either few-shot or zero-shot based on the presence of in-context examples. Few-shot learning, where a model is given a limited set of examples, has historically shown superior performance over zero-shot learning, which relies on instructions without examples, as highlighted by Brown et al. [4].

The “pre-train and prompt” paradigm emphasizes the distinction between few-shot prompts (conditioned on task examples) and zero-shot prompts (template-only). While few-shot learning was traditionally favored, recent studies, including those on GPT-4, suggest that zero-shot approaches can sometimes outperform few-shot methods, particularly in specific domains [13, 22].

In our study, to investigate the terms in prompts, we conduct experiments using both few-shot and zero-shot settings and compare their outcomes.

2.3 Advanced in Prompt Engineering

Prompt engineering has emerged as a critical factor in harnessing the full potential of LLMs across various natural language processing applications. The formulation of a prompt is instrumental in guiding an LLM’s output, significantly influencing its performance in diverse tasks [27, 4]. The art of crafting effective prompts involves meticulous design and strategic engineering, ensuring that prompts are precise and contextually relevant [25, 11, 28].

The increasing complexity of LLMs has spurred interest in developing sophisticated prompt tuning methods. These methods often utilize gradient-based approaches to optimize prompts over a continuous space, aiming for maximal efficiency and efficacy [17, 24]. However, the practical application of these methods can be limited due to constraints such as restricted access to the models’ gradients, particularly when using API-based models. This challenge has led to the exploration of discrete prompt search techniques, including prompt generation [2], scoring [33], and paraphrasing [12].

In the broader context of prompt-learning, or “prompting,” the approach is increasingly recognized as a frontier in natural language processing, seamlessly bridging the gap between the pre-training and fine-tuning phases of model development [14, 9]. This technique is particularly valuable in low-data environments, where conventional training methods may be less effective [26, 15, 23].

Within the realm of prompt-learning, two primary strategies are employed: few-shot and zero-shot learning. Liang et al. [16] demonstrated a few-shot technique for generating relevance, while studies like those by Sun et al. [30] and Dai et al. [6] have successfully applied few-shot learning in various scenarios. Conversely, Ding et al. [9] suggested that with an appropriate template, zero-shot prompt-learning could yield results surpassing those of extensive fine-tuning, emphasizing the power and flexibility of well-engineered prompts.

So far, there has been little focus on the terms within a prompt in existing research. This study is important because even small changes in a prompt can lead to different results. Our research, which concentrates on word terms, can be considered a form of micro-level prompt engineering.

3 Methodology

Refer to caption — Figure 1: A prompt example for relevance evaluation. This example utilizes 2-shot examples.

Prompts for relevance evaluation, as shown in Figure 1, include an instruction to guide the LLM, in-context few-shot examples for clarity, and an input as the target task. Using these elements, LLMs generate the corresponding output. We apply this template in conduting our experiments for finding out which terms in prompts have impact on the performacne.

3.1 Evaluation method

To evaluate the effectiveness of each prompt in the relevance evaluation task, an objective metric is required. For this purpose, we decided to use the similarity between the evaluations conducted by humans and those conducted by the LLM using the prompt. To measure the similarity between the two sets of evaluations, we utilize Cohen’s kappa ( $\kappa$ ) coefficient, a statistical measure for inter-rater reliability that accounts for chance agreement. This measure compares the agreement between relevance labels generated by the LLM and human judgments, reflecting the quality of the prompt. Higher kappa values indicate a stronger alignment between the LLM and human evaluations. The Cohen’s kappa coefficient is calculated using the following formula:

\kappa=\frac{P_{o}-P_{e}}{1-P_{e}}

(1)

In this equation, $P_{o}$ represents the observed agreement between the two sets of evaluations, and $P_{e}$ is the expected agreement by chance. The kappa value ranges from -1 to 1, where 1 indicates perfect agreement, 0 no agreement other than what would be expected by chance, and -1 indicates total disagreement. A higher kappa value suggests that the LLM’s relevance evaluations are more closely aligned with human assessments, indicating a higher quality of the prompt in guiding the LLM to make evaluations similar to those of human judges.

Table 1: Templates used for generating and analyzing by LLMs.

Usage	Template for generating prompts
Generation	\pbox0.8Instruction: When given a query, a passage, and a few examples, generate a prompt that can make an output from the given input.
	Example 1 - Input:[query,passage]`\n`Output:[Yes/No]
	Example 2 - Input:[query,passage]`\n`Output:[Yes/No]
	…
	Generate prompt:
Analysis	\pbox0.8Instruction: Which terms are common in these prompts that have a key role to evaluate relevance?
	Prompt 1: [Prompt]
	Prompt 2: [Promtp]
	…
	Find terms :

3.2 Prompts and Few-shot Examples

We utilize two types of prompts, as shown in Table 7 of Appendix B. The first type consists of prompts named with an ‘M’, sourced from previous research [16, 30, 10]. The second type includes prompts generated using the template in Table 1, which are named with a ‘G’. After assessing the performance of both prompt types, we aim to determine which prompts perform better. Following the experiments, we will analyze whether there are any terms common to the more effective prompts. If common terms are identified, it would suggest that these terms play a crucial role in the effectiveness of the prompt.

We conduct the experiments under both zero-shot and few-shot settings. Few-shot examples, derived from [10], are illustrated in Table 0.A of Appendix A. These few-shot examples consist of four instances: two are positive examples, and the other two are negative ones. To ensure a fair comparison, we apply the same set of few-shot examples across all prompts.

3.3 Analysis

We analyze which terms are beneficial for relevance evaluation. Initially, we compare the performance of the prompts illustrated in Table 7. We then categorize the prompts into those with high performance and those with lower performance and look for distinguishing characteristics in each group. To identify the specific terms that play a role, we utilize the analysis prompts provided in Table 1. Furthermore, we compare how the results of each group vary depending on the presence or absence of few-shot examples.

We advance our analysis by constructing confusion matrices for the prompts, allowing for a more in-depth evaluation of their impact. Through the examination of precision and recall values derived from these matrices, we gain insights into the roles played by different terms within the context of relevance evaluation.

4 Experimental Result

We presents the results of our experimental investigation into the effectiveness of various prompts in relevance evaluation tasks using LLMs. We detail the experimental setup, including the models and datasets used, and then delve into the outcomes of our experiments. These results provide crucial insights into how different prompt designs and key terms influence the performance of LLMs in relevance judgment tasks.

4.1 Experimental Setup

4.1.1 Large Language Models

For our experiments, we utilize GPT-3.5-turbo and GPT-4, both accessed via OpenAI’s APIs. GPT-3.5-turbo, with its 178 billion parameters, enhances user interaction by providing clearer and more precise answers. As the most advanced in the series, GPT-4 has 1.76 trillion parameters and outperforms its predecessors in processing and contextual understanding.

4.1.2 Dataset

Table 2: Overview of the TREC DL Passage datasets utilized in the study. The datasets from 2019 to 2021 are used for evaluating the performance of prompts. The table details the year of the dataset, the number of queries, the total number of query relevance judgments (qrels), and the number of sampled qrels used in the study.

Usage TREC DL year Number of queries Number of qrels Number of sampeld qrels Evaluation 2019 43 9,260 200 2020 54 11,386 200 2021 53 10,828 200

For our experiments, we utilize the test sets from the MS MARCO TREC DL Passage datasets spanning three years¹¹1https://microsoft.github.io/msmarco/TREC-Deep-Learning-2019 https://microsoft.github.io/msmarco/TREC-Deep-Learning-2020 https://microsoft.github.io/msmarco/TREC-Deep-Learning-2021. As depicted in Table 2, We randomly sampled 200 data points from each year’s test dataset, ensuring every query in the full set is included. These sampled datasets are then used to evaluate the prompts.

Relevance in these dataset is rated on a 4-point scale: “Perfectly relevant,” “Highly relevant,” “Related,” and “Irrelevant.”

For binary classification tasks, we simplify this 4-point relevance scale to a binary “Yes” or ”No” judgment. Specifically, the categories of “Perfectly relevant” and “Highly relevant” are consolidated into a “Yes” category to indicate relevance, while “Related” and “Irrelevant” is classified as “No.”

4.2 Relevance Evaluation Result of Prompts

Table 3: Comparative Results of Relevance Evaluation in Zero-shot and Few-shot Settings: This table presents the performance of various prompts under zero-shot and few-shot scenarios. The top five performing prompts are highlighted in bold, while the bottom five are underlined. We provide the respective average performances for these groups in both GPT-3.5-turbo and GPT-4 models. A ‘*’ symbol denotes a significant difference at the 95% confidence level.

Type Name Zero-shot Few-shot GPT-3.5-turbo GPT-4 GPT-3.5-turbo GPT-4 Manual M1 0.389 ( $\pm$ 0.115) 0.450 ( $\pm$ 0.090) 0.339 ( $\pm$ 0.059) 0.471 ( $\pm$ 0.041) M2 0.326 ( $\pm$ 0.032) 0.426 ( $\pm$ 0.061) 0.274 ( $\pm$ 0.064) 0.437 ( $\pm$ 0.046) M3 0.319 ( $\pm$ 0.033) 0.396 ( $\pm$ 0.086) 0.330 ( $\pm$ 0.025) 0.460 ( $\pm$ 0.046) M4 0.204 ( $\pm$ 0.019) 0.344 ( $\pm$ 0.073) 0.310 ( $\pm$ 0.041) 0.433 ( $\pm$ 0.028) Generated G1 0.301 ( $\pm$ 0.046) 0.209 ( $\pm$ 0.116) 0.309 ( $\pm$ 0.052) 0.408 ( $\pm$ 0.029) G2 0.356 ( $\pm$ 0.064) 0.384 ( $\pm$ 0.099) 0.315 ( $\pm$ 0.033) 0.425 ( $\pm$ 0.050) G3 0.279 ( $\pm$ 0.044) 0.424 ( $\pm$ 0.060) 0.303 ( $\pm$ 0.026) 0.427 ( $\pm$ 0.067) G4 0.268 ( $\pm$ 0.053) 0.426 ( $\pm$ 0.082) 0.312 ( $\pm$ 0.017) 0.432 ( $\pm$ 0.054) G5 0.342 ( $\pm$ 0.007) 0.429 ( $\pm$ 0.101) 0.257 ( $\pm$ 0.031) 0.461 ( $\pm$ 0.071) G6 0.363 ( $\pm$ 0.085) 0.462 ( $\pm$ 0.073) 0.333 ( $\pm$ 0.073) 0.472 ( $\pm$ 0.046) G7 0.393 ( $\pm$ 0.074) 0.450 ( $\pm$ 0.066) 0.379 ( $\pm$ 0.042) 0.464 ( $\pm$ 0.051) G8 0.382 ( $\pm$ 0.075) 0.455 ( $\pm$ 0.084) 0.349 ( $\pm$ 0.066) 0.463 ( $\pm$ 0.039) G9 0.398 ( $\pm$ 0.089) 0.443 ( $\pm$ 0.074) 0.351 ( $\pm$ 0.078) 0.468 ( $\pm$ 0.046) G10 0.366 ( $\pm$ 0.086) 0.442 ( $\pm$ 0.074) 0.327 ( $\pm$ 0.050) 0.445 ( $\pm$ 0.055) Top-5 average 0.386 ( $\pm$ 0.013)^∗ 0.452 ( $\pm$ 0.007)^∗ 0.352 ( $\pm$ 0.018)^∗ 0.468 ( $\pm$ 0.004)^∗ Bottom-5 average 0.274 ( $\pm$ 0.044) 0.351 ( $\pm$ 0.084) 0.291 ( $\pm$ 0.024) 0.425 ( $\pm$ 0.010)

The evaluation of prompt efficacy in relevance assessments, as outlined in Table 3, reveals notable trends. A key observation is the significant performance variation among semantically similar prompts, highlighting the impact of subtle differences in prompt design on evaluation outcomes. For example, although M3 and G3 are similar prompts asking if the query and passage are ‘relevant,’ they yield different results. Moreover, despite all prompts addressing the relevance between the query and passage, their outcomes vary substantially.

When comparing results between GPT-3.5 and GPT-4 across both few-shot and zero-shot settings, Prompts M1, G7, G8, and G9 consistently rank in the top five across both GPT-3.5-turbo and GPT-4, indicating their inherent effectiveness. Conversely, certain prompts consistently underperform in both models. Specifically, prompts M4, G1, and G3 are found in the bottom five, underscoring elements that may detract from the efficacy of relevance evaluations.

Examining the performance of individual models reveals distinct characteristics in response to the prompts. Each model demonstrates unique preferences in prompt efficacy, illustrating that LLMs may respond differently to the same prompt structures. Certain prompts show high efficacy in GPT-3.5-turbo, while others perform better in GPT-4. Notably, GPT-4 generally exhibits superior performance compared to GPT-3.5-turbo across a range of prompts. A particular case of interest is prompt G1 in the zero-shot setting, where GPT-4’s performance is the only instance of falling behind GPT-3.5-turbo. Aside from this case, GPT-4’s performance is generally superior to that of GPT-3.5-turbo.

Further statistical analysis, involving a paired t-test on the averages of the top five and bottom five prompts, reinforces these findings. Specifically, the top five prompts in GPT-3.5-turbo had an average performance of 0.386, while in GPT-4, this average was higher at 0.452. Conversely, the bottom five prompts averaged 0.274 in GPT-3.5-turbo and 0.351 in GPT-4. These results indicate a statistically significant difference in performance at a 95% confidence level, emphasizing the pivotal role of prompt design in influencing the effectiveness of relevance evaluations in LLMs.

4.3 Analysis of Terms in prompts

Table 4: Key terms that have an crucial role. In prompts demonstrating good performance, the term ‘answer’ is commonly used, whereas in prompts indicating low performance, the term ‘relevant’ is commonly used.

Efficacy	Key Term	Prompt
High	Answer	\pbox0.7G9: … if the passage provides a direct answer to …
		\pbox0.7G7: … the passage contains the answer to the query …
		\pbox0.7M1: Does the passage answer the query? …
		\pbox0.7G10: Determine if the passage correctly answers to …
Low	Relevant	\pbox0.7G1: Do the query and passage relate to the same topic..
		\pbox0.7M4: 2 = highly relevant, very helpful for …
		\pbox0.7M3: Indicate if the passage is relevant for the query? …
		\pbox0.7G3: In the context of the query, is the passage relevant?

In our analysis, we utilized the template from Table 1 to identify key terms in prompts that play a significant role in relevance evaluation using LLMs. The findings are summarized in Table 4.

We observed that prompts demonstrating top performance commonly used the term ‘answer’ or its variations. For instance, in M1, the prompt asks if the passage ‘answers’ the query. Similarly, G7 and G9 emphasize whether the passage contains or directly ‘answers’ the query. This pattern is also evident in G10, where the prompt focuses on whether the passage ‘correctly answers’ the query.

On the other hand, prompts associated with lower performance frequently included the term ‘relevant’ or related terms. For example, M3’s prompt requires indicating if the passage is ‘relevant’ for the query, while G1 asks if the query and passage ‘relate’ to the same topic. This trend continues in M4 and G3, where the term ‘relevant’ is central to the prompt’s structure.

These findings indicate that the choice of key terms in prompts significantly impacts the performance of LLMs in relevance evaluation tasks. Terms like ‘answer’ seem to guide the LLM towards more effective evaluation, while the use of ‘relevant’ appears to be less conducive for this purpose.

4.4 Analysis of Zero-shot and Few-shot Results

The differences in performance between zero-shot and few-shot models for GPT-3.5-turbo and GPT-4 are illustrated in Figure 2, which presents the average results for each approach. From this analysis, we can discern two interesting observations.

Firstly, there is a notable variation in performance across the top and bottom five performers between the two model versions. In the case of GPT-3.5-turbo, while there is an improvement in the performance of the bottom five prompts (from an average of 0.274 in zero-shot to 0.291 in few-shot), the top five prompts exhibit a decrease in performance (from 0.386 in zero-shot to 0.352 in few-shot). This indicates that while few-shot examples enhance GPT’s ability to handle previously lower-performing prompts, they might detrimentally affect the performance of the highest-performing prompts.

In contrast, GPT-4 shows a consistent improvement in both the top and bottom performers with few-shot examples. The top five prompts improve from an average of 0.452 in zero-shot to 0.468 in few-shot, and the bottom five improve from 0.351 to 0.425. This shows that few-shot examples enhance the overall performance in evaluation tasks with GPT-4.

Secondly, both models demonstrate a reduction in the performance gap between the top and bottom five prompts with few-shot learning. This convergence is more pronounced in GPT-4, which sees a more significant increase in performance for the bottom five prompts. It suggests that few-shot examples is particularly effective in refining the model’s responses to less optimal prompts, leading to a more consistent performance across different types of prompts.

Given the role of few-shot examples in providing clearer instructions and context, these results suggest that GPT-4 is more adept at adapting to varied prompt structures and content than GPT-3.5-turbo.

5 Discussion

This section offers an analysis of our experimental results, focusing on the impact of specific prompt terms on the performance of LLMs in relevance evaluation. We also discuss the potential and challenges of using LLMs as direct rankers in IR, compared to their current role in generating relevance judgments.

Table 5: Confusion Matrices for three prompts using the TREC DL 2021 test set in a zero-shot setting. This table includes Cohen’s kappa values, along with calculated precision and recall. The analysis focuses on G6 with the highest performance, G1 with the lowest, and G10, which has the narrowest definition by using the term of ‘correctly’.

Prompt	Prediction	Human assessors		Cohen’s $\kappa$	Precision	Recall
Prompt	Prediction	Relevant	Irrelevant	Cohen’s $\kappa$	Precision	Recall
G6	Relevant	43	24	0.528	0.641	0.716
G6	Irrelevant	17	116	0.528	0.641	0.716
G1	Relevant	59	84	0.275	0.413	0.983
G1	Irrelevant	1	56	0.275	0.413	0.983
G10	Relevant	38	20	0.495	0.655	0.633
G10	Irrelevant	22	120	0.495	0.655	0.633

G6 : Given a query and a passage, determine if the passage provides an answer to the query. …
G1 : Do the query and passage relate to the same topic? …
G10 : Determine if the passage correctly answers a given query. …

5.1 Why ‘Answer’ Is Better Than ‘Relevant’

The analysis of confusion matrices in Table 5 provides key insights into the effectiveness of different prompt types in relevance evaluation. This analysis highlights G6, which had the highest performance, G1 with the lowest performance, and G10, known for its use of the term ‘correctly.’

G6, achieving the highest performance, questions if the passage provides ‘an answer’ to the query. This prompt led to a significant agreement between LLM predictions and human assessors, as evident by a high Cohen’s kappa value of 0.528, along with strong precision and recall. The high number of true positives (43) and true negatives (116) in G6’s matrix suggests that focusing on ‘answering’ is highly effective in evaluating the relevance of the passage to the query.

Conversely, G1, which demonstrated the lowest performance, focuses on whether the query and passage ‘relate’ to the same topic. Despite its high recall, this prompt yielded a lower Cohen’s kappa value of 0.275. The comparatively fewer true negatives (56) against G6 indicate that a broader ‘relevance’ focus may lead to less precise evaluations.

G10, with its emphasis on whether the passage ‘correctly answers’ the query, shows a distinct performance, marked by a Cohen’s kappa value of 0.495. Its precision is notably high, but the recall is somewhat limited, suggesting that while it is effective in identifying specific relevant answers, it may overlook some broader aspects of relevance.

This comparison underlines the varying effectiveness of prompts based on their focus in the context of information retrieval. Prompts like G6, with an ‘answering’ focus, tend to lead to more accurate and precise evaluations, while ‘relevance’-focused prompts like G1 might not capture the entire scope of the query-passage relationship. G10’s specific focus on ‘correctly answering’ demonstrates a particular effectiveness in identifying precise answers but at the potential expense of broader relevance. Therefore, the choice of key terms and their emphasis is crucial in designing prompts for efficient retrieval and ranking in LLMs.

5.2 Balancing the Definition of ‘Relevance’

As discussed in the previous section, defining ‘relevance’ in the context of LLM prompts varies significantly in its scope. G10’s approach, using the term ‘correctly answers’, tends to give a slightly narrow definition in relevance evaluation. It focuses on whether the passage precisely addresses the query, potentially overlooking broader aspects of relevance.

On the other hand, we explored a more balanced approach with G6’s prompt. This prompt, focusing on whether the passage provides ‘an answer’ to the query, strikes a middle ground. It covers not just the direct answer but also the broader context, leading to a more comprehensive consideration of relevance.

Conversely, G1’s prompt offers the broadest definition of relevance by asking if the query and passage ‘relate’ to the same topic. This wide approach, while inclusive, risks being too expansive. As reflected in the confusion matrix for G1 in Table 5, this broad definition results in high recall but at the cost of lower precision, as it captures a wide net of potentially relevant information, including false positives.

This analysis highlights the need for a balanced definition of relevance in prompt design. While G1’s broad approach increases recall, its precision suffers. G10’s narrow focus may miss broader relevance aspects. In contrast, G6’s approach appears to offer a more optimal balance. It captures a wide array of relevant information without being overly narrow or inclusive, leading to more accurate and balanced performance in relevance evaluations. These findings are pivotal for crafting prompts that precisely measure the relevance of information in LLM-based retrieval and ranking systems.

5.3 Influence of Few-shot Examples

As can be seen in Figure 2, in GPT-3.5-turbo, the performance of zero-shot is slightly higher than that of few-shot. In contrast, in GPT-4, the performance of few-shot exceeds that of zero-shot. This variation indicates that a conclusive determination of the relative impacts of few-shot and zero-shot approaches is complex and model-dependent.

However, there is a characteristic that appears consistently in both models: the use of few-shot examples reduces the performance gap between the top-5 and bottom-5 groups. In GPT-3.5-turbo, the gap decreased from 0.112 to 0.061, and in GPT-4, it nearly halved from 0.101 to 0.043. These results suggest that few-shot examples help in defining unclear aspects in the bottom-5 instructions. For instance, consider the case of the G1 prompt. In the zero-shot setting, GPT-4 shows a low performance of 0.209, but when few-shot examples are used, the performance dramatically increases to 0.409. This could indicate that while the term ‘relate’ in G1 has a broad meaning, the use of few-shot examples helps in clarifying its interpretation.

5.4 Direct Ranking vs. Relevance Judgment Using LLMs

An emerging area of interest is the potential for using LLMs directly as rankers in IR, rather than just for generating relevance judgments. However, the practical application of LLMs as direct rankers faces significant challenges, primarily due to efficiency concerns. Directly ranking with LLMs, especially when reliant on API calls, can be slow and costly, as it requires repeated, resource-intensive interactions with the model for each ranking task. This approach, therefore, becomes impractical for large-scale or real-time ranking applications.

Given these constraints, future research in this domain should consider the development and utilization of downloadable, standalone LLMs. Such models, once sufficiently advanced, could potentially be integrated directly into ranking systems, offering a more efficient and cost-effective solution compared to API-dependent models. This shift would allow for the direct application of LLMs in ranking tasks, potentially overcoming the limitations currently posed by API reliance. However, this path also necessitates further advancements in LLM technology to ensure these models can operate effectively and reliably in a standalone capacity.

6 Conclusions

In this paper, we have examined the nuances of prompt design in relevance evaluation tasks using Large Language Models such as GPT-3.5-turbo and GPT-4. Our research reveals the profound impact that specific terms within prompts have on the effectiveness of these models. Contrary to initial expectations, our findings indicate that prompts focusing on ‘answering’ the query are more effective than those emphasizing broader concepts of ‘relevance.’ This highlights the importance of precision in relevance assessments, where a direct answer often more closely aligns with the intended query-passage relationship.

Furthermore, our investigations into few-shot and zero-shot scenarios revealed contrasting impacts on model performance. We found that few-shot examples tend to enhance the performance of LLMs, particularly in GPT-4, by bridging performance gaps between differently functioning prompts.

Our study also underscores the need for a well-balanced definition of ‘relevance’ in prompt design. We observed that overly broad definitions, while helpful in increasing recall, can compromise precision. Conversely, narrowly defined prompts, though precise, risk missing broader relevance aspects, failing to capture a comprehensive relevance assessment. Therefore, striking the right balance in prompt design is crucial for enhancing the efficiency and accuracy of LLMs in relevance evaluation tasks.

In summary, this paper contributes to the field by providing new insights into optimizing LLMs for relevance evaluation tasks. These insights offer crucial guidelines for creating effective prompts, ensuring that LLM outputs align more accurately with nuanced, human-like relevance judgments. As LLM technology continues to evolve, understanding the subtleties of prompt design becomes increasingly important in natural language processing and information retrieval applications.

Acknowledgment

This work was supported by Hankuk University of Foreign Studies Research Fund of 2024.

Appendix 0.A Few-shot Exmaples

We utilize four few-shot exmaples for our experiments.

Table 6: Four few-shot exsmples

#	Few-shot examples
1	\pbox0.93Query: how many eye drops per ml
Passage: Its 25 drops per ml, you guys are all wrong. If it is water, the standard was changed 15 - 20 years ago to make 20 drops = 1mL. The viscosity of most things is temperature dependent, so this would be at room temperature. Hope this helps.
Answer: Yes

2 & \pbox0.93Query: how many eye drops per ml Passage: RE: How many eyedrops are there in a 10 ml bottle of Cosopt? My Kaiser pharmacy insists that 2 bottles should last me 100 days but I run out way before that time when I am using 4 drops per day.In the past other pharmacies have given me 3 10-ml bottles for 100 days.E: How many eyedrops are there in a 10 ml bottle of Cosopt? My Kaiser pharmacy insists that 2 bottles should last me 100 days but I run out way before that time when I am using 4 drops per day. Answer: No 3 \pbox0.93Query: can you open a wells fargo account online Passage: You can transfer money to your checking account from other Wells Fargo. accounts through Wells Fargo Mobile Banking with the mobile app, online, at any. Wells Fargo ATM, or at a Wells Fargo branch. 1 Money in — deposits. Answer: No 4 \pbox0.93Query: can you open a wells fargo account online Passage: You can open a Wells Fargo banking account from your home or even online. It is really easy to do, provided you have all of the appropriate documentation. Wells Fargo has so many bank account options that you will be sure to find one that works for you. They offer free checking accounts with free online banking. Answer: Yes

Appendix 0.B Prompts

We utilize 14 prompts for our experiments.

Table 7: List of 14 prompts used in the experiments, detailing their names and instructions.

Name		Prompt instruction
Manual	M1	\pbox0.84Does the passage answer the query? Respond with ‘Yes’ or ‘No’.
	M2	\pbox0.84Given a passage and a query, predict whether the passage includes an answer to the query by producing either “Yes” or “No”.
	M3	\pbox0.84Indicate if the passage is relevant for the query. Respond with “Yes” or “No”.
	M4	\pbox0.84You are a search quality rater evaluating the relevance of passages. Given a query and a passages, you must provide a score on an integer scale of 0 to 2 with the following meanings:
2 = highly relevant, very helpful for this query
1 = relevant, may be partly helpful but might contain other irrelevant content
0 = not relevant, should never be shown for this query
Generated	G1	\pbox0.84Do the query and passage relate to the same topic? Respond with ‘Yes’ or ‘No’.
	G2	\pbox0.84Is the passage pertinent to the query? Indicate with ‘Yes’ or ‘No’.
	G3	\pbox0.84In the context of the query, is the passage relevant? Reply with ‘Yes’ or ‘No’.
	G4	\pbox0.84Would a user find the passage relevant to their query? Respond with ‘Yes’ or ‘No’.
	G5	\pbox0.88Does the passage contain information relevant to the query? Answer with ‘Yes’ or ‘No’.
	G6	\pbox0.84Given a query and a passage, determine if the passage provides an answer to the query. If the passage answers the query, respond with “Yes”. If the passage does not answer the query, respond with “No”.
	G7	\pbox0.85Your task is to determine whether the passage contains the answer to the query or not. If the passage contains the answer to the query, your response should be ‘Yes’. If the passage does not contain the answer, your response should be ‘No’
	G8	\pbox0.84Given a query and a passage, determine if the passage provides a satisfactory answer to the query. Respond with ‘Yes’ or ‘No’.
	G9	\pbox0.84Given a query and a passage, determine if the passage provides a direct answer to the query. Answer with ‘Yes’ or ‘No’
	G10	\pbox0.86Determine if the passage correctly answers a given query. Respond with ‘Yes’ or ‘No’

References

Alonso et al. [2009] Alonso, O., Mizzaro, S., et al.: Can we get rid of trec assessors? using mechanical turk for relevance assessment. In: Proceedings of the SIGIR 2009 Workshop on the Future of IR Evaluation, vol. 15, p. 16 (2009)
Ben-David et al. [2021] Ben-David, E., Oved, N., Reichart, R.: Pada: A prompt-based autoregressive approach for adaptation to unseen domains. arXiv preprint arXiv:2102.12206 3 (2021)
Blanco et al. [2011] Blanco, R., Halpin, H., Herzig, D.M., Mika, P., Pound, J., Thompson, H.S., Tran Duc, T.: Repeatable and reliable search system evaluation using crowdsourcing. In: Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval, pp. 923–932 (2011)
Brown et al. [2020] Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language models are few-shot learners (2020)
Carterette et al. [2006] Carterette, B., Allan, J., Sitaraman, R.: Minimal test collections for retrieval evaluation. In: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 268–275 (2006)
Dai et al. [2022] Dai, Z., Zhao, V.Y., Ma, J., Luan, Y., Ni, J., Lu, J., Bakalov, A., Guu, K., Hall, K.B., Chang, M.W.: Promptagator: Few-shot dense retrieval from 8 examples (2022)
Dietz et al. [2022] Dietz, L., Chatterjee, S., Lennox, C., Kashyapi, S., Oza, P., Gamari, B.: Wikimarks: Harvesting relevance benchmarks from wikipedia. In: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 3003–3012 (2022)
Ding et al. [2022] Ding, B., Qin, C., Liu, L., Bing, L., Joty, S., Li, B.: Is gpt-3 a good data annotator? arXiv preprint arXiv:2212.10450 (2022)
Ding et al. [2021] Ding, N., Hu, S., Zhao, W., Chen, Y., Liu, Z., Zheng, H.T., Sun, M.: Openprompt: An open-source framework for prompt-learning. arXiv preprint arXiv:2111.01998 (2021)
Faggioli et al. [2023] Faggioli, G., Dietz, L., Clarke, C., Demartini, G., Hagen, M., Hauff, C., Kando, N., Kanoulas, E., Potthast, M., Stein, B., Wachsmuth, H.: Perspectives on large language models for relevance judgment (2023)
Gao et al. [2020] Gao, T., Fisch, A., Chen, D.: Making pre-trained language models better few-shot learners. arXiv preprint arXiv:2012.15723 (2020)
Jiang et al. [2020] Jiang, Z., Xu, F.F., Araki, J., Neubig, G.: How can we know what language models know? Transactions of the Association for Computational Linguistics 8, 423–438 (2020)
Kojima et al. [2022] Kojima, T., Gu, S.S., Reid, M., Matsuo, Y., Iwasawa, Y.: Large language models are zero-shot reasoners. Advances in neural information processing systems 35, 22199–22213 (2022)
Lester et al. [2021] Lester, B., Al-Rfou, R., Constant, N.: The power of scale for parameter-efficient prompt tuning (2021)
Li et al. [2021] Li, C., Gao, F., Bu, J., Xu, L., Chen, X., Gu, Y., Shao, Z., Zheng, Q., Zhang, N., Wang, Y., et al.: Sentiprompt: Sentiment knowledge enhanced prompt-tuning for aspect-based sentiment analysis. arXiv preprint arXiv:2109.08306 (2021)
Liang et al. [2022] Liang, P., Bommasani, R., Lee, T., Tsipras, D., Soylu, D., Yasunaga, M., Zhang, Y., Narayanan, D., Wu, Y., Kumar, A., et al.: Holistic evaluation of language models. arXiv preprint arXiv:2211.09110 (2022)
Liu et al. [2023] Liu, X., Zheng, Y., Du, Z., Ding, M., Qian, Y., Yang, Z., Tang, J.: Gpt understands, too. AI Open (2023)
Lu et al. [2021] Lu, Y., Bartolo, M., Moore, A., Riedel, S., Stenetorp, P.: Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. arXiv preprint arXiv:2104.08786 (2021)
MacAvaney and Soldaini [2023] MacAvaney, S., Soldaini, L.: One-shot labeling for automatic relevance estimation. arXiv preprint arXiv:2302.11266 (2023)
Maddalena et al. [2016] Maddalena, E., Basaldella, M., De Nart, D., Degl’Innocenti, D., Mizzaro, S., Demartini, G.: Crowdsourcing relevance assessments: The unexpected benefits of limiting the time to judge. In: Proceedings of the AAAI conference on human computation and crowdsourcing, vol. 4, pp. 129–138 (2016)
Nouri et al. [2020] Nouri, Z., Wachsmuth, H., Engels, G.: Mining crowdsourcing problems from discussion forums of workers. In: Proceedings of the 28th International Conference on Computational Linguistics, pp. 6264–6276 (2020)
OpenAI [2023] OpenAI: Gpt-4 technical report (2023)
Qin and Joty [2021] Qin, C., Joty, S.: Lfpt5: A unified framework for lifelong few-shot language learning based on prompt tuning of t5. arXiv preprint arXiv:2110.07298 (2021)
Qin and Eisner [2021] Qin, G., Eisner, J.: Learning how to ask: Querying lms with mixtures of soft prompts. arXiv preprint arXiv:2104.06599 (2021)
Reynolds and McDonell [2021] Reynolds, L., McDonell, K.: Prompt programming for large language models: Beyond the few-shot paradigm. In: Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems, pp. 1–7 (2021)
Scao and Rush [2021] Scao, T.L., Rush, A.M.: How many data points is a prompt worth? arXiv preprint arXiv:2103.08493 (2021)
Schick and Schütze [2021] Schick, T., Schütze, H.: Few-shot text generation with natural language instructions. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 390–402 (2021)
Shin et al. [2020] Shin, T., Razeghi, Y., Logan IV, R.L., Wallace, E., Singh, S.: Autoprompt: Eliciting knowledge from language models with automatically generated prompts. arXiv preprint arXiv:2010.15980 (2020)
Soboroff et al. [2001] Soboroff, I., Nicholas, C., Cahan, P.: Ranking retrieval systems without relevance judgments. In: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 66–73 (2001)
Sun et al. [2023] Sun, W., Yan, L., Ma, X., Ren, P., Yin, D., Ren, Z.: Is chatgpt good at search? investigating large language models as re-ranking agent (2023)
Thomas et al. [2023] Thomas, P., Spielman, S., Craswell, N., Mitra, B.: Large language models can accurately predict searcher preferences (2023)
Wang et al. [2021] Wang, S., Liu, Y., Xu, Y., Zhu, C., Zeng, M.: Want to reduce labeling cost? gpt-3 can help. arXiv preprint arXiv:2108.13487 (2021)
Yuan et al. [2021] Yuan, W., Neubig, G., Liu, P.: Bartscore: Evaluating generated text as text generation. Advances in Neural Information Processing Systems 34, 27263–27277 (2021)