SE Factual Knowledge in Frozen Giant Code Model: A Study on FQN and its Retrieval

Qing Huang, Dianshu Liao, Zhenchang Xing, Zhiqiang Yuan, Qinghua Lu, Xiwei Xu, Jiaxing Lu Q. Huang, D. Liao, Z. Yuan, J. Lu are with School of Computer Information Engineering, Jiangxi Normal University, China.Q. Huang and D. Liao are co-first authors, Z. Yuan is the corresponding author(yuanzhiq@jxnu.edu.cn) Z. Xing is with Data61 Australia and Australian National University Australia. Q. Lu and X. Xu are with Data61 Australia.

(December 2022)

Abstract

Pre-trained giant code models (PCMs) start coming into the developers’ daily practices. Understanding what types of and how much software knowledge is packed into PCMs is the foundation for incorporating PCMs into software engineering (SE) tasks and fully releasing their potential. In this work, we conduct the first systematic study on the SE factual knowledge in the state-of-the-art PCM CoPilot, focusing on APIs’ Fully Qualified Names (FQNs), the fundamental knowledge for effective code analysis, search and reuse. Driven by FQNs’ data distribution properties, we design a novel lightweight in-context learning on Copilot for FQN inference, which does not require code compilation as traditional methods or gradient update by recent FQN prompt-tuning. We systematically experiment with five in-context-learning design factors to identify the best in-context learning configuration that developers can adopt in practice. With this best configuration, we investigate the effects of amount of example prompts and FQN data properties on Copilot’s FQN inference capability. Our results confirm that Copilot stores diverse FQN knowledge and can be applied for the FQN inference due to its high inference accuracy and non-reliance on code analysis. Based on our experience interacting with Copilot, we discuss various opportunities to improve human-CoPilot interaction in the FQN inference task.

Index Terms:

In-context Learning, Frozen Giant Code Model, FQN Inference, GitHub CoPilot, Prompt Design.

1 INTRODUCTION

Software engineering (SE) is knowledge-intensive. The knowledge includes, but not limited to: 1) factual knowledge, such as the API’s fully-qualified name (FQN), user credentials, text resources. 2) structural knowledge, such as Abstract Syntax Tree (AST) and control flow graph (CFG), reflecting the code’s syntax and execution flow. 3) semantic knowledge, such as code patterns, code semantics and API constraints, usually lead to bugs or vulnerabilities if they are overlooked.

In the Natural Language Processing (NLP) community, many studies [1, 2, 3, 4, 5, 6] show that the pre-trained language models (PLMs) have powerful capabilities for capturing linguistic and semantic knowledge in text. These capabilities enable PLMs to significantly improve the state-of-the-art in NLP tasks. In the SE community, code naturalness shows that code can be understood and manipulated in the same way as natural language text [7, 8]. This spawn a series of pre-trained code models (PCMs), e.g., CodeBERT [9], CodeT5 [10], CoPilot [11], modeling code simply as text. Studies have shown that syntactic or semantic knowledge packed in the PCMs can be transferred to the downstream SE tasks and benefit these tasks [12, 13, 14, 15, 16, 17, 18, 19]. Giant PCMs (e.g., CoPilot [11]) start coming into the everyday software development practices.

However, there has been no work on probing and using SE factual knowledge in the PCMs. In this work, we study a particular type of SE factual knowledge, i.e., API’s FQNs. An API FQN identifies which class, function or field a simple name refers to in a code context. It is a fundamental knowledge for program analysis [20, 21, 22, 23, 24, 25] and code search [26, 27]. However, the FQNs of simple names in partial code (see Fig. 1) usually cannot be resolved which hinders the partial code reuse and analysis [28, 29, 30]. Existing work [31, 32, 33] for FQN inference in partial code depends on a symbolic knowledge base of APIs and their usage and partial program analysis [34]. However, building such symbolic knowledge bases requires project compilation, and suffers from out-of-vocabulary (OOV) issues [35]. Our previous work [35] proposed to treat code as text and infer FQNs as a text fill-in-blank task based on an FQN-prompt-tuned CodeBERT (see an illustration in Fig. 1), showing the promise of PCMs for modeling FQN knowledge in code.

Supervised fine-tuning [12, 35, 36] updates the model weights. Although the fine-tuned PLMs perform well in the downstream tasks, it does not mean that the vanilla PLMs without fine-tuning effectively store relevant knowledge. In fact, our study (see RQ4 in Section 4.4) shows that CodeBERT (125M parameters) [9] (the backbone PCM fine-tuned in [35]) performs poorly on FQN inference without fine-tuning. A recent study by Google [37] shows that model size matters and emergent abilities will not appear until a critical threshold of scale is reached. Recent NLP studies [6, 38, 39] demonstrate that giant PLMs (e.g., GPT3 with 175B parameters) have strong ability to store factual and commonsense knowledge as neural knowledge base and the stored knowledge can be accurately recalled through in-context learning (see Fig. 1 for the illustration of in-context learning). In these studies, giant PLMs are frozen (i.e., without updating model parameters) such that people can understand to what extent certain factual knowledge is present in the intact models.

A recent study [40] identifies three data distribution properties (temporally bursty, Zipfian distribution, and dynamic meaning) empowers the in-context learning ability of frozen giant PLMs. FQN data manifests these three properties. First, the temporally bursty is reflected in API evolution [41, 42, 43, 44], during which new APIs are introduced and existing APIs are removed, changed or deprecated. Furthermore, new libraries emerge and existing ones become obsolete [41]. Second, the Zipfian (i.e. power law) distribution is reflected in the FQN lengths and usage. In our experimental dataset of six libraries (Android SDK, JDK, Hibernate, GWT, Xstream, Joda Time), about 12% (or 27%) FQNs have been used more than 10k (or 1k) times respectively, followed by a long tail of infrequently used FQNs (see Table III). Third, the dynamic meaning is reflected in the 1:N and N:1 cardinalities between simple names and FQNs (i.e., polysemy and synonymy ambiguity in NLP). For example, all variables reader, br and buffRead can be of the type java.io.BufferedReader. The simple name Date can be java.util.Date, java.sql.Date or sun.util.calendar.Gregorian.Date.

We hypothesize that frozen giant PCMs (e.g., CoPilot [11] extended from GPT3) can serve as a neural knowledge base of SE factual knowledge (e.g., FQNs) because they are pre-trained on a giant corpus of source code. To validate this hypothesis, we conduct a series of experiments, focusing on two questions: 1) how much FQN knowledge is packed in a frozen giant PCM (CoPilot in this study)? 2) how can we effectively retrieve the FQN knowledge in the frozen CoPilot? Inspired by the alignment of FQN data distribution properties with the findings in [40], we set up in-context learning tasks on CoPilot for inferring FQNs for cannot-be-resolved simple names in partial code snippets (see Fig. 1). We experiment a wide range of learning configurations: zero/one/few-shot learning and in-context learning design factors (code context, task description, prompt template, example prompt order and identifier format). We analyze the CoPilot’s capability in inferring FQNs with diverse data distribution properties (FQN lengths, FQN usage times, simplename-FQN and FQN-simplename cardinalities).

As CoPilot can only be manually invoked, we sample a set of 1,440 representative and diverse methods from the Github dataset of six libraries in [35]. These methods use 4,697 distinct APIs from 850 packages. Different from supervised FQN-prompt tuning [35], our FQN inference stands on the shoulder of frozen giant CoPilot. With the lightweight in-context learning, CoPilot can be quickly adapted to the FQN inference task with zero or a few examples of simplename-FQN mappings. On the Stack Overflow dataset (496 code snippets using 791 distinct APIs from 66 packages), our method achieves very close or better inference accuracy than supervised prompt tuning of CodeBERT with 11,776 library source files in [35]. Our study also unveils several interesting interactions with CoPilot, including micro-level sensitivity, human-CoPilot colearning and priori knowledge for in-context learning. These interaction experiences call for further studies on human-PCM collaboration design to effectively integrate human intelligence and AI in software engineering tasks.

The main contributions of this paper are as follows:

•

Conceptually, we conduct the first systematic study on a fundamental SE factual knowledge (FQN) in the state-of-the-art giant PCM (CoPilot). Our methodology can be extended to other SE factual knowledge in giant PCMs (e.g., privacy and proprietary information in code).
•

Technically, we design the first lightweight in-context learning-based method for FQN inference, standing on the shoulder of frozen giant PCM. Our method removes the reliance on code compilation and special model tuning and deployment. Developers can easily adopt our method when reusing and parsing partial code by writing a few lines of code comments demonstrating the FQN inference task to CoPilot and then requesting the CoPilot’s completion.
•

Empirically, we systematically experiment a wide range of in-context learning configurations. Our results reveal the extent and characteristics of FQN knowledge stored in CoPilot and confirm the practicality of using CoPilot for FQN inference in partial code. We also identify effective in-context learning configurations for different priori FQN knowledge and data properties. Our data package can be found here ¹¹1https://github.com/SE-qinghuang/SE-Factual-Knowledge-in-Frozen-Giant-Code-Model. Code will be released upon paper acceptance.

2 In-Context Learning for FQN Inference

Refer to caption — Figure 1: In-Context Learning for FQN Inference. The bold text is the explanation, not part of task input. The model’s task input parts are highlighted, such as the code context in purple block and the prompts in gray block. A task input concatenates a purple block and a gray block.

We design a lightweight, easy-to-deploy in-context learning method for FQN inference in partial code. This method allows us to probe the FQN knowledge in the frozen giant PCMs.

2.1 Supervised Fine-Tuning vs. In-Context Learning

2.1.1 Supervised Fine-Tuning

PCMs have been adopted in many downstream SE tasks through supervised fine-tuning [9, 45, 46]. Fine tuning adapts task-agnostic PLMs to downstream tasks by gradient-updating the weights of a PLM on a supervised dataset specific to the downstream task. Downstream tasks are generally heterogeneous from the PLM pretraining. To alleviate this heterogeneity, prompt tuning has been proposed [47, 6, 48]. Prompts are templates that convert downstream tasks into the form of pre-training tasks which makes pre-training and prompt tuning share homogeneous learning objective. “Pre-training, prompt-tuning and prediction” has become a new NLP paradigm and has demonstrated strong capability in zero-, one- or few-shot scenarios in code summarization [49], fault prediction [50], code translation [51] and requirement classification [17]. Recently, we. [35] applied FQN-prompt-tuning to CodeBERT for FQN inference. However, as fine-tuning updates model weights, we cannot know how much FQN knowledge a frozen PLM captures.

2.1.2 In-context Learning

In-context learning is a form of prompt learning without gradient update of PLMs so that we can learn to what extent a frozen PLM captures certain knowledge. In-context learning uses the text input to a PLM to specify the downstream task: the PLM is conditioned on a task description and/or a few task demonstrations (i.e., example prompts) and is then asked to complete further instances of the task (i.e., to-be-complete prompt) by generating what comes next. Fig. 1 presents our task input for FQN inference. Although a PCM may have seen many simple names and FQNs during pre-training, this form of the FQN inference task has never been seen during pre-training. By analyzing the capability of a PCM in generating the FQN for the given simple name in a code context in the in-context learning setting, we probe the FQN knowledge that this PCM captures.

2.2 Our In-Context Learning Design

To develop a comprehensive understanding of the FQN knowledge in a frozen, giant PCM, we consider two main perspectives in designing in-context learning tasks.

2.2.1 Amount of Example Prompts

Example prompts are the most important part for in-context learning. As shown in Fig. 1, example prompts follow the task description and appear before the to-be-complete prompt. They not only inform the model priori-known FQNs but also “teach” the model how it should complete the task unseen during pre-training. Given a code snippet, the developer may know the FQNs of some simple names. Priori-known FQNs would help to infer other unknown FQNs. To simulate different levels of priori-known FQNs in the FQN inference task, we design five different shot settings (zero-shot, two one-shot and two few-shot).

The zero-shot provides no example prompts. This simulates the situation when the developer does not know any FQN (no matter for the simple names in the code context or any other FQNs). As shown in Zero-Shot in Fig. 1, the model predicts the FQN of the simple name given only the task description. Given this minimum task input, the model may not clearly know what task to solve and thus generate some irrelevant results (e.g., generating a piece of code instead of an FQN, generating the value of a variable instead of its FQN).

The one-shot-ENIC (example not in code context) provides one example prompt, but the simple name and corresponding FQN does not appear in the code context. This setting assumes that even if the developer does not know the FQNs of any simple names in the code context, he or she may still know some general FQNs. As shown in One-Shot-ENIC (Example Not in code context) in Fig. 1, the example prompt shows the simple name Object and its corresponding FQN java.lang.Object, even though Object does not appear in the code context. Although this example prompt does not provide any priori-known FQNs relevant to the code context, it still demonstrates the task completion format, which may reduce the chance of generating the text irrelevant to FQN. However, this not-in-code-context simplename-FQN example may cause the model to misunderstand the scope of FQN inference and generate an FQN irrelevant to the code context.

The one-shot provides one example prompt for a simple name (selected randomly or by FQN usage times) appearing in the code context. As shown in One-Shot in Fig. 1, the prompt shows the simple name List $<>$ and its corresponding FQN java.util.List $<>$ , and the simple name List $<>$ appears in the code context. This setting is the lower bound of priori-known FQN in the code context and demonstrates the task completion format. However, providing only one example may not be sufficient to adapt the model to the FQN inference task.

The few-shot-REP (random example prompts) provides the example prompts for 2 to $n-2$ randomly selected simple names in the code context. $n$ is the number of all unique cannot-be-resolved simple names. In the example of Few-Shot-REP (Random Example Prompts) in Fig. 1, there are three example prompts, each for one simple name in the code context. This simulates the situations when the developer knows the FQNs for some simple names but not others. The more example prompts the model sees, the more likely it can adapt to specific code contexts and infer relevant FQNs in the correct format.

The few-shot-LOO (leave one out) provides the example prompts for all $n-1$ simple names in the code context, except for one simple name left out for inference, as illustrated in the example of Few-shot-LOO (Leave One Out) in Fig. 1. This setting is the upper bound of priori-known FQNs, which would maximize the model’s capability in inferring the FQN for the left-out simple name. This upper bound allows us to estimate the extent of FQN knowledge in a giant PCM.

2.2.2 Prompt Engineering

In addition to task demonstrations by example prompts, our in-context learning considers the following five factors when preparing the task input.

Code Context. The code context is a code snippet (the purple block in Fig. 1). We put the code context at the beginning of the task input, which specifies the context where the FQN inference occurs. If the code context is not given, we essentially probe the preferred FQN that the model memorizes for a simple name from pre-training.

Task Description. The task description (the green block in Fig. 1) follows the code context and appears before the example prompts. It tells the model what the task is about. We consider three types of task description: no description, concise (i.e., “type inference”), or verbose (i.e., “parse simple name to fully qualified name”).

Prompt Template. All example prompts and to-be-complete prompt use the same prompt template so that the model can learn from the example prompts how to complete the to-be-complete prompt. We experiment two types of prompt templates: description or symbol. The description template is “the fully qualified name of simplename is FQN”, as shown in Fig. 1. The symbol template uses a symbol ( $\rightarrow$ in our experiments) to indicate mapping a simple name to an FQN (e.g., File $\rightarrow$ java.io.File).

Order of Example Prompts. To understand if the order of example prompts affects the model’s learning, we design three orders: random, frequent first, and infrequent first. We count the FQN usage times in all the methods in our dataset. Frequent-first order means the example prompts of the more frequently used FQNs appear before the less frequently used FQNs. Infrequent-first order is the opposite to Frequent-first order. Random order means randomly ordering example prompts without considering the FQN usage times.

Identifier Format. Annotating a word in a sentence can help the model distinguish it from other words. We experiment the identifier format with and without special annotations. Without annotations means the words in the prompts are treated equally. With annotations, we add “” to simple names and FQNs in the prompts, for example, the fully qualified name of “File” is “java.io.File” in Fig. 1.

2.3 FQN Inference by In-Context Learning on CoPilot

In this study, we probe the FQN knowledge in CoPilot with 175B parameters. We choose CoPilot as it has commercial product quality and provides the IDE plugin to invoke the model. We use giant PCM rather than small models like CodeBERT [9] as the study [37] shows giant models exhibit emergent abilities that small/medium models do not have. To apply in-context learning on CoPilot for FQN inference, developers can write a task input comprising one code snippet as text (if provided), followed by one task description, example prompts (if any) (one per line) and one to-be-complete prompt at the end, as shown in Fig. 1. Each part starts on a new line. Task description and prompts starts with // as code comments. The entire task input is plain text, and CoPilot can be invoked on the to-be-complete prompt line to complete it.

A to-be-complete prompt is to infer the FQN for a cannot-be-resolved simple name in the code context. As in [35], we consider three types of cannot-be-resolved simple names: 1) data type of variable declaration (e.g., “File” in Fig. 1); 2) type name of class instantiation and array creation (e.g., “List $<$ String $>$ ”); 3) the object or the type on which a method is invoked or a field is accessed (e.g., “br”). In addition, we consider the method/field name of this method invocation or field access (e.g., readLine()) to test the CoPilot’s inference capability for methods and fields. However, we do not consider the chained method calls or field accesses (e.g., “br.readLine().toLowerCase()”). Once “br” or “readLine()” is resolved to an FQN, the receiving type of “toLowerCase()” can be obtained from the return type of “br.readLine()”.

As shown in Fig. 1, each prompt (example or to-be-complete) contains only one simplename-FQN mapping. For example, for “List $<$ String $>$ ”, there will be two separate prompts, one for “List $<>$ ” and the other for “String”. Furthermore, we treat simple names with same base but different forms (e.g., “List”, “List $<>$ ”, “List[]”, and “List()”) as different simple names, because we want to probe the CoPilot’s capability in understanding different program syntax (i.e., general type, generic type, array type, constructor).

3 Experiments Setup

Our experiments have two-fold objectives: 1) investigate how much FQN knowledge is packed in a frozen giant PCM like CoPilot; and 2) identify the effective ways to retrieve the FQN knowledge in CoPilot by in-context learning. To achieve these objectives, we conduct a series of experiments to investigate the four research questions:

•

RQ1 - How sensitive is the FQN inference on CoPilot to five prompt engineering factors?
•

RQ2 - How do amount of example prompts (zero/one/few-shot) affect the FQN inference on CoPilot?
•

RQ3 - How do FQN data distribution properties affect the FQN inference on CoPilot?
•

RQ4 - How well does the FQN inference on CoPilot perform on real-world partial code, compared with the state-of-the-art prompt-tuning based FQN inference method [35]?

Next, we describe our datasets, prompt-engineering configurations, preparation of task inputs for large-scale experiments, our experiment environment and process.

3.1 Datasets

To obtain accurate and practical answers to our research questions, we construct two datasets: one Github code dataset from the six libraries (Android SDK, JDK, GWT, Hibernate, Joda Time and Xstream) and one SO dataset from Stack Overflow posts discussing the API usage of these six libraries.

3.1.1 Github Dataset From Six Libraries

We download the source code of the six libraries from the replication package provided by [35]. The original dataset was downloaded from the library’s Github repository and was used to evaluate the prompt-tuned FQN inference method in [35].

We extract all methods (not just public methods) from each source file, because all methods can be used as the FQN usage context no matter their visibility. The body of each method is considered as a code snippet in our experiments. As the source code of these libraries is compilable, we collect unique $<$ simplename, FQN $>$ pairs used in the library methods using the Spoon [52] tool. These $<$ simplename, FQN $>$ pairs provide the ground-truth for the large-scale experiments of the CoPilot’s FQN inference capability in RQ1, RQ2 and RQ3. A long method generally includes many $<$ simplename, FQN $>$ pairs, which will generate too many experiment instances through the combination of different in-context learning factors. Through a pilot study of manual effort, we decide to keep only the methods with less than 30 lines of code (LOC), which account for 93.42% of all the methods in the six libraries.

Because CoPilot can only be invoked manually in the IDE editor, we cannot perform automatic experiments as in [35] due to the prohibitive manual effort required. Therefore, we sample a subset of diverse and representative methods from the original dataset. First, we randomly sample a package and randomly sample a method in this package and add the method to the sample dataset. Then, we randomly sample the methods in the other not-yet-sampled packages one at a time as follows. We randomly sample a not-yet-sampled package and collect as candidates all methods in the package whose similarity is less than 0.9 with any of the methods in the sample dataset. The method similarity is measured by the code embedding method [53]. We add the least-similar candidate method with more than three $<$ simplename, FQN $>$ pairs to the sample dataset, as those with too few $<$ simplename, FQN $>$ pairs are not sufficient for certain learning configurations (e.g., few-shot (random example prompts). This sampling process continues until all the library packages are iterated.

We obtain 1,440 methods as our Github dataset, which have 8,258 $<$ simplename, FQN $>$ pairs (including 3,871 unique simple names and 4,697 unique FQNs from 850 packages). We confirm that the sampled methods are representative in terms of code LOCs, FQN lengths, usage times and simplename-FQN cardinalities, and also diverse in terms of low pair-wise code similarities and FQN-set Jaccard coefficients. Details are reported in the Section I-A and Section I-B in the supplementary document.

3.1.2 Dataset From Stack Overflow Posts

In RQ4, we would like to evaluate CoPilot’s FQN inference capability in real-world partial code developers write. To that end, we use the two partial code datasets (Stat-Type-SO and Short-SO) collected from the Stack Overflow posts discussing the API usage of the six libraries. In our previous work [35], we manually labelled the ground-truth FQNs for the simple names in the code snippets. We double-check and confirm the label correctness. The dataset Stat-Type-SO has been used in several FQN inference studies [54, 32, 55, 35]. It contains 381 code snippets (LOCs from 3 to 30) which use 685 distinct APIs of the library APIs from 35 distinct packages. The Short-SO was created by [35]. It contains 115 short partial code snippets (LOC $\leq$ 10) which use 205 distinct APIs from 35 distinct packages. In total, the two SO datasets have 2,320 $<$ simplename, FQN $>$ pairs (including 684 unique simple names, 791 unique FQNs in 496 code snippets.).

TABLE I: Two Configurations of Prompt Engineering

Prompt Factor	Basic	Best
Code Context	Provided	Provided
Task Description	Verbose	Concise
Prompt Template	Description	Description
Example Prompt Order	Random Order	Infrequent First
Identifier Format	With Quote	With Quote

3.2 Configurations of Prompt Engineering

Before our experiments, we design a basic configuration (the Basic column in Table I) based on our intuition of which options would be most effective for FQN inference. The basic configuration provides the code context, uses verbose task description (i.e., “parse simple name to fully qualified name”), provides example prompts in random order, and adopts description-style prompt template and annotates simple names and FQNs with quotes (i.e., the fully qualified name of “simplename” is “FQN”). To study the impact of one prompt design factor on the FQN inference, we vary the concerned factor with different variants but leave other factors the same as the basic configuration. We validate our intuition and identify the best configuration in RQ1 4.1 (the Best column in Table I). The best configuration is consistent with the basic configuration except for using concise task description (i.e., “type inference”) and providing the infrequent FQN examples first. This best configuration is used in RQ2/3/4. In all experiments, we apply a prompt-engineering configuration in the five different shot settings.

3.3 Composing Task Inputs for Large-Scale Experiments

To conduct large-scale experiments, it is impractical to ask developers to manually write the task inputs in the IDE. Therefore, given a code snippet in our datasets, we automatically compose the task inputs according to a learning configuration. The composed task inputs are saved in .java files and stored in a folder structure corresponding to the factor options. In this work, we assume the same simple name appearing at different places in the code context refers to the same FQN. That is, there is no variable shadowing or name masking. For each unique $<$ simplename, FQN $>$ pair in the code context, we produce a to-be-complete prompt for the simplename with the FQN as the ground truth. Then, we prepare the example prompts with the rest $<$ simplename, FQN $>$ pairs for five different shot settings. The composed example prompts will be ordered according to the example prompt order configuration. The Github dataset has 8,258 $<$ simplename, FQN $>$ pairs, so we compose 41,290 (5*8258) task inputs for a learning configuration. RQ1, RQ2 and RQ3 involve 9 configurations so we compose in total 371,610 task inputs. The SO dataset has 2,320 $<$ simplename, FQN $>$ pairs, and we compose 11,600 task inputs for the best configuration in RQ4.

3.4 Experiment Environment and Process

We prepare nine computers for crowd workers to manually request the completion of CoPilot. We install JetBrain’s IDEA IDE version 2022.2 on each computer and install the CoPilot plugin version 1.1.28.1744 in the IDEA environment. As CoPilot treats source code as text, there is no need for compiling the task inputs. We recruit nine undergraduate students from our school to complete the tasks. We divide the task inputs roughly evenly across the nine workers. The workers are offered a small financial incentive for their work.

The workers open our stored task input files, obtain the predicted FQN by pressing tab on the to-be-complete prompt line, and save the file. We write a program to automatically extract the text after the to-be-complete prompt in the input files. If the content is empty, the input files are missed by the workers, as CoPilot will output “No completions were found” if its generation fails. We remind the workers to complete the missed inputs. The process of completing all task inputs took 1,512 man-hours (on average 14.2 seconds per input). Then, we extract the predicted FQNs for all model inputs. If the content shows “No completions were found”, the predicted FQN is marked as “…” which does not match any ground-truth FQNs. If the predicted FQN contains brackets (e.g., (), [], $<>$ ), we keep the brackets but remove the contents inside the brackets. If the delimiter between two tokens is a special symbol (e.g., #, $) but not the dot (.), we replace the symbols with the dot. As CoPilot is interactively used by developers, we believe developers can easily recognize and fix these trivial errors. After the post-processing, we consider a predicted FQN correct only if it is identical to the ground-truth FQN.

4 Experiment Results

4.1 Sensitivity to Prompt Engineering (RQ1)

TABLE II: Results of Sensitivity to Prompt Engineering (+/-value Against the Basic Configuration)

PE Factor	Variant	Zero-Shot	One-Shot-ENIC	One-Shot	Few-Shot-REP	Few-Shot-LOO
	Basic Configuration	49.00%	49.72%	61.18%	74.10%	77.55%
	Best Configuration	+1.07%	+0.54%	+4.54%	+2.30%	+1.79%
Code Context	Not Provided	-9.00%	-4.87%	-4.33%	-5.24%	-5.03%
Task Description	Concise	+1.07%	+0.54%	+0.94%	+1.38%	+1.10%
Task Description	No	-1.90%	+1.44%	+1.34%	+0.47%	+0.93%
Prompt Template	Symbol	-1.93%	-0.19%	-2.25%	-1.47%	+4.47%
Example Prompt Order	Frequent First	-	-	-7.18%	-3.65%	-1.56%
Example Prompt Order	Infrequent First	-	-	+7.69%	+2.95%	+0.06%
Identifier Format	Without Quote	-6.89%	-2.81%	-2.06%	-1.06%	-0.50%

4.1.1 Motivation

In-context learning prompts CoPilot how to complete new tasks with examples. The model’s task completion capability can be sensitive to the design of prompts (so-called prompt engineering) [47, 56, 57]. Our in-context learning involves five prompt-engineering factors (see Section 2.2.2), and each factor has some variants. This RQ aims to investigate the CoPilot’s sensitivity to prompt engineering and validate our intuition of basic configuration.

4.1.2 Methodology

We use the Github dataset in this RQ. We create variant configurations by varying one of the five factors (code context, task description, prompt template, example prompt order or identifier format) and keeping the other four factors the same as the basic configuration. In addition to the basic configuration, we obtain 7 more configurations by factor ablation, and identify the best configuration of factor settings. We experiment each configuration with five different shot settings which produce the final 45 variant configurations, with the total 371,610 model inputs to complete.

4.1.3 Result Analysis

The results of with/without code context confirms our intuition. Code context is critical for making context-sensitive inference. Without code context, CoPilot essentially generates FQNs based on its memory of the correlations between simple names and FQNs [58], and suffers the largest accuracy drop. However, the inference accuracy without code context is still acceptable (the upper bound 72.52% at Few-Shot-LOO). This does indicate that CoPilot memorizes many FQNs. On the other hand, simply relying on model memorization may bias the FQN inference by the data distribution for model pre-training. For example, without code context, CoPilot always generates java.util.Date for Date, while given the SQL processing code, it will generate java.sql.Date. As java.util.Date is more frequently used than java.sql.Date, CoPilot prefers the former over the latter when no code context is provided.

The results of task description variants do not follow our intuition. Verbose task description is only better than no task description at Zero-Shot (i.e., no example prompt provided). For all other cases, verbose task description is worse (around 1%) than concise and no task description. The default example prompt template (“the fully qualified name of … is …”) carries similar information as verbose task description, which may reduce the importance of verbose task description.

The results of two prompt templates are somewhat surprising. The description template (“the fully qualified name of … is …”) is better (around 2%) than the symbol template (“simplename $\rightarrow$ FQN”), except for Few-Shot-LOO where the symbol template is 4.5% better than the description template. This suggests that unless there are sufficient example prompts, natural language examples are more beneficial for conditioning the model on the task. However, when example prompts are sufficient, the model can derive the meaning of the symbol in the task context and complete the task more correctly.

The results of example prompt order are very interesting. Frequent-first (i.e., more-frequently-used FQNs appear at the beginning) is always the worst, random-order is in the middle, and infrequent-first (less-frequently-used FQNs appear at the beginning) is always the best. The accuracy (77.05%) of infrequent-first at Few-Shot-REP is even better than the accuracy (75.99%) of frequent-first at Few-Shot-LOO. That is, letting the model see more challenging (even relatively) examples first is beneficial for the model to learn better and faster. The variants of example prompt order is not applicable to Zero-Shot (no example prompt) and One-Shot-ENIC (the same $<$ simplename, FQN $>$ not in the code context).

The results of identifier format confirm our intuition. Using a special symbol (“” in this work) to annotate simple names and FQNs results in better inference accuracy, especially for Zero-Shot, One-Shot-ENIC and One-Shot. The effect of special symbol diminishes as the example prompts increases. These results suggest that special symbol let the model attend to the key information (simple name and FQN) when there is limited information. However, when there are several example prompts, the model can distinguish the key information (different across examples) from other repeating information even the key information is not specially annotated.

Based on the prompt engineering results, we define the “best” configuration: with code context, concise task description, description-style prompt template, infrequent-first example prompt order, identifier with quote. We carry out the experiments with the best configuration. The best configuration is always better than the basic configuration across the five shot settings. It is 1.07%, 0.54%, 4.54%, 2.30%, 1.79% more accurate in zero-shot, one-shot-EXIC, one-shot, few-shot-REP, and few-shot-LOO, respectively. The best configuration is also generally better than the basic configuration with only one factor variant, except for four configuration variants (no task description at One-Shot-ENIC, infrequent-first at One-Shot, infrequent-first at Few-Shot-REP, and symbol prompt template at Few-Shot-LOO). This suggests that prompt factors have complex interactions which may cancel the effects of others.

Although studies [47, 56, 57] show PLMs are sensitive to prompt engineering, CoPilot remains overall stable in inferring FQN knowledge in face of variant prompt-engineering factors. Our intuition of the effectiveness of factor variants largely holds, except for task description and example prompt order. This indicates the necessity to combine intuition and empirical evidences in prompt engineering. Combining the best individual variants results in an overall-balanced best configuration.

4.2 Impact of Amount of Example Prompts (RQ2)

TABLE III: The FQN Inference Accuracy For FQNs with Different Data Distribution Properties (the Closer to 1, the Brighter the Color)

	Range	FQN Percentage(%)	Zero-Shot	One-Shot-ENIC	One-Shot	Few-Shot-REP	Few-Shot-LOO
Best Configuration	all	100%	50.07%	50.26%	65.72%	76.40%	79.34%
FQN Length	$2-4$	58.04%	76.59%	77.49%	78.40%	86.42%	88.29%
	$5-7$	28.00%	18.59%	18.19%	50.73%	64.25%	68.92%
	$8-10$	13.35%	3.08%	1.40%	43.19%	59.33%	63.25%
	$\geq 11$	0.61%	0.00%	0.00%	40.82%	55.10%	59.18%
FQN Usage Time	$\geq 10k$	12.58%	99.42%	99.71%	99.52%	99.42%	99.62%
	$[1k,10k)$	14.99%	79.75%	85.98%	82.99%	89.71%	92.53%
	$[10,1k)$	42.20%	51.87%	50.56%	61.48%	75.56%	79.18%
	$[1,10)$	30.23%	11.51%	10.72%	48.49%	61.04%	64.27%
SN:FQN	$1:1$	74.80%	53.54%	54.45%	70.48%	79.64%	82.52%
	$1:2$	9.53%	45.23%	47.19%	60.00%	74.64%	76.47%
	$1:3$	3.88%	36.22%	34.29%	51.92%	68.27%	73.40%
	$1:\geq 4$	11.79%	36.53%	31.36%	44.67%	59.97%	63.46%
FQN:SN	$1:1$	70.36%	54.59%	55.39%	68.57%	76.27%	78.48%
	$1:2$	12.95%	26.73%	25.38%	57.60%	75.87%	81.15%
	$1:3$	3.91%	28.03%	25.16%	56.69%	77.07%	79.94%
	$1:\geq 4$	12.78%	55.56%	54.87%	61.01%	77.49%	82.07%

4.2.1 Motivation

In-context learning relies on example prompts to adapt the model to the new tasks unseen during pre-training. From the practical point of view, example prompts correspond to priori FQN knowledge developers have which may help the model complete the FQN inference. This RQ aims to investigate the CoPilot’s FQN inference capability when the developer can provide different amount of priori-known FQNs. The results help us evaluate the practicality of CoPilot for FQN inference, and estimate the extent of FQN knowledge stored in CoPilot.

4.2.2 Methodology

We use the Github dataset in this RQ. We use the best configuration of prompt factors (see Section 4.1) and experiment five different amount of example prompts (see Section 2.2.1). For each $<$ $simplename$ , $FQN$ $>$ pair in a code snippet, we generate five model inputs, corresponding to the five shot settings respectively. We obtain 41,290 model inputs (8,258 for each shot setting) for this RQ.

4.2.3 Result Analysis

The Best Configuration in Table III shows our results. At Zero-Shot, the inference performs poorly (50.07% accuracy). A common mistake CoPilot makes at zero-shot is to generate some FQN-irrelevant text. For example, it generates a path name “Data.Exported_Datas.BasicConfig-zero-shot.prompt_files.Ticket” for the simple name “Ticket”, which is the file that stores the task input. The poor zero-shot accuracy suggests that only a task description is insufficient for the model to determine what task it needs to solve.

One-shot-ENIC (example not in code) shows the model one example prompt. Although this example is not for the simple name in the code context, it helps to reduce the cases where the model generates FQN-irrelevant text. However, the overall accuracy (50.26%) of One-Shot-ENIC does not improve much over that of Zero-Shot, because the model often generates some FQNs irrelevant to the code context at One-Shot-ENIC, which could be misled by the non-in-the-code example.

Providing an example prompt for the simple name and FQN in the code context helps the model correct many irrelevant-to-context FQN inference errors, which leads to significant accuracy improvement (overall from 50.26% at One-Shot-ENIC to 65.72% at One-Shot). Further increasing example prompts can further boost the inference accuracy (overall 76.40% Few-Shot-REP and 79.34% Few-Shot-LOO). At Few-Shot-LOO, the model gives the upper bound of the model’s inference capability: it correctly inferences 3,224 distinct FQNs.

CoPilot stores rich FQN knowledge. Its FQN knowledge can be reasonably recalled even when only one simplename-FQN example in the code context is provided. As the number of examples increases, more and more FQN knowledge can be accurately recalled, with the maximum accuracy at about 80%. Therefore, it is practical to use CoPilot with in-context learning for FQN inference.

4.3 Impact of FQN Data Distribution Properties (RQ3)

4.3.1 Motivation

FQN data exhibits Zipfian distribution and dynamic meanings which have been reported to be influential in the in-context learning ability of PLMs [40]. This RQ aims to reveal the correlations between the FQN properties, different shot settings and the CoPilot’s FQN inference ability. The results help us understand the characteristics of FQN knowledge stored in CoPilot, and the conditions to effectively retrieve FQNs with different properties.

4.3.2 Methodology

We consider four FQN properties: lengths, usage times, and simplename-FQN (SN:FQN) and FQN-simplename (FQN:SN) cardinalities (i.e., polysemy and synonymy ambiguity respectively). We obtain the statistics of these FQN properties in the original Github dataset (see Appendix Section I in our supplementary document). The experiment setting is the same as RQ2. We calculate the accuracy for four FQN length ranges (2-4, 5-7, 8-10 and $\geq$ 11), four FQN usage time ranges [1,10), [10-1k), [1k-10k) and $\geq$ 10k), four SN:FQN cardinalities (1:1, 1:2, 1:3, 1: $\geq$ 4), and four FQN:SN cardinalities (1:1, 1:2, 1:3, 1: $\geq$ 4). SN:FQN 1:1 can be different from FQN:SN 1:1 as they are indexed by unique simple names and unique FQNs respectively. FQN percentage column in Table III shows the percentage of each range in our sample dataset.

4.3.3 Results Analysis

Table III shows the results. Looking at the inference accuracy at different FQN length ranges, it is unsurprising that the accuracy degrades as the FQNs become longer. For short FQNs (2-4 tokens), the model can make fairly accurate inference even without any example prompts (76.59% at Zero-Shot), and achieves the maximum accuracy 88.29% at Few-Shot-LOO. As the FQNs become longer (5 or more tokens), the inference accuracy drops over 58% to below 18.59% at Zero-Shot and below 18.19% at One-Shot-ENIC. For FQNs with 8 or more tokens, the accuracy is close to 0%. However, with only one example in the code context, the accuracy dramatically jumps to 50.73% for FQN length 5-7 and over 40% for FQN length $\geq$ 8. Increasing the number of examples can further boost the accuracy. At Few-Shot-REP and Few-Shot-LOO, the accuracy for the most challenging FQN length $\geq$ 11 is 55.10% and 59.18%, respectively.

For the most frequently-used FQNs ( $\geq$ 10k), CoPilot achieves almost perfect accuracy ( $\geq$ 99%) even at Zero-Shot. For the FQN usage times [1k, 10k), it also performs very well (79.75% at Zero-Shot and 90% or above at Few-Shot). For less frequently-used FQNs ([10, 1k), CoPilot still maintains a reasonable inference accuracy (about 50%-60% at Zero-Shot and One-Shot), unlike the close-to-0 accuracy for long FQNs ( $\geq$ 8 tokens). For the FQNs in the ([10, 1k) range, providing some FQN examples improves the accuracy to 75.56% at Few-Shot-REP and maximizes the accuracy to 79.18% at Few-Shot-LOO. For the lest frequently-used FQNs ( $<$ 10), CoPilot has to see some FQN examples to make the inference with reasonable accuracy (above 61% at Few-Shot).

The overall trend of accuracy changes for SN:FQN (1:N) across the five shot settings is similar to that for FQN lengths and usage times. One difference is that CoPilot still has a certain level of inference capability for the challenging SN:FQN (1: $\geq$ 3) (31%-36% accuracy at Zero-Shot and One-Shot-ENIC), rather than the catastrophic failures for FQN length $\geq$ 8 and FQN usage times $<$ 10. The other difference is that the accuracy gaps between the easy and the more challenging SN:FQN cases are smaller than those between different FQN length ranges and FQN usage-time ranges (visible from the color differences across the ranges). An interesting result is that CoPilot may make mistakes for SN:FQN (1:1). For example, the simple name Cookies, CoPilot sometimes predicts the FQN com.google.gwt.http.client.Cookies while the ground-truth is com.google.gwt.user.client.Cookies. We attribute this to the probabilistic nature of the neural network. Note that even for the dictionary-lookup in a symbolic FQN knowledge base [59, 54, 32], the accuracy at SN:FQN (1:1) is not 100% either, because the symbolic knowledge base suffers from the OOV issue limited by code compilation [35]. In contrast, CoPilot does not require any code compilation or analysis which is much more convenient to deploy in practice.

The trend of increasing accuracy from Zero-Shot to Few-Shot-LOO for FQN:SN is the same as that for the other three properties. However, at a particular shot setting (One-Shot, Few-Shot-REP or Few-Shot-LOO), the accuracy at different FQN:SN ranges has much smaller differences, compared with the accuracy differences between different ranges of the other three properties. That is, the number of different variable names referring to the same FQN (i.e., synonymy ambiguity) do not affect much the inference of this FQN once one FQN example in the code context is provided. This is because the variables are used to invoke the methods or access the fields (e.g., br.readLine()) and the method/field name provides good usage context for inferring the type of the variable name. Another interesting difference from the other three properties is that the middle FQN:SN ranges (1:2 and 1:3) are much worse than FQN:SN 1:1 and 1: $\geq$ 4 at Zero-Shot and One-Shot-ENIC, and the accuracy at FQN:SN 1: $\geq$ 4 is in par with that at FQN 1:1 for all shot settings. FQN:SN 1: $\geq$ 4 actually means more usage of an FQN although the FQN is referred by many different variable names. The benefit of the higher usage times outweighs the challenge incurred by different variable names.

The FQN knowledge in CoPilot is diverse in FQN lengths, usage times, SN:FQN and FQN:SN cardinalities. CoPilot can accurately infers short, frequently-used, less ambiguous FQNs, without the need of many task demonstrations. Providing more task demonstrations helps CoPilot better infer longer, less frequently-used, or more ambiguous FQNs. FQN usage time is the most influential on inference accuracy, followed by FQN length, and then name ambiguity. Synonymy (FQN:SN 1: $\geq$ 2) is the least influential as it indicates higher FQN usage which is beneficial for FQN inference.

4.4 Inference for Real-World Partial Code (RQ4)

TABLE IV: Comparison between PCM-based FQN Inferences

Method	Test Strategy	Stat-Type-SO	Short-SO	Overall
Pre-trained CodeBERT MLM	Individuals	18.86%	10.89%	18.16%
	Majority Win	14.73%	10.75%	14.07%
	Any-correct	18.20%	12.28%	17.22%
Prompt-tuned CodeBERT MLM	Individuals	88.25%	80.54%	87.56%
	Majority Win	88.95%	82.90%	87.94%
	Any-correct	89.76%	82.90%	88.61%
Copilot with In-Context Learning	Zero-Shot	76.09%	78.76%	76.98%
	One-Shot-ENIC	73.82%	79.35%	75.73%
	One-Shot	83.89%	88.79%	84.70%
	Few-Shot-REP	88.66%	92.04%	88.88%
	Few-Shot-LOO	89.01%	91.15%	89.31%

4.4.1 Motivation

RQ2 and RQ3 use the methods in library source code as code snippets. As the library source code is compilable, we can automatically collect the ground-truth FQNs for the large-scale experiments in RQ2 and RQ3. Although CoPilot performs well on library methods, we want to further confirm its FQN inference capability for real-world partial code. Furthermore, we want to compare CoPilot’s capability with that of small-size PCMs and the capability of in-context learning versus supervised prompt tuning.

4.4.2 Methodology

We use the two Stack Overflow datasets Stat-Type-SO and Short-SO (see Section 3.1.2). We use the best configuration identified in RQ1 (see Table I). The two SO datasets contain 496 partial code snippets and 2,320 $<$ simplename, FQN $>$ pairs, and we obtain 11,600 task inputs.

We consider two baselines: 1) pre-trained CodeBERT [9] without fine-tuning (i.e., zero-shot); 2) pre-trained CodeBERT with FQN prompt fine-tuning as [35]. We use the best FQN-prompt-tuned CodeBERT model in [35] (tuned with 11,776 source code files of the six libraries). CodeBERT is a MLM and has 125M parameters. We formulate the FQN inference on CodeBERT as a text fill-in-blank task as in [35].

As illustrated in Fig. 1, CoPilot makes one inference for each unique simple name in the code context. However, CodeBERT-based methods make one inference for each individual simple name in the code context. It may infer different FQNs for the same simple name at different locations. We consider three accuracy variants for CodeBERT-based methods (individual instance, majority-win or any-correct). Individual instance is the accuracy for all individual simple names. Majority-win calculates the accuracy based on the majority of inferred FQNs for each unique simple name. If there is a tie, an FQN is randomly picked. Any-correct means if any of the inferred FQNs for a unique simple name is correct, we consider the model makes the correct inference.

4.4.3 Result Analysis

CodeBERT without fine-tuning performs the worst, with an accuracy of less than 19% at best. This suggests that small-size PCMs do not capture much FQN knowledge as giant CoPilot. In contrast, standing on the shoulder of giant CoPilot which stores rich FQN knowledge, in-context learning achieves above 76% accuracy at Zero-Shot on the two SO datasets. Furthermore, the accuracy on the SO partial code at Zero-Shot is much higher than that on library methods at Zero-Shot (about 50%). This is because the FQNs in the partial code on Stack Overflow are mostly frequently-used APIs, which are easy to make accurate inference as shown in RQ3.

The FQN prompt-tuning significantly boosts CodeBERT’s inference accuracy to above 80%. We see small fluctuations between the three accuracy variants (individual-instance, majority-win and any-correct), which suggests that CodeBERT with prompt tuning generally makes consistent FQN inferences for the same simple name at different locations. Considering SO partial code uses many commonly used APIs, it is reasonable to assume developers would know some of the used APIs. Therefore, One-Shot and Few-Shot-REP will reflect CoPilot’s capability in practice. At these two shot settings, CoPilot achieves 8% and 12% higher accuracy than Zero-Shot, and the accuracy is close to or better than that of CodeBERT with prompt tuning. Maximizing priori-known FQNs at Few-Shot-LOO only slightly improves the accuracy. We analyze the accuracy for the code of six libraries individually, and find that in-context learning is more stable than the prompt-tuning method. Due to the space limitation, we put the result details in Section II in the supplementary materials.

CoPilot exhibits emergent FQN inference ability which does not exist in small PCMs. Reusing online code snippets is a common practice [28, 29, 30]. CoPilot can facilitate this code reuse by accurately infer FQNs for cannot-be-resoled simple names in uncompilable partial code. It achieves the accuracy with only a few examples of the FQN inference task, in par with CodeBERT with FQN-prompt tuning [35]

5 Discussions

We now discuss our exploratory interaction experiences with CoPilot, the differences between fine-tuning and in-context learning, and potential threats to the validity of our study.

5.1 Human-CoPilot Interaction

During our experiments, we perform some exploratory interactions with CoPilot inspired by its outputs. The results of such exploratory interactions are not counted in our RQs, but they inspire ways to help CoPilot better serve the SE tasks and call for further research in human-CoPilot interaction.

5.1.1 Micro-Level Sensitivity

Although CoPilot remains stable in face of prompt variants (RQ1), it is sometimes sensitive to micro-level input changes. For example, when CoPilot returns “no completion were found”, we may trigger a successful completion by slightly changing the to-be-complete prompt, for example, delete the ending “is” , append a “:”, or append several spaces. Furthermore, CoPilot sometimes generates an FQN followed by a code snippet. We find that inserting an empty line between code context and task description often forces CoPilot to generate only FQNs. Such micro-level sensitivities of CoPilot seem inevitable, but knowing them will enhance pragmatic use of CoPilot in the SE tasks.

5.1.2 Error Correction as Task Demonstration

In our experiments, we complete the inference tasks without human intervention. However, We find that many of CoPilot’s errors can be easily recognized and fixed by human. For example, CoPilot generates only a partial FQN (e.g., org.apache.xalan.xsltc.compiler for the simple name ClassGenerator). When pressing tab once more to request a further completion, CoPilot generates the full correct FQN. As another example, CoPilot sometimes generates some noisy characters in the FQNs, especially when the simple name has symbols ([], $<>$ or ()). For example, it generates [Ljava.lang.String; for String[]. The outputs of CoPilot can never be perfect, and we may not need it to be perfect [60]. But we may teach CoPilot to avoid common errors by human feedback. For example, human may tell CoPilot the generation is incomplete or remove the noise characters as further task demonstrations.

5.1.3 Various Forms of Priori Knowledge

In our experiments, a task input includes fixed priori-known FQNs and only one simple name for inference. In practice, developers can iteratively infer FQNs for multiple simple names and provide feedback during the process, for example by confirming the correctly-inferred FQNs or correcting the wrongly-inferred FQNs, which may help subsequent inferences. Other types of priori knowledge could also be useful. For example, the developer may know the partial package name of the APIs. Even providing just some beginning words such as com.android, it not only shortens the FQN generation, but also conditions CoPilot on what to generate. Or the developer may forget some parts of an FQN, she may specify the forgotten parts of the FQN by masks (e.g., “_”). For example, CoPilot generates com.android.layoutlib.bridge.impl.binding.AdapterItem based on the to-be-complete-prompt “complete the FQN com.android._._._._.AdapterItem : ”.

5.2 In-context Learning vs. Supervised Fine-Tuning

The supervised fine-tuning method has some drawbacks when compared to the lightweight and easily applicable in-context learning method.

First, fine-tuning typically necessitates the collection and processing of datasets. Our previous work [35], for example, collect the library’s source code, remove the noise in codes, and then build the dataset for tuning the pre-trained CodeBERT. In contrast, the in-context learning method only needs a few task demonstrations to teach the frozen giant code model, and does not need to go through the data process.

Second, fine-tuning necessitates gradient update, which usually requires testing a variety of different hyper-parameters, resulting in a significant amount of work. In contrast, the in-context learning method directly activates the frozen giant code model without a gradient update.

Third, the fine-tuned model is applicable to a domain-specific task. Assume that developers discover a new and distinct downstream task. In that case, they must repeat the model tuning processes to obtain a new domain-specific model. In contrast, if a new scenario arises, the in-context learning method only requires reusing the prompt design to construct the demonstrations for the new task.

So, instead of gathering new data and fine-tuning the model, the SE task shifts to designing/reusing a prompt for extracting knowledge from frozen pre-trained LMs, which is an entirely new paradigm for leveraging the superpower of giant models for previously unseen tasks.

5.3 Threats to Validity

Due to the prohibitive manual effort required, we sampled the datasets from the six libraries of Java. It is impossible to achieve the exact same distribution as the original code, but we made our best effort to ensure the distributions of sampled code are as close as possible to those of the original code and share the same overall trend. To ensure the diversity of the sampled dataset, we chose the 0.9 threshold for removing clone codes. We experimented the threshold from 0.8-0.95 and found 0.9 produces the code samples with the most diverse distributions (e.g., code LOCs, FQN lengths, usage times and simplename-FQN cardinalities) which are close to the original dataset.

Due to the prohibitive manual effort, we experiment 9 prompt configurations through factor ablation, among which we identify a “best” configuration. We will experiment more configurations (e.g., verbose task description with symbol-style prompt) in the future. Our experiments show the practicality of CoPilot for FQN inference, but this study focuses on the large-scale evaluation, not human-CoPilot interaction. Future work needs to explore the challenges and opportunities in human-CoPilot interaction discussed above. Our experiments require crowd workers to manually request the CoPilot’s completion. To minimize human errors, we train crowd workers and ensure they can use CoPilot in the IDE successfully. Furthermore, we automatically generate task inputs so crowd workers only need to perform minimum action (only need to press tab to request completion). We also have automatic check to detect the missed task inputs.

Our prompt design and evaluation methodology is generic. This study uses only Java code due to the dataset availability and the high effort to execute over 383K prompts. Our follow-up work is to apply and evaluate the FQN prompts from this study on other libraries and programming languages such as Python, C++, and C#. As CoPilot is pre-trained with vast GitHub code in many programming languages, our follow-up work does not involve any gathering new code for model fine-tuning.

6 Related Work

Many PLMs have been proposed in recent years, such as Bert [61], T5 [62], GPT3 [6], to name a few popular ones. Researchers have investigated when and why PLMs achieve the superior performance in NLP tasks. Wei et al. [12] show that model scaling is crucial for emergent abilities of giant PLMs. Chen et al. [40] identify three data distribution properties that drive emergent in-context learning capability in PLMs. These findings inspire our adoption of in-context learning on CoPilot for FQN inference.

PLMs have been extended to source code, following software naturalness [7, 8], which produces pre-trained code models (PCMs), such as CodeBERT [9], CodeT5 [10], Codex [46] and CoPilot [11]. These PCMs have significantly improve many SE tasks, such as code summarization [49], fault prediction [50], vulnerability detection [63], code translation [51], FQN inference [35]. There are two ways to transfer PLMs in the downstream tasks, supervised fine-tuning versus in-context learning (see Section 2.1). All existing work on using PCMs adopts supervised fine-tuning, while our work is the first to adopt in-context learning in the SE task.

Many NLP studies [64, 65, 66, 67] show PLMs can serve as neural knowledge bases, as opposed to symbolic knowledge bases. SE researchers also attempt to probe different knowledge captured in PCMs, for example, structural knowledge like AST [68, 69], semantic knowledge like code weakness [70, 71]. Some recent works investigate CoPilot’s capability for code generation [72, 70] and code translation [60], but they use only a small number of coding tasks. Our study is the first to investigate the SE factual knowledge in PCMs at large scale. In addition to FQNs, there are other types of SE factual knowledge often embedded in code, such as user credentials, text resources. Memorizing these facts may lead to serious security issues [73, 74], as attackers may probe user credentials or commercial secrets in PLMs. Although this is a real concern in adopting the models like CoPilot [75], no systematic study has been done to probe security-related factual knowledge in CoPilot. Our research methodology can be extended to this aspect.

Although PLMs demonstrate strong language capabilities, they can never be perfect due to their probabilistic nature [72, 70, 60]. Vaithilingam et al. [72] evaluate the usability of CoPilot in 3 simple programming tasks. Our interaction experience with CoPilot echoes their findings, but we conduct large-scale experiments on a specific type of SE factual knowledge. The imperfection of PLMs calls for innovative human-AI collaboration. The tools like CoAuthor [76] and WordCraft [77] support human-GPT collaborative writing. The quality of such collaborative writing is subjective. However, human-CoPilot collaboration in SE tasks would demand objective outcomes (e.g., FQN correctness in our study), which demand innovative interaction designs for human-AI co-learning [60, 76].

A promising paradigm for human-AI interaction is PLM recursion [78] and AI chain [72]. This paradigm adopts a divide-and-conquer strategy, in which the same PLM can be adapted to play different roles, and these roles and human roles can be linked to perform complex tasks. Our FQN inference on CoPilot plays a specific role in code reuse. We encourage the community to design other roles on the shoulder of PCMs and incorporate these roles to support complex coding tasks.

7 Conclusion

This paper presents a lightweight in-context learning method on CoPilot for FQN inference in partial code, and designs a research methodology to evaluate the extent and characteristics of FQN knowledge in CoPilot and identify effective prompt designs and conditions to retrieve the FQN knowledge. Our experiments confirm that CoPilot stores a large amount of priori knowledge of FQNs which can be accurately retrieved through zero or a few examples of task demonstrations. Furthermore, in-context learning on CoPilot for FQN inference has no technical barrier (except for the CoPilot account cost) to deploy, as it does not require any code parsing or model tuning. Our work demonstrates the benefits and practicality of standing on the shoulder of frozen giant PCMs and using the SE factual knowledge these models store in the SE tasks. In the future, we will extend our FQN inference approach to more programming languages, and extend our research methodology to other SE factual knowledge (e.g., privacy and proprietary information in code). We will investigate novel human-CoPilot interaction design to enhance human-AI collaboration in the SE tasks.

8 Acknowledgements

The work is partly supported by the National Nature Science Foundation of China under Grant (Nos. 62262031, 61902162, 61862033), the Nature Science Foundation of Jiangxi Province (20202BAB202015), the Graduate Innovative Special Fund Projects of Jiangxi Province (YC2021-S308, YC2022-S258).

References

[1] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer, “Deep contextualized word representations,” in NAACL, 2018.
[2] A. Radford and K. Narasimhan, “Improving language understanding by generative pre-training,” in ., 2018.
[3] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” ArXiv, vol. abs/1810.04805, 2019.
[4] Z. Yang, Z. Dai, Y. Yang, J. G. Carbonell, R. Salakhutdinov, and Q. V. Le, “Xlnet: Generalized autoregressive pretraining for language understanding,” in NeurIPS, 2019.
[5] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert pretraining approach,” ArXiv, vol. abs/1907.11692, 2019.
[6] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. J. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei, “Language models are few-shot learners,” ArXiv, vol. abs/2005.14165, 2020.
[7] P. T. Devanbu, “On the naturalness of software,” 2012 34th International Conference on Software Engineering (ICSE), pp. 837–847, 2012.
[8] M. Allamanis, E. T. Barr, P. T. Devanbu, and C. Sutton, “A survey of machine learning for big code and naturalness,” ACM Computing Surveys (CSUR), vol. 51, pp. 1 – 37, 2018.
[9] Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong, L. Shou, B. Qin, T. Liu, D. Jiang, and M. Zhou, “Codebert: A pre-trained model for programming and natural languages,” ArXiv, vol. abs/2002.08155, 2020.
[10] Y. Wang, W. Wang, S. Joty, and S. C. Hoi, “Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation,” arXiv preprint arXiv:2109.00859, 2021.
[11] GitHub, “”github copilot. your ai pair programmer” [online]. available:https://copilot.github.com/,” GitHub CoPilot, 2021.
[12] Y. Wan, W. Zhao, H. Zhang, Y. Sui, G. Xu, and H. Jin, “What do they capture? - a structural analysis of pre-trained language models for source code,” 2022 IEEE/ACM 44th International Conference on Software Engineering (ICSE), pp. 2377–2388, 2022.
[13] L. Buratti, S. Pujar, M. A. Bornea, S. McCarley, Y. Zheng, G. Rossiello, A. Morari, J. Laredo, V. Thost, Y. Zhuang, and G. Domeniconi, “Exploring software naturalness through neural language models,” ArXiv, vol. abs/2006.12641, 2020.
[14] X. Yuan, G. Lin, Y. Tai, and J. Zhang, “Deep neural embedding for software vulnerability discovery: Comparison and optimization,” Security and Communication Networks, vol. 2022, 2022.
[15] D. Wang, Z. Jia, S. Li, Y. Yu, Y. Xiong, W. Dong, and X. Liao, “Bridging pre-trained models and downstream tasks for source code understanding,” 2022 IEEE/ACM 44th International Conference on Software Engineering (ICSE), pp. 287–298, 2022.
[16] A. Karmakar and R. Robbes, “What do pre-trained code models know about code?” 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 1332–1336, 2021.
[17] C. Wang, Y. Yang, C. Gao, Y. Peng, H. Zhang, and M. R. Lyu, “No more fine-tuning? an experimental evaluation of prompt tuning in code intelligence,” arXiv preprint arXiv:2207.11680, 2022.
[18] X. Qiu, T. Sun, Y. Xu, Y. Shao, N. Dai, and X. Huang, “Pre-trained models for natural language processing: A survey,” ArXiv, vol. abs/2003.08271, 2020.
[19] X. Han, Z. Zhang, N. Ding, Y. Gu, X. Liu, Y. Huo, J. Qiu, L. Zhang, W. Han, M. Huang, Q. Jin, Y. Lan, Y. Liu, Z. Liu, Z. Lu, X. Qiu, R. Song, J. Tang, J. rong Wen, J. Yuan, W. X. Zhao, and J. Zhu, “Pre-trained models: Past, present and future,” AI Open, vol. 2, pp. 225–250, 2021.
[20] P. Godefroid, P. de Halleux, A. V. Nori, S. K. Rajamani, W. Schulte, N. Tillmann, and M. Y. Levin, “Automating software testing using program analysis,” IEEE software, vol. 25, no. 5, pp. 30–37, 2008.
[21] Y. Zhou, S. Liu, J. Siow, X. Du, and Y. Liu, “Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks,” ArXiv, vol. abs/1909.03496, 2019.
[22] X. Ren, X. Ye, Z. Xing, X. Xia, X. Xu, L. Zhu, and J. Sun, “Api-misuse detection driven by fine-grained api-constraint knowledge graph,” 2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 461–472, 2020.
[23] L. T. C. Melo, R. G. Ribeiro, B. C. F. Guimarães, and F. M. Q. a. Pereira, “Type inference for c: Applications to the static analysis of incomplete programs,” ACM Trans. Program. Lang. Syst., 2020.
[24] T. Zhang, G. Upadhyaya, A. Reinhardt, H. Rajan, and M. Kim, “Are code examples on an online q&a forum reliable?: A study of api misuse on stack overflow,” 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE), pp. 886–896, 2018.
[25] L. Piccolboni, G. D. Guglielmo, L. P. Carloni, and S. Sethumadhavan, “Crylogger: Detecting crypto misuses dynamically,” 2021 IEEE Symposium on Security and Privacy (SP), pp. 1972–1989, 2021.
[26] S. Maji, S. S. Rout, and S. Choudhary, “Dcom: A deep column mapper for semantic data type detection,” 2021. [Online]. Available: https://arxiv.org/abs/2106.12871
[27] S. Thummalapenta and T. Xie, “Parseweb: a programmer assistant for reusing open source code on the web,” in Proceedings of the twenty-second IEEE/ACM international conference on Automated software engineering, 2007, pp. 204–213.
[28] M. M. Rahman, J. Barson, S. Paul, J. Kayani, F. A. Lois, S. F. Quezada, C. Parnin, K. T. Stolee, and B. Ray, “Evaluating how developers use general-purpose web-search for code retrieval,” in Proceedings of the 15th International Conference on Mining Software Repositories, 2018, pp. 465–475.
[29] D. Gopstein, H. H. Zhou, P. Frankl, and J. Cappos, “Prevalence of confusing code in software projects: Atoms of confusion in the wild,” in Proceedings of the 15th International Conference on Mining Software Repositories, 2018, pp. 281–291.
[30] P. Yin, B. Deng, E. Chen, B. Vasilescu, and G. Neubig, “Learning to mine aligned code and natural language pairs from stack overflow,” in 2018 IEEE/ACM 15th international conference on mining software repositories (MSR). IEEE, 2018, pp. 476–486.
[31] S. Subramanian, L. Inozemtseva, and R. Holmes, “Live api documentation,” International Conference on Software Engineering (ICSE), 2014.
[32] C. M. K. Saifullah, M. Asaduzzaman, and C. K. Roy, “Learning from examples to find fully qualified names of api elements in code snippets,” in 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE), 2019, pp. 243–254.
[33] Z. Su, G. Zhang, F. Yue, L. Chang, J. Jiang, and X. Yao, “Snr-constrained heuristics for optimizing the scaling parameter of robust audio watermarking,” IEEE Transactions on Multimedia, vol. 20, pp. 2631–2644, 2018.
[34] B. Dagenais and L. J. Hendren, “Enabling static analysis for partial java programs,” Proceedings of the 23rd ACM SIGPLAN conference on Object-oriented programming systems languages and applications, 2008.
[35] Q. Huang, Z. Yuan, Z. Xing, X. Xu, L. Zhu, and Q. Lu, “Prompt-tuned code language model as a neural knowledge base for type inference in statically-typed partial code,” International Conference on Automated Software Engineering(ASE), 2022.
[36] G. Shi, J. Chen, W. Zhang, L.-M. Zhan, and X.-M. Wu, “Overcoming catastrophic forgetting in incremental few-shot learning by finding flat minima,” Advances in Neural Information Processing Systems, vol. 34, pp. 6747–6761, 2021.
[37] J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler, E. Chi, T. Hashimoto, O. Vinyals, P. Liang, J. Dean, and W. Fedus, “Emergent abilities of large language models,” ArXiv, vol. abs/2206.07682, 2022.
[38] F. Petroni, T. Rocktäschel, P. Lewis, A. Bakhtin, Y. Wu, A. H. Miller, and S. Riedel, “Language models as knowledge bases?” ArXiv, vol. abs/1909.01066, 2019.
[39] B. Heinzerling and K. Inui, “Language models as knowledge bases: On entity representations, storage capacity, and paraphrased queries,” ArXiv, vol. abs/2008.09036, 2021.
[40] S. C. Y. Chan, A. Santoro, A. K. Lampinen, J. X. Wang, A. Singh, P. H. Richemond, J. Mcclelland, and F. Hill, “Data distributional properties drive emergent in-context learning in transformers,” ArXiv, vol. abs/2205.05055, 2022.
[41] Z. Xing and E. Stroulia, “Api-evolution support with diff-catchup,” IEEE Transactions on Software Engineering, vol. 33, no. 12, pp. 818–836, 2007.
[42] M. Lamothe, Y.-G. Guéhéneuc, and W. Shang, “A systematic review of api evolution literature,” ACM Computing Surveys (CSUR), vol. 54, no. 8, pp. 1–36, 2021.
[43] L. Li, J. Gao, T. F. Bissyandé, L. Ma, X. Xia, and J. Klein, “Cda: Characterising deprecated android apis,” Empirical Software Engineering, vol. 25, pp. 2058–2098, 2020.
[44] R. Robbes, M. Lungu, and D. Röthlisberger, “How do developers react to api deprecation?: the case of a smalltalk ecosystem,” in SIGSOFT FSE, 2012.
[45] J. Zhang, H. Zhang, C. Xia, and L. Sun, “Graph-bert: Only attention is needed for learning graph representations,” arXiv preprint arXiv:2001.05140, 2020.
[46] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman et al., “Evaluating large language models trained on code,” arXiv preprint arXiv:2107.03374, 2021.
[47] P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, and G. Neubig, “Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing,” arXiv preprint arXiv:2107.13586, 2021.
[48] T. Schick, H. Schmid, and H. Schütze, “Automatically identifying words that can serve as labels for few-shot text classification,” arXiv preprint arXiv:2010.13641, 2020.
[49] W. Sun, C. Fang, Y. Chen, Q. Zhang, G. Tao, T. Han, Y. Ge, Y. You, and B. Luo, “An extractive-and-abstractive framework for source code summarization,” arXiv preprint arXiv:2206.07245, 2022.
[50] S. S. Rathore and S. Kumar, “A study on software fault prediction techniques,” Artificial Intelligence Review, vol. 51, no. 2, pp. 255–327, 2019.
[51] M.-A. Lachaux, B. Roziere, L. Chanussot, and G. Lample, “Unsupervised translation of programming languages,” arXiv preprint arXiv:2006.03511, 2020.
[52] R. Pawlak, M. Monperrus, N. Petitprez, C. Noguera, and L. Seinturier, “Spoon: A Library for Implementing Analyses and Transformations of Java Source Code,” Software: Practice and Experience, vol. 46, pp. 1155–1179, 2015. [Online]. Available: https://hal.archives-ouvertes.fr/hal-01078532/document
[53] X. Ye, H. Shen, X. Ma, R. C. Bunescu, and C. Liu, “From word embeddings to document similarities for improved information retrieval in software engineering,” 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE), pp. 404–415, 2016.
[54] H. D. Phan, H. A. Nguyen, N. M. Tran, L.-H. Truong, A. T. Nguyen, and T. N. Nguyen, “Statistical learning of api fully qualified names in code snippets of online forums,” 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE), pp. 632–642, 2018.
[55] Y. Dong, T. Gu, Y. Tian, and C. Sun, “Snr: Constraint-based type inference for incomplete java code snippets,” 2022 IEEE/ACM 44th International Conference on Software Engineering (ICSE), pp. 1982–1993, 2022.
[56] N. Ding, Y. Chen, X. Han, G. Xu, P. Xie, H.-T. Zheng, Z. Liu, J. Li, and H.-G. Kim, “Prompt-learning for fine-grained entity typing,” arXiv preprint arXiv:2108.10604, 2021.
[57] Z. Zhao, E. Wallace, S. Feng, D. Klein, and S. Singh, “Calibrate before use: Improving few-shot performance of language models,” in International Conference on Machine Learning. PMLR, 2021, pp. 12 697–12 706.
[58] Y. Elazar, N. Kassner, S. Ravfogel, A. Feder, A. Ravichander, M. Mosbach, Y. Belinkov, H. Schutze, and Y. Goldberg, “Measuring causal effects of data statistics on language model’s ’factual’ predictions,” ArXiv, vol. abs/2207.14251, 2022.
[59] Y. Dong, T. Gu, Y. Tian, and C. Sun, “Snr: Constraint based type inference for incomplete java code snippets,” International Conference on Software Engineering (ICSE), 2022.
[60] J. D. Weisz, M. J. Muller, S. Houde, J. T. Richards, S. I. Ross, F. Martinez, M. Agarwal, and K. Talamadupula, “Perfection not required? human-ai partnerships in code translation,” 26th International Conference on Intelligent User Interfaces, 2021.
[61] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
[62] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, P. J. Liu et al., “Exploring the limits of transfer learning with a unified text-to-text transformer.” J. Mach. Learn. Res., vol. 21, no. 140, pp. 1–67, 2020.
[63] Z. Li, D. Zou, S. Xu, H. Jin, Y. Zhu, and Z. Chen, “Sysevr: A framework for using deep learning to detect software vulnerabilities,” IEEE Transactions on Dependable and Secure Computing, 2021.
[64] A. Roberts, C. Raffel, and N. M. Shazeer, “How much knowledge can you pack into the parameters of a language model?” ArXiv, vol. abs/2002.08910, 2020.
[65] F. Petroni, T. Rocktäschel, P. Lewis, A. Bakhtin, Y. Wu, A. H. Miller, and S. Riedel, “Language models as knowledge bases?” arXiv preprint arXiv:1909.01066, 2019.
[66] Z. Jiang, F. F. Xu, J. Araki, and G. Neubig, “How can we know what language models know?” Transactions of the Association for Computational Linguistics, vol. 8, pp. 423–438, 2020.
[67] B. Heinzerling and K. Inui, “Language models as knowledge bases: On entity representations, storage capacity, and paraphrased queries,” arXiv preprint arXiv:2008.09036, 2020.
[68] J. Zhang, X. Wang, H. Zhang, H. Sun, K. Wang, and X. Liu, “A novel neural source code representation based on abstract syntax tree,” in 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). IEEE, 2019, pp. 783–794.
[69] H. Diel, “Language representation based on abstract sytax,” in GI — 6. Jahrestagung, E. J. Neuhold, Ed. Berlin, Heidelberg: Springer Berlin Heidelberg, 1976, pp. 133–147.
[70] H. Pearce, B. Ahmad, B. Tan, B. Dolan-Gavitt, and R. Karri, “Asleep at the keyboard? assessing the security of github copilot’s code contributions,” in 2022 IEEE Symposium on Security and Privacy (SP). IEEE, 2022, pp. 754–768.
[71] W. Li, C. J. Mitchell, and T. Chen, “Your code is my code: Exploiting a common weakness in oauth 2.0 implementations,” in Cambridge International Workshop on Security Protocols. Springer, 2018, pp. 24–41.
[72] P. Vaithilingam, T. Zhang, and E. L. Glassman, “Expectation vs. experience: Evaluating the usability of code generation tools powered by large language models,” in CHI Conference on Human Factors in Computing Systems Extended Abstracts, 2022, pp. 1–7.
[73] N. Carlini, F. Tramer, E. Wallace, M. Jagielski, A. Herbert-Voss, K. Lee, A. Roberts, T. Brown, D. Song, U. Erlingsson et al., “Extracting training data from large language models,” in 30th USENIX Security Symposium (USENIX Security 21), 2021, pp. 2633–2650.
[74] J. Conrad, The secret sharer. phonereader, 2006.
[75] H. Pearce, B. Ahmad, B. Tan, B. Dolan-Gavitt, and R. Karri, “An empirical cybersecurity evaluation of github copilot’s code contributions,” arXiv e-prints, pp. arXiv–2108, 2021.
[76] M. Lee, P. Liang, and Q. Yang, “Coauthor: Designing a human-ai collaborative writing dataset for exploring language model capabilities,” in CHI Conference on Human Factors in Computing Systems, 2022, pp. 1–19.
[77] A. Coenen, L. Davis, D. Ippolito, E. Reif, and A. Yuan, “Wordcraft: A human-ai collaborative editor for story writing,” arXiv preprint arXiv:2107.07430, 2021.
[78] T. Wu, M. Terry, and C. J. Cai, “Ai chains: Transparent and controllable human-ai interaction by chaining large language model prompts,” in CHI Conference on Human Factors in Computing Systems, 2022, pp. 1–22.

QING HUANG received the M.S degree in computer application and technology from Nanchang University, in 2009, and the PH.D. degree in computer software and theory from Wuhan University, in 2018. He is currently an Assistant Professor with the School of Computer and Information Engineering, Jiangxi Normal University, China. His research interests include information security, software engineering and knowledge graph.

Dianshu Liao is a fourth-year undergraduate student at the School of Computer and Information Engineering, Jiangxi Normal University, China. His research interests include software engineering and knowledge graph.

Zhenchang Xing is a Senior Research Scientist with Data61, CSIRO, Eveleigh, NSW, Australia. In addition, he is an Associate Professor in the Research School of Computer Science, Australian National University. Previously, he was an Assistant Professor in the School of Computer Science and Engineering, Nanyang Technological University, Singapore, from 2012-2016. His main research areas are software engineering, applied data analytics, and human-computer interaction.

Zhiqiang Yuan is a second-year graduate student in the School of Computer and Information Engineering, Jiangxi Normal University. His research interests are software engineering and knowledge graph.

Qinghua Lu is a Senior Research Scientist with Data61, CSIRO, Eveleigh, NSW, Australia. Before joining Data61, she was an Associate professor at China University of Petroleum, and she worked as a researcher at National Information and Communications Technology Australia. She has published more than 100 academic papers in international journals and conferences. Her research interests include the software architecture, blockchain, software engineering for AI, and AI ethics.

Xiwei Xu is a Senior Research Scientist with Architecture& Analytics Platforms Team, Data61, CSIRO. She is also a Conjoint Lecturer with UNSW. She started working on blockchain since 2015. She is doing research on blockchain from software architecture perspective, for example, tradeoff analysis, and decision making and evaluation framework. Her main research interest is software architecture. She also does research in the areas of service computing, business process, and cloud computing and dependability.

Jiaxing Lu received his Master’s degree from Jiangxi Normal University, China, in 2004. He is currently a Lecturer in the School of Computer and Information Engineering of Jiangxi Normal University. His research interest is mainly in the areas of algorithms of artificial intelligence and computing theory. He has participated in multiple projects and published several research papers in scholarly journals in the above research areas.