This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

SmartGSN: a generative AI-powered online tool for the management of assurance cases
thanks: We thank the Mitacs GRI program for funding SmartGSN development.

Oluwafemi Odu1, Daniel Méndez Beltrán1,2, Emiliano Berrones Gutiérrez 1,2, Alvine B. Belle1, Melika Sherafat1 1Lassonde School of Engineering at York University, Toronto, Canada
2Monterrey Institute of Technology and Higher Education, Mexico, Mexique
olufemi2@yorku.ca, a01770621@tec.mx, a01369091@tec.mx, alvine.belle@lassonde.yorku.ca, meddeta@my.yorku.ca
Abstract

Developing industry-wide standards and making sure producers of mission-critical systems comply with them is crucial to foster consumer acceptance. Producers of such systems can rely on assurance cases to demonstrate to regulatory authorities how they have complied with such standards to help prevent system failure, which could result in fatalities and environmental damage. In this paper, we introduce SmartGSN, an innovative online tool that relies on Large Language Models to (semi-)automate the management of assurance cases complying with GSN –a very popular notation to graphically represent assurance cases. The evaluation of SmartGSN demonstrates its strong capability to detect assurance case patterns within the assurance cases manually created for five systems spanning several application domains. SmartGSN is accessible online at [https://smartgsn.vercel.app], and a demonstration video can be viewed at [https://youtu.be/qLrTHf-SZbM].

Index Terms:
requirement engineering, assurance cases, assurance case patterns, large language models, generative AI, GPT.

I Introduction

The producers of mission-critical systems designed to perform essential tasks must usually demonstrate that their systems effectively support their intended non-functional properties (e.g., safety, security). This allows certifying these systems comply with specific industrial standards and allow regulators to authorize the deployment of these systems [1]. System assurance supports that demonstration by relying on assurance cases (ACs) [1]. An assurance case is a well-established, structured, reasoned, and auditable set of arguments supported by a body of evidence and designed to demonstrate that the non-functional requirements of a system have been correctly implemented [2]. There are different types of assurance cases (e.g., safety cases, security cases), each of them targeting a specific non-functional requirement. The use of assurance cases is common across various application domains (e.g., aerospace, automotive) since it allows preventing system failure. The latter could result in fatalities, and financial losses.

Assurance cases can be represented using either textual notations (e.g., structured prose) or graphical notations [3]. The most used graphical notation to represent assurance cases is the Goal Structuring Notation (GSN) [4]. Assurance cases represented in GSN are depicted in a tree-like structure called the goal structure [4]. The latter provides a diagrammatic representation of assurance cases. Assurance case patterns (ACPs) are reusable templates that facilitate the creation of assurance cases from recurring assurance case structures [4].

System assurance management activities include: the construction of ACs, the instantiation of ACs, their assessment, their update, their formalization, and their formal verification [5, 6, 7]. However, manually creating and maintaining these assurance cases can be tedious, error-prone, and time-consuming especially since assurance cases are usually very large documents that may consist of several hundred pages [5, 8]. Thus, several tools (e.g., MMINT-A [9], DevCase [7], Trusta [10], SPIRIT [11], ExplicitCase [12], and Advocate [13]) support the management of assurance cases. These tools range from simple graphic editors for drawing assurance cases to comprehensive platforms that offer advanced features, such as traceability between assurance case elements, support for multiple assurance case notations (e.g., SACM, CAE, and GSN), and type checking to validate the content and structure of assurance cases. Maksimov et al. [5] surveyed some of these tools with an emphasis on assurance case assessment.

To the best of our knowledge, there is currently no online tool that leverages Generative Artificial Intelligence (AI) through Large Language Models (LLMs) for the semi-automatic management of assurance cases. This hinders the automation of the assurance case development process, forcing users to depend on manual, tedious and time-consuming methods. Without natural language processing capabilities, assurance case developers continue to face increased workloads as they must manually extract system artifacts from system models and replace abstract information within patterns to construct system specific assurance cases. To bridge that gap, we propose SmartGSN, a novel online tool relying on LLMs to support the semi-automatic management of assurance cases. Its envisioned users are researchers and practitioners (e.g., regulatory authorities, corporate safety/security analysts).

The contributions of this paper are three-fold: 1) we present an online tool that relies on LLMs to manage GSN-compliant assurance cases; 2) we describe the technological stack we used to implement and deploy that tool; and 3) we report an evaluation of the proposed tool. Results from that evaluation showed that our tool is performant in detecting patterns within assurance cases.

II Description of SmartGSN

We propose a novel tool called SmartGSN. It leverages Generative AI through LLMs to semi-automate the management of assurance cases. SmartGSN aims at expediting the initial drafts of assurance cases, enabling argument developers to focus on refining and enhancing the quality of these assurance cases. Thus, it is possible to interact with the tool to refine its results.

II-A Core Features of The SmartGSN

II-A1 Functional Requirements

The features of our tool are:

  • Instantiation of assurance cases from assurance case patterns: We introduced in previous work (i.e. [14]) the pattern instantiation approach our tool supports. In [14], we used different prompting techniques to craft LLM prompts allowing to instantiate assurance cases from assurance case patterns. Thus, our tool offers users the choice between two LLMs (i.e. GPT-4o and GPT-4 Turbo) to provide prompts that contain domain knowledge of a specific system and a given pattern in a formalized format [14]. The tool then uses that information to prompt the selected LLM to generate a system assurance case in compliance with the specified pattern.

  • Detection of assurance case patterns within assurance cases: This feature aids in understanding the underlying structure of assurance cases, ensuring consistency and compliance with established patterns. Detecting these patterns also facilitates the analysis, validation, and refactoring of assurance cases, enhancing their reliability. To detect a given assurance case pattern within an assurance case, each LLM at hand utilizes a prompt that leverages a metric-based heuristic. The latter aggregates n metrics that respectively compare the similarity between an assurance case pattern and an assurance case. The textbox below illustrates that heuristic. Based on that heuristic, if at least one of the n metrics value tops its threshold, the LLM at hand concludes that the pattern is detected; otherwise, the LLM concludes that the pattern is not detected.

  • Conversion of assurance cases from textual format to graphical format: to account for the widespread use of LLMs for content and text generation, our tool facilitates the automatic conversion of LLM-generated assurance cases from textual format (structured prose) to GSN diagrams using the GSN standard [4] guidelines.

  • Creation (edition) of assurance cases: Similar to most assurance case editors, our tool features a graphical interface to facilitate the creation of assurance cases complying with GSN.

If the value of metric_1metric\_1 is superior or equal to threshold_metric_1threshold\_metric\_1, OR if the value of metric_2metric\_2 is superior or equal to threshold_metric_2threshold\_metric\_2, …, OR if the value of metric_nmetric\_n is superior or equal to threshold_metric_nthreshold\_metric\_n, then conclude that the formalized assurance case pattern has been detected in the formalized assurance case. Otherwise, conclude that the formalized assurance case pattern has not been detected in the formalized assurance case.

II-A2 Non-Functional Requirements

Our tool supports several non-functional requirements including the following:

  • Security: To support security, we implemented authorization with FireBase Authentication which allows secure user authentication, password management and account creation. Credentials are managed securely on Firebase’s backend, protecting sensitive data like passwords.

  • Accessibility: To support accessibility in our tool, we have deployed it online, allowing users to access and utilize its features from any location with an internet connection. Responsive web design is implemented as well, allowing users from different devices to use the tool.

  • Cross-platform compatibility: Our tool is available online, which makes it possible to operate on various operating systems, hardware platforms and browsers.

II-B Architecture of SmartGSN

To support the aforementionned features within SmartGSN, we applied the 3-tier architectural pattern to obtain the architecture of SmartGSN. Hence, the architecture of that tool comprises three tiers: the presentation, business and data tiers. Figure 1 depicts the core technologies we used to implement/deploy SmartGSN tiers. We describe them below.

Refer to caption
Figure 1: SmartGSN core supporting technologies

II-B1 Technologies used to develop the presentation tier

The presentation tier is responsible for managing the application’s user interface and experience. It allows displaying information to users and collecting their input. We used the following main technologies to implement that tier:

  1. 1.

    React JS: The most popular JavaScript library for the front-end in webpages. It is the core of the tool. We implemented React Hooks to manage performance and handle functions properly

  2. 2.

    React Flow: A library for visualizing of the nodes in the graphical format of GSN. It uses ’index’ files for the nodes and edges, which translate into nodes with connections and positions.

  3. 3.

    Material UI : It is a popular React component library that implements Google’s Material Design guidelines. It provides pre-built, customizable components to help build modern, responsive, and consistent user interfaces.

II-B2 Technologies used to develop the business tier

The business tier manages the core functionality and logic of our tool. It includes various APIs, rules, and functions designed to process user inputs and perform specific tasks, such as pattern detection and the automatic instantiation of assurances. The following technologies allowed us to implement that tier:

  1. 1.

    Dagre: A library that lays out the nodes depending on their connections. It uses various algorithms to identify the level of each node and then portrays the elements as a tree structure.

  2. 2.

    LLM (GPT and its parameters and values): The OpenAI API provides developers with access to powerful language models, such as GPT-4, for tasks involving natural language processing [15]. We relied on that API to create a connection between the tool and the two LLMs at hand, namely: GPT-4 Turbo and GPT-4 Omni. After receiving a user’s request containing the desired parameters and values, each LLM provides a suitable response (e.g., a GSN-compliant assurance case).

II-B3 Technologies used to develop the data tier

The data tier is responsible for data storage within our tool (e.g., users’credentials). It supports the execution of CRUD (Create, Read, Update, Delete) operations on the stored data. To implement that tier, we relied on Google Firebase. It is a set of tools that manage the information accessed in the back-end of the tool. Thus, SmartGSN webpages use Authentication, and Firestore Database to manage passwords as well as users.

II-B4 Technologies used to manage the versions of the tool

To manage the versions of our tool, we relied on a version control platform called GitHub.

II-B5 Technologies used to deploy the tool

We used two technologies to deploy SmartGSN on the cloud:

  1. 1.

    Vercel: For the deployment of the SmartGSN front-end we used Vercel, which is a platform designed for the deployment and hosting of web applications. Vercel automatically deploys the project to their cloud infrastructure once the build process is finished. We integrated Vercel with our GitHub repository to make updates easier to implement into the tool.

  2. 2.

    Heroku: We deployed SmartGSN back-end on Heroku, a cloud platform that allows us to build, run and scale applications. Thus, we used Heroku to deploy the code that interacts with the OpenAI API.

II-C Illustration of the Tool

Refer to caption
Figure 2: Partial Assurance Case Pattern Detection Screen

Figure 2 provides a partial illustration of the pattern detection screen. More specifically, it depicts the collapsed interface that allows users to input values required to support pattern detection. The LLM selected through that interface uses the supplied input values to detect a pattern in an assurance case.

III Empirical Evaluation

In [14], we validated the pattern instantiation technique that we have used to implement SmartGSN. In this paper, our evaluation focuses on the SmartGSN pattern detection feature.

III-A Experiment settings

III-A1 Research question

Our evaluation aims at answering the following research question (RQ): Can SmartGSN correctly detect assurance case patterns in assurance cases?

III-A2 Dataset description

The dataset utilized in our experiment is a collection of assurance case patterns and assurance cases manually developed from these patterns. Specifically, our dataset comprises six assurance case patterns and five assurance cases related to the following systems: ACAS XU, BLUEROV2, GPCA, IM Software, and DEEPMIND. We extracted the ACPs, ACs, and additional domain information about these systems from several studies focusing on instantiating assurance cases from patterns: see previous work [14] for more details. These ACPs and ACs were developed for systems across a range of application domains including aviation, automotive, medical, and computing — to support the assurance of non-functional requirements such as safety and security. It is important to note that in our dataset, all assurance cases, except for one, are derived from a single assurance case pattern. It is only the assurance case of BlueROV2, that is derived from a combination of two patterns.

III-A3 LLM settings

Our experiment focuses on two LLMs: GPT-4o and GPT-4 Turbo. To evaluate the generative AI-based pattern detection feature of SmartGSN, we performed that experiment five times (K=5) to account for the non-deterministic nature of the LLMs at hand. We set each LLM at hand to its default parameters. Thus, we set the Temperature to the default value of 1. Additionally, we set the maximum number of each LLM tokens to be generated at 4096.

III-A4 Metrics used to craft pattern detection prompts

In the context of this paper, we focus on a set of two well-known metrics (i.e. n=2) that allow computing text similarity. These metrics are: BLEU Score and Semantic Similarity [16, 17]. To support pattern detection, SmartGSN uses these metrics to compute the similarity between the assurance case pattern and the assurance case, both formalized in an advanced structured prose format we introduced in previous work [14]. SmartGSN computes these two metrics by relying on the following Python libraries: Sacrebleu, and scikit-learn. For brevity’s sake, we will only report the experiment results we obtained with the following metric thresholds: 0.4, 0.6, and 0.8.

III-A5 Experiments description

We rely on a combination of two prompting strategies to perform our experiment [18, 19]: Zero-shot + CoT. In these techniques, we do not include any examples in our prompt. Still, we provide a series of intermediate logical steps needed to improve our tool’s ability to detect a given pattern within an assurance case. Thus, to validate the pattern detection feature of SmartGSN, we used SmartGSN to perform an experiment in which we run each LLM five times with the Zero-shot prompting technique together with CoT to detect patterns in assurance cases.

III-A6 Metrics used to assess experiment results

To assess the results, we relied on well-known metrics: Precision, Recall, and F1-Measure.

III-B Results analysis: using SmartGSN for pattern detection

System Mod Threshold
0.4 0.6 0.8
R P F1 R P F1 R P F1
ACAS XU 4o 1 1 1 1 1 1 0 0 0
4T 1 1 1 1 1 1 0 0 0
BlueROV2 4o 0.5 1 0.67 0.5 1 0.67 0 0 0
4T 0.5 1 0.67 0.5 1 0.67 0 0 0
GPCA 4o 1 1 1 0 0 0 0 0 0
4T 1 1 1 0 0 0 0 0 0
IMSoftware 4o 1 1 1 1 1 1 0 0 0
4T 1 1 1 1 1 1 0 0 0
DeepMind 4o 1 1 1 0 0 0 0 0 0
4T 1 1 1 0 0 0 0 0 0
TABLE I: Recall (R), Precision (P) and F1-Score (F1) results

Table I reports the average of recall (R), precision (P), and F1-Measure (F1) for the five systems in our dataset using two models across various thresholds. We compute each average over five runs i.e. for K=5. The ”Mod” header either denotes GPT-4o (i.e. ”4o”) or GPT-4 Turbo (i.e. ”4T”). For each system, both models achieve perfect scores at threshold of 0.4, with the exception of BLUEROV2, which has an average recall of 0.5 and an average F1-Measure of 0.67. At a threshold of 0.6, both ACAS XU and IM Software maintain perfect scores, whereas the BLUEROV2 still yields the same results. The DeepMind and GPCA metric scores drops to zero at this threshold. At a threshold of 0.8, all systems and models achieve zero scores across all metrics, indicating an inability for SmartGSN to detect patterns within an assurance case at this threshold. Threshold values affect the pattern detection process; lower thresholds like 0.4 results in detecting more patterns, while at higher thresholds (e.g., 0.8), performance drastically declines. Hence, the optimal threshold range for effective detection of patterns within assurance cases in our dataset likely falls within the [0.4, 0.6[ interval. More experiments will help confirm that observation.

IV Conclusion and future work

In this paper, we presented a novel tool called SmartGSN. The latter leverages large language models (LLMs) to support the semi-automatic management of assurance cases.

Future work will focus on refining the pattern detection approach our tool supports by using more advanced heuristics. Future work will also focus on assurance case refactoring using assurance case patterns. In addition, future work will focus on implementing a collaboration feature that will support the concurrent edition of the same assurance case when using SmartGSN. We are also planning to carry out a user study to gauge the usability of our tool in various application domains.

Acknowledgment

We thank Kimya Khakzad Shahandashti for suggesting the use of React Flow for node-based UIs implementation.

References

  • [1] R. Hawkins, T. Richardson, T. Kelly, Using process models in system assurance, in: Computer Safety, Reliability, and Security: 35th International Conference, SAFECOMP 2016, Trondheim, Norway, September 21-23, 2016, Proceedings 35, Springer, 2016, pp. 27–38.
  • [2] A. B. Belle, T. C. Lethbridge, S. Kpodjedo, O. O. Adesina, M. A. Garzón, A novel approach to measure confidence and uncertainty in assurance cases, in: Requirements Engineering Conference Workshops, IEEE, 2019, pp. 24–33.
  • [3] C. M. Holloway, Safety case notations: Alternatives for the non-graphically inclined?, in: 2008 3rd IET International Conference on System Safety, IET, 2008, pp. 1–6.
  • [4] G. S. N. S. W. Group, Gsn (v3), accessed on Sept. 15, 2024 (2021).
    URL https://scsc.uk/gsn
  • [5] M. Maksimov, S. Kokaly, M. Chechik, A survey of tool-supported assurance case assessment techniques, ACM CSUR) 52 (5) (2019) 1–34.
  • [6] F. U. Muram, M. A. Javed, Attest: Automating the review and update of assurance case arguments, Journal of systems architecture 134 (2023) 102781.
  • [7] Y. Wang, M. Sivakumar, A. B. Belle, O. Odu, K. K. Shahandashti, Devcase: design and implementation of a novel web-based graphical editor for safety cases complying with the gsn.
  • [8] M. Sivakumar, A. B. Belle, J. Shan, K. K. Shahandashti, Prompting gpt–4 to support automatic safety case generation, Expert Systems with Applications 255 (2024) 124653.
  • [9] A. Di Sandro, L. Murphy, T. Viger, M. Chechik, Mmint-a: A framework for model-based safety assurance, Science of Computer Programming 231 (2024) 103004.
  • [10] Z. Chen, Y. Deng, W. Du, Trusta: Reasoning about assurance cases with formal methods and large language models, arXiv preprint arXiv:2309.12941 (2023).
  • [11] C.-L. Lin, W. Shen, R. Hawkins, Support for safety case generation via model transformation, ACM SIGBED Review 14 (2) (2017) 44–52.
  • [12] C. Cârlan, S. Barner, A. Diewald, A. Tsalidis, S. Voss, Explicitcase: integrated model-based development of system and safety cases, in: SAFECOMP Workshops, Springer, 2017, pp. 52–63.
  • [13] E. Denney, G. Pai, Tool support for assurance case development, Automated Software Engineering 25 (3) (2018) 435–499.
  • [14] O. Odu, A. B. Belle, S. Wang, S. Kpodjedo, T. C. Lethbridge, H. Hemmati, Automatic instantiation of assurance cases from patterns using large language models (2024). arXiv:2410.05488.
    URL https://doi.org/10.48550/arXiv.2410.05488
  • [15] OpenAI, Openai documentation, https://platform.openai.com/ (October 2024).
  • [16] X. Hou, Y. Zhao, D. … Lo, J. Grundy, H. Wang, Large language models for software engineering: A systematic literature review, TOSEM (2023).
  • [17] K. K. Shahandashti, A. B. Belle, M. M. Mohajer, O. Odu, T. C. Lethbridge, H. Hemmati, S. Wang, Using gpt-4 turbo to automatically identify defeaters in assurance cases, in: Requirements Engineering Conference Workshops, IEEE, 2024, pp. 46–56.
  • [18] J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou, et al., Chain-of-thought prompting elicits reasoning in large language models, NeurIPS 35 (2022) 24824–24837.
  • [19] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. K. et al., Language models are few-shot learners (2020). arXiv:2005.14165.
    URL https://arxiv.org/abs/2005.14165