MMAC-Copilot: Multi-modal Agent Collaboration Operating System Copilot

Zirui Song¹¹1Equal contribution Yaohang Li²²footnotemark: 2 Meng Fang Corresponding Author. Email: Meng.Fang@liverpool.ac.uk Zhenhao Chen Zhecheng Shi Yuan Huang Ling Chen University of Technology Sydney University of Liverpool Mohamed bin Zayed University of Artificial Intelligence Northeastern University

Abstract

Autonomous virtual agents are often limited by their singular mode of interaction with real-world environments, restricting their versatility. To address this, we propose the Multi-Modal Agent Collaboration framework (MMAC-Copilot), a framework utilizes the collective expertise of diverse agents to enhance interaction ability with operating systems. The framework introduces a team collaboration chain, enabling each participating agent to contribute insights based on their specific domain knowledge, effectively reducing the hallucination associated with knowledge domain gaps. To evaluate the performance of MMAC-Copilot, we conducted experiments using both the GAIA benchmark and our newly introduced Visual Interaction Benchmark (VIBench). VIBench focuses on non-API-interactable applications across various domains, including 3D gaming, recreation, and office scenarios. MMAC-Copilot achieved exceptional performance on GAIA, with an average improvement of 6.8% over existing leading systems. Furthermore, it demonstrated remarkable capability on VIBench, particularly in managing various methods of interaction within systems and applications. These results underscore MMAC-Copilot’s potential in advancing the field of autonomous virtual agents through its innovative approach to agent collaboration.

\paperid

1888

1 Introduction

Refer to caption — Figure 1: Illustrates the sequence of actions undertaken by MMAC-Copilot in response to the user’s request "Play <Love Story> on Spotify". (A) Represents the initial state where the "Planner" formulates preliminary plans. . (B) "Programmer" launches the application using a bash command. (C) "Viewer" click the search box and enter <Love Story>. (D) "Viewer" click the play button.

Large Language Models (LLMs) [1, 2, 3, 12, 27, 37, 38] has demonstrated unprecedented capabilities to tackle complex tasks that require human-like reasoning, decision-making, and collaborative efforts [11, 16, 28, 29, 45]. Soon afterwards, Large Vision Models (LVMs) [7, 21, 23, 24, 25, 54] expand beyond traditional text-based processing by integrating a visual dimension into LLMs. The applications of LVMs have reached diverse domains, with one notable example being their pivotal usage in autonomous virtual agents [44].

Usually, autonomous virtual agents require a unified interaction interface and a high degree of generalization to adapt to the challenges posed by the numerous different applications within an operating system. Although previous works [35, 44, 51] that utilize scripts and the planning abilities of LLMs have demonstrated success in some applications there are still some challenges: First, the acquisition of real-world environmental information relies on a single text modality which lead to a limited range of applications. Second, the hallucinations caused by the knowledge domain gap could lead to incorrect planning by virtual agents, resulting in extremely severe consequences.

To engage with the initial challenge, we introduce MMAC-Copilot, a framework designed to promote information exchange and collaboration among multi-modal agents. Specifically, we created five key agents: Planner, an expert in making plans, strategic thinking and resource allocation,ensuring that the collaborative efforts of all agents are directed to achieve the target; Librarian, proficient in managing question-answering tasks and API in retrieving information; Programmer, specializing in coding tasks, particularly skilled at executing Bash commands and Python programs; Viewer, focused on understanding image content for visual information and execute click interaction; and Video Analyst, dedicated to analyzing video content and extracting key events. In addition, there is an auxiliary agents: the Mentor, supervising system operations and agent interactions to ensure smooth functionality and integrity. When receiving a user query, Planner formulates an initial plan based on the user query’s requirements. It then engages with other agents through a team collaboration chain, allowing for a dynamic exchange of information and insight. As the plan progresses, it is continuously adapted based on the insights of each specialized agent.

In resolving the second dilemma, we introduce the team collaboration chain, allowing participating agents to adapt the initial plan crafted based on their domain expertise. Refer to the scenario depicted in Figure 3, where the user request is "Open the Discord application and send Hi to Dylan Li". Planner firstly delineates five sub-tasks: "Open Discord", "Navigate to friends", "Search for Dylan Li", "Open Chat with Dylan Li", and "Send a Message", all of which are assigned to Programmer. Upon completion of the first subtask by the Programmer, Mentor observes that Discord has been successfully launched and notes that "Dylan is already present in the application." Planner, upon receiving this feedback, revises the initial plan to "Open chat with Dylan" and "Send the message", and then reassigns it to Programmer. In this instance, Planner overlooks the fact that programmer needs to specifically integrate Discord’s proprietary API to accomplish these tasks. Adapting to the proprietary interfaces of thousands of applications to achieve universality is clearly impractical. Furthermore, unforeseen scenarios such as pop-up notifications or advertisements may disrupt task execution. Programmer fails to execute and returns the message, "Sorry, I don’t know how to do this." Upon receiving this message, Planner reallocates the task to Viewer. Viewer decomposes the coarse-grained task into finer-grained tasks: "click on Dylan", "type Hi", and "return the message", which are then successfully executed. Additionally, each subtask completion is reviewed by Mentor, and upon approval, the process advances to the next round. The process concludes after all rounds are completed.

We assess MMAC-Copilot using two benchmarks: GAIA [26] and our newly introduced Visual Interaction Benchmark (VIBench). GAIA is designed to test the abilities of general artificial intelligence assistants. The MMAC-Copilot performed in level-1, level-2, and level-3 tasks with scores of 45.16, 20.75, and 6.12, respectively. It outperforms previous methods by 6.8%. VIBench, specifically developed for this study, tests the framework’s performance in non-API-interactable applications across diverse domains such as 3D gaming, recreation, and office environments. The results exhibits outstanding adaptability, handling interfaces with a high degree of accuracy, enhancing interaction within digitally environments.

In summary, our main contributions are as follow:

•

We propose MMAC-Copilot, a collaborative framework involving specialized agents designed to navigate and perform general operating system tasks. MMAC-Copilot enhances the interaction capabilities of autonomous virtual agents with operating systems by leveraging multi-modality in processing to tasks. The framework consists of distinct modal agents: "Librarian" for question-answering and information retrieval, "Programmer" for executing code and bash commands, "Viewer" for interpreting visual information, and "Video Analyst" for analyzing video content. By leveraging the collaboration of modal agents, MMAC-copilot can handle a diverse range of tasks with enhanced accuracy.
•

To mitigate the incorrect planning issue caused by LLMs’ hallucination, we introduce the team collaboration chain which allows participating agents to adjust the initial plan based on their domain expertise into MMAC-Copilot. It has demonstrated adaptability in handling a variety of case.
•

We develop the Visual Interaction Benchmark (VIBench), specifically designed to assess the system’s performance in non-API-interactable applications across diverse domains such as 3D gaming, recreation, and office scenarios. VIBench complements existing benchmarks by focusing on the agents’ ability to interact with dynamic and visually complex interfaces, thus providing a more strict test of their practical interaction capabilities.

2 Related work

In this section, we review some related works, with a focus on the domains of LLM-based agents and Autonomous virtual agents.

2.1 LLM-based agents

The sunrise field of LLM-based agents, highlighted by projects such as AutoGPT [33] and TaskWeaver [30], marks a significant milestone in the field of LLM research. These agents adopt a human-like approach to problem-solving by decomposing complex tasks into manageable subtasks. The capability demonstrates a profound advancement in planning, observation, and responsive actions, closely mirroring human cognitive processes [38, 40, 45].

The LangChain Agent [36] further extends these capabilities by enabling LLMs to sequence actions, thus expanding the potential for automation across various domains such as customer service, logistics, and content creation. This strategic utilization of LLMs underscores a move towards more sophisticated, context-aware AI systems.

Developments in multi-agent systems, such as AutoGen [43], MetaGPT [17], and AutoAgents [8], introduce a novel dimension to the field by enabling collaboration and competition among agents. This architecture leverages the unique strengths of individual agents, enhancing system-wide efficiency and adaptability in complex scenarios. Such collaborative frameworks are proving essential for large-scale applications in sectors like healthcare [10, 20], finance [14], and urban planning [4].

The versatility and efficacy of these LLM-based agents are highlighted in a range of application domains: In robotics, agents assist in both navigational tasks and complex manipulations, contributing to more autonomous and adaptive robotic systems [12]. Web manipulation and data extraction have seen significant enhancements, enabling better data-driven decision-making and user interaction models [49, 55]. The gaming industry benefits from more dynamic AI opponents and game management systems, which improve user engagement and game design [13, 35]. Automated data analysis tools powered by LLM agents offer sophisticated insights and predictions, facilitating more effective data handling in business and research [39, 53].

2.2 Autonomous virtual agents

The application of LVM systems for interfacing with Graphical User Interfaces (GUIs) in digital applications have been a key area of research. Wonderland [46] harnessed the capabilities of GPT-4V which delineated by Yang et al. [47] to navigate through mobile application interfaces by analyzing screenshots. Concurrently, the AppAgent [48] utilizes GPT-4V to simulate the actions of mobile users, facilitation the execution of tasks in mobile applications through the analysis of device snapshots. Moreover, the MobileAgent incorporates Optical Character Recognition (OCR) technology to enhance the capabilities of of GPT-4V within a similar mobile agent architecture. This strategic enhancement allows MobileAgent to attain task completion rates on par with human operators. In contrast, CogAgent [18] constructs a specifically engineered LVM for interpreting and navigating GUIs. UFO [51] utilizes the windows inspect tool to acquire component and control information from system applications, providing it as context to GPT-4V to fully leverage its planning capabilities.

3 MMAC-Copilot Framework

The workflow of MMAC-Copilot is designed to leverage the specialized capabilities of each agent in a sequential and collaborative manner. Here, we present an algorithmic overview in Algorithm 1. In section 3.1, we discuss the specialization of agents within the framework, each equipped with distinct roles and tools. In section 3.2, we delve into communications protocols which bolster accuracy in agents collaborations. In section 3.3, we introduce the team collaboration chain, dynamic framework that utilizes continuous feedback and iterative improvements to adapt framework strategies in real-time.

Algorithm 1 Workflow of MMAC-Copilot

1:User request

R

2:Task completion that satisfies

R

3:Initialize system state

S_{0}

4:Planner formulates initial plan

P_{0}

based on

R

5:for each subtask

s

P_{0}

6: Assign

s

to appropriate agent

A

A

executes

s

, updates system state to

S_{i}

8: Mentor checks

S_{i}

; provides feedback

F_{i}

9: if

F_{i}

indicates adjustment needed then

10: Planner revises

P_{i}

based on

F_{i}

11: Go to step 4 with revised

P_{i}

12: end if

13:end for

14:Verify if

S_{i}

satisfies

R

15:if not satisfied then

16: Retry

17:else

18: Break loop and finalize task

19:end if

3.1 Specialization of agents

Empirical studies suggest that diversity within human groups contributes to a variety of perspectives, which in turn enhances the group’s performance across various tasks [6, 41, 42]. Also, contemporary research indicates that assigning specific roles to autonomous agents can improve their effectiveness [22, 29, 31]. Thus, in our framework, we have defined six agents, namely: Planner, Librarian, Programmer, Viewer, Video Analyst, and Mentor, as shown in Figure 2. The followings are detailed introductions to each agent:

•

Planner: Employs GPT-4 to strategically manage and allocate tasks among agents, optimizing workflow efficiency. In contrast to other frameworks, our Planner is not required to decompose tasks in to indivisible atomic subtasks. We recognize Planner operating solely in the textual modality may not effectively integrate visual information such as UI elements, potentially leading to hallucination. Therefore, when Planner identifies a subtask which necessitates the incorporation of visual information, it provides only a coarse outline of the subtask. It allows each agent to leverage its own expertise, enhancing the adaptability of the overall framework.
•

Librarian: Utilizes GPT-4 and API for information retrieval, enabling it to answer queries and provide foundational knowledge. It possesses the capability to rapidly process vast amounts of information, thereby supporting user requests and providing related context with other agent.
•
Programmer: Responsible for coding and executing scripts, including Python and Bash commands, to interact with software environments directly. The Programmer’s role is crucial in implementing precise operating systems interactions that adapt to diverse requirements. Key to the Programmer’s efficacy is its ability to refine existing code through a mechanism designed to diagnose and resolve issues effectively. When code errors are identified, Programmer analyzes the source of failure, applies necessary modifications, and maintains the original code structure as much as possible. This refined process includes a detailed explanation of identified issues, enhancing the clarity and functionality of the code without introducing new exception, and improving the handling of existing ones. Furthermore, Programmer can generate accurate Python invocation statements. By deeply understanding the context of the provided Python class, analyzing task descriptions, and interpreting method parameter details to ensure the correct application. The result is a syntactically correct Python statement that calls the methods with appropriate parameters. Lastly, Programmer’s capabilities are complemented by a critical evaluation mechanism. This mechanism assesses the refined and executed code against the user’s specified task requirements. It evaluates the code’s functionality, alignment with the task, and feedback provided by the environment, culminating in a comprehensive reasoning process. The interactions within the Programmer’s mechanism can be mathematically represented as follows:

$C_{\text{init}}\xrightarrow[R]{E}C_{\text{mod}}\xrightarrow[X]{\delta}\text{Out}\xrightarrow[V]{T}(\text{Judge},\text{Score})$

where:
- -
  
  $C_{\text{init}}$ is the initial code.
- -
  
  $E$ represents the error analysis and feedback.
- -
  
  $R$ is refinement mechanism.
- -
  
  $C_{\text{mod}}$ is the modified code.
- -
  
  $\delta$ includes environmental variables such as the current working directory or system configurations.
- -
  
  $X$ represents execution function.
- -
  
  Out includes the runtime output and any runtime errors.
- -
  
  $T$ is the original task requirements.
- -
  
  $V$ represents the evaluation mechanism.
- -
  
  Judge is a boolean indicating if the task is accomplished.
- -
  
  Score is an integer rating the code’s generality and robustness. If the score is above the specified threshold, the code for this run is retained.
•

Viewer: Integrates GPT-4V for interpreting complex visual information from screenshots. Coupled with SeeClick [9] model, Viewer could translates visual analysis into actionable commands, such as click on identified UI elements, bridging the gap between visual understanding and operating systems interaction.
•

Video Analyst: Uses Gemini Vision to process and analyze video content, extracting critical visual data. It is significant for subtasks which depend on video-based information, enhancing the model’s ability to generalize across different modal. Video Analyst interprets queries, retrieves relevant details from video streams, and delivers accurate responses. These capabilities are crucial for providing contextual insights, assisting other agents in understanding the current statement. In summary, Video Analyst is adept at handling complex scenarios where video data is key.
•

Mentor: Powered by GPT-4V, Mentor provides strategic oversight and troubleshooting support after the execution processes. To be specific, Mentor firstly analyzes screenshots of the current application window to provide detailed description of visible state. And then Mentor identified whether the intended change or commands have successfully impacted the application as expected. Additionally, Mentor generates insights which are derived from a deep analysis of the outcomes and are structured as feedback to be incorporated into the next round.

3.2 Communication Protocols

Following established multi-agent frameworks [17, 22, 28, 52, 56], MMAC-Copilot adopts Structured Communication Interfaces, avoiding the traps of unconstrained nature language communication. Each agent communicates through a defined schema appropriate for their role.

In specific, we define the output format of the agents as JSON, and stipulate the keys which should be output according to the different roles of each agent. Meanwhile, each agent only receives the content of the required key. For example, "observation" key is exclusive output by Mentor and Viewer, as it is specifically to their role in emphasizing visual observations of screenshots. Apart from themselves, only the Planner receives the content of "observation" key, both when formulating initial plans and when revising them. These protocols are designed to prevent common issues such as data ambiguity in complex multi-agent interactions. By standardizing the communication format and defining clear protocols for message acknowledgment, MMAC-Copilot minimizes the risks of miscommunication and ensures which tasks are executed reliably. Additionally, the use of JSON allows for easy integration of new agents as the system scales, supporting the framework’s adaptability and expansion.

3.3 Team Collaboration Chain

Team collaboration chains leverage the strengths of dispersed management structures to facilitate enhanced cooperation among team members across different domains. By distributing responsibilities and decision-making power, these chains enable a more agile and responsive organizational environment. Such chain has been successfully implemented across multiple domains [5, 19, 32, 50].

In the MMAC-Copilot framework, team collaboration chain is essential for managing difficult tasks. This section delves into the collaborative mechanisms that enable dynamic planning and the iterative refinement of tasks, ensure effective execution and continuous improvement. A concrete demonstration of the collaboration chain in operation can be seen in Figure 3.

3.3.1 Dynamic Planning and Subtask Refinement

During the planning phase, Planner uses input from the Viewer to formulate an initial plan. Initial plan often outline the subtasks in a coarse manner, addressing the dynamic nature of visual environments and the inherent limitations of planning based on textual information alone. This initial planning stage sets the foundation for task execution but requires further refinement to address the specifics of the visual statement.

As the system transitions to the execution phase, agents equipped with specialized modalities, such as the Viewer and Video Analyst, assume pivotal roles. These agents reassess the current visual context and refine the coarsely defined subtasks into detailed, atomic subtasks. This step is crucial for adapting to real-time changes in the visual interface, ensuring that each task is executed with precision based on the latest visual information. This process showcases the strength of team collaboration, where different agents contribute uniquely to the refinement and execution of tasks.

3.3.2 Continuous Feedback and Iterative Improvement

Mentor plays a fundamental role in the team collaboration chain by establishing a continuous feedback loop with the Planner. After the execution of subtasks, Mentor evaluates the current state of the system through detailed analysis of screenshots and the visible state of the application. This evaluation determines whether the intended changes or commands have successfully impacted the application as expected.

If discrepancies or incomplete tasks are identified, Mentor generates insights and structured feedback based on a deep analysis of the outcomes. This feedback is crucial for revising the current plans and is incorporated into the next round of planning. This feedback loop between the Mentor and the Planner not only ensures high-quality task completion but also facilitates adaptive learning, allowing the MMAC-Copilot system to evolve its strategies based on real-world interactions and outcomes.

4 Experiments

4.1 Benchmark and setting

We evaluate the performance of MMAC-Copilot using two distinct benchmarks designed to challenge and assess the capabilities of AI assistants in diverse interaction environments. The first is the General AI Assistant Benchmark (GAIA) [26], which includes 466 rigorous question-answering tasks. GAIA primarily tests AI assistants in a question-answering (QA) format, focusing on their ability to utilize API calls for information retrieval and task execution. However, GAIA restricts the evaluation to textual interactions and API-based actions, which do not fully capture the broader range of GUI interactions that are prevalent in real-world applications.

To address these limitations and to evaluate the agents’ capabilities in a more complex and visually oriented environment, we introduced the Visual Interaction Benchmark (VIBench). VIBench comprises cases from three categories of applications: 3D gaming, recreation, and office environments. Unlike GAIA, VIBench emphasizes the necessity for agents to interact with applications through graphical user interfaces (GUIs) that cannot typically be controlled via standard APIs such as Win32 APIs.

Mathematical Representation of VIBench

The core interaction process in VIBench can be mathematically described as follows: Let $S_{0}$ represent the initial system state which includes the current GUI appearance and available actions. Given a user request $R$ , the system must transition through a series of states $S_{1},S_{2},\ldots,S_{n}$ via actions $a_{1},a_{2},\ldots,a_{n}$ , ultimately achieving the final state $S_{f}$ that satisfies $R$ . This process is represented by the sequence:

S_{0}\xrightarrow{a_{1}}S_{1}\xrightarrow{a_{2}}S_{2}\xrightarrow{\cdots}S_{n}\xrightarrow{a_{n}}S_{f}

The primary goal of VIBench is not merely to evaluate the agent’s ability to execute each step efficiently but to assess whether the final state $S_{f}$ satisfies the initial user request $R$ . This objective emphasizes the effectiveness of the final outcome over the process, reflecting the practical focus of real-world applications:

\text{Objective}:\text{Verify that }S_{f}\text{ meets }R

VIBench thus complements GAIA by adding a layer of complexity and realism, challenging agents to not only respond correctly in a QA format but also to effectively navigate and interact within software environments that mirror everyday user challenges.Examples are shown in Figure 4

Setting

To ensure a fair comparison with previous work, we employed the same versions of the ChatGPT API used in prior studies. Specifically, we utilized GPT-4-turbo-1106 and gpt-4-vision-preview. For the Video Analyst, we used gemini-1.5-pro-vision.

4.2 Evaluation Method

For GAIA, each question calls for a specific response: string, number, or floats. Evaluation is conducted through a exact match between the model’s answer and ground truth.

For our VIBench, we adopt human evaluation which follow the SIMA evaluation method [34], recognizing that in numerous instances, task success cannot be automatically inferred. We engage human judges who possess expert-level familiarity with the specific applications, defined as having at least 20 hours of application usage. These experts are asked to assess recorded videos to determine whether the user requests have been accurately executed. During the assessment, human experts are instructed to disregard any irrelevant actions performed by the agents. Their judgments are solely based on whether the stated objectives of the tasks are achieved, without considering the intermediate steps or unrelated actions. Meanwhile, to minimize human subjectivity, we have imposed a maximum limit of 30 execution rounds for task. If the task is not completed within 30 rounds, it is judged as a failure.

4.3 Baseline

For GAIA, we present the performance of GPT-3.5 and GPT-4 with and without manually set plugins, along with results from AutoGPT, Chamonmile and FRIDAY which all use GPT-4 as their backend. The selection of Plugins for GPT-4 is conducted manually, with individual choosing based on specific user query. For comparative purposes, we also include data on human performance sourced from GAIA [26].

For VIBench, we report the performance of UFO and FRIDAY. Notably, since FRIDAY does not natively support the Windows platform, we have modified FRIDAY’s code to adapt it to Windows. Also we set UFO’s Vision model same as our framework.

4.4 Results

The results of our experiments with MMAC-Copilot provide a comprehensive assessment of its capabilities in handling complex, multi-modal tasks with in digital environments. In this session, we present and discuss the findings from our evaluations on both the GAIA benchmark and our newly developed Visual Interaction Benchmark(VIBench).

Table 1: Performance comparison on GAIA. All performance are reported on the private test set.

Model	Level 1	Level 2	Level 3	Average
Human*	93.90	91.80	87.30	91.00
GPT-3.5 [1]	4.3	1.89	2.08	2.67
GPT-4 [1]	9.68	1.89	0.00	4.00
AutoGPT4 [33]	15.05	0.63	0.00	5.00
GPT4 Turbo [1]	9.68	6.92	0.00	6.67
GPT-4 Plugins [1]	30.30	9.70	0.00	14.60
Chamomile [15]	16.13	17.61	2.08	14.67
FRIDAY [44]	40.86	20.13	6.12	24.25
Ours	45.16	20.75	6.12	25.91 (+6.8%)

MMAC-Copilot shows significant performance improvement across all levels of GAIA. As illustrated in Table 1, MMAC-Copilot achieved the highest scores in Level 1 (45.16) and Level 2 (20.75) tasks, matching the highest score in Level 3 tasks (6.12). Also on average, it outperformed the closest competing system, FRIDAY, by 6.8% in overall task performance. Given that tasks in GAIA encompass various complex AI assistant capabilities, from basic information retrieval to interactive task requiring multiple steps and integration of external data sources.

Table 2: Evaluation results on VIBench.

Model	3D Gaming	Recreation	Office	Average
UFO [15]	0.00	28.57	15.38	14.65
FRIDAY [44]	31.58	42.86	30.77	35.07
Ours	63.16	69.23	78.57	70.32 (+35.25)

In VIBench as shown in Table LABEL:tab:VIBench, MMAC-Copilot outperformed other systems across all application categories (3D gaming, recreation, and office scenarios). Our model achieved average scores fo 63.16% in 3D gaming, 69.23% in recreation, and 78.57% in office applications, with an overall average of 70.32%. It indicates MMAC-Copilot’s robust capability in navigating and interacting within complex GUI environments, where direct API interaction is not possible. MMAC-Copilot’s superior performance can be attributed to its effective utilization of image and video analysis, enabling precise navigation and interaction through the GUIs of different software.

5 Conclusion

This paper presents MMAC-Copilot, a novel framework designed to enhance the interaction capabilities of autonomous virtual agents with operating systems. Our approach capitalizes of autonomous virtual agents with operating systems. Our approach capitalizes on a team collaboration chain that harnesses the collective expertise of each agents, including "Planner", "Librarian", "Programmer", "Viewer", "Video Analyst", and "Mentor". Each agent brings a unique set of skills and perspectives, enabling an adaptive interaction with complex digital environments. Also MMAC-Copilot innovatively mitigates the issues with hallucination among the agents. By facilitating dynamic and team collaboration among the agents, our framework allows for continuous adaptation and refinement of plans based on environmental changes. The performance of MMAC-Copilot has been tested on GAIA benchmark and our newly introduced VIBench, showing significant improvements over existing state-of-the-art systems.

6 Discussion

We acknowledge that while MMAC-Copilot has achieved a level of generality in application through the integration of GUI models for visual detection, it is still limited by the GUI model’s ability to understand UI components. Consequently, MMAC-Copilot struggles with interpreting complex UI interfaces and performing multi-step operations that depend on such complexity. Additionally, MMAC-Copilot’s capabilities in real-time 3D gaming environments reveal gaps in the spatial understanding, which is cruical for tasks such as navigation, aiming and shooting in 3D environments. The framework’s longer inference time impair the ability to perform tasks with higher temporal sensitivity.

To solve these, we suggest integrating external knowledge databases via online search engines, providing additional context for decision-making processes. This approach will not only help in understanding complex UI layouts but also in recognizing user intentions and environmental cues in real-time games.

References

Achiam et al. [2023] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
Anil et al. [2023] R. Anil, A. M. Dai, O. Firat, M. Johnson, D. Lepikhin, A. Passos, S. Shakeri, E. Taropa, P. Bailey, Z. Chen, E. Chu, J. H. Clark, L. E. Shafey, Y. Huang, K. Meier-Hellstern, G. Mishra, E. Moreira, M. Omernick, K. Robinson, S. Ruder, Y. Tay, K. Xiao, Y. Xu, Y. Zhang, G. H. Abrego, J. Ahn, J. Austin, P. Barham, J. Botha, J. Bradbury, S. Brahma, K. Brooks, M. Catasta, Y. Cheng, C. Cherry, C. A. Choquette-Choo, A. Chowdhery, C. Crepy, S. Dave, M. Dehghani, S. Dev, J. Devlin, M. Díaz, N. Du, E. Dyer, V. Feinberg, F. Feng, V. Fienber, M. Freitag, X. Garcia, S. Gehrmann, L. Gonzalez, G. Gur-Ari, S. Hand, H. Hashemi, L. Hou, J. Howland, A. Hu, J. Hui, J. Hurwitz, M. Isard, A. Ittycheriah, M. Jagielski, W. Jia, K. Kenealy, M. Krikun, S. Kudugunta, C. Lan, K. Lee, B. Lee, E. Li, M. Li, W. Li, Y. Li, J. Li, H. Lim, H. Lin, Z. Liu, F. Liu, M. Maggioni, A. Mahendru, J. Maynez, V. Misra, M. Moussalem, Z. Nado, J. Nham, E. Ni, A. Nystrom, A. Parrish, M. Pellat, M. Polacek, A. Polozov, R. Pope, S. Qiao, E. Reif, B. Richter, P. Riley, A. C. Ros, A. Roy, B. Saeta, R. Samuel, R. Shelby, A. Slone, D. Smilkov, D. R. So, D. Sohn, S. Tokumine, D. Valter, V. Vasudevan, K. Vodrahalli, X. Wang, P. Wang, Z. Wang, T. Wang, J. Wieting, Y. Wu, K. Xu, Y. Xu, L. Xue, P. Yin, J. Yu, Q. Zhang, S. Zheng, C. Zheng, W. Zhou, D. Zhou, S. Petrov, and Y. Wu. Palm 2 technical report, 2023.
Anthropic [2023] Anthropic. Model card and evaluations for claude models, 2023. URL https://www-files.anthropic.com/production/images/Model-Card-Claude-2.pdf.
Bae et al. [2022] K.-H. Bae, N. Mustafee, S. Lazarova-Molnar, and L. Zheng. Hybrid modeling of collaborative freight transportation planning using agent-based simulation, auction-based mechanisms, and optimization. Simulation, 98(9):753–771, 2022.
Bang and Dalsgaard [2005] J. Bang and C. Dalsgaard. Samarbejde-kooperation eller kollaboration? Tidsskrift for Universiteternes Efter-og Videreuddannelse (UNEV), 3(5), 2005.
Bransford and Stein [1993] J. D. Bransford and B. S. Stein. The ideal problem solver. 1993.
Cai et al. [2023] R. Cai, Z. Song, D. Guan, Z. Chen, X. Luo, C. Yi, and A. Kot. Benchlmm: Benchmarking cross-style visual capability of large multimodal models. arXiv preprint arXiv:2312.02896, 2023.
Chen et al. [2023] G. Chen, S. Dong, Y. Shu, G. Zhang, J. Sesay, B. F. Karlsson, J. Fu, and Y. Shi. Autoagents: A framework for automatic agent generation. arXiv preprint arXiv:2309.17288, 2023.
Cheng et al. [2024] K. Cheng, Q. Sun, Y. Chu, F. Xu, Y. Li, J. Zhang, and Z. Wu. Seeclick: Harnessing gui grounding for advanced visual gui agents. arXiv preprint arXiv:2401.10935, 2024.
Dhatterwal et al. [2023] J. S. Dhatterwal, M. S. Naruka, and K. S. Kaswan. Multi-agent system based medical diagnosis using particle swarm optimization in healthcare. In 2023 International Conference on Artificial Intelligence and Smart Communication (AISC), pages 889–893. IEEE, 2023.
Ding et al. [2023] R. Ding, C. Zhang, L. Wang, Y. Xu, M. Ma, W. Zhang, S. Qin, S. Rajmohan, Q. Lin, and D. Zhang. Everything of thoughts: Defying the law of penrose triangle for thought generation. arXiv preprint arXiv:2311.04254, 2023.
Driess et al. [2023] D. Driess, F. Xia, M. S. M. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu, W. Huang, Y. Chebotar, P. Sermanet, D. Duckworth, S. Levine, V. Vanhoucke, K. Hausman, M. Toussaint, K. Greff, A. Zeng, I. Mordatch, and P. Florence. Palm-e: An embodied multimodal language model, 2023.
Fan et al. [2022] L. Fan, G. Wang, Y. Jiang, A. Mandlekar, Y. Yang, H. Zhu, A. Tang, D.-A. Huang, Y. Zhu, and A. Anandkumar. Minedojo: Building open-ended embodied agents with internet-scale knowledge. Advances in Neural Information Processing Systems, 35:18343–18362, 2022.
Gautier et al. [2023] A. Gautier, M. Rigter, B. Lacerda, N. Hawes, and M. Wooldridge. Risk-constrained planning for multi-agent systems with shared resources. 2023.
Guillot [2023] P. Guillot. Create audio plugins with pure data patches. https://github.com/pierreguillot/Camomile, 2023.
Hao et al. [2023] S. Hao, Y. Gu, H. Ma, J. J. Hong, Z. Wang, D. Z. Wang, and Z. Hu. Reasoning with language model is planning with world model. arXiv preprint arXiv:2305.14992, 2023.
Hong et al. [2023a] S. Hong, X. Zheng, J. Chen, Y. Cheng, J. Wang, C. Zhang, Z. Wang, S. K. S. Yau, Z. Lin, L. Zhou, et al. Metagpt: Meta programming for multi-agent collaborative framework. arXiv preprint arXiv:2308.00352, 2023a.
Hong et al. [2023b] W. Hong, W. Wang, Q. Lv, J. Xu, W. Yu, J. Ji, Y. Wang, Z. Wang, Y. Dong, M. Ding, et al. Cogagent: A visual language model for gui agents. arXiv preprint arXiv:2312.08914, 2023b.
Huh et al. [2019] S. Huh, S. Muralidharan, H. Ko, and B. Yoo. Xr collaboration architecture based on decentralized web. In Proceedings of the 24th International Conference on 3D Web Technology, pages 1–9, 2019.
Jones et al. [2022] L. Jones, K. Armit, A. Haynes, and P. Lees. Role of medical leaders in integrated care systems: what can be learnt from previous research? BMJ Leader, 7:133 – 136, 2022. 10.1136/leader-2022-000655.
Li et al. [2024] C. Li, C. Wong, S. Zhang, N. Usuyama, H. Liu, J. Yang, T. Naumann, H. Poon, and J. Gao. Llava-med: Training a large language-and-vision assistant for biomedicine in one day. Advances in Neural Information Processing Systems, 36, 2024.
Li et al. [2023] G. Li, H. A. A. K. Hammoud, H. Itani, D. Khizbullin, and B. Ghanem. Camel: Communicative agents for" mind" exploration of large scale language model society. 2023.
Liu et al. [2023] H. Liu, C. Li, Y. Li, and Y. J. Lee. Improved baselines with visual instruction tuning, 2023.
Liu et al. [2024] H. Liu, C. Li, Q. Wu, and Y. J. Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2024.
Luo et al. [2023] Z. Luo, C. Xu, P. Zhao, Q. Sun, X. Geng, W. Hu, C. Tao, J. Ma, Q. Lin, and D. Jiang. Wizardcoder: Empowering code large language models with evol-instruct. arXiv preprint arXiv:2306.08568, 2023.
Mialon et al. [2023] G. Mialon, C. Fourrier, C. Swift, T. Wolf, Y. LeCun, and T. Scialom. Gaia: a benchmark for general ai assistants, 2023.
Ouyang et al. [2022] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022.
Park et al. [2023] J. S. Park, J. O’Brien, C. J. Cai, M. R. Morris, P. Liang, and M. S. Bernstein. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, pages 1–22, 2023.
Qian et al. [2023] C. Qian, X. Cong, C. Yang, W. Chen, Y. Su, J. Xu, Z. Liu, and M. Sun. Communicative agents for software development. arXiv preprint arXiv:2307.07924, 2023.
Qiao et al. [2023] B. Qiao, L. Li, X. Zhang, S. He, Y. Kang, C. Zhang, F. Yang, H. Dong, J. Zhang, L. Wang, et al. Taskweaver: A code-first agent framework. arXiv preprint arXiv:2311.17541, 2023.
Salewski et al. [2024] L. Salewski, S. Alaniz, I. Rio-Torto, E. Schulz, and Z. Akata. In-context impersonation reveals large language models’ strengths and biases. Advances in Neural Information Processing Systems, 36, 2024.
Selvanathan and Poravi [2018] N. Selvanathan and G. Poravi. Comparative study on decentralized cloud collaboration (dcc). In 2018 3rd International Conference for Convergence in Technology (I2CT), pages 1–6. IEEE, 2018.
Significant-Gravitas [2023] Significant-Gravitas. Autogpt: build & use ai agents. https://github.com/Significant-Gravitas/AutoGPT, 2023.
SIMA Team [2024] SIMA Team. Scaling instructable agents across many simulated worlds. 2024.
Tan et al. [2024] W. Tan, Z. Ding, W. Zhang, B. Li, B. Zhou, J. Yue, H. Xia, J. Jiang, L. Zheng, X. Xu, et al. Towards general computer control: A multimodal agent for red dead redemption ii as a case study. arXiv preprint arXiv:2403.03186, 2024.
Topsakal and Akinci [2023] O. Topsakal and T. C. Akinci. Creating large language model applications utilizing langchain: A primer on developing llm apps fast. In International Conference on Applied Engineering and Natural Sciences, volume 1, pages 1050–1056, 2023.
Touvron et al. [2023a] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, D. Bikel, L. Blecher, C. C. Ferrer, M. Chen, G. Cucurull, D. Esiobu, J. Fernandes, J. Fu, W. Fu, B. Fuller, C. Gao, V. Goswami, N. Goyal, A. Hartshorn, S. Hosseini, R. Hou, H. Inan, M. Kardas, V. Kerkez, M. Khabsa, I. Kloumann, A. Korenev, P. S. Koura, M.-A. Lachaux, T. Lavril, J. Lee, D. Liskovich, Y. Lu, Y. Mao, X. Martinet, T. Mihaylov, P. Mishra, I. Molybog, Y. Nie, A. Poulton, J. Reizenstein, R. Rungta, K. Saladi, A. Schelten, R. Silva, E. M. Smith, R. Subramanian, X. E. Tan, B. Tang, R. Taylor, A. Williams, J. X. Kuan, P. Xu, Z. Yan, I. Zarov, Y. Zhang, A. Fan, M. Kambadur, S. Narang, A. Rodriguez, R. Stojnic, S. Edunov, and T. Scialom. Llama 2: Open foundation and fine-tuned chat models, 2023a.
Touvron et al. [2023b] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
Wang et al. [2023] G. Wang, Y. Xie, Y. Jiang, A. Mandlekar, C. Xiao, Y. Zhu, L. Fan, and A. Anandkumar. Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023.
Wang et al. [2024] L. Wang, C. Ma, X. Feng, Z. Zhang, H. Yang, J. Zhang, Z. Chen, J. Tang, X. Chen, Y. Lin, et al. A survey on large language model based autonomous agents. Frontiers of Computer Science, 18(6):1–26, 2024.
Williams and O’Reilly III [1998] K. Y. Williams and C. A. O’Reilly III. Demography and. Research in organizational behavior, 20:77–140, 1998.
Woolley et al. [2015] A. W. Woolley, I. Aggarwal, and T. W. Malone. Collective intelligence and group performance. Current Directions in Psychological Science, 24(6):420–424, 2015.
Wu et al. [2023] Q. Wu, G. Bansal, J. Zhang, Y. Wu, S. Zhang, E. Zhu, B. Li, L. Jiang, X. Zhang, and C. Wang. Autogen: Enabling next-gen llm applications via multi-agent conversation framework. arXiv preprint arXiv:2308.08155, 2023.
Wu et al. [2024] Z. Wu, C. Han, Z. Ding, Z. Weng, Z. Liu, S. Yao, T. Yu, and L. Kong. Os-copilot: Towards generalist computer agents with self-improvement, 2024.
Xi et al. [2023] Z. Xi, W. Chen, X. Guo, W. He, Y. Ding, B. Hong, M. Zhang, J. Wang, S. Jin, E. Zhou, et al. The rise and potential of large language model based agents: A survey. arXiv preprint arXiv:2309.07864, 2023.
Yan et al. [2023] A. Yan, Z. Yang, W. Zhu, K. Lin, L. Li, J. Wang, J. Yang, Y. Zhong, J. McAuley, J. Gao, et al. Gpt-4v in wonderland: Large multimodal models for zero-shot smartphone gui navigation. arXiv preprint arXiv:2311.07562, 2023.
Yang et al. [2023a] Z. Yang, L. Li, K. Lin, J. Wang, C.-C. Lin, Z. Liu, and L. Wang. The dawn of lmms: Preliminary explorations with gpt-4v (ision). arXiv preprint arXiv:2309.17421, 9(1):1, 2023a.
Yang et al. [2023b] Z. Yang, J. Liu, Y. Han, X. Chen, Z. Huang, B. Fu, and G. Yu. Appagent: Multimodal agents as smartphone users. arXiv preprint arXiv:2312.13771, 2023b.
Yao et al. [2022] S. Yao, H. Chen, J. Yang, and K. Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents. Advances in Neural Information Processing Systems, 35:20744–20757, 2022.
Yatsuka et al. [2020] T. Yatsuka, A. Ishigaki, S. M. Gupta, Y. Kinoshita, T. Yamada, and M. Inoue. Collaboration strategy for a decentralized supply chain using linear physical programming. International Journal of Automation Technology, 14(5):723–733, 2020.
Zhang et al. [2024] C. Zhang, L. Li, S. He, X. Zhang, B. Qiao, S. Qin, M. Ma, Y. Kang, Q. Lin, S. Rajmohan, D. Zhang, and Q. Zhang. Ufo: A ui-focused agent for windows os interaction, 2024.
Zhang et al. [2023a] H. Zhang, W. Du, J. Shan, Q. Zhou, Y. Du, J. B. Tenenbaum, T. Shu, and C. Gan. Building cooperative embodied agents modularly with large language models. arXiv preprint arXiv:2307.02485, 2023a.
Zhang et al. [2023b] W. Zhang, Y. Shen, W. Lu, and Y. Zhuang. Data-copilot: Bridging billions of data and humans with autonomous workflow. arXiv preprint arXiv:2306.07209, 2023b.
Zhou et al. [2023a] C. Zhou, P. Liu, P. Xu, S. Iyer, J. Sun, Y. Mao, X. Ma, A. Efrat, P. Yu, L. Yu, S. Zhang, G. Ghosh, M. Lewis, L. Zettlemoyer, and O. Levy. Lima: Less is more for alignment, 2023a.
Zhou et al. [2023b] S. Zhou, F. F. Xu, H. Zhu, X. Zhou, R. Lo, A. Sridhar, X. Cheng, Y. Bisk, D. Fried, U. Alon, et al. Webarena: A realistic web environment for building autonomous agents. arXiv preprint arXiv:2307.13854, 2023b.
Zhuge et al. [2023] M. Zhuge, H. Liu, F. Faccio, D. R. Ashley, R. Csordás, A. Gopalakrishnan, A. Hamdi, H. A. A. K. Hammoud, V. Herrmann, K. Irie, et al. Mindstorms in natural language-based societies of mind. arXiv preprint arXiv:2305.17066, 2023.