MMAC-Copilot: Multi-modal Agent Collaboration Operating Copilot

📅 2024-04-28

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

Large language model (LLM) agents face significant challenges when interacting with PC applications—especially those lacking API interfaces—due to their reliance on unimodal (text-only) input and limited capability for real-world visual interaction, leading to hallucinations and poor domain adaptability. To address this, we propose a multimodal agent collaboration framework featuring: (1) a novel team-based orchestration chain that integrates heterogeneous expert agents; (2) VIBench—the first benchmark dedicated to vision-based interactive tasks, covering realistic scenarios including 3D gaming and office productivity applications; and (3) a unified architecture combining vision–language perception, modular role specialization, dynamic task scheduling, and cross-modal collaborative reasoning. Experiments demonstrate a 6.8% average improvement over the GAIA benchmark and substantial gains on VIBench, validating our framework’s strong generalization and robust interaction capability with non-API desktop applications.

Technology Category

Application Category

📝 Abstract

Large language model agents that interact with PC applications often face limitations due to their singular mode of interaction with real-world environments, leading to restricted versatility and frequent hallucinations. To address this, we propose the Multi-Modal Agent Collaboration framework (MMAC-Copilot), a framework utilizes the collective expertise of diverse agents to enhance interaction ability with application. The framework introduces a team collaboration chain, enabling each participating agent to contribute insights based on their specific domain knowledge, effectively reducing the hallucination associated with knowledge domain gaps. We evaluate MMAC-Copilot using the GAIA benchmark and our newly introduced Visual Interaction Benchmark (VIBench). MMAC-Copilot achieved exceptional performance on GAIA, with an average improvement of 6.8% over existing leading systems. VIBench focuses on non-API-interactable applications across various domains, including 3D gaming, recreation, and office scenarios. It also demonstrated remarkable capability on VIBench. We hope this work can inspire in this field and provide a more comprehensive assessment of Autonomous agents. The anonymous Github is available at href{https://anonymous.4open.science/r/ComputerAgentWithVision-3C12}{Anonymous Github}

Problem

Research questions and friction points this paper is trying to address.

Overcoming single-mode interaction limits in PC applications

Reducing hallucinations from knowledge domain gaps

Enhancing multi-agent collaboration for diverse application interactions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-modal agent collaboration enhances interaction

Team collaboration chain reduces knowledge gaps

New benchmarks improve agent performance evaluation

🔎 Similar Papers

No similar papers found.

Apple

Cupertino, United States of America

Applied Research Engineer - Multimodal LLMs for Human Interaction

Apple

Sunnyvale, United States of America

Research Scientist Intern, Multimodal AI (PhD)