🤖 AI Summary
Large language model (LLM) agents face significant challenges when interacting with PC applications—especially those lacking API interfaces—due to their reliance on unimodal (text-only) input and limited capability for real-world visual interaction, leading to hallucinations and poor domain adaptability. To address this, we propose a multimodal agent collaboration framework featuring: (1) a novel team-based orchestration chain that integrates heterogeneous expert agents; (2) VIBench—the first benchmark dedicated to vision-based interactive tasks, covering realistic scenarios including 3D gaming and office productivity applications; and (3) a unified architecture combining vision–language perception, modular role specialization, dynamic task scheduling, and cross-modal collaborative reasoning. Experiments demonstrate a 6.8% average improvement over the GAIA benchmark and substantial gains on VIBench, validating our framework’s strong generalization and robust interaction capability with non-API desktop applications.
📝 Abstract
Large language model agents that interact with PC applications often face limitations due to their singular mode of interaction with real-world environments, leading to restricted versatility and frequent hallucinations. To address this, we propose the Multi-Modal Agent Collaboration framework (MMAC-Copilot), a framework utilizes the collective expertise of diverse agents to enhance interaction ability with application. The framework introduces a team collaboration chain, enabling each participating agent to contribute insights based on their specific domain knowledge, effectively reducing the hallucination associated with knowledge domain gaps. We evaluate MMAC-Copilot using the GAIA benchmark and our newly introduced Visual Interaction Benchmark (VIBench). MMAC-Copilot achieved exceptional performance on GAIA, with an average improvement of 6.8% over existing leading systems. VIBench focuses on non-API-interactable applications across various domains, including 3D gaming, recreation, and office scenarios. It also demonstrated remarkable capability on VIBench. We hope this work can inspire in this field and provide a more comprehensive assessment of Autonomous agents. The anonymous Github is available at href{https://anonymous.4open.science/r/ComputerAgentWithVision-3C12}{Anonymous Github}