VIKI-R: Coordinating Embodied Multi-Agent Cooperation via Reinforcement Learning

📅 2025-06-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing vision-language models (VLMs) struggle to support embodied agents with diverse morphologies and multi-view perception, hindering effective collaboration among heterogeneous agents in dynamic environments. Method: This paper introduces VIKI-Bench—the first hierarchical, vision-driven evaluation benchmark for embodied collaboration—and proposes VIKI-R, a two-stage framework: (1) chain-of-thought (CoT)-annotated fine-tuning to enhance visual grounding and reasoning, followed by (2) multi-level reward reinforcement learning for end-to-end collaborative planning and trajectory generation. Contribution/Results: (1) VIKI-Bench is the first benchmark enabling multi-granularity tasks and heterogeneous agent coordination; (2) VIKI-R achieves the first end-to-end reinforcement-based collaborative control of morphologically diverse agents under multi-view visual observation; (3) VIKI-R significantly outperforms baselines across all hierarchical levels of VIKI-Bench, demonstrating that enhanced visual grounding substantially improves activation, planning, and perception–level collaboration.

Technology Category

Application Category

📝 Abstract
Coordinating multiple embodied agents in dynamic environments remains a core challenge in artificial intelligence, requiring both perception-driven reasoning and scalable cooperation strategies. While recent works have leveraged large language models (LLMs) for multi-agent planning, a few have begun to explore vision-language models (VLMs) for visual reasoning. However, these VLM-based approaches remain limited in their support for diverse embodiment types. In this work, we introduce VIKI-Bench, the first hierarchical benchmark tailored for embodied multi-agent cooperation, featuring three structured levels: agent activation, task planning, and trajectory perception. VIKI-Bench includes diverse robot embodiments, multi-view visual observations, and structured supervision signals to evaluate reasoning grounded in visual inputs. To demonstrate the utility of VIKI-Bench, we propose VIKI-R, a two-stage framework that fine-tunes a pretrained vision-language model (VLM) using Chain-of-Thought annotated demonstrations, followed by reinforcement learning under multi-level reward signals. Our extensive experiments show that VIKI-R significantly outperforms baselines method across all task levels. Furthermore, we show that reinforcement learning enables the emergence of compositional cooperation patterns among heterogeneous agents. Together, VIKI-Bench and VIKI-R offer a unified testbed and method for advancing multi-agent, visual-driven cooperation in embodied AI systems.
Problem

Research questions and friction points this paper is trying to address.

Coordinating multiple embodied agents in dynamic environments
Supporting diverse embodiment types in multi-agent cooperation
Advancing visual-driven cooperation for heterogeneous agents
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical benchmark for multi-agent cooperation
Two-stage VLM fine-tuning with reinforcement learning
Multi-level reward signals for diverse embodiments
🔎 Similar Papers
No similar papers found.
L
Li Kang
Shanghai Jiao Tong University, Shanghai Artificial Intelligence Laboratory
Xiufeng Song
Xiufeng Song
Shanghai Jiao Tong University
Computer VisionEmbodied Intelligence
Heng Zhou
Heng Zhou
Jiangnan University
Multi-modal LearningImage ProcessingComputer VisionRemote Sensing
Y
Yiran Qin
Shanghai Artificial Intelligence Laboratory, The Chinese University of Hong Kong, Shenzhen
J
Jie Yang
The Chinese University of Hong Kong, Shenzhen
X
Xiaohong Liu
Shanghai Jiao Tong University
Philip Torr
Philip Torr
Professor, University of Oxford
Department of Engineering
Lei Bai
Lei Bai
Shanghai AI Laboratory
Foundation ModelScience IntelligenceMulti-Agent SystemAutonomous Discovery
Zhenfei Yin
Zhenfei Yin
University of Oxford
Deep LearningMultimodalAI AgentRobotics