VIKI-R: Coordinating Embodied Multi-Agent Cooperation via Reinforcement Learning

📅 2025-06-10

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

Existing vision-language models (VLMs) struggle to support embodied agents with diverse morphologies and multi-view perception, hindering effective collaboration among heterogeneous agents in dynamic environments. Method: This paper introduces VIKI-Bench—the first hierarchical, vision-driven evaluation benchmark for embodied collaboration—and proposes VIKI-R, a two-stage framework: (1) chain-of-thought (CoT)-annotated fine-tuning to enhance visual grounding and reasoning, followed by (2) multi-level reward reinforcement learning for end-to-end collaborative planning and trajectory generation. Contribution/Results: (1) VIKI-Bench is the first benchmark enabling multi-granularity tasks and heterogeneous agent coordination; (2) VIKI-R achieves the first end-to-end reinforcement-based collaborative control of morphologically diverse agents under multi-view visual observation; (3) VIKI-R significantly outperforms baselines across all hierarchical levels of VIKI-Bench, demonstrating that enhanced visual grounding substantially improves activation, planning, and perception–level collaboration.

Technology Category

Application Category

📝 Abstract

Coordinating multiple embodied agents in dynamic environments remains a core challenge in artificial intelligence, requiring both perception-driven reasoning and scalable cooperation strategies. While recent works have leveraged large language models (LLMs) for multi-agent planning, a few have begun to explore vision-language models (VLMs) for visual reasoning. However, these VLM-based approaches remain limited in their support for diverse embodiment types. In this work, we introduce VIKI-Bench, the first hierarchical benchmark tailored for embodied multi-agent cooperation, featuring three structured levels: agent activation, task planning, and trajectory perception. VIKI-Bench includes diverse robot embodiments, multi-view visual observations, and structured supervision signals to evaluate reasoning grounded in visual inputs. To demonstrate the utility of VIKI-Bench, we propose VIKI-R, a two-stage framework that fine-tunes a pretrained vision-language model (VLM) using Chain-of-Thought annotated demonstrations, followed by reinforcement learning under multi-level reward signals. Our extensive experiments show that VIKI-R significantly outperforms baselines method across all task levels. Furthermore, we show that reinforcement learning enables the emergence of compositional cooperation patterns among heterogeneous agents. Together, VIKI-Bench and VIKI-R offer a unified testbed and method for advancing multi-agent, visual-driven cooperation in embodied AI systems.

Problem

Research questions and friction points this paper is trying to address.

Coordinating multiple embodied agents in dynamic environments

Supporting diverse embodiment types in multi-agent cooperation

Advancing visual-driven cooperation for heterogeneous agents

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical benchmark for multi-agent cooperation

Two-stage VLM fine-tuning with reinforcement learning

Multi-level reward signals for diverse embodiments

🔎 Similar Papers

Enabling Multi-Robot Collaboration from Single-Human Guidance