🤖 AI Summary
Existing vision-language models (VLMs) struggle to support embodied agents with diverse morphologies and multi-view perception, hindering effective collaboration among heterogeneous agents in dynamic environments.
Method: This paper introduces VIKI-Bench—the first hierarchical, vision-driven evaluation benchmark for embodied collaboration—and proposes VIKI-R, a two-stage framework: (1) chain-of-thought (CoT)-annotated fine-tuning to enhance visual grounding and reasoning, followed by (2) multi-level reward reinforcement learning for end-to-end collaborative planning and trajectory generation.
Contribution/Results: (1) VIKI-Bench is the first benchmark enabling multi-granularity tasks and heterogeneous agent coordination; (2) VIKI-R achieves the first end-to-end reinforcement-based collaborative control of morphologically diverse agents under multi-view visual observation; (3) VIKI-R significantly outperforms baselines across all hierarchical levels of VIKI-Bench, demonstrating that enhanced visual grounding substantially improves activation, planning, and perception–level collaboration.
📝 Abstract
Coordinating multiple embodied agents in dynamic environments remains a core challenge in artificial intelligence, requiring both perception-driven reasoning and scalable cooperation strategies. While recent works have leveraged large language models (LLMs) for multi-agent planning, a few have begun to explore vision-language models (VLMs) for visual reasoning. However, these VLM-based approaches remain limited in their support for diverse embodiment types. In this work, we introduce VIKI-Bench, the first hierarchical benchmark tailored for embodied multi-agent cooperation, featuring three structured levels: agent activation, task planning, and trajectory perception. VIKI-Bench includes diverse robot embodiments, multi-view visual observations, and structured supervision signals to evaluate reasoning grounded in visual inputs. To demonstrate the utility of VIKI-Bench, we propose VIKI-R, a two-stage framework that fine-tunes a pretrained vision-language model (VLM) using Chain-of-Thought annotated demonstrations, followed by reinforcement learning under multi-level reward signals. Our extensive experiments show that VIKI-R significantly outperforms baselines method across all task levels. Furthermore, we show that reinforcement learning enables the emergence of compositional cooperation patterns among heterogeneous agents. Together, VIKI-Bench and VIKI-R offer a unified testbed and method for advancing multi-agent, visual-driven cooperation in embodied AI systems.