Transforming Monolithic Foundation Models into Embodied Multi-Agent Architectures for Human-Robot Collaboration

📅 2025-11-30

📈 Citations: 0

✨ Influential: 0

career value

239K/year

🤖 AI Summary

Current foundation models struggle to simultaneously satisfy service robots’ requirements for distributed perception, geometrically reliable manipulation, and proactive human-robot collaboration; scaling model size alone does not ensure autonomy in human environments. This paper proposes an embodied multi-agent architecture wherein a large language model (LLM) serves as the coordination hub, decoupling perception, planning, actuation, and human delegation into modular, interoperable components to form a closed-loop system. We introduce three key innovations: (1) perception-dependent hierarchical planning, (2) failure-driven reflective reasoning, and (3) dynamic human delegation strategies—collectively enhancing geometric grounding and social adaptability. Over a three-month open deployment across heterogeneous robotic platforms, our approach achieves significant improvements in task success rate, environmental generalization, and collaborative efficiency. Results empirically validate multi-agent orchestration—not monolithic LLM scaling—as a practical, scalable pathway toward deployable embodied intelligence.

Technology Category

Application Category

📝 Abstract

Foundation models have become central to unifying perception and planning in robotics, yet real-world deployment exposes a mismatch between their monolithic assumption that a single model can handle all cognitive functions and the distributed, dynamic nature of practical service workflows. Vision-language models offer strong semantic understanding but lack embodiment-aware action capabilities while relying on hand-crafted skills. Vision-Language-Action policies enable reactive manipulation but remain brittle across embodiments, weak in geometric grounding, and devoid of proactive collaboration mechanisms. These limitations indicate that scaling a single model alone cannot deliver reliable autonomy for service robots operating in human-populated settings. To address this gap, we present InteractGen, an LLM-powered multi-agent framework that decomposes robot intelligence into specialized agents for continuous perception, dependency-aware planning, decision and verification, failure reflection, and dynamic human delegation, treating foundation models as regulated components within a closed-loop collective. Deployed on a heterogeneous robot team and evaluated in a three-month open-use study, InteractGen improves task success, adaptability, and human-robot collaboration, providing evidence that multi-agent orchestration offers a more feasible path toward socially grounded service autonomy than further scaling standalone models.

Problem

Research questions and friction points this paper is trying to address.

Monolithic foundation models lack distributed capabilities for real-world service workflows

Vision-language models lack embodiment-aware action and proactive collaboration mechanisms

Scaling single models fails to deliver reliable autonomy in human-populated robot settings

Innovation

Methods, ideas, or system contributions that make the work stand out.

Decomposes robot intelligence into specialized multi-agent framework

Uses LLM-powered agents for perception, planning, and verification

Treats foundation models as regulated components in closed-loop system

🔎 Similar Papers

No similar papers found.