Transforming Monolithic Foundation Models into Embodied Multi-Agent Architectures for Human-Robot Collaboration

📅 2025-11-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current foundation models struggle to simultaneously satisfy service robots’ requirements for distributed perception, geometrically reliable manipulation, and proactive human-robot collaboration; scaling model size alone does not ensure autonomy in human environments. This paper proposes an embodied multi-agent architecture wherein a large language model (LLM) serves as the coordination hub, decoupling perception, planning, actuation, and human delegation into modular, interoperable components to form a closed-loop system. We introduce three key innovations: (1) perception-dependent hierarchical planning, (2) failure-driven reflective reasoning, and (3) dynamic human delegation strategies—collectively enhancing geometric grounding and social adaptability. Over a three-month open deployment across heterogeneous robotic platforms, our approach achieves significant improvements in task success rate, environmental generalization, and collaborative efficiency. Results empirically validate multi-agent orchestration—not monolithic LLM scaling—as a practical, scalable pathway toward deployable embodied intelligence.

Technology Category

Application Category

📝 Abstract
Foundation models have become central to unifying perception and planning in robotics, yet real-world deployment exposes a mismatch between their monolithic assumption that a single model can handle all cognitive functions and the distributed, dynamic nature of practical service workflows. Vision-language models offer strong semantic understanding but lack embodiment-aware action capabilities while relying on hand-crafted skills. Vision-Language-Action policies enable reactive manipulation but remain brittle across embodiments, weak in geometric grounding, and devoid of proactive collaboration mechanisms. These limitations indicate that scaling a single model alone cannot deliver reliable autonomy for service robots operating in human-populated settings. To address this gap, we present InteractGen, an LLM-powered multi-agent framework that decomposes robot intelligence into specialized agents for continuous perception, dependency-aware planning, decision and verification, failure reflection, and dynamic human delegation, treating foundation models as regulated components within a closed-loop collective. Deployed on a heterogeneous robot team and evaluated in a three-month open-use study, InteractGen improves task success, adaptability, and human-robot collaboration, providing evidence that multi-agent orchestration offers a more feasible path toward socially grounded service autonomy than further scaling standalone models.
Problem

Research questions and friction points this paper is trying to address.

Monolithic foundation models lack distributed capabilities for real-world service workflows
Vision-language models lack embodiment-aware action and proactive collaboration mechanisms
Scaling single models fails to deliver reliable autonomy in human-populated robot settings
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decomposes robot intelligence into specialized multi-agent framework
Uses LLM-powered agents for perception, planning, and verification
Treats foundation models as regulated components in closed-loop system
🔎 Similar Papers
No similar papers found.
Nan Sun
Nan Sun
University of New South Wales
CybersecurityArtificial Intelligence Applications
B
Bo Mao
School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing, 100876, China
Y
Yongchang Li
Department of Computer Science and Technology, Tsinghua University, Beijing, 100084, China
C
Chenxu Wang
Department of Computer Science and Technology, Tsinghua University, Beijing, 100084, China
Di Guo
Di Guo
Xiamen University of Technology
signal processingsensor networkswireless communicationsparse representationcompressed sensing
Huaping Liu
Huaping Liu
Professor of Electrical Engineering, Oregon State University
Communication theorywireless communicationssignal processingsensor networksinformation security