🤖 AI Summary
This work addresses the challenges of long-term object management and high-level command execution for embodied robots in home environments. We propose a fine-tuning-free, LLM-driven multi-agent collaborative architecture comprising three specialized agents: routing, task planning, and knowledge base. Leveraging in-context learning and retrieval-augmented generation (RAG), the architecture enables memory-enhanced task planning, cross-turn object state tracking, and semantic scene understanding. The end-to-end embodied intelligence system integrates multimodal foundation models—including Grounded SAM, LLaMA3.2-Vision, Qwen2.5, and LLaMA3.1—to jointly perceive, reason, and act. Experiments across three household task categories demonstrate significant improvements in planning accuracy; RAG substantially enhances long-term memory recall; and Qwen2.5 and LLaMA3.1 achieve superior performance in planning and routing, respectively. To our knowledge, this is the first memory-augmented multi-agent paradigm tailored to domestic settings, offering a scalable, fine-tuning-free architectural pathway for embodied AI.
📝 Abstract
We present an embodied robotic system with an LLM-driven agent-orchestration architecture for autonomous household object management. The system integrates memory-augmented task planning, enabling robots to execute high-level user commands while tracking past actions. It employs three specialized agents: a routing agent, a task planning agent, and a knowledge base agent, each powered by task-specific LLMs. By leveraging in-context learning, our system avoids the need for explicit model training. RAG enables the system to retrieve context from past interactions, enhancing long-term object tracking. A combination of Grounded SAM and LLaMa3.2-Vision provides robust object detection, facilitating semantic scene understanding for task planning. Evaluation across three household scenarios demonstrates high task planning accuracy and an improvement in memory recall due to RAG. Specifically, Qwen2.5 yields best performance for specialized agents, while LLaMA3.1 excels in routing tasks. The source code is available at: https://github.com/marc1198/chat-hsr.