ManiAgent: An Agentic Framework for General Robotic Manipulation

📅 2025-10-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing vision-language-action (VLA) models face limitations in complex reasoning and long-horizon task planning due to data scarcity and insufficient modeling capacity. To address this, we propose a general-purpose robotic manipulation framework based on multi-agent collaboration. The framework decouples task understanding, environment perception, subtask decomposition, and action generation into specialized, interoperable agents, enabling end-to-end reasoning and planning via structured inter-agent communication. It further supports autonomous closed-loop data collection. By innovatively integrating VLA models with a multi-agent architecture, our approach achieves 86.8% success rate on the SimPlerEnv benchmark and 95.8% on real-world pick-and-place tasks. Notably, VLA models trained solely on autonomously collected data match the performance of those trained on human-annotated datasets, substantially alleviating the data dependency bottleneck.

Technology Category

Application Category

📝 Abstract
While Vision-Language-Action (VLA) models have demonstrated impressive capabilities in robotic manipulation, their performance in complex reasoning and long-horizon task planning is limited by data scarcity and model capacity. To address this, we introduce ManiAgent, an agentic architecture for general manipulation tasks that achieves end-to-end output from task descriptions and environmental inputs to robotic manipulation actions. In this framework, multiple agents involve inter-agent communication to perform environmental perception, sub-task decomposition and action generation, enabling efficient handling of complex manipulation scenarios. Evaluations show ManiAgent achieves an 86.8% success rate on the SimplerEnv benchmark and 95.8% on real-world pick-and-place tasks, enabling efficient data collection that yields VLA models with performance comparable to those trained on human-annotated datasets.The project webpage is available at https://yi-yang929.github.io/ManiAgent/.
Problem

Research questions and friction points this paper is trying to address.

Addresses limitations in robotic manipulation reasoning and planning
Develops multi-agent framework for complex task decomposition
Enables end-to-end action generation from environmental inputs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Agentic framework enables multi-agent robotic manipulation
Inter-agent communication handles perception and task decomposition
End-to-end architecture generates actions from environmental inputs
🔎 Similar Papers
No similar papers found.
Y
Yi Yang
Beijing University of Technology
K
Kefan Gu
Nanjing University
Y
Yuqing Wen
University of Science and Technology of China
Hebei Li
Hebei Li
PhD of USTC
Event cameraNeuromorphic3D
Yucheng Zhao
Yucheng Zhao
MEGVII Technology
RobotLarge Language ModelVideo Generation
Tiancai Wang
Tiancai Wang
Dexmal
Computer VisionEmbodied AI
X
Xudong Liu
Beijing University of Technology