Agentic Robot: A Brain-Inspired Framework for Vision-Language-Action Models in Embodied Agents

📅 2025-05-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Long-horizon robotic manipulation faces critical challenges including error accumulation, absence of online execution verification, and insufficient robustness. To address these, we propose the brain-inspired Standardized Action Planning (SAP) framework—a closed-loop “plan–execute–verify” architecture: (1) a large language model decomposes high-level goals into executable sub-goals; (2) a vision-language-action model generates real-time, low-level control commands; and (3) a temporal sequence verifier performs introspective execution assessment. Inspired by human standard operating procedures (SOPs), SAP introduces the first dynamic self-verifying embodied multi-agent coordination framework. Evaluated on the LIBERO benchmark, SAP achieves a mean task success rate of 79.6%, surpassing SpatialVLA (+6.1%) and OpenVLA (+7.4%)—setting a new state-of-the-art. Moreover, SAP enhances system interpretability and reliability for real-world deployment.

Technology Category

Application Category

📝 Abstract
Long-horizon robotic manipulation poses significant challenges for autonomous systems, requiring extended reasoning, precise execution, and robust error recovery across complex sequential tasks. Current approaches, whether based on static planning or end-to-end visuomotor policies, suffer from error accumulation and lack effective verification mechanisms during execution, limiting their reliability in real-world scenarios. We present Agentic Robot, a brain-inspired framework that addresses these limitations through Standardized Action Procedures (SAP)--a novel coordination protocol governing component interactions throughout manipulation tasks. Drawing inspiration from Standardized Operating Procedures (SOPs) in human organizations, SAP establishes structured workflows for planning, execution, and verification phases. Our architecture comprises three specialized components: (1) a large reasoning model that decomposes high-level instructions into semantically coherent subgoals, (2) a vision-language-action executor that generates continuous control commands from real-time visual inputs, and (3) a temporal verifier that enables autonomous progression and error recovery through introspective assessment. This SAP-driven closed-loop design supports dynamic self-verification without external supervision. On the LIBERO benchmark, Agentic Robot achieves state-of-the-art performance with an average success rate of 79.6%, outperforming SpatialVLA by 6.1% and OpenVLA by 7.4% on long-horizon tasks. These results demonstrate that SAP-driven coordination between specialized components enhances both performance and interpretability in sequential manipulation, suggesting significant potential for reliable autonomous systems. Project Github: https://agentic-robot.github.io.
Problem

Research questions and friction points this paper is trying to address.

Addresses error accumulation in robotic manipulation tasks
Lacks effective verification mechanisms during execution
Improves reliability in real-world sequential tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Brain-inspired framework for vision-language-action models
Standardized Action Procedures for component coordination
Specialized components for reasoning, execution, verification
🔎 Similar Papers
Z
Zhejian Yang
Jilin University
Yongchao Chen
Yongchao Chen
Harvard University, Massachusetts Institute of Technology
Robot PlanningFoundation ModelsFormal MethodsMechanicsAI for Science
X
Xueyang Zhou
Huazhong University of Science and Technology
J
Jiangyue Yan
Southern University of Science and Technology
Dingjie Song
Dingjie Song
Lehigh University; CUHK-Shenzhen; Nanjing University
Multimodal LearningLarge Language Models
Yinuo Liu
Yinuo Liu
Huazhong University of Science and Technology
AI securityMultimodal LLM
Y
Yuting Li
Shanghai Jiao Tong University
Y
Yu Zhang
Southern University of Science and Technology
P
Pan Zhou
Huazhong University of Science and Technology
Hechang Chen
Hechang Chen
School of Artificial Intelligence, Jilin University, China
Machine LearningData MiningDeep Reinforcement LearningComplex Network AnalysisKnowledge Graph
L
Lichao Sun
Lehigh University