RoboClaw: An Agentic Framework for Scalable Long-Horizon Robotic Tasks

📅 2026-03-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing vision–language–action (VLA) systems in long-horizon robotic tasks, which suffer from fragmented data collection, policy learning, and deployment pipelines, heavy reliance on manual resets, and fragile multi-policy execution. To overcome these challenges, the authors propose RoboClaw, a framework that unifies perception, decision-making, and control within a single vision–language model (VLM). RoboClaw introduces the novel entangled action pair (EAP) mechanism, enabling a self-resetting loop for continuous online policy refinement and end-to-end semantically consistent task execution. Experiments on physical robots demonstrate that this approach improves long-horizon task success rates by 25%, reduces human labor by 53.7%, and significantly enhances system robustness, stability, and scalability.

Technology Category

Application Category

📝 Abstract
Vision-Language-Action (VLA) systems have shown strong potential for language-driven robotic manipulation. However, scaling them to long-horizon tasks remains challenging. Existing pipelines typically separate data collection, policy learning, and deployment, resulting in heavy reliance on manual environment resets and brittle multi-policy execution. We present RoboClaw, an agentic robotics framework that unifies data collection, policy learning, and task execution under a single VLM-driven controller. At the policy level, RoboClaw introduces Entangled Action Pairs (EAP), which couple forward manipulation behaviors with inverse recovery actions to form self-resetting loops for autonomous data collection. This mechanism enables continuous on-policy data acquisition and iterative policy refinement with minimal human intervention. During deployment, the same agent performs high-level reasoning and dynamically orchestrates learned policy primitives to accomplish long-horizon tasks. By maintaining consistent contextual semantics across collection and execution, RoboClaw reduces mismatch between the two phases and improves multi-policy robustness. Experiments in real-world manipulation tasks demonstrate improved stability and scalability compared to conventional open-loop pipelines, while significantly reducing human effort throughout the robot lifecycle, achieving a 25% improvement in success rate over baseline methods on long-horizon tasks and reducing human time investment by 53.7%.
Problem

Research questions and friction points this paper is trying to address.

long-horizon tasks
robotic manipulation
Vision-Language-Action systems
policy learning
autonomous data collection
Innovation

Methods, ideas, or system contributions that make the work stand out.

RoboClaw
Entangled Action Pairs
Vision-Language-Action
self-resetting loops
long-horizon robotic tasks
R
Ruiying Li
AgiBot, China; National University of Singapore
Y
Yunlang Zhou
AgiBot, China; Shanghai Jiao Tong University, Shanghai 200240, China
Y
YuYao Zhu
AgiBot, China; Shanghai Jiao Tong University, Shanghai 200240, China
K
Kylin Chen
AgiBot, China
J
Jingyuan Wang
AgiBot, China
S
Sukai Wang
AgiBot, China
K
Kongtao Hu
AgiBot, China
M
Minhui Yu
AgiBot, China
Bowen Jiang
Bowen Jiang
University of Pennsylvania, Microsoft Corporation
Artificial IntelligencePost-trainingPersonalizationMultimodality
Zhan Su
Zhan Su
University of Montreal;MILA
PEFT approachesLLMsInformation retrieval
J
Jiayao Ma
AgiBot, China
X
Xin He
AgiBot, China
Y
Yongjian Shen
AgiBot, China
Y
Yangyang
AgiBot, China
G
Guanghui Ren
AgiBot, China
Maoqing Yao
Maoqing Yao
Google
W
Wenhao Wang
AgiBot, China
Y
Yao Mu
AgiBot, China; MoE Key Lab of Artificial Intelligence, AI Institute, SJTU