🤖 AI Summary
This work addresses the limitations of existing vision–language–action (VLA) systems in long-horizon robotic tasks, which suffer from fragmented data collection, policy learning, and deployment pipelines, heavy reliance on manual resets, and fragile multi-policy execution. To overcome these challenges, the authors propose RoboClaw, a framework that unifies perception, decision-making, and control within a single vision–language model (VLM). RoboClaw introduces the novel entangled action pair (EAP) mechanism, enabling a self-resetting loop for continuous online policy refinement and end-to-end semantically consistent task execution. Experiments on physical robots demonstrate that this approach improves long-horizon task success rates by 25%, reduces human labor by 53.7%, and significantly enhances system robustness, stability, and scalability.
📝 Abstract
Vision-Language-Action (VLA) systems have shown strong potential for language-driven robotic manipulation. However, scaling them to long-horizon tasks remains challenging. Existing pipelines typically separate data collection, policy learning, and deployment, resulting in heavy reliance on manual environment resets and brittle multi-policy execution. We present RoboClaw, an agentic robotics framework that unifies data collection, policy learning, and task execution under a single VLM-driven controller. At the policy level, RoboClaw introduces Entangled Action Pairs (EAP), which couple forward manipulation behaviors with inverse recovery actions to form self-resetting loops for autonomous data collection. This mechanism enables continuous on-policy data acquisition and iterative policy refinement with minimal human intervention. During deployment, the same agent performs high-level reasoning and dynamically orchestrates learned policy primitives to accomplish long-horizon tasks. By maintaining consistent contextual semantics across collection and execution, RoboClaw reduces mismatch between the two phases and improves multi-policy robustness. Experiments in real-world manipulation tasks demonstrate improved stability and scalability compared to conventional open-loop pipelines, while significantly reducing human effort throughout the robot lifecycle, achieving a 25% improvement in success rate over baseline methods on long-horizon tasks and reducing human time investment by 53.7%.