🤖 AI Summary
Vision-language-action (VLA) models for bimanual robot manipulation suffer from strong dependence on task-specific human demonstrations, poor generalization, and high data collection costs. Method: This paper proposes a task-agnostic action learning paradigm that decouples action execution from task semantics. We introduce ATARA, a self-supervised automated data collection framework integrating Arm-Decoupled Estimation and a direction-aware decoder, enhanced by video-conditioned action verification and inverse-dynamics modeling to improve safety and reliability. Furthermore, we design AnyPos, a model enabling cross-task zero-shot transfer. Contribution/Results: Experiments demonstrate 30–40% higher success rates on grasping, lifting, and clicking tasks, and a 51% improvement in test accuracy over human teleoperation-based baselines. The approach establishes a new paradigm for bimanual robotic learning with low data dependency and strong generalization capability.
📝 Abstract
Vision-language-action (VLA) models have shown promise on task-conditioned control in complex settings such as bimanual manipulation. However, the heavy reliance on task-specific human demonstrations limits their generalization and incurs high data acquisition costs. In this work, we present a new notion of task-agnostic action paradigm that decouples action execution from task-specific conditioning, enhancing scalability, efficiency, and cost-effectiveness. To address the data collection challenges posed by this paradigm -- such as low coverage density, behavioral redundancy, and safety risks -- we introduce ATARA (Automated Task-Agnostic Random Actions), a scalable self-supervised framework that accelerates collection by over $ 30 imes $ compared to human teleoperation. To further enable effective learning from task-agnostic data, which often suffers from distribution mismatch and irrelevant trajectories, we propose AnyPos, an inverse dynamics model equipped with Arm-Decoupled Estimation and a Direction-Aware Decoder (DAD). We additionally integrate a video-conditioned action validation module to verify the feasibility of learned policies across diverse manipulation tasks. Extensive experiments show that the AnyPos-ATARA pipeline yields a 51% improvement in test accuracy and achieves 30-40% higher success rates in downstream tasks such as lifting, pick-and-place, and clicking, using replay-based video validation. Project Page: https://embodiedfoundation.github.io/vidar_anypos