AnyPos: Automated Task-Agnostic Actions for Bimanual Manipulation

📅 2025-07-16

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

Vision-language-action (VLA) models for bimanual robot manipulation suffer from strong dependence on task-specific human demonstrations, poor generalization, and high data collection costs. Method: This paper proposes a task-agnostic action learning paradigm that decouples action execution from task semantics. We introduce ATARA, a self-supervised automated data collection framework integrating Arm-Decoupled Estimation and a direction-aware decoder, enhanced by video-conditioned action verification and inverse-dynamics modeling to improve safety and reliability. Furthermore, we design AnyPos, a model enabling cross-task zero-shot transfer. Contribution/Results: Experiments demonstrate 30–40% higher success rates on grasping, lifting, and clicking tasks, and a 51% improvement in test accuracy over human teleoperation-based baselines. The approach establishes a new paradigm for bimanual robotic learning with low data dependency and strong generalization capability.

Technology Category

Application Category

📝 Abstract

Vision-language-action (VLA) models have shown promise on task-conditioned control in complex settings such as bimanual manipulation. However, the heavy reliance on task-specific human demonstrations limits their generalization and incurs high data acquisition costs. In this work, we present a new notion of task-agnostic action paradigm that decouples action execution from task-specific conditioning, enhancing scalability, efficiency, and cost-effectiveness. To address the data collection challenges posed by this paradigm -- such as low coverage density, behavioral redundancy, and safety risks -- we introduce ATARA (Automated Task-Agnostic Random Actions), a scalable self-supervised framework that accelerates collection by over $ 30 imes $ compared to human teleoperation. To further enable effective learning from task-agnostic data, which often suffers from distribution mismatch and irrelevant trajectories, we propose AnyPos, an inverse dynamics model equipped with Arm-Decoupled Estimation and a Direction-Aware Decoder (DAD). We additionally integrate a video-conditioned action validation module to verify the feasibility of learned policies across diverse manipulation tasks. Extensive experiments show that the AnyPos-ATARA pipeline yields a 51% improvement in test accuracy and achieves 30-40% higher success rates in downstream tasks such as lifting, pick-and-place, and clicking, using replay-based video validation. Project Page: https://embodiedfoundation.github.io/vidar_anypos

Problem

Research questions and friction points this paper is trying to address.

Reducing reliance on task-specific human demonstrations

Addressing data collection challenges in task-agnostic actions

Improving learning from task-agnostic data distribution mismatch

Innovation

Methods, ideas, or system contributions that make the work stand out.

Task-agnostic action paradigm for bimanual manipulation

ATARA framework for scalable self-supervised data collection

AnyPos model with Arm-Decoupled Estimation and DAD

🔎 Similar Papers

No similar papers found.

Toyota Research Institute

Los Altos, CA / Cambridge, MA

Research Scientist Intern, Robotic Control Policy (PhD)