Parse-Augment-Distill: Learning Generalizable Bimanual Visuomotor Policies from Single Human Video

📅 2025-09-24

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This work addresses key bottlenecks in dual-arm robotic policy learning—namely, heavy reliance on large-scale teleoperation data, poor generalization, and the Sim2Real gap—by proposing a novel paradigm that learns highly generalizable visuomotor policies from just a single human demonstration video. Methodologically, it introduces the PAD unified framework: (1) parsing bimanual hand keypoint trajectories directly from human videos; (2) performing task-semantic demonstration augmentation without simulation; and (3) distilling knowledge via a keypoint-conditioned policy network. PAD is the first approach to holistically integrate trajectory parsing, task-level augmentation, and policy distillation while entirely bypassing simulation modeling and domain shift. Evaluated on six real-world dual-arm tasks—including pouring and debris clearing—it achieves zero-shot transfer and one-shot deployment, significantly outperforming end-to-end image-based policies and simulation-augmented baselines. The method demonstrates exceptional sample efficiency and engineering practicality.

Technology Category

Application Category

📝 Abstract

Learning visuomotor policies from expert demonstrations is an important frontier in modern robotics research, however, most popular methods require copious efforts for collecting teleoperation data and struggle to generalize out-ofdistribution. Scaling data collection has been explored through leveraging human videos, as well as demonstration augmentation techniques. The latter approach typically requires expensive simulation rollouts and trains policies with synthetic image data, therefore introducing a sim-to-real gap. In parallel, alternative state representations such as keypoints have shown great promise for category-level generalization. In this work, we bring these avenues together in a unified framework: PAD (Parse-AugmentDistill), for learning generalizable bimanual policies from a single human video. Our method relies on three steps: (a) parsing a human video demo into a robot-executable keypoint-action trajectory, (b) employing bimanual task-and-motion-planning to augment the demonstration at scale without simulators, and (c) distilling the augmented trajectories into a keypoint-conditioned policy. Empirically, we showcase that PAD outperforms state-ofthe-art bimanual demonstration augmentation works relying on image policies with simulation rollouts, both in terms of success rate and sample/cost efficiency. We deploy our framework in six diverse real-world bimanual tasks such as pouring drinks, cleaning trash and opening containers, producing one-shot policies that generalize in unseen spatial arrangements, object instances and background distractors. Supplementary material can be found in the project webpage https://gtziafas.github.io/PAD_project/.

Problem

Research questions and friction points this paper is trying to address.

Learning generalizable bimanual policies from single human videos

Overcoming limitations of teleoperation data collection methods

Addressing sim-to-real gap in visuomotor policy training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Parsing human videos into robot-executable keypoint trajectories

Augmenting demonstrations with bimanual task-and-motion planning

Distilling augmented trajectories into keypoint-conditioned policies

🔎 Similar Papers

No similar papers found.

Toyota Research Institute

Los Altos, CA / Cambridge, MA

Promotion (PhD): KI-basierte Lernstrategien für Smart Manufacturing im europäischen HORIZON-Projekt

Bosch Group

ARENA2036 in Stuttgart

Research Scientist Intern, Robotic Control Policy (PhD)