ViViDex: Learning Vision-based Dexterous Manipulation from Human Videos

📅 2024-04-24
🏛️ arXiv.org
📈 Citations: 4
Influential: 0
📄 PDF
🤖 AI Summary
Learning generalizable visual manipulation policies for multifingered dexterous hands directly from human demonstration videos—without privileged information such as ground-truth object states—remains a significant challenge. Method: We propose an end-to-end learning framework integrating trajectory-guided reinforcement learning pretraining, behavior cloning, and diffusion-based policy modeling, augmented by coordinate-transformation-enhanced point cloud representations, forming a two-stage visual policy learning paradigm. Contribution/Results: To our knowledge, this is the first work to train robust, generalizable, purely vision-based dexterous manipulation policies solely from raw human demonstration videos. Evaluated on three dexterous manipulation tasks in both simulation and real-robot settings, our approach significantly outperforms state-of-the-art methods. At deployment, it requires only monocular RGB input—no additional sensors or explicit state feedback—enabling practical, sensor-light robotic manipulation.

Technology Category

Application Category

📝 Abstract
In this work, we aim to learn a unified vision-based policy for multi-fingered robot hands to manipulate a variety of objects in diverse poses. Though prior work has shown benefits of using human videos for policy learning, performance gains have been limited by the noise in estimated trajectories. Moreover, reliance on privileged object information such as ground-truth object states further limits the applicability in realistic scenarios. To address these limitations, we propose a new framework ViViDex to improve vision-based policy learning from human videos. It first uses reinforcement learning with trajectory guided rewards to train state-based policies for each video, obtaining both visually natural and physically plausible trajectories from the video. We then rollout successful episodes from state-based policies and train a unified visual policy without using any privileged information. We propose coordinate transformation to further enhance the visual point cloud representation, and compare behavior cloning and diffusion policy for the visual policy training. Experiments both in simulation and on the real robot demonstrate that ViViDex outperforms state-of-the-art approaches on three dexterous manipulation tasks.
Problem

Research questions and friction points this paper is trying to address.

Learn vision-based policy for multi-fingered robot hands
Overcome noise in human video trajectory estimation
Eliminate reliance on privileged object information
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement learning with trajectory guided rewards
Coordinate transformation enhances visual point cloud
Behavior cloning and diffusion policy comparison
🔎 Similar Papers
No similar papers found.