TASTE-Rob: Advancing Video Generation of Task-Oriented Hand-Object Interaction for Generalizable Robotic Manipulation

📅 2025-03-14

📈 Citations: 0

✨ Influential: 0

career value

233K/year

🤖 AI Summary

Existing hand-object interaction video generation methods suffer from viewpoint inconsistency and misaligned interactions, limiting their utility for high-fidelity robotic imitation learning. To address this, we propose TASTE-Rob: the first large-scale (100K+), first-person, viewpoint-consistent, and language-aligned hand-object interaction video dataset. We design a three-stage hand pose refinement pipeline to significantly improve the physical plausibility and spatiotemporal coherence of grasping poses. Our method builds upon video diffusion models, integrating multi-stage hand keypoint optimization, language-action alignment modeling, and egocentric video synthesis. Experiments demonstrate that our generated videos achieve superior physical plausibility and task executability compared to baselines. Moreover, downstream robotic manipulation policies trained on our data exhibit enhanced generalization across unseen objects and improved deployment reliability in real-world settings.

Technology Category

Application Category

📝 Abstract

We address key limitations in existing datasets and models for task-oriented hand-object interaction video generation, a critical approach of generating video demonstrations for robotic imitation learning. Current datasets, such as Ego4D, often suffer from inconsistent view perspectives and misaligned interactions, leading to reduced video quality and limiting their applicability for precise imitation learning tasks. Towards this end, we introduce TASTE-Rob -- a pioneering large-scale dataset of 100,856 ego-centric hand-object interaction videos. Each video is meticulously aligned with language instructions and recorded from a consistent camera viewpoint to ensure interaction clarity. By fine-tuning a Video Diffusion Model (VDM) on TASTE-Rob, we achieve realistic object interactions, though we observed occasional inconsistencies in hand grasping postures. To enhance realism, we introduce a three-stage pose-refinement pipeline that improves hand posture accuracy in generated videos. Our curated dataset, coupled with the specialized pose-refinement framework, provides notable performance gains in generating high-quality, task-oriented hand-object interaction videos, resulting in achieving superior generalizable robotic manipulation. The TASTE-Rob dataset will be made publicly available upon publication to foster further advancements in the field.

Problem

Research questions and friction points this paper is trying to address.

Address limitations in task-oriented hand-object interaction video generation.

Introduce TASTE-Rob dataset for consistent, high-quality video demonstrations.

Enhance realism with pose-refinement for superior robotic manipulation.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale dataset with consistent camera viewpoints

Fine-tuned Video Diffusion Model for realistic interactions

Three-stage pose-refinement for accurate hand postures

🔎 Similar Papers

This&That: Language-Gesture Controlled Video Generation for Robot Planning