VidBot: Learning Generalizable 3D Actions from In-the-Wild 2D Human Videos for Zero-Shot Robotic Manipulation

๐Ÿ“… 2025-03-10
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses zero-shot robotic manipulation by proposing a novel paradigm for unsupervised learning of generalizable 3D manipulation skills from monocular internet videos of human actionsโ€”without requiring any real-robot interaction data. Methodologically, it introduces the first affordance modeling framework that synergistically combines coarse-grained action recognition with fine-grained diffusion-based generation, integrating depth foundation models, structure-from-motion (SfM), and a two-stage diffusion architecture, augmented with test-time constraint-guided planning. It also establishes the first 3D hand trajectory reconstruction pipeline ensuring metric scale consistency and temporal coherence. Evaluated on 13 manipulation tasks, the approach achieves significant zero-shot performance gains over state-of-the-art methods and demonstrates successful deployment on UR5 and Franka robots, validating strong cross-scene and cross-platform generalization capability.

Technology Category

Application Category

๐Ÿ“ Abstract
Future robots are envisioned as versatile systems capable of performing a variety of household tasks. The big question remains, how can we bridge the embodiment gap while minimizing physical robot learning, which fundamentally does not scale well. We argue that learning from in-the-wild human videos offers a promising solution for robotic manipulation tasks, as vast amounts of relevant data already exist on the internet. In this work, we present VidBot, a framework enabling zero-shot robotic manipulation using learned 3D affordance from in-the-wild monocular RGB-only human videos. VidBot leverages a pipeline to extract explicit representations from them, namely 3D hand trajectories from videos, combining a depth foundation model with structure-from-motion techniques to reconstruct temporally consistent, metric-scale 3D affordance representations agnostic to embodiments. We introduce a coarse-to-fine affordance learning model that first identifies coarse actions from the pixel space and then generates fine-grained interaction trajectories with a diffusion model, conditioned on coarse actions and guided by test-time constraints for context-aware interaction planning, enabling substantial generalization to novel scenes and embodiments. Extensive experiments demonstrate the efficacy of VidBot, which significantly outperforms counterparts across 13 manipulation tasks in zero-shot settings and can be seamlessly deployed across robot systems in real-world environments. VidBot paves the way for leveraging everyday human videos to make robot learning more scalable.
Problem

Research questions and friction points this paper is trying to address.

Bridging embodiment gap in robotics using human videos
Learning 3D affordance from 2D videos for manipulation
Enabling zero-shot robotic manipulation in novel scenes
Innovation

Methods, ideas, or system contributions that make the work stand out.

Extracts 3D hand trajectories from 2D videos
Combines depth model with structure-from-motion
Uses diffusion model for fine-grained interaction planning