Video2Policy: Scaling up Manipulation Tasks in Simulation through Internet Videos

📅 2025-02-14

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

To address the challenges of heavy reliance on manual annotation, poor task generalization, and difficult sim-to-real transfer in robot manipulation policy training, this paper proposes a novel paradigm for directly generating trainable simulation tasks from open-domain Internet RGB videos. Methodologically, it adopts a two-stage framework: first parsing videos into physically executable simulation manipulation tasks; second, employing a context-aware LLM-based reward function to guide model-agnostic reinforcement learning (PPO/SAC), thereby mitigating task hallucination and ensuring behavioral fidelity to real-world dynamics. Key contributions include: (1) the first Real2Sim2Real task generation pipeline requiring neither digital twins nor human annotation; (2) an LLM-in-the-loop reward modeling mechanism; and (3) successful reconstruction of nine complex manipulation categories—including high-difficulty tasks like throwing—from over 100 videos in the Something-Something v2 dataset, with learned policies validated on physical robots.

Technology Category

Application Category

📝 Abstract

Simulation offers a promising approach for cheaply scaling training data for generalist policies. To scalably generate data from diverse and realistic tasks, existing algorithms either rely on large language models (LLMs) that may hallucinate tasks not interesting for robotics; or digital twins, which require careful real-to-sim alignment and are hard to scale. To address these challenges, we introduce Video2Policy, a novel framework that leverages internet RGB videos to reconstruct tasks based on everyday human behavior. Our approach comprises two phases: (1) task generation in simulation from videos; and (2) reinforcement learning utilizing in-context LLM-generated reward functions iteratively. We demonstrate the efficacy of Video2Policy by reconstructing over 100 videos from the Something-Something-v2 (SSv2) dataset, which depicts diverse and complex human behaviors on 9 different tasks. Our method can successfully train RL policies on such tasks, including complex and challenging tasks such as throwing. Finally, we show that the generated simulation data can be scaled up for training a general policy, and it can be transferred back to the real robot in a Real2Sim2Real way.

Problem

Research questions and friction points this paper is trying to address.

Leverages internet videos for task reconstruction

Generates simulation data from diverse human behaviors

Trains general RL policies for real robots

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages internet RGB videos

Reconstructs tasks from human behavior

Uses in-context LLM-generated rewards

🔎 Similar Papers

Learning by Watching: A Review of Video-based Learning Approaches for Robot Manipulation

2024-02-11arXiv.orgCitations: 15