Solving New Tasks by Adapting Internet Video Knowledge

📅 2025-04-21

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

To address the challenge of zero-shot, text-conditioned behavioral generalization for novel robotic tasks, this paper proposes Inverse Probabilistic Adaptation (IPA)—a method that integrates internet-scale pre-trained video diffusion models with a minimal set of robot-domain demonstration videos, achieving cross-environment robust generalization without additional training data or model fine-tuning. IPA enhances robustness against low-quality or suboptimal demonstrations via probabilistic inversion optimization, effectively mitigating domain shift and data noise. Evaluated across diverse real and simulated robotic platforms, IPA demonstrates accurate execution of unseen natural language instructions using only a few demonstration videos—significantly outperforming existing text-to-action baselines. This work establishes a new paradigm for embodied intelligence generalization under resource-constrained settings, enabling scalable, data-efficient adaptation through principled probabilistic reasoning over pre-trained vision-language-video foundations.

Technology Category

Application Category

📝 Abstract

Video generative models demonstrate great promise in robotics by serving as visual planners or as policy supervisors. When pretrained on internet-scale data, such video models intimately understand alignment with natural language, and can thus facilitate generalization to novel downstream behavior through text-conditioning. However, they may not be sensitive to the specificities of the particular environment the agent inhabits. On the other hand, training video models on in-domain examples of robotic behavior naturally encodes environment-specific intricacies, but the scale of available demonstrations may not be sufficient to support generalization to unseen tasks via natural language specification. In this work, we investigate different adaptation techniques that integrate in-domain information with large-scale pretrained video models, and explore the extent to which they enable novel text-conditioned generalization for robotic tasks, while also considering their independent data and resource considerations. We successfully demonstrate across robotic environments that adapting powerful video models with small scales of example data can successfully facilitate generalization to novel behaviors. In particular, we present a novel adaptation strategy, termed Inverse Probabilistic Adaptation, that not only consistently achieves strong generalization performance across robotic tasks and settings, but also exhibits robustness to the quality of adaptation data, successfully solving novel tasks even when only suboptimal in-domain demonstrations are available.

Problem

Research questions and friction points this paper is trying to address.

Adapting internet video models for robotic task generalization

Balancing large-scale pretraining with domain-specific data needs

Enhancing generalization to novel text-conditioned robotic behaviors

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adapting pretrained video models with in-domain data

Introducing Inverse Probabilistic Adaptation strategy

Enabling generalization with minimal example data

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs