Video In-context Learning

📅 2024-07-10

🏛️ arXiv.org

📈 Citations: 3

✨ Influential: 0

career value

203K/year

🤖 AI Summary

Large language models struggle to perform unseen tasks by merely observing demonstration videos, hindering zero-shot cross-scenario semantic imitation. Method: We propose an autoregressive Transformer architecture pretrained via video-based self-supervised learning, enabling in-context understanding and reproduction of semantic content from demonstration videos—without task-specific fine-tuning. Contribution/Results: This work introduces the first video-driven zero-shot cross-scenario imitation framework. We discover that in-context imitation capability emerges spontaneously during self-supervised pretraining and follows scaling laws. The method demonstrates strong generalization across diverse embodied tasks—including manipulation and navigation—and generates videos with precise semantic alignment. Comprehensive objective and subjective evaluations validate its effectiveness. Code and pretrained models are publicly released.

Technology Category

Application Category

📝 Abstract

People interact with the real-world largely dependent on visual signal, which are ubiquitous and illustrate detailed demonstrations. In this paper, we explore utilizing visual signals as a new interface for models to interact with the environment. Specifically, we choose videos as a representative visual signal. And by training autoregressive Transformers on video datasets in a self-supervised objective, we find that the model emerges a zero-shot capability to infer the semantics from a demonstration video, and imitate the semantics to an unseen scenario. This allows the models to perform unseen tasks by watching the demonstration video in an in-context manner, without further fine-tuning. To validate the imitation capacity, we design various evaluation metrics including both objective and subjective measures. The results show that our models can generate high-quality video clips that accurately align with the semantic guidance provided by the demonstration videos, and we also show that the imitation capacity follows the scaling law. Code and models have been open-sourced.

Problem

Research questions and friction points this paper is trying to address.

Autoregressive Transformers learn zero-shot video imitation.

Models infer semantics from videos for unseen tasks.

Self-supervised training enables in-context video understanding.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Autoregressive Transformers for video learning

Zero-shot video imitation without fine-tuning

Self-supervised training on video datasets

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs