From Generation to Generalization: Emergent Few-Shot Learning in Video Diffusion Models

📅 2025-06-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Video diffusion models (VDMs) remain primarily generative; their utility as general-purpose visual foundation models—especially under few-shot transfer—is unproven. Method: To address weak few-shot adaptation, we propose a lightweight fine-tuning paradigm that freezes the VDM backbone and preserves the original generative interface. Leveraging LoRA adapters, we reformulate downstream tasks as end-to-end mappings from task-specific prompts to visual token sequences, capitalizing on the structured latent representations and implicit world knowledge emergent during VDM pretraining. Contribution/Results: Our approach achieves strong generalization across heterogeneous vision tasks—including semantic segmentation, human pose estimation, and abstract reasoning (ARC-AGI)—using only 2–5 annotated samples per task. This is the first empirical demonstration that VDMs can serve as unified visual learners under extreme data scarcity, opening a new pathway toward diffusion-based general visual foundation models.

Technology Category

Application Category

📝 Abstract
Video Diffusion Models (VDMs) have emerged as powerful generative tools, capable of synthesizing high-quality spatiotemporal content. Yet, their potential goes far beyond mere video generation. We argue that the training dynamics of VDMs, driven by the need to model coherent sequences, naturally pushes them to internalize structured representations and an implicit understanding of the visual world. To probe the extent of this internal knowledge, we introduce a few-shot fine-tuning framework that repurposes VDMs for new tasks using only a handful of examples. Our method transforms each task into a visual transition, enabling the training of LoRA weights on short input-output sequences without altering the generative interface of a frozen VDM. Despite minimal supervision, the model exhibits strong generalization across diverse tasks, from low-level vision (for example, segmentation and pose estimation) to high-level reasoning (for example, on ARC-AGI). These results reframe VDMs as more than generative engines. They are adaptable visual learners with the potential to serve as the backbone for future foundation models in vision.
Problem

Research questions and friction points this paper is trying to address.

Probes internal knowledge of Video Diffusion Models (VDMs)
Introduces few-shot fine-tuning for new visual tasks)
Demonstrates VDM generalization from low-level to high-level vision)
Innovation

Methods, ideas, or system contributions that make the work stand out.

Few-shot fine-tuning for video diffusion models
LoRA weights training on short sequences
Adaptable visual learners for diverse tasks
🔎 Similar Papers
No similar papers found.