🤖 AI Summary
Video diffusion models (VDMs) remain primarily generative; their utility as general-purpose visual foundation models—especially under few-shot transfer—is unproven.
Method: To address weak few-shot adaptation, we propose a lightweight fine-tuning paradigm that freezes the VDM backbone and preserves the original generative interface. Leveraging LoRA adapters, we reformulate downstream tasks as end-to-end mappings from task-specific prompts to visual token sequences, capitalizing on the structured latent representations and implicit world knowledge emergent during VDM pretraining.
Contribution/Results: Our approach achieves strong generalization across heterogeneous vision tasks—including semantic segmentation, human pose estimation, and abstract reasoning (ARC-AGI)—using only 2–5 annotated samples per task. This is the first empirical demonstration that VDMs can serve as unified visual learners under extreme data scarcity, opening a new pathway toward diffusion-based general visual foundation models.
📝 Abstract
Video Diffusion Models (VDMs) have emerged as powerful generative tools, capable of synthesizing high-quality spatiotemporal content. Yet, their potential goes far beyond mere video generation. We argue that the training dynamics of VDMs, driven by the need to model coherent sequences, naturally pushes them to internalize structured representations and an implicit understanding of the visual world. To probe the extent of this internal knowledge, we introduce a few-shot fine-tuning framework that repurposes VDMs for new tasks using only a handful of examples. Our method transforms each task into a visual transition, enabling the training of LoRA weights on short input-output sequences without altering the generative interface of a frozen VDM. Despite minimal supervision, the model exhibits strong generalization across diverse tasks, from low-level vision (for example, segmentation and pose estimation) to high-level reasoning (for example, on ARC-AGI). These results reframe VDMs as more than generative engines. They are adaptable visual learners with the potential to serve as the backbone for future foundation models in vision.