UniVid: Unifying Vision Tasks with Pre-trained Video Generation Models

📅 2025-09-25

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

Existing vision foundation models typically require task-specific pretraining or architectural modifications to handle diverse image and video tasks. Method: This work proposes a unified vision foundation model framework that leverages a single, task-agnostic pretrained video generative model—specifically, a video diffusion Transformer—without any labeled data or task-specific design. It reformulates diverse vision tasks as reversible “visual sentences,” where task type (understanding vs. generation) and input/output modalities (image ↔ video) are controlled solely by reordering instruction and modality tokens. This enables zero-shot transfer across modalities and data sources. Contribution/Results: Trained exclusively on unlabeled natural videos, the model achieves strong generalization across multimodal reasoning and cross-source tasks—including classification, detection, inpainting, and video extrapolation—demonstrating the feasibility and scalability of pretrained video generative models as unified visual foundation models.

Technology Category

Application Category

📝 Abstract

Large language models, trained on extensive corpora, successfully unify diverse linguistic tasks within a single generative framework. Inspired by this, recent works like Large Vision Model (LVM) extend this paradigm to vision by organizing tasks into sequential visual sentences, where visual prompts serve as the context to guide outputs. However, such modeling requires task-specific pre-training across modalities and sources, which is costly and limits scalability to unseen tasks. Given that pre-trained video generation models inherently capture temporal sequence dependencies, we explore a more unified and scalable alternative: can a pre-trained video generation model adapt to diverse image and video tasks? To answer this, we propose UniVid, a framework that fine-tunes a video diffusion transformer to handle various vision tasks without task-specific modifications. Tasks are represented as visual sentences, where the context sequence defines both the task and the expected output modality. We evaluate the generalization of UniVid from two perspectives: (1) cross-modal inference with contexts composed of both images and videos, extending beyond LVM's uni-modal setting; (2) cross-source tasks from natural to annotated data, without multi-source pre-training. Despite being trained solely on natural video data, UniVid generalizes well in both settings. Notably, understanding and generation tasks can easily switch by simply reversing the visual sentence order in this paradigm. These findings highlight the potential of pre-trained video generation models to serve as a scalable and unified foundation for vision modeling. Our code will be released at https://github.com/CUC-MIPG/UniVid.

Problem

Research questions and friction points this paper is trying to address.

Adapting pre-trained video generation models for diverse vision tasks

Overcoming costly task-specific pre-training across different modalities

Enabling cross-modal and cross-source generalization without multi-source training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tunes video diffusion transformer for vision tasks

Represents tasks as visual sentences for context

Generalizes across modalities without task-specific training

🔎 Similar Papers

VideoPrism: A Foundational Visual Encoder for Video Understanding