UniMMVSR: A Unified Multi-Modal Framework for Cascaded Video Super-Resolution

📅 2025-10-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing cascaded video super-resolution (VSR) methods rely solely on text conditioning, limiting fidelity and cross-modal consistency in multimodal generation. To address this, we propose the first unified cascaded framework supporting hybrid conditioning—text, image, and video—built upon a latent video diffusion model. Our approach introduces a multimodal conditional injection mechanism, a stage-wise collaborative training strategy, and a cross-modal data fusion method. It is the first to enable multimodal-guided 4K VSR generation, overcoming the fundamental limitations of single-text-conditioned approaches. Extensive experiments demonstrate significant improvements over state-of-the-art methods in generation quality, temporal coherence, and multimodal alignment. Moreover, our framework seamlessly integrates with mainstream foundation models, enabling high-fidelity 4K video synthesis.

Technology Category

Application Category

📝 Abstract
Cascaded video super-resolution has emerged as a promising technique for decoupling the computational burden associated with generating high-resolution videos using large foundation models. Existing studies, however, are largely confined to text-to-video tasks and fail to leverage additional generative conditions beyond text, which are crucial for ensuring fidelity in multi-modal video generation. We address this limitation by presenting UniMMVSR, the first unified generative video super-resolution framework to incorporate hybrid-modal conditions, including text, images, and videos. We conduct a comprehensive exploration of condition injection strategies, training schemes, and data mixture techniques within a latent video diffusion model. A key challenge was designing distinct data construction and condition utilization methods to enable the model to precisely utilize all condition types, given their varied correlations with the target video. Our experiments demonstrate that UniMMVSR significantly outperforms existing methods, producing videos with superior detail and a higher degree of conformity to multi-modal conditions. We also validate the feasibility of combining UniMMVSR with a base model to achieve multi-modal guided generation of 4K video, a feat previously unattainable with existing techniques.
Problem

Research questions and friction points this paper is trying to address.

Unifies multi-modal conditions for video super-resolution
Enhances video fidelity using text, image, and video inputs
Enables high-resolution 4K video generation with hybrid guidance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified framework integrates hybrid-modal conditions
Explores condition injection strategies and training schemes
Enables multi-modal guided generation of 4K video
🔎 Similar Papers
No similar papers found.
Shian Du
Shian Du
Tsinghua University
Video Generation
M
Menghan Xia
Huazhong University of Science and Technology
C
Chang Liu
Tsinghua University
Quande Liu
Quande Liu
Kling Team@Kuaishou Technology
Computer VisionGenerative AIMultimodal
X
Xintao Wang
Kling Team, Kuaishou Technology
Pengfei Wan
Pengfei Wan
Head of Kling Video Generation Models, Kuaishou Technology
Generative ModelsComputer VisionMultimodal AIComputer Graphics
X
Xiangyang Ji
Tsinghua University