PatchVSR: Breaking Video Diffusion Resolution Limits with Patch-wise Video Super-Resolution

πŸ“… 2025-09-30
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Current video diffusion models face two key bottlenecks in full-resolution video super-resolution (VSR): prohibitive computational cost from global attention and inherent limitations on output resolution. To address these, we propose PatchVSRβ€”the first framework to integrate pretrained video diffusion priors into a patch-based VSR architecture. Methodologically, we design a dual-stream adapter that jointly encodes local details and global context; introduce position-aware feature injection and multi-patch joint modulation to ensure cross-patch spatiotemporal consistency; and enable efficient 4K video generation using only a 512Γ—512-pretrained diffusion model. Experiments demonstrate that PatchVSR significantly reduces computational overhead while producing high-fidelity, temporally coherent super-resolved videos, achieving state-of-the-art performance across multiple benchmarks. Our core contributions are: (i) the first adaptation of video diffusion models to patch-based VSR, and (ii) a lightweight, scalable conditional modulation architecture for consistent, high-resolution video synthesis.

Technology Category

Application Category

πŸ“ Abstract
Pre-trained video generation models hold great potential for generative video super-resolution (VSR). However, adapting them for full-size VSR, as most existing methods do, suffers from unnecessary intensive full-attention computation and fixed output resolution. To overcome these limitations, we make the first exploration into utilizing video diffusion priors for patch-wise VSR. This is non-trivial because pre-trained video diffusion models are not native for patch-level detail generation. To mitigate this challenge, we propose an innovative approach, called PatchVSR, which integrates a dual-stream adapter for conditional guidance. The patch branch extracts features from input patches to maintain content fidelity while the global branch extracts context features from the resized full video to bridge the generation gap caused by incomplete semantics of patches. Particularly, we also inject the patch's location information into the model to better contextualize patch synthesis within the global video frame. Experiments demonstrate that our method can synthesize high-fidelity, high-resolution details at the patch level. A tailor-made multi-patch joint modulation is proposed to ensure visual consistency across individually enhanced patches. Due to the flexibility of our patch-based paradigm, we can achieve highly competitive 4K VSR based on a 512x512 resolution base model, with extremely high efficiency.
Problem

Research questions and friction points this paper is trying to address.

Overcoming computational limitations in video super-resolution diffusion models
Enhancing patch-level detail generation with incomplete semantic information
Achieving high-resolution video upscaling while maintaining cross-patch visual consistency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Patch-wise video super-resolution with dual-stream adapter
Injecting location information for contextual patch synthesis
Multi-patch joint modulation ensuring visual consistency
πŸ”Ž Similar Papers
No similar papers found.
Shian Du
Shian Du
Tsinghua University
Video Generation
M
Menghan Xia
Kling Team, Kuaishou Technology
C
Chang Liu
Tsinghua University
X
Xintao Wang
Kling Team, Kuaishou Technology
J
Jing Wang
Beijing Institute of Technology
Pengfei Wan
Pengfei Wan
Head of Kling Video Generation Models, Kuaishou Technology
Generative ModelsComputer VisionMultimodal AIComputer Graphics
D
Di Zhang
Kling Team, Kuaishou Technology
X
Xiangyang Ji
Tsinghua University