RealisVSR: Detail-enhanced Diffusion for Real-World 4K Video Super-Resolution

📅 2025-07-25

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

This work addresses three key challenges in real-world 4K video super-resolution (VSR): inconsistent temporal modeling, insufficient high-frequency detail recovery, and the absence of high-quality evaluation benchmarks. Methodologically, we propose a detail-enhanced diffusion model featuring: (i) a consistency-preserving ControlNet architecture to strengthen inter-frame temporal modeling; (ii) a novel high-frequency correction diffusion loss coupled with higher-order moment loss, integrated with wavelet decomposition, HOG feature constraints, and spatiotemporal guidance; and (iii) RealisVideo-4K—the first publicly available 4K VSR benchmark. Leveraging the Wan2.1 video diffusion framework and a lightweight training strategy, our method achieves efficient training using only 5–25% of typical data volume. Extensive experiments demonstrate state-of-the-art performance across multiple mainstream VSR datasets, particularly excelling in 4K upscaling and texture restoration, thereby advancing the practical deployment of real-scenario video restoration.

Technology Category

Application Category

📝 Abstract

Video Super-Resolution (VSR) has achieved significant progress through diffusion models, effectively addressing the over-smoothing issues inherent in GAN-based methods. Despite recent advances, three critical challenges persist in VSR community: 1) Inconsistent modeling of temporal dynamics in foundational models; 2) limited high-frequency detail recovery under complex real-world degradations; and 3) insufficient evaluation of detail enhancement and 4K super-resolution, as current methods primarily rely on 720P datasets with inadequate details. To address these challenges, we propose RealisVSR, a high-frequency detail-enhanced video diffusion model with three core innovations: 1) Consistency Preserved ControlNet (CPC) architecture integrated with the Wan2.1 video diffusion to model the smooth and complex motions and suppress artifacts; 2) High-Frequency Rectified Diffusion Loss (HR-Loss) combining wavelet decomposition and HOG feature constraints for texture restoration; 3) RealisVideo-4K, the first public 4K VSR benchmark containing 1,000 high-definition video-text pairs. Leveraging the advanced spatio-temporal guidance of Wan2.1, our method requires only 5-25% of the training data volume compared to existing approaches. Extensive experiments on VSR benchmarks (REDS, SPMCS, UDM10, YouTube-HQ, VideoLQ, RealisVideo-720P) demonstrate our superiority, particularly in ultra-high-resolution scenarios.

Problem

Research questions and friction points this paper is trying to address.

Inconsistent temporal dynamics modeling in video super-resolution

Limited high-frequency detail recovery in real-world degradations

Insufficient evaluation of 4K super-resolution and detail enhancement

Innovation

Methods, ideas, or system contributions that make the work stand out.

Consistency Preserved ControlNet for motion modeling

High-Frequency Rectified Diffusion Loss for texture

RealisVideo-4K benchmark for 4K VSR evaluation

🔎 Similar Papers

No similar papers found.