Self-supervised ControlNet with Spatio-Temporal Mamba for Real-world Video Super-resolution

📅 2025-06-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Diffusion models for video super-resolution (VSR) suffer from severe artifacts and temporal inconsistency due to inherent stochasticity. To address this without paired data, we propose a self-supervised Mamba-enhanced framework. Our method introduces: (1) the first self-supervised ControlNet-guided mechanism for degradation-agnostic feature disentanglement; (2) a 3D Selective Scan-driven Video State-Space Module to model long-range spatiotemporal dependencies; and (3) a three-stage hybrid high-resolution/low-resolution training strategy that jointly optimizes latent diffusion priors. Evaluated on real-world VSR benchmarks, our approach significantly outperforms state-of-the-art methods, achieving substantial gains in PSNR (+1.27 dB) and SSIM (+0.021), while generating videos with superior perceptual quality, enhanced inter-frame coherence, and markedly reduced artifacts.

Technology Category

Application Category

📝 Abstract
Existing diffusion-based video super-resolution (VSR) methods are susceptible to introducing complex degradations and noticeable artifacts into high-resolution videos due to their inherent randomness. In this paper, we propose a noise-robust real-world VSR framework by incorporating self-supervised learning and Mamba into pre-trained latent diffusion models. To ensure content consistency across adjacent frames, we enhance the diffusion model with a global spatio-temporal attention mechanism using the Video State-Space block with a 3D Selective Scan module, which reinforces coherence at an affordable computational cost. To further reduce artifacts in generated details, we introduce a self-supervised ControlNet that leverages HR features as guidance and employs contrastive learning to extract degradation-insensitive features from LR videos. Finally, a three-stage training strategy based on a mixture of HR-LR videos is proposed to stabilize VSR training. The proposed Self-supervised ControlNet with Spatio-Temporal Continuous Mamba based VSR algorithm achieves superior perceptual quality than state-of-the-arts on real-world VSR benchmark datasets, validating the effectiveness of the proposed model design and training strategies.
Problem

Research questions and friction points this paper is trying to address.

Addressing complex degradations in diffusion-based video super-resolution
Enhancing content consistency with spatio-temporal attention mechanisms
Reducing artifacts via self-supervised ControlNet and contrastive learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-supervised ControlNet for artifact reduction
Spatio-temporal Mamba with 3D Selective Scan
Three-stage training with HR-LR video mixture
🔎 Similar Papers
No similar papers found.
S
Shijun Shi
Jiangnan University
J
Jing Xu
University of Science and Technology of China
L
Lijing Lu
Peking University
Zhihang Li
Zhihang Li
Kwai Inc
Computer VisionGenerative modelvideo/image generationLLM
K
Kai Hu
Jiangnan University