Semantic and Temporal Integration in Latent Diffusion Space for High-Fidelity Video Super-Resolution

📅 2025-08-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Video super-resolution faces the fundamental challenge of jointly preserving fine-grained detail fidelity and ensuring inter-frame temporal consistency. To address this, we propose a semantic-spatiotemporal joint guidance framework operating in the latent space of diffusion models. Our method is the first to integrate semantic segmentation priors with explicit spatiotemporal attention mechanisms directly into the diffusion latent space, enabling multi-frame collaborative reconstruction. By jointly modeling semantic structure and inter-frame motion relationships within the latent space, the model maintains geometric alignment with low-resolution inputs while substantially enhancing both perceptual detail realism and temporal coherence. Extensive experiments on standard benchmarks—including VID4 and REDS4—demonstrate state-of-the-art performance. Notably, our approach achieves significant improvements in PSNR, SSIM, and perceptual quality (LPIPS), especially under challenging scenarios involving rapid motion and complex textures.

Technology Category

Application Category

📝 Abstract
Recent advancements in video super-resolution (VSR) models have demonstrated impressive results in enhancing low-resolution videos. However, due to limitations in adequately controlling the generation process, achieving high fidelity alignment with the low-resolution input while maintaining temporal consistency across frames remains a significant challenge. In this work, we propose Semantic and Temporal Guided Video Super-Resolution (SeTe-VSR), a novel approach that incorporates both semantic and temporal-spatio guidance in the latent diffusion space to address these challenges. By incorporating high-level semantic information and integrating spatial and temporal information, our approach achieves a seamless balance between recovering intricate details and ensuring temporal coherence. Our method not only preserves high-reality visual content but also significantly enhances fidelity. Extensive experiments demonstrate that SeTe-VSR outperforms existing methods in terms of detail recovery and perceptual quality, highlighting its effectiveness for complex video super-resolution tasks.
Problem

Research questions and friction points this paper is trying to address.

Achieving high fidelity alignment with low-resolution video input
Maintaining temporal consistency across video frames
Balancing detail recovery and temporal coherence in VSR
Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantic and temporal guidance in diffusion space
Balances detail recovery with temporal coherence
Outperforms existing methods in perceptual quality
🔎 Similar Papers
No similar papers found.
Y
Yiwen Wang
Institute of Image Communication and Network Engineering, Shanghai Jiao Tong University, Shanghai, China
Xinning Chai
Xinning Chai
Shanghai Jiao Tong University
low-level vision
Y
Yuhong Zhang
Institute of Image Communication and Network Engineering, Shanghai Jiao Tong University, Shanghai, China
Zhengxue Cheng
Zhengxue Cheng
Assistant Researcher, Shanghai Jiao Tong University
Video and Image CodingComputer VisionImage Quality Assessment
J
Jun Zhao
Tencent, Shanghai, China
R
Rong Xie
Institute of Image Communication and Network Engineering, Shanghai Jiao Tong University, Shanghai, China
Li Song
Li Song
Professor of Electronic Engineering, Shanghai Jiao Tong University
Video CodingImage ProcessingComputer Vision