🤖 AI Summary
Video super-resolution faces the fundamental challenge of jointly preserving fine-grained detail fidelity and ensuring inter-frame temporal consistency. To address this, we propose a semantic-spatiotemporal joint guidance framework operating in the latent space of diffusion models. Our method is the first to integrate semantic segmentation priors with explicit spatiotemporal attention mechanisms directly into the diffusion latent space, enabling multi-frame collaborative reconstruction. By jointly modeling semantic structure and inter-frame motion relationships within the latent space, the model maintains geometric alignment with low-resolution inputs while substantially enhancing both perceptual detail realism and temporal coherence. Extensive experiments on standard benchmarks—including VID4 and REDS4—demonstrate state-of-the-art performance. Notably, our approach achieves significant improvements in PSNR, SSIM, and perceptual quality (LPIPS), especially under challenging scenarios involving rapid motion and complex textures.
📝 Abstract
Recent advancements in video super-resolution (VSR) models have demonstrated impressive results in enhancing low-resolution videos. However, due to limitations in adequately controlling the generation process, achieving high fidelity alignment with the low-resolution input while maintaining temporal consistency across frames remains a significant challenge. In this work, we propose Semantic and Temporal Guided Video Super-Resolution (SeTe-VSR), a novel approach that incorporates both semantic and temporal-spatio guidance in the latent diffusion space to address these challenges. By incorporating high-level semantic information and integrating spatial and temporal information, our approach achieves a seamless balance between recovering intricate details and ensuring temporal coherence. Our method not only preserves high-reality visual content but also significantly enhances fidelity. Extensive experiments demonstrate that SeTe-VSR outperforms existing methods in terms of detail recovery and perceptual quality, highlighting its effectiveness for complex video super-resolution tasks.