STCDiT: Spatio-Temporally Consistent Diffusion Transformer for High-Quality Video Super-Resolution

📅 2025-11-24

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

This work addresses structural distortion and temporal instability in degraded videos under complex camera motion. We propose a novel super-resolution framework leveraging pre-trained video diffusion models. Our method introduces two key innovations: (1) a motion-aware segment-wise reconstruction strategy that adaptively partitions videos to alleviate motion modeling difficulty; and (2) an anchor-frame latent guidance mechanism, which exploits rich spatial priors from the VAE encoding of the first frame to constrain the entire segment’s generation, significantly enhancing structural fidelity and temporal consistency. The framework integrates a motion-aware VAE, segment-wise latent-space processing, and Transformer-driven diffusion-guided generation. Extensive evaluations on multiple benchmarks demonstrate state-of-the-art performance, particularly excelling in scenarios with large-motion and non-rigid deformations—achieving superior robustness and fine-detail recovery compared to existing methods.

Technology Category

Application Category

📝 Abstract

We present STCDiT, a video super-resolution framework built upon a pre-trained video diffusion model, aiming to restore structurally faithful and temporally stable videos from degraded inputs, even under complex camera motions. The main challenges lie in maintaining temporal stability during reconstruction and preserving structural fidelity during generation. To address these challenges, we first develop a motion-aware VAE reconstruction method that performs segment-wise reconstruction, with each segment clip exhibiting uniform motion characteristic, thereby effectively handling videos with complex camera motions. Moreover, we observe that the first-frame latent extracted by the VAE encoder in each clip, termed the anchor-frame latent, remains unaffected by temporal compression and retains richer spatial structural information than subsequent frame latents. We further develop an anchor-frame guidance approach that leverages structural information from anchor frames to constrain the generation process and improve structural fidelity of video features. Coupling these two designs enables the video diffusion model to achieve high-quality video super-resolution. Extensive experiments show that STCDiT outperforms state-of-the-art methods in terms of structural fidelity and temporal consistency.

Problem

Research questions and friction points this paper is trying to address.

Maintain temporal stability during video reconstruction

Preserve structural fidelity during video generation

Handle complex camera motions in video super-resolution

Innovation

Methods, ideas, or system contributions that make the work stand out.

Motion-aware VAE reconstruction for uniform motion segments

Anchor-frame guidance leveraging structural information

Coupling designs for enhanced video super-resolution

🔎 Similar Papers

No similar papers found.

Apple

Cupertino, United States of America

AI Research Scientist, Computer Vision - Facebook Video Intelligence