🤖 AI Summary
To address the limitation of latent diffusion models—namely, their reliance solely on latent-space computation, which hinders high-resolution video output—this paper proposes an efficient cascaded video super-resolution (VSR) framework. Methodologically, it introduces: (1) a controllable degradation strategy that emulates the output characteristics of foundational diffusion models, thereby bridging semantic generation and detail reconstruction; (2) interleaved temporal units coupled with sparse local attention to ensure temporal alignment while substantially improving computational efficiency; and (3) training optimizations including temporal sampling analysis and low-resolution input noise augmentation. Experiments demonstrate state-of-the-art performance across multiple benchmarks. Ablation studies systematically validate the efficacy of each component. The proposed framework establishes a concise, efficient, and scalable new baseline for VSR.
📝 Abstract
Latent diffusion models have emerged as a leading paradigm for efficient video generation. However, as user expectations shift toward higher-resolution outputs, relying solely on latent computation becomes inadequate. A promising approach involves decoupling the process into two stages: semantic content generation and detail synthesis. The former employs a computationally intensive base model at lower resolutions, while the latter leverages a lightweight cascaded video super-resolution (VSR) model to achieve high-resolution output. In this work, we focus on studying key design principles for latter cascaded VSR models, which are underexplored currently. First, we propose two degradation strategies to generate training pairs that better mimic the output characteristics of the base model, ensuring alignment between the VSR model and its upstream generator. Second, we provide critical insights into VSR model behavior through systematic analysis of (1) timestep sampling strategies, (2) noise augmentation effects on low-resolution (LR) inputs. These findings directly inform our architectural and training innovations. Finally, we introduce interleaving temporal unit and sparse local attention to achieve efficient training and inference, drastically reducing computational overhead. Extensive experiments demonstrate the superiority of our framework over existing methods, with ablation studies confirming the efficacy of each design choice. Our work establishes a simple yet effective baseline for cascaded video super-resolution generation, offering practical insights to guide future advancements in efficient cascaded synthesis systems.