LiftVSR: Lifting Image Diffusion to Video Super-Resolution via Hybrid Temporal Modeling with Only 4$ imes$RTX 4090s

📅 2025-06-10

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

Existing video super-resolution (VSR) methods often suffer from poor long-term temporal consistency and prohibitively high computational costs (e.g., requiring 8× A100 GPUs), especially when optimizing perceptual quality. To address these limitations, we propose a lightweight and efficient VSR framework grounded in the PixArt-α diffusion prior. Our method introduces a novel hybrid temporal modeling mechanism: Dynamic Temporal Attention (DTA) enables intra-frame fine-grained alignment, while Attention Memory Caching (AMC) ensures cross-segment long-term consistency. An asymmetric sampling strategy further stabilizes cached inference. By synergistically integrating diffusion-based transfer learning, dynamic spatio-temporal attention, memory caching, and multi-head token-flow modeling, our approach achieves state-of-the-art performance on mainstream VSR benchmarks—using only 4× RTX 4090 GPUs (reducing GPU demand by >50%)—thereby significantly improving both efficiency and deployment feasibility.

Technology Category

Application Category

📝 Abstract

Diffusion models have significantly advanced video super-resolution (VSR) by enhancing perceptual quality, largely through elaborately designed temporal modeling to ensure inter-frame consistency. However, existing methods usually suffer from limited temporal coherence and prohibitively high computational costs (e.g., typically requiring over 8 NVIDIA A100-80G GPUs), especially for long videos. In this work, we propose LiftVSR, an efficient VSR framework that leverages and elevates the image-wise diffusion prior from PixArt-$alpha$, achieving state-of-the-art results using only 4$ imes$RTX 4090 GPUs. To balance long-term consistency and efficiency, we introduce a hybrid temporal modeling mechanism that decomposes temporal learning into two complementary components: (i) Dynamic Temporal Attention (DTA) for fine-grained temporal modeling within short frame segment ($ extit{i.e.}$, low complexity), and (ii) Attention Memory Cache (AMC) for long-term temporal modeling across segments ($ extit{i.e.}$, consistency). Specifically, DTA identifies multiple token flows across frames within multi-head query and key tokens to warp inter-frame contexts in the value tokens. AMC adaptively aggregates historical segment information via a cache unit, ensuring long-term coherence with minimal overhead. To further stabilize the cache interaction during inference, we introduce an asymmetric sampling strategy that mitigates feature mismatches arising from different diffusion sampling steps. Extensive experiments on several typical VSR benchmarks have demonstrated that LiftVSR achieves impressive performance with significantly lower computational costs.

Problem

Research questions and friction points this paper is trying to address.

Enhance video super-resolution with limited computational resources

Improve inter-frame consistency in long videos efficiently

Balance temporal coherence and computational cost in VSR

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages image-wise diffusion prior from PixArt-α

Hybrid temporal modeling with DTA and AMC

Asymmetric sampling stabilizes cache interaction

🔎 Similar Papers

Pyramidal Flow Matching for Efficient Video Generative Modeling