TurboVSR: Fantastic Video Upscalers and Where to Find Them

📅 2025-06-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Diffusion models achieve state-of-the-art performance in video super-resolution (VSR) but suffer from prohibitively slow inference—e.g., tens of minutes for a 2-second 1080p clip—hindering practical deployment. To address this, we propose the first ultra-fast diffusion-based VSR framework. Our approach introduces: (i) a high-compression (32×32×8) spatiotemporal autoencoder to drastically reduce latent dimensionality; (ii) a factorized conditional mechanism that decouples motion and content modeling; and (iii) an efficient distillation strategy converting pre-trained diffusion models into shortcut generative models, significantly reducing sampling steps and computational complexity. The method preserves SOTA perceptual quality (PSNR, SSIM, LPIPS) while accelerating 2-second 1080p VSR to just 7 seconds—over 100× faster than prior diffusion-based methods. Moreover, it unifies single-image and video super-resolution within a single architecture and enables detail-rich 4K reconstruction.

Technology Category

Application Category

📝 Abstract
Diffusion-based generative models have demonstrated exceptional promise in the video super-resolution (VSR) task, achieving a substantial advancement in detail generation relative to prior methods. However, these approaches face significant computational efficiency challenges. For instance, current techniques may require tens of minutes to super-resolve a mere 2-second, 1080p video. In this paper, we present TurboVSR, an ultra-efficient diffusion-based video super-resolution model. Our core design comprises three key aspects: (1) We employ an autoencoder with a high compression ratio of 32$ imes$32$ imes$8 to reduce the number of tokens. (2) Highly compressed latents pose substantial challenges for training. We introduce factorized conditioning to mitigate the learning complexity: we first learn to super-resolve the initial frame; subsequently, we condition the super-resolution of the remaining frames on the high-resolution initial frame and the low-resolution subsequent frames. (3) We convert the pre-trained diffusion model to a shortcut model to enable fewer sampling steps, further accelerating inference. As a result, TurboVSR performs on par with state-of-the-art VSR methods, while being 100+ times faster, taking only 7 seconds to process a 2-second long 1080p video. TurboVSR also supports image resolution by considering image as a one-frame video. Our efficient design makes SR beyond 1080p possible, results on 4K (3648$ imes$2048) image SR show surprising fine details.
Problem

Research questions and friction points this paper is trying to address.

Improve computational efficiency in video super-resolution
Reduce training complexity for highly compressed latents
Enable faster inference with fewer sampling steps
Innovation

Methods, ideas, or system contributions that make the work stand out.

High compression autoencoder reduces token count
Factorized conditioning simplifies training complexity
Shortcut model enables faster sampling steps
🔎 Similar Papers
No similar papers found.