InstantViR: Real-Time Video Inverse Problem Solver with Distilled Diffusion Prior

📅 2025-11-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Video inverse problems—critical for streaming, telepresence, and AR/VR—demand both high-fidelity reconstruction and real-time performance. However, existing diffusion-based approaches either introduce temporal artifacts via image-level time regularization or suffer from prohibitively slow iterative sampling, precluding real-time deployment. This paper proposes the first real-time video reconstruction framework grounded in distilled diffusion priors. We introduce a novel distillation paradigm that transforms a bidirectional video diffusion teacher model into a causal autoregressive student model via teacher-guided spatial regularization—requiring no paired data and eliminating iterative inference. Integrated with LeanVAE, our method enables efficient latent-space modeling. Evaluated on an NVIDIA A100 GPU, it achieves >35 FPS—over 100× faster than iterative diffusion baselines—while matching or exceeding state-of-the-art diffusion models in reconstruction quality. Our approach sets new benchmarks in video deblurring and super-resolution.

Technology Category

Application Category

📝 Abstract
Video inverse problems are fundamental to streaming, telepresence, and AR/VR, where high perceptual quality must coexist with tight latency constraints. Diffusion-based priors currently deliver state-of-the-art reconstructions, but existing approaches either adapt image diffusion models with ad hoc temporal regularizers - leading to temporal artifacts - or rely on native video diffusion models whose iterative posterior sampling is far too slow for real-time use. We introduce InstantViR, an amortized inference framework for ultra-fast video reconstruction powered by a pre-trained video diffusion prior. We distill a powerful bidirectional video diffusion model (teacher) into a causal autoregressive student that maps a degraded video directly to its restored version in a single forward pass, inheriting the teacher's strong temporal modeling while completely removing iterative test-time optimization. The distillation is prior-driven: it only requires the teacher diffusion model and known degradation operators, and does not rely on externally paired clean/noisy video data. To further boost throughput, we replace the video-diffusion backbone VAE with a high-efficiency LeanVAE via an innovative teacher-space regularized distillation scheme, enabling low-latency latent-space processing. Across streaming random inpainting, Gaussian deblurring and super-resolution, InstantViR matches or surpasses the reconstruction quality of diffusion-based baselines while running at over 35 FPS on NVIDIA A100 GPUs, achieving up to 100 times speedups over iterative video diffusion solvers. These results show that diffusion-based video reconstruction is compatible with real-time, interactive, editable, streaming scenarios, turning high-quality video restoration into a practical component of modern vision systems.
Problem

Research questions and friction points this paper is trying to address.

Real-time video restoration under strict latency constraints
Eliminating iterative sampling in diffusion-based video reconstruction
Maintaining high perceptual quality while achieving 100x speedup
Innovation

Methods, ideas, or system contributions that make the work stand out.

Distills bidirectional video diffusion into causal autoregressive model
Uses LeanVAE for high-efficiency latent-space processing
Enables single-pass video reconstruction without iterative optimization
🔎 Similar Papers
W
Weimin Bai
Peking University
S
Suzhe Xu
Huaqiao University
Y
Yiwei Ren
Peking University
Jinhua Hao
Jinhua Hao
Kuaishou Technology
Computer VisionGenerative AIFluid Mechanics
M
Ming Sun
Kuaishou Technology
Wenzheng Chen
Wenzheng Chen
Peking University
Computational Photography3D Vision
H
He Sun
Peking University