FVGen: Accelerating Novel-View Synthesis with Adversarial Video Diffusion Distillation

📅 2025-08-08

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

To address the slow sampling speed and severe artifacts in unseen regions inherent to video diffusion models (VDMs) for sparse-view 3D reconstruction, this paper proposes a knowledge distillation framework tailored for efficient view completion. The method distills a multi-step denoising teacher VDM into a lightweight student model requiring only four sampling steps. Its core innovation lies in a dual-objective distillation strategy that jointly optimizes GAN-based discriminative supervision and softened inverse KL divergence minimization. This enables effective transfer of perceptual and structural knowledge while preserving geometric consistency. Experimental results on real-world datasets demonstrate that the distilled model achieves over 90% inference acceleration—without compromising, and often improving, visual fidelity of synthesized views. Crucially, the densely completed views significantly mitigate reconstruction artifacts, leading to substantial gains in both accuracy and efficiency for downstream 3D reconstruction tasks such as neural radiance fields (NeRF).

Technology Category

Application Category

📝 Abstract

Recent progress in 3D reconstruction has enabled realistic 3D models from dense image captures, yet challenges persist with sparse views, often leading to artifacts in unseen areas. Recent works leverage Video Diffusion Models (VDMs) to generate dense observations, filling the gaps when only sparse views are available for 3D reconstruction tasks. A significant limitation of these methods is their slow sampling speed when using VDMs. In this paper, we present FVGen, a novel framework that addresses this challenge by enabling fast novel view synthesis using VDMs in as few as four sampling steps. We propose a novel video diffusion model distillation method that distills a multi-step denoising teacher model into a few-step denoising student model using Generative Adversarial Networks (GANs) and softened reverse KL-divergence minimization. Extensive experiments on real-world datasets show that, compared to previous works, our framework generates the same number of novel views with similar (or even better) visual quality while reducing sampling time by more than 90%. FVGen significantly improves time efficiency for downstream reconstruction tasks, particularly when working with sparse input views (more than 2) where pre-trained VDMs need to be run multiple times to achieve better spatial coverage.

Problem

Research questions and friction points this paper is trying to address.

Accelerates novel-view synthesis with sparse input views

Reduces sampling time using adversarial diffusion distillation

Improves 3D reconstruction efficiency with few-step denoising

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fast novel view synthesis with VDMs

GAN-based video diffusion model distillation

Softened reverse KL-divergence minimization

🔎 Similar Papers

Pyramidal Flow Matching for Efficient Video Generative Modeling