Generative Neural Video Compression via Video Diffusion Prior

📅 2025-12-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing perceptual video codecs rely on image-level generative priors and lack explicit temporal modeling, resulting in severe flickering artifacts. To address this, we propose GNVC-VD—the first diffusion-based neural video compression framework built upon the Diffusion Transformer (DiT)—introducing native video diffusion priors into the compression paradigm. Methodologically, GNVC-VD employs a video DiT for sequence-level joint denoising, enabling unified spatiotemporal latent representation learning; it further incorporates flow-matching-based latent optimization and conditional adapters, enabling end-to-end joint training initialized from decoded latents. Experiments demonstrate that GNVC-VD achieves state-of-the-art performance across PSNR, LPIPS, and subjective quality—especially at ultra-low bitrates (<0.01 bpp)—significantly outperforming both conventional and learned codecs while effectively suppressing flicker and enabling high-fidelity perceptual reconstruction.

Technology Category

Application Category

📝 Abstract
We present GNVC-VD, the first DiT-based generative neural video compression framework built upon an advanced video generation foundation model, where spatio-temporal latent compression and sequence-level generative refinement are unified within a single codec. Existing perceptual codecs primarily rely on pre-trained image generative priors to restore high-frequency details, but their frame-wise nature lacks temporal modeling and inevitably leads to perceptual flickering. To address this, GNVC-VD introduces a unified flow-matching latent refinement module that leverages a video diffusion transformer to jointly enhance intra- and inter-frame latents through sequence-level denoising, ensuring consistent spatio-temporal details. Instead of denoising from pure Gaussian noise as in video generation, GNVC-VD initializes refinement from decoded spatio-temporal latents and learns a correction term that adapts the diffusion prior to compression-induced degradation. A conditioning adaptor further injects compression-aware cues into intermediate DiT layers, enabling effective artifact removal while maintaining temporal coherence under extreme bitrate constraints. Extensive experiments show that GNVC-VD surpasses both traditional and learned codecs in perceptual quality and significantly reduces the flickering artifacts that persist in prior generative approaches, even below 0.01 bpp, highlighting the promise of integrating video-native generative priors into neural codecs for next-generation perceptual video compression.
Problem

Research questions and friction points this paper is trying to address.

Develops a generative neural video compression framework using video diffusion prior
Addresses flickering artifacts in perceptual codecs by enhancing spatio-temporal consistency
Improves compression quality at low bitrates by integrating video-native generative models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified flow-matching latent refinement via video diffusion transformer
Adapts diffusion prior to compression degradation with correction term
Injects compression-aware cues into DiT layers for artifact removal
Q
Qi Mao
School of Information and Communication Engineering, Communication University of China
H
Hao Cheng
School of Information and Communication Engineering, Communication University of China
Tinghan Yang
Tinghan Yang
Purdue University
Mobile computing and mobile sensingwireless communications
L
Libiao Jin
School of Information and Communication Engineering, Communication University of China
S
Siwei Ma
School of Computer Science, Peking University