🤖 AI Summary
This work addresses the challenge of achieving high-quality video compression at extremely low bitrates (<0.05 bpp) by proposing a novel framework that integrates implicit neural representations (INRs) with a pretrained video diffusion model. Instead of conventional keyframes, the method employs INRs as semantic conditioning for the diffusion process and jointly optimizes INR weights alongside lightweight adapters to guide hierarchical reconstruction—from coarse semantic layouts to fine-grained texture details—with minimal parameter overhead. Evaluated on UVG, MCL-JCV, and JVET Class-B datasets, the approach significantly outperforms HEVC, VVC, and existing neural codecs, achieving a BD-LPIPS gain of 0.214 and a BD-FID improvement of 91.14, thereby establishing the first perceptually favorable solution for ultra-low-bitrate video compression.
📝 Abstract
We present a perceptually-driven video compression framework integrating implicit neural representations (INRs) and pre-trained video diffusion models to address the extremely low bitrate regime (<0.05 bpp). Our approach exploits the complementary strengths of INRs, which provide a compact video representation, and diffusion models, which offer rich generative priors learned from large-scale datasets. The INR-based conditioning replaces traditional intra-coded keyframes with bit-efficient neural representations trained to estimate latent features and guide the diffusion process. Our joint optimization of INR weights and parameter-efficient adapters for diffusion models allows the model to learn reliable conditioning signals while encoding video-specific information with minimal parameter overhead. Our experiments on UVG, MCL-JCV, and JVET Class-B benchmarks demonstrate substantial improvements in perceptual metrics (LPIPS, DISTS, and FID) at extremely low bitrates, including improvements on BD-LPIPS up to 0.214 and BD-FID up to 91.14 relative to HEVC, while also outperforming VVC and previous strong state-of-the-art neural and INR-only video codecs. Moreover, our analysis shows that INR-conditioned diffusion-based video compression first composes the scene layout and object identities before refining textural accuracy, exposing the semantic-to-visual hierarchy that enables perceptually faithful compression at extremely low bitrates.