YODA: Yet Another One-step Diffusion-based Video Compressor

📅 2026-01-03

📈 Citations: 1

✨ Influential: 1

career value

186K/year

🤖 AI Summary

This work addresses the limitation of existing single-step diffusion-based video compression methods, which often neglect inter-frame temporal dependencies and struggle to effectively model spatiotemporal correlations. To overcome this, we introduce, for the first time, a multi-scale temporal reference mechanism into the single-step diffusion compression framework. Our approach fuses multi-scale temporal features, jointly optimizes latent variable generation and encoding, and employs a linear Diffusion Transformer (DiT) for efficient denoising. This design significantly enhances the compactness and perceptual quality of the latent representations. Extensive experiments demonstrate that our method consistently outperforms both traditional and deep learning-based baselines across multiple perceptual metrics—including LPIPS, DISTS, FID, and KID—establishing a new state of the art in perceptual video compression.

Technology Category

Application Category

📝 Abstract

While one-step diffusion models have recently excelled in perceptual image compression, their application to video remains limited. Prior efforts typically rely on pretrained 2D autoencoders that generate per-frame latent representations independently, thereby neglecting temporal dependencies. We present YODA--Yet Another One-step Diffusion-based Video Compressor--which embeds multiscale features from temporal references for both latent generation and latent coding to better exploit spatial-temporal correlations for more compact representation, and employs a linear Diffusion Transformer (DiT) for efficient one-step denoising. YODA achieves state-of-the-art perceptual performance, consistently outperforming traditional and deep-learning baselines on LPIPS, DISTS, FID, and KID. Source code will be publicly available at https://github.com/NJUVISION/YODA.

Problem

Research questions and friction points this paper is trying to address.

video compression

one-step diffusion

temporal dependencies

spatial-temporal correlations

latent representation

Innovation

Methods, ideas, or system contributions that make the work stand out.

one-step diffusion

video compression

temporal dependencies