🤖 AI Summary
This work addresses the limitation of existing single-step diffusion-based video compression methods, which often neglect inter-frame temporal dependencies and struggle to effectively model spatiotemporal correlations. To overcome this, we introduce, for the first time, a multi-scale temporal reference mechanism into the single-step diffusion compression framework. Our approach fuses multi-scale temporal features, jointly optimizes latent variable generation and encoding, and employs a linear Diffusion Transformer (DiT) for efficient denoising. This design significantly enhances the compactness and perceptual quality of the latent representations. Extensive experiments demonstrate that our method consistently outperforms both traditional and deep learning-based baselines across multiple perceptual metrics—including LPIPS, DISTS, FID, and KID—establishing a new state of the art in perceptual video compression.
📝 Abstract
While one-step diffusion models have recently excelled in perceptual image compression, their application to video remains limited. Prior efforts typically rely on pretrained 2D autoencoders that generate per-frame latent representations independently, thereby neglecting temporal dependencies. We present YODA--Yet Another One-step Diffusion-based Video Compressor--which embeds multiscale features from temporal references for both latent generation and latent coding to better exploit spatial-temporal correlations for more compact representation, and employs a linear Diffusion Transformer (DiT) for efficient one-step denoising. YODA achieves state-of-the-art perceptual performance, consistently outperforming traditional and deep-learning baselines on LPIPS, DISTS, FID, and KID. Source code will be publicly available at https://github.com/NJUVISION/YODA.