🤖 AI Summary
This work addresses the challenge of video compression at ultra-low bitrates, where conventional methods suffer from poor reconstruction quality and blurry perception, while existing generative approaches struggle to balance temporal consistency and computational efficiency. To this end, we propose Diff-SIT, a novel framework that integrates a Sparse Temporal Encoding Module (STEM) to drastically reduce bitrate and a One-shot Diffusion model with Frame Type Embedding (ODFTE) to jointly reconstruct intermediate frames, thereby enhancing both perceptual quality and temporal coherence. Our approach is the first to combine sparse information transmission with frame-type-guided diffusion strategies, enabling end-to-end generative video compression. Extensive experiments demonstrate that Diff-SIT achieves state-of-the-art performance across multiple benchmark datasets under ultra-low bitrate conditions, setting new standards in perceptual fidelity and temporal consistency.
📝 Abstract
Video compression aims to maximize reconstruction quality with minimal bitrates. Beyond standard distortion metrics, perceptual quality and temporal consistency are also critical. However, at ultra-low bitrates, traditional end-to-end compression models tend to produce blurry images of poor perceptual quality. Besides, existing generative compression methods often treat video frames independently and show limitations in time coherence and efficiency. To address these challenges, we propose the Efficient Video Diffusion with Sparse Information Transmission (Diff-SIT), which comprises the Sparse Temporal Encoding Module (STEM) and the One-Step Video Diffusion with Frame Type Embedder (ODFTE). The STEM sparsely encodes the original frame sequence into an information-rich intermediate sequence, achieving significant bitrate savings. Subsequently, the ODFTE processes this intermediate sequence as a whole, which exploits the temporal correlation. During this process, our proposed Frame Type Embedder (FTE) guides the diffusion model to perform adaptive reconstruction according to different frame types to optimize the overall quality. Extensive experiments on multiple datasets demonstrate that Diff-SIT establishes a new state-of-the-art in perceptual quality and temporal consistency, particularly in the challenging ultra-low-bitrate regime. Code is released at https://github.com/MingdeZhou/Diff-SIT.