🤖 AI Summary
This work addresses the challenge of achieving high-fidelity, causal, and real-time video communication under ultra-low bitrate conditions (<0.0003 bpp), where existing methods struggle to balance these competing requirements. The authors propose a modular diffusion architecture that integrates lossy semantic encoding with extremely low-resolution frames. Causal video generation is enabled through a semantic control mechanism alongside dedicated recovery and temporal adapters. Furthermore, an efficient temporal distillation strategy is introduced to substantially reduce model parameters and training overhead. The proposed method maintains real-time inference while consistently outperforming both conventional and neural baselines in perceptual quality, semantic fidelity, and temporal consistency, establishing a new paradigm for ultra-low-bitrate video communication.
📝 Abstract
We introduce a video diffusion model for high-fidelity, causal, and real-time video generation under ultra-low-bitrate semantic communication constraints. Our approach utilizes lossy semantic video coding to transmit the semantic scene structure, complemented by a stream of highly compressed, low-resolution frames that provide sufficient texture information to preserve fidelity. Building on these inputs, we introduce a modular video diffusion model that contains Semantic Control, Restoration Adapter, and Temporal Adapter. We further introduce an efficient temporal distillation procedure that enables extension to real-time and causal synthesis, reducing trainable parameters by 300x and training time by 2x, while adhering to communication constraints. Evaluated across diverse datasets, the framework achieves strong perceptual quality, semantic fidelity, and temporal consistency at ultra-low bitrates (<0.0003 bpp), outperforming classical, neural, and generative baselines in extensive quantitative, qualitative, and subjective evaluations.