🤖 AI Summary
At low bitrates, conventional and neural video codecs (NVCs) struggle to jointly optimize perceptual quality and generation efficiency: adversarial or perceptual-loss-based methods often introduce artifacts, while diffusion-based approaches relying on pre-trained models suffer from prohibitively high sampling overhead. To address this, we propose S2VC—the first single-step diffusion-based video coding framework. It introduces a novel semantic-temporal joint guidance mechanism: contextual semantic features replace text prompts for fine-grained reconstruction control, while temporal consistency modeling enhances inter-frame coherence. S2VC adopts a conditional encoder coupled with a single-step diffusion U-Net, integrating buffered feature extraction and joint semantic-temporal modeling. Experiments demonstrate that S2VC achieves an average 52.73% bitrate reduction across mainstream benchmarks, significantly outperforms state-of-the-art methods in perceptual quality (lower LPIPS), and reduces diffusion sampling to just one step—substantially improving compression efficiency.
📝 Abstract
While traditional and neural video codecs (NVCs) have achieved remarkable rate-distortion performance, improving perceptual quality at low bitrates remains challenging. Some NVCs incorporate perceptual or adversarial objectives but still suffer from artifacts due to limited generation capacity, whereas others leverage pretrained diffusion models to improve quality at the cost of heavy sampling complexity. To overcome these challenges, we propose S2VC, a Single-Step diffusion based Video Codec that integrates a conditional coding framework with an efficient single-step diffusion generator, enabling realistic reconstruction at low bitrates with reduced sampling cost. Recognizing the importance of semantic conditioning in single-step diffusion, we introduce Contextual Semantic Guidance to extract frame-adaptive semantics from buffered features. It replaces text captions with efficient, fine-grained conditioning, thereby improving generation realism. In addition, Temporal Consistency Guidance is incorporated into the diffusion U-Net to enforce temporal coherence across frames and ensure stable generation. Extensive experiments show that S2VC delivers state-of-the-art perceptual quality with an average 52.73% bitrate saving over prior perceptual methods, underscoring the promise of single-step diffusion for efficient, high-quality video compression.