π€ AI Summary
This work addresses the challenge of achieving high-quality video compression at extremely low bitrates by efficiently guiding the generative process with compact conditional signals. The proposed ActDiff-VC framework innovatively integrates content-adaptive keyframe active sampling, budget-aware sparse point trajectory modeling, and conditional diffusion-based generation, enabling the synthesis of high-fidelity non-keyframes from only a few transmitted keyframes and sparse trajectories. Evaluated on the UVG and MCL-JCV datasets, the method significantly outperforms existing learning-based and diffusion-based baselines, reducing bitrate by 64.6% at the same NIQE score while simultaneously achieving a 64.6% reduction in KID and a 37.7% improvement in FID, thereby jointly enhancing perceptual quality and compression efficiency.
π Abstract
Diffusion models provide a powerful generative prior for perceptual reconstruction at ultra-low bitrates, but effective video compression requires controlling the generative process using highly compact conditioning signals. In this work, we present ActDiff-VC, a diffusion-based video compression framework for the ultra-low-bitrate regime. Our method partitions videos into variable-length segments, transmits keyframes only when needed, and summarizes temporal dynamics using a compact set of tracked point trajectories. Conditioned on these sparse signals, a conditional diffusion decoder synthesizes the remaining frames, enabling perceptually realistic reconstruction under severe rate constraints. To support this design, we introduce two mechanisms: content-adaptive keyframe selection and budget-aware sparse trajectory selection, which together enable compact yet effective conditioning for generative reconstruction. Experiments on the UVG and MCL-JCV benchmarks show that ActDiff-VC achieves up to 64.6\% bitrate reduction at matched NIQE, improves KID by up to 64.6\% and FID by up to 37.7\% at comparable bitrates against strong learned codecs, and delivers favorable perceptual rate--distortion trade-offs relative to learned and diffusion-based baselines in the ultra-low-bitrate regime.