🤖 AI Summary
Existing generative video coding methods suffer from two key limitations at ultra-low bitrates (ULB): domain specificity (e.g., restricted to faces or human bodies) and motion modeling distortion—particularly, text-guided generation fails to capture fine-grained motion, leading to reconstruction artifacts. To address this, we propose Trajectory-Guided Generative Video Coding (T-GVC), the first framework to embed semantic-aware sparse motion trajectories into diffusion models. T-GVC introduces a training-free latent-space trajectory alignment constraint that ensures geometric plausibility and semantic consistency in motion synthesis. By jointly leveraging low-level optical flow tracking and high-level semantic importance sampling, it significantly improves motion fidelity without compromising generative quality. Experiments demonstrate that T-GVC consistently outperforms both conventional codecs and end-to-end learned compression methods at ULB. Notably, its motion control accuracy surpasses state-of-the-art text-guided approaches, validating the efficacy of explicit geometric motion guidance in generative video coding.
📝 Abstract
Recent advances in video generation techniques have given rise to an emerging paradigm of generative video coding, aiming to achieve semantically accurate reconstructions in Ultra-Low Bitrate (ULB) scenarios by leveraging strong generative priors. However, most existing methods are limited by domain specificity (e.g., facial or human videos) or an excessive dependence on high-level text guidance, which often fails to capture motion details and results in unrealistic reconstructions. To address these challenges, we propose a Trajectory-Guided Generative Video Coding framework (dubbed T-GVC). T-GVC employs a semantic-aware sparse motion sampling pipeline to effectively bridge low-level motion tracking with high-level semantic understanding by extracting pixel-wise motion as sparse trajectory points based on their semantic importance, not only significantly reducing the bitrate but also preserving critical temporal semantic information. In addition, by incorporating trajectory-aligned loss constraints into diffusion processes, we introduce a training-free latent space guidance mechanism to ensure physically plausible motion patterns without sacrificing the inherent capabilities of generative models. Experimental results demonstrate that our framework outperforms both traditional codecs and state-of-the-art end-to-end video compression methods under ULB conditions. Furthermore, additional experiments confirm that our approach achieves more precise motion control than existing text-guided methods, paving the way for a novel direction of generative video coding guided by geometric motion modeling.