🤖 AI Summary
Current generative video models suffer from insufficient temporal realism, while mainstream evaluation metrics exhibit low sensitivity to motion modeling. To address this, we propose the first temporal fidelity assessment framework based on compressed-domain motion vectors (MVs) extracted from H.264/HEVC bitstreams. Leveraging MV statistical properties—including motion entropy and optical flow field structure—we quantify dynamic behavioral discrepancies between generated and real videos. We innovatively employ KL divergence, Jensen–Shannon divergence, and Wasserstein distance to measure MV distributional differences, and design three MV-RGB fusion mechanisms—channel concatenation, cross-attention, and joint embedding—to enhance temporal modeling. Evaluated on GenVidBench across eight state-of-the-art generators, our method enables fine-grained assessment (with Pika and SVD achieving top performance). When integrated with MV features, ResNet and I3D achieve 99.0% accuracy in binary fake/real video classification.
📝 Abstract
Temporal realism remains a central weakness of current generative video models, as most evaluation metrics prioritize spatial appearance and offer limited sensitivity to motion. We introduce a scalable, model-agnostic framework that assesses temporal behavior using motion vectors (MVs) extracted directly from compressed video streams. Codec-generated MVs from standards such as H.264 and HEVC provide lightweight, resolution-consistent descriptors of motion dynamics. We quantify realism by computing Kullback-Leibler, Jensen-Shannon, and Wasserstein divergences between MV statistics of real and generated videos. Experiments on the GenVidBench dataset containing videos from eight state-of-the-art generators reveal systematic discrepancies from real motion: entropy-based divergences rank Pika and SVD as closest to real videos, MV-sum statistics favor VC2 and Text2Video-Zero, and CogVideo shows the largest deviations across both measures. Visualizations of MV fields and class-conditional motion heatmaps further reveal center bias, sparse and piecewise constant flows, and grid-like artifacts that frame-level metrics do not capture. Beyond evaluation, we investigate MV-RGB fusion through channel concatenation, cross-attention, joint embedding, and a motion-aware fusion module. Incorporating MVs improves downstream classification across ResNet, I3D, and TSN backbones, with ResNet-18 and ResNet-34 reaching up to 97.4% accuracy and I3D achieving 99.0% accuracy on real-versus-generated discrimination. These findings demonstrate that compressed-domain MVs provide an effective temporal signal for diagnosing motion defects in generative videos and for strengthening temporal reasoning in discriminative models. The implementation is available at: https://github.com/KurbanIntelligenceLab/Motion-Vector-Learning