🤖 AI Summary
Current generative video models suffer from “temporal hallucinations”—such as ambiguous motion speed and temporal instability—due to inconsistent real-world frame rates in training data. This work proposes Visual Chronometer, the first method to systematically define and quantify the physical frame rate (PhyFPS) alignment problem in video generation. By directly predicting the underlying physical timescale from visual dynamics without relying on unreliable metadata, our approach enables accurate temporal calibration. We introduce two benchmarks, PhyFPS-Bench-Real and PhyFPS-Bench-Gen, a controlled temporal resampling training strategy, and a deep learning–based PhyFPS prediction model. Experiments reveal that mainstream generative models exhibit significant PhyFPS misalignment, and that correcting this discrepancy substantially improves the perceptual naturalness and temporal consistency of generated videos.
📝 Abstract
While recent generative video models have achieved remarkable visual realism and are being explored as world models, true physical simulation requires mastering both space and time. Current models can produce visually smooth kinematics, yet they lack a reliable internal motion pulse to ground these motions in a consistent, real-world time scale. This temporal ambiguity stems from the common practice of indiscriminately training on videos with vastly different real-world speeds, forcing them into standardized frame rates. This leads to what we term chronometric hallucination: generated sequences exhibit ambiguous, unstable, and uncontrollable physical motion speeds. To address this, we propose Visual Chronometer, a predictor that recovers the Physical Frames Per Second (PhyFPS) directly from the visual dynamics of an input video. Trained via controlled temporal resampling, our method estimates the true temporal scale implied by the motion itself, bypassing unreliable metadata. To systematically quantify this issue, we establish two benchmarks, PhyFPS-Bench-Real and PhyFPS-Bench-Gen. Our evaluations reveal a harsh reality: state-of-the-art video generators suffer from severe PhyFPS misalignment and temporal instability. Finally, we demonstrate that applying PhyFPS corrections significantly improves the human-perceived naturalness of AI-generated videos. Our project page is https://xiangbogaobarry.github.io/Visual_Chronometer/.