🤖 AI Summary
This work addresses the challenge of evaluating the authenticity of generated videos without relying on human annotations or reference videos, which existing methods often require. To this end, we propose 3DSPA, a no-reference spatiotemporal autoencoder that, for the first time, integrates 3D point trajectory modeling, depth cues, and DINO semantic features to sensitively detect motion artifacts and violations of physical plausibility. By operating without any ground-truth video as reference, 3DSPA effectively identifies temporally inconsistent or physically implausible dynamics. Experimental results demonstrate that 3DSPA significantly outperforms current approaches across multiple datasets, with its assessments showing strong alignment with human judgments of video realism.
📝 Abstract
AI video generation is evolving rapidly. For video generators to be useful for applications ranging from robotics to film-making, they must consistently produce realistic videos. However, evaluating the realism of generated videos remains a largely manual process -- requiring human annotation or bespoke evaluation datasets which have restricted scope. Here we develop an automated evaluation framework for video realism which captures both semantics and coherent 3D structure and which does not require access to a reference video. Our method, 3DSPA, is a 3D spatiotemporal point autoencoder which integrates 3D point trajectories, depth cues, and DINO semantic features into a unified representation for video evaluation. 3DSPA models how objects move and what is happening in the scene, enabling robust assessments of realism, temporal consistency, and physical plausibility. Experiments show that 3DSPA reliably identifies videos which violate physical laws, is more sensitive to motion artifacts, and aligns more closely with human judgments of video quality and realism across multiple datasets. Our results demonstrate that enriching trajectory-based representations with 3D semantics offers a stronger foundation for benchmarking generative video models, and implicitly captures physical rule violations. The code and pretrained model weights will be available at https://github.com/TheProParadox/3dspa_code.