Quantitative Video World Model Evaluation for Geometric-Consistency

📅 2026-05-14
📈 Citations: 0
Influential: 0
📄 PDF

career value

212K/year
🤖 AI Summary
Existing video generation models lack objective, quantitative evaluation of geometric consistency, making it difficult to diagnose their shortcomings in 3D structure and physically plausible motion. To address this gap, this work proposes PDI-Bench, the first quantifiable benchmark framework tailored for geometric consistency assessment. It leverages SAM 2, MegaSaM, and CoTracker3 for object segmentation and point tracking, integrating monocular 3D reconstruction with projective geometry residual analysis to evaluate scale-depth alignment, 3D motion consistency, and structural rigidity. Validation on the newly curated, diverse PDI-Dataset demonstrates that the proposed method effectively uncovers geometric failure modes in state-of-the-art models—failures overlooked by conventional perceptual metrics—thereby offering a crucial evaluation tool for advancing physically plausible video generation and world model research.
📝 Abstract
Generative video models are increasingly studied as implicit world models, yet evaluating whether they produce physically plausible 3D structure and motion remains challenging. Most existing video evaluation pipelines rely heavily on human judgment or learned graders, which can be subjective and weakly diagnostic for geometric failures. We introduce PDI-Bench (Perspective Distortion Index), a quantitative framework for auditing geometric coherence in generated videos. Given a generated clip, we obtain object-centric observations via segmentation and point tracking (e.g., SAM 2, MegaSaM, and CoTracker3), lift them to 3D world-space coordinates via monocular reconstruction, and compute a set of projective-geometry residuals capturing three failure dimensions: scale-depth alignment, 3D motion consistency, and 3D structural rigidity. To support systematic evaluation, we build PDI-Dataset, covering diverse scenarios designed to stress these geometric constraints. Across state-of-the-art video generators, PDI reveals consistent geometry-specific failure modes that are not captured by common perceptual metrics, and provides a diagnostic signal for progress toward physically grounded video generation and physical world model. Our code and dataset can be found at https://pdi-bench.github.io/.
Problem

Research questions and friction points this paper is trying to address.

geometric consistency
video generation
world model
3D structure
quantitative evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

geometric consistency
video world model
monocular reconstruction
projective geometry
quantitative evaluation
🔎 Similar Papers
No similar papers found.