🤖 AI Summary
It remains unclear whether video foundation models (VidFMs), trained solely on unlabeled video data, implicitly acquire a global understanding of 3D scene geometry. Existing evaluations lack model-agnostic, quantitative benchmarks for assessing such implicit 3D awareness.
Method: We propose the first model-agnostic quantification framework that probes frozen VidFM features via lightweight, task-specific readout heads to jointly regress multiple 3D properties—including depth, surface normals, and geometric consistency—enabling unbiased, cross-model 3D competency assessment without fine-tuning.
Contribution/Results: Our experiments demonstrate, for the first time, that leading VidFMs—especially video generative models—exhibit implicit 3D understanding comparable to or exceeding that of dedicated 3D expert models trained with explicit supervision. We establish a standardized 3D perception benchmark and evaluation protocol, offering a scalable paradigm for 3D intelligence acquisition without explicit 3D annotations.
📝 Abstract
Videos are continuous 2D projections of 3D worlds. After training on large video data, will global 3D understanding naturally emerge? We study this by quantifying the 3D understanding of existing Video Foundation Models (VidFMs) pretrained on vast video data. We propose the first model-agnostic framework that measures the 3D awareness of various VidFMs by estimating multiple 3D properties from their features via shallow read-outs. Our study presents meaningful findings regarding the 3D awareness of VidFMs on multiple axes. In particular, we show that state-of-the-art video generation models exhibit a strong understanding of 3D objects and scenes, despite not being trained on any 3D data. Such understanding can even surpass that of large expert models specifically trained for 3D tasks. Our findings, together with the 3D benchmarking of major VidFMs, provide valuable observations for building scalable 3D models.