How Much 3D Do Video Foundation Models Encode?

📅 2025-12-22

📈 Citations: 0

✨ Influential: 0

career value

245K/year

🤖 AI Summary

It remains unclear whether video foundation models (VidFMs), trained solely on unlabeled video data, implicitly acquire a global understanding of 3D scene geometry. Existing evaluations lack model-agnostic, quantitative benchmarks for assessing such implicit 3D awareness. Method: We propose the first model-agnostic quantification framework that probes frozen VidFM features via lightweight, task-specific readout heads to jointly regress multiple 3D properties—including depth, surface normals, and geometric consistency—enabling unbiased, cross-model 3D competency assessment without fine-tuning. Contribution/Results: Our experiments demonstrate, for the first time, that leading VidFMs—especially video generative models—exhibit implicit 3D understanding comparable to or exceeding that of dedicated 3D expert models trained with explicit supervision. We establish a standardized 3D perception benchmark and evaluation protocol, offering a scalable paradigm for 3D intelligence acquisition without explicit 3D annotations.

Technology Category

Application Category

📝 Abstract

Videos are continuous 2D projections of 3D worlds. After training on large video data, will global 3D understanding naturally emerge? We study this by quantifying the 3D understanding of existing Video Foundation Models (VidFMs) pretrained on vast video data. We propose the first model-agnostic framework that measures the 3D awareness of various VidFMs by estimating multiple 3D properties from their features via shallow read-outs. Our study presents meaningful findings regarding the 3D awareness of VidFMs on multiple axes. In particular, we show that state-of-the-art video generation models exhibit a strong understanding of 3D objects and scenes, despite not being trained on any 3D data. Such understanding can even surpass that of large expert models specifically trained for 3D tasks. Our findings, together with the 3D benchmarking of major VidFMs, provide valuable observations for building scalable 3D models.

Problem

Research questions and friction points this paper is trying to address.

Quantify 3D understanding in video foundation models

Propose framework to measure 3D awareness from features

Benchmark models to guide scalable 3D model development

Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes model-agnostic framework for 3D awareness measurement

Estimates 3D properties from model features via shallow read-outs

Benchmarks major video foundation models for 3D understanding

🔎 Similar Papers

Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understanding

2024-09-05Neural Information Processing SystemsCitations: 15

From Image to Video, what do we need in multimodal LLMs?

2024-04-18arXiv.orgCitations: 8