Evaluating Foundation Models' 3D Understanding Through Multi-View Correspondence Analysis

๐Ÿ“… 2025-12-12
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing benchmarks for evaluating the 3D understanding capability of foundational vision models lack no-fine-tuning, dense-feature-driven evaluation protocols. Method: We introduce the first fine-tuning-free benchmark for multi-view correspondence analysis, built upon MVImgNet. It establishes a novel image segmentation task with graded difficulty across novel viewpoints, featuring a first-of-its-kind four-level difficulty hierarchy driven by inter-view angular disparity. We extend the Hummingbird framework to 3D multi-view settings, integrating multi-view geometry, dense feature matching, and in-context evaluation. Contribution/Results: Our framework systematically evaluates eight state-of-the-art encoders (e.g., DINO, ViT-G). Experiments reveal that DINO-style models exhibit significantly superior robustness under large viewpoint shifts compared to 3D-aware models, which require dedicated multi-view adaptation. We open-source a fully reproducible evaluation toolkit, establishing a new paradigm for assessing the intrinsic 3D competence of pre-trained visual encoders.

Technology Category

Application Category

๐Ÿ“ Abstract
Benchmarking 3D spatial understanding of foundation models is essential for real-world applications such as robotics and autonomous driving. Existing evaluations often rely on downstream finetuning with linear heads or task-specific decoders, making it difficult to isolate the intrinsic 3D reasoning ability of pretrained encoders. In this work, we introduce a novel benchmark for in-context 3D scene understanding that requires no finetuning and directly probes the quality of dense visual features. Building on the Hummingbird framework, which evaluates in-context 2D scene understanding, we extend the setup to the 3D Multi-View ImageNet (MVImgNet) dataset. Given a set of images from objects in specific angles (keys), we benchmark the performance of segmenting novel views (queries) and report the scores in 4 categories of easy, medium, hard, and extreme based on the key-query view contrast. We benchmark 8 state-of-the-art foundation models and show DINO-based encoders remain competitive across large viewpoint shifts, while 3D-aware models like VGGT require dedicated multi-view adjustments. Our code is publicly available at https://github.com/ToyeshC/open-hummingbird-3d-eval .
Problem

Research questions and friction points this paper is trying to address.

Benchmarking 3D spatial understanding in foundation models
Evaluating intrinsic 3D reasoning without finetuning
Analyzing multi-view correspondence for real-world applications
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces a novel benchmark for in-context 3D scene understanding
Extends the Hummingbird framework to 3D using Multi-View ImageNet dataset
Evaluates models by segmenting novel views without any finetuning
๐Ÿ”Ž Similar Papers
No similar papers found.
V
Valentina Lilova
University of Amsterdam
T
Toyesh Chakravorty
University of Amsterdam
J
Julian I. Bibo
University of Amsterdam
E
Emma Boccaletti
University of Amsterdam
B
Brandon Li
University of Amsterdam
L
Lรญvia Baxovรก
University of Amsterdam
Cees G. M. Snoek
Cees G. M. Snoek
Professor of Computer Science, University of Amsterdam
Video Understanding:computer visionmultimodal learningmachine learningartificial intelligence
M
Mohammadreza Salehi
University of Amsterdam