Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understanding

📅 2024-09-05
🏛️ Neural Information Processing Systems
📈 Citations: 15
Influential: 2
📄 PDF
🤖 AI Summary
Prior work has not systematically identified the optimal visual encoding strategy for 3D scene understanding, particularly lacking rigorous, task-agnostic comparisons across image-based and emerging multimodal encoders. Method: We introduce the first unified multimodal benchmark framework covering four core 3D vision-language tasks—reasoning, localization, segmentation, and registration—and quantitatively evaluate seven state-of-the-art encoders, including DINOv2, VideoMAE, Point-BERT, and Stable Diffusion feature extractors. Contribution/Results: DINOv2 achieves consistent top-tier performance across all tasks. Video-based encoders yield up to +12.3% gains in localization and segmentation; diffusion-based features reduce registration error by 18.7%; language-pretrained models underperform on language-grounded tasks—challenging prevailing assumptions. All evaluation protocols, metrics, and implementation code are publicly released, establishing an empirical foundation and methodological framework for principled encoder selection in 3D vision-language research.

Technology Category

Application Category

📝 Abstract
Complex 3D scene understanding has gained increasing attention, with scene encoding strategies playing a crucial role in this success. However, the optimal scene encoding strategies for various scenarios remain unclear, particularly compared to their image-based counterparts. To address this issue, we present a comprehensive study that probes various visual encoding models for 3D scene understanding, identifying the strengths and limitations of each model across different scenarios. Our evaluation spans seven vision foundation encoders, including image-based, video-based, and 3D foundation models. We evaluate these models in four tasks: Vision-Language Scene Reasoning, Visual Grounding, Segmentation, and Registration, each focusing on different aspects of scene understanding. Our evaluations yield key findings: DINOv2 demonstrates superior performance, video models excel in object-level tasks, diffusion models benefit geometric tasks, and language-pretrained models show unexpected limitations in language-related tasks. These insights challenge some conventional understandings, provide novel perspectives on leveraging visual foundation models, and highlight the need for more flexible encoder selection in future vision-language and scene-understanding tasks. Code: https://github.com/YunzeMan/Lexicon3D
Problem

Research questions and friction points this paper is trying to address.

Identifying optimal 3D scene encoding strategies across scenarios
Evaluating vision foundation models for diverse scene understanding tasks
Challenging conventional insights on model performance in language-related tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Probes multiple visual foundation models for 3D scenes
Evaluates seven encoders across four diverse tasks
Identifies DINOv2 and video models as top performers
🔎 Similar Papers
No similar papers found.
Yunze Man
Yunze Man
University of Illinois Urbana-Champaign
RoboticsMachine LearningComputer VisionAutonomous Driving
S
Shuhong Zheng
University of Illinois Urbana-Champaign
Z
Zhipeng Bao
Carnegie Mellon University
M
Martial Hebert
Carnegie Mellon University
Liangyan Gui
Liangyan Gui
University of Illinois Urbana-Champaign
Computer VisionMachine LearningArtificial Intelligence
Y
Yu-Xiong Wang
University of Illinois Urbana-Champaign