Advancing 3D Scene Understanding with MV-ScanQA Multi-View Reasoning Evaluation and TripAlign Pre-training Dataset

📅 2025-08-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing 3D vision-language datasets are limited to single-view, close-range, and single-object alignment, hindering scene-level understanding involving distant objects, multiple viewpoints, and complex multi-object configurations. To address this, we introduce MV-ScanQA—a novel benchmark requiring cross-view reasoning for 68% of its questions—and TripAlign, a million-scale triplet dataset enabling fine-grained semantic alignment across text, multiple objects, and multiple viewpoints. Building upon 2D vision-language models, we propose LEGO, a cross-modal knowledge transfer framework that facilitates joint 2D–3D–text pretraining. Our approach achieves state-of-the-art performance on MV-ScanQA and demonstrates consistent cross-benchmark improvements on 3D dense captioning and visual question answering tasks. This work establishes the first systematic foundation for fine-grained, multi-object, multi-view 3D scene understanding and significantly advances the frontier of 3D vision-language comprehension.

Technology Category

Application Category

📝 Abstract
The advancement of 3D vision-language (3D VL) learning is hindered by several limitations in existing 3D VL datasets: they rarely necessitate reasoning beyond a close range of objects in single viewpoint, and annotations often link instructions to single objects, missing richer contextual alignments between multiple objects. This significantly curtails the development of models capable of deep, multi-view 3D scene understanding over distant objects. To address these challenges, we introduce MV-ScanQA, a novel 3D question answering dataset where 68% of questions explicitly require integrating information from multiple views (compared to less than 7% in existing datasets), thereby rigorously testing multi-view compositional reasoning. To facilitate the training of models for such demanding scenarios, we present TripAlign dataset, a large-scale and low-cost 2D-3D-language pre-training corpus containing 1M <2D view, set of 3D objects, text> triplets that explicitly aligns groups of contextually related objects with text, providing richer, view-grounded multi-object multimodal alignment signals than previous single-object annotations. We further develop LEGO, a baseline method for the multi-view reasoning challenge in MV-ScanQA, transferring knowledge from pre-trained 2D LVLMs to 3D domain with TripAlign. Empirically, LEGO pre-trained on TripAlign achieves state-of-the-art performance not only on the proposed MV-ScanQA, but also on existing benchmarks for 3D dense captioning and question answering. Datasets and code are available at https://matthewdm0816.github.io/tripalign-mvscanqa.
Problem

Research questions and friction points this paper is trying to address.

Enhancing multi-view reasoning in 3D scene understanding
Addressing limited contextual alignment in 3D vision-language datasets
Improving distant object comprehension in 3D VL models
Innovation

Methods, ideas, or system contributions that make the work stand out.

MV-ScanQA dataset tests multi-view reasoning
TripAlign provides 2D-3D-language pre-training corpus
LEGO transfers 2D LVLMs knowledge to 3D
🔎 Similar Papers
No similar papers found.
Wentao Mo
Wentao Mo
Tsinghua University
Trustworthy Artificial IntelligenceMultimodal Learning
Qingchao Chen
Qingchao Chen
Assistant Professor, Peking University
Transfer LearningMedical Data AnalysisMulti-modal Human SensingRadar Systems
Y
Yuxin Peng
Wangxuan Institute of Computer Technology, Peking University
S
Siyuan Huang
State Key Laboratory of General Artificial Intelligence, BIGAI
Y
Yang Liu
Wangxuan Institute of Computer Technology, State Key Laboratory of General Artificial Intelligence, Peking University