🤖 AI Summary
Existing 3D vision-language datasets are limited to single-view, close-range, and single-object alignment, hindering scene-level understanding involving distant objects, multiple viewpoints, and complex multi-object configurations. To address this, we introduce MV-ScanQA—a novel benchmark requiring cross-view reasoning for 68% of its questions—and TripAlign, a million-scale triplet dataset enabling fine-grained semantic alignment across text, multiple objects, and multiple viewpoints. Building upon 2D vision-language models, we propose LEGO, a cross-modal knowledge transfer framework that facilitates joint 2D–3D–text pretraining. Our approach achieves state-of-the-art performance on MV-ScanQA and demonstrates consistent cross-benchmark improvements on 3D dense captioning and visual question answering tasks. This work establishes the first systematic foundation for fine-grained, multi-object, multi-view 3D scene understanding and significantly advances the frontier of 3D vision-language comprehension.
📝 Abstract
The advancement of 3D vision-language (3D VL) learning is hindered by several limitations in existing 3D VL datasets: they rarely necessitate reasoning beyond a close range of objects in single viewpoint, and annotations often link instructions to single objects, missing richer contextual alignments between multiple objects. This significantly curtails the development of models capable of deep, multi-view 3D scene understanding over distant objects. To address these challenges, we introduce MV-ScanQA, a novel 3D question answering dataset where 68% of questions explicitly require integrating information from multiple views (compared to less than 7% in existing datasets), thereby rigorously testing multi-view compositional reasoning. To facilitate the training of models for such demanding scenarios, we present TripAlign dataset, a large-scale and low-cost 2D-3D-language pre-training corpus containing 1M <2D view, set of 3D objects, text> triplets that explicitly aligns groups of contextually related objects with text, providing richer, view-grounded multi-object multimodal alignment signals than previous single-object annotations. We further develop LEGO, a baseline method for the multi-view reasoning challenge in MV-ScanQA, transferring knowledge from pre-trained 2D LVLMs to 3D domain with TripAlign. Empirically, LEGO pre-trained on TripAlign achieves state-of-the-art performance not only on the proposed MV-ScanQA, but also on existing benchmarks for 3D dense captioning and question answering. Datasets and code are available at https://matthewdm0816.github.io/tripalign-mvscanqa.