🤖 AI Summary
Existing multimodal 3D language models (MMMLMs) rely on 2D training data, limiting their ability to model intrinsic 3D spatial structure and thus hindering performance on referential localization, captioning, and visual question answering. To address this, we propose Video3D—a novel paradigm that represents static 3D scenes as dynamic video sequences. Our approach introduces learnable 3D position-aware video embeddings explicitly aligned with real-world spatial coordinates, and a maximum-coverage sampling strategy for efficient cross-frame inference. Key contributions include: (1) the first unified framework reformulating 3D understanding as video understanding; (2) a differentiable, geometry-aware video representation mechanism encoding 3D positional priors; and (3) a computationally efficient frame-sampling strategy balancing coverage and overhead. We achieve state-of-the-art results across five major 3D vision-language benchmarks—ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, and SQA3D—demonstrating substantial gains in 3D spatial reasoning and grounding.
📝 Abstract
The rapid advancement of Multimodal Large Language Models (MLLMs) has significantly impacted various multimodal tasks. However, these models face challenges in tasks that require spatial understanding within 3D environments. Efforts to enhance MLLMs, such as incorporating point cloud features, have been made, yet a considerable gap remains between the models' learned representations and the inherent complexity of 3D scenes. This discrepancy largely stems from the training of MLLMs on predominantly 2D data, which restricts their effectiveness in comprehending 3D spaces. To address this issue, in this paper, we propose a novel generalist model, i.e., Video-3D LLM, for 3D scene understanding. By treating 3D scenes as dynamic videos and incorporating 3D position encoding into these representations, our Video-3D LLM aligns video representations with real-world spatial contexts more accurately. In addition, we have implemented a maximum coverage sampling technique to optimize the trade-off between computational cost and performance. Extensive experiments demonstrate that our model achieves state-of-the-art performance on several 3D scene understanding benchmarks, including ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, and SQA3D.