🤖 AI Summary
Current 3D large vision-language models typically employ separate 2D and 3D feature encoders, hindering effective cross-modal semantic alignment and 3D spatial relationship modeling—leading to incomplete scene representations and low training/inference efficiency. To address this, we propose a unified instance-aware 3D large vision-language model, introducing two novel components: Multi-View Cross-Modal Fusion (MCMF) and 3D Instance Spatial Relation modeling (3D-ISR). These enable joint point cloud–image representation learning and end-to-end multi-task instruction tuning—without task-specific adapters. Our framework is the first to support integrated, fine-grained 3D instance recognition, spatial relation reasoning, and cross-modal localization in a single training paradigm. Extensive experiments demonstrate consistent state-of-the-art performance across major 3D understanding benchmarks, with significant gains in complex scene reasoning and cross-modal localization tasks.
📝 Abstract
Despite encouraging progress in 3D scene understanding, it remains challenging to develop an effective Large Multi-modal Model (LMM) that is capable of understanding and reasoning in complex 3D environments. Most previous methods typically encode 3D point and 2D image features separately, neglecting interactions between 2D semantics and 3D object properties, as well as the spatial relationships within the 3D environment. This limitation not only hinders comprehensive representations of 3D scene, but also compromises training and inference efficiency. To address these challenges, we propose a unified Instance-aware 3D Large Multi-modal Model (Inst3D-LMM) to deal with multiple 3D scene understanding tasks simultaneously. To obtain the fine-grained instance-level visual tokens, we first introduce a novel Multi-view Cross-Modal Fusion (MCMF) module to inject the multi-view 2D semantics into their corresponding 3D geometric features. For scene-level relation-aware tokens, we further present a 3D Instance Spatial Relation (3D-ISR) module to capture the intricate pairwise spatial relationships among objects. Additionally, we perform end-to-end multi-task instruction tuning simultaneously without the subsequent task-specific fine-tuning. Extensive experiments demonstrate that our approach outperforms the state-of-the-art methods across 3D scene understanding, reasoning and grounding tasks. Source code is available at https://github.com/hanxunyu/Inst3D-LMM