HMR3D: Hierarchical Multimodal Representation for 3D Scene Understanding with Large Vision-Language Model

📅 2025-11-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing vision-language model (VLM)-based 3D scene understanding methods rely on implicit feature alignment, suffering from limited 3D data availability and challenges in modeling spatial relationships, leading to suboptimal performance. To address these limitations, this work proposes a hierarchical multimodal representation framework. First, it introduces explicit input-space alignment of VLMs via coordinate-guided textual descriptions jointly with multi-view imagery—specifically, a top-down view plus four orthogonal side views. Second, it establishes a three-level feature aggregation mechanism—patch → view → scene—integrated with spatially aware cross-modal alignment. Evaluated on embodied and general-purpose 3D visual question answering benchmarks, the method achieves significant improvements over state-of-the-art approaches. It effectively captures both global and local contextual cues and models complex spatial relations, thereby enhancing accuracy and robustness in 3D scene understanding.

Technology Category

Application Category

📝 Abstract
Recent advances in large vision-language models (VLMs) have shown significant promise for 3D scene understanding. Existing VLM-based approaches typically align 3D scene features with the VLM's embedding space. However, this implicit alignment often yields suboptimal performance due to the scarcity of 3D data and the inherent complexity of spatial relationships in 3D environments. To address these limitations, we propose a novel hierarchical multimodal representation for 3D scene reasoning that explicitly aligns with VLMs at the input space by leveraging both multi-view images and text descriptions. The text descriptions capture spatial relationships by referencing the 3D coordinates of detected objects, while the multi-view images include a top-down perspective and four directional views (forward, left, right, and backward), ensuring comprehensive scene coverage. Additionally, we introduce a hierarchical feature representation that aggregates patch-level image features into view-level and scene-level representations, enabling the model to reason over both local and global scene context. Experimental results on both situated 3D Q&A and general 3D Q&A benchmarks demonstrate the effectiveness of our approach.
Problem

Research questions and friction points this paper is trying to address.

Develops hierarchical multimodal representation for 3D scene understanding
Explicitly aligns 3D scene features with vision-language models via multi-view images and text
Improves reasoning over local and global spatial relationships in 3D environments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Explicit alignment using multi-view images and text
Text descriptions referencing 3D object coordinates
Hierarchical feature aggregation from patch to scene level
🔎 Similar Papers
No similar papers found.
C
Chen Li
Institute of High-Performance Computing, Agency for Science, Technology and Research, Singapore
E
Eric Peh
Institute of High-Performance Computing, Agency for Science, Technology and Research, Singapore
Basura Fernando
Basura Fernando
Scientist at A*STAR Singapore, Assistant Professor at NTU
Visual ReasoningAction PredictionAction RecognitionTransfer LearningEmbodied AI