๐ค AI Summary
This work addresses the limitations of existing image-to-3D multimodal retrieval methods, which predominantly rely on single-view images and struggle to handle the multi-view observations typical of real-world scenarios. To overcome this, the authors propose FusionBERT, a novel framework that first introduces a multi-view visual aggregator leveraging cross-attention mechanisms to adaptively fuse complementary features from multiple viewpoints. Additionally, a normal-aware 3D encoder is incorporated to jointly encode point coordinates and surface normals, thereby enhancing geometric representation for models lacking texture or suffering from color degradation. Experimental results demonstrate that the proposed method significantly outperforms current state-of-the-art approaches under both single-view and multi-view settings, establishing a strong baseline for image-to-3D cross-modal retrieval.
๐ Abstract
We propose FusionBERT, a novel multi-view visual fusion framework for image-3D multimodal retrieval. Existing image-3D representation learning methods predominantly focus on feature alignment of a single object image and its 3D model, limiting their applicability in realistic scenarios where an object is typically observed and captured from multiple viewpoints. Although multi-view observations naturally provide complementary geometric and appearance cues, existing multimodal large models rarely explore how to effectively fuse such multi-view visual information for better cross-modal retrieval. To address this limitation, we introduce a multi-view image-3D retrieval framework named FusionBERT, which innovatively utilizes a cross-attention-based multi-view visual aggregator to adaptively integrate features from multi-view images of an object. The proposed multi-view visual encoder fuses inter-view complementary relationships and selectively emphasizes informative visual cues across multiple views to get a more robustly fused visual feature for better 3D model matching. Furthermore, FusionBERT proposes a normal-aware 3D model encoder that can further enhance the 3D geometric feature of an object model by jointly encoding point normals and 3D positions, enabling a more robust representation learning for textureless or color-degraded 3D models. Extensive image-3D retrieval experiments demonstrate that FusionBERT achieves significantly higher retrieval accuracy than SOTA multimodal large models under both single-view and multi-view settings, establishing a strong baseline for multi-view multimodal retrieval.