FusionBERT: Multi-View Image-3D Retrieval via Cross-Attention Visual Fusion and Normal-Aware 3D Encoder

๐Ÿ“… 2026-04-02
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the limitations of existing image-to-3D multimodal retrieval methods, which predominantly rely on single-view images and struggle to handle the multi-view observations typical of real-world scenarios. To overcome this, the authors propose FusionBERT, a novel framework that first introduces a multi-view visual aggregator leveraging cross-attention mechanisms to adaptively fuse complementary features from multiple viewpoints. Additionally, a normal-aware 3D encoder is incorporated to jointly encode point coordinates and surface normals, thereby enhancing geometric representation for models lacking texture or suffering from color degradation. Experimental results demonstrate that the proposed method significantly outperforms current state-of-the-art approaches under both single-view and multi-view settings, establishing a strong baseline for image-to-3D cross-modal retrieval.
๐Ÿ“ Abstract
We propose FusionBERT, a novel multi-view visual fusion framework for image-3D multimodal retrieval. Existing image-3D representation learning methods predominantly focus on feature alignment of a single object image and its 3D model, limiting their applicability in realistic scenarios where an object is typically observed and captured from multiple viewpoints. Although multi-view observations naturally provide complementary geometric and appearance cues, existing multimodal large models rarely explore how to effectively fuse such multi-view visual information for better cross-modal retrieval. To address this limitation, we introduce a multi-view image-3D retrieval framework named FusionBERT, which innovatively utilizes a cross-attention-based multi-view visual aggregator to adaptively integrate features from multi-view images of an object. The proposed multi-view visual encoder fuses inter-view complementary relationships and selectively emphasizes informative visual cues across multiple views to get a more robustly fused visual feature for better 3D model matching. Furthermore, FusionBERT proposes a normal-aware 3D model encoder that can further enhance the 3D geometric feature of an object model by jointly encoding point normals and 3D positions, enabling a more robust representation learning for textureless or color-degraded 3D models. Extensive image-3D retrieval experiments demonstrate that FusionBERT achieves significantly higher retrieval accuracy than SOTA multimodal large models under both single-view and multi-view settings, establishing a strong baseline for multi-view multimodal retrieval.
Problem

Research questions and friction points this paper is trying to address.

image-3D retrieval
multi-view fusion
cross-modal retrieval
3D representation learning
visual feature alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-view fusion
cross-attention
normal-aware 3D encoding
image-3D retrieval
visual-geometric representation
๐Ÿ”Ž Similar Papers
No similar papers found.
Wei Li
Wei Li
Zhejiang University
Vision&LanguageVideo UnderstandingLLMs
Yufan Ren
Yufan Ren
EPFL
3D Perception and ReconstructionDiffusion ModelsLVLM
H
Hanqing Jiang
IROOTECH TECHNOLOGY, Wolf 1069 b Lab, Sany Group, Hangzhou, Zhejiang, China
J
Jianhui Ding
IROOTECH TECHNOLOGY, Wolf 1069 b Lab, Sany Group, Guangzhou, Guangdong, China
Z
Zhen Peng
IROOTECH TECHNOLOGY, Wolf 1069 b Lab, Sany Group, Hangzhou, Zhejiang, China
L
Leman Feng
IROOTECH TECHNOLOGY, Wolf 1069 b Lab, Sany Group, Guangzhou, Guangdong, China
Y
Yichun Shentu
IROOTECH TECHNOLOGY, Wolf 1069 b Lab, Sany Group, Hangzhou, Zhejiang, China
Guoqiang Xu
Guoqiang Xu
Southeast University
Non-Hermitian physicsDiffusionMetamaterials
Baigui Sun
Baigui Sun
Wolf 1069 b Lab, Sany Group
ไบบๅทฅๆ™บ่ƒฝใ€่ฎก็ฎ—ๆœบ่ง†่ง‰