IPDN: Image-enhanced Prompt Decoding Network for 3D Referring Expression Segmentation

📅 2025-01-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses two key challenges in 3D referring expression segmentation: (1) ambiguous point cloud features due to acquisition distortion, and (2) insufficient modeling of linguistic intent caused by task-agnostic decoding. To tackle these, we propose a multimodal collaborative framework comprising: (1) a novel Multi-View Semantic Embedding (MSE) module that leverages multi-view 2D image encodings to compensate for spatial information loss in point clouds; (2) a Prompt-Aware Decoder (PAD) that generates task-oriented decoding signals via language–vision cross-modal alignment and prompt-driven attention; and (3) a joint point cloud–image representation learning paradigm. Evaluated on 3D-RES and 3D-GRES benchmarks, our method achieves absolute mIoU improvements of 1.9% and 4.2%, respectively, significantly outperforming state-of-the-art approaches.

Technology Category

Application Category

📝 Abstract
3D Referring Expression Segmentation (3D-RES) aims to segment point cloud scenes based on a given expression. However, existing 3D-RES approaches face two major challenges: feature ambiguity and intent ambiguity. Feature ambiguity arises from information loss or distortion during point cloud acquisition due to limitations such as lighting and viewpoint. Intent ambiguity refers to the model's equal treatment of all queries during the decoding process, lacking top-down task-specific guidance. In this paper, we introduce an Image enhanced Prompt Decoding Network (IPDN), which leverages multi-view images and task-driven information to enhance the model's reasoning capabilities. To address feature ambiguity, we propose the Multi-view Semantic Embedding (MSE) module, which injects multi-view 2D image information into the 3D scene and compensates for potential spatial information loss. To tackle intent ambiguity, we designed a Prompt-Aware Decoder (PAD) that guides the decoding process by deriving task-driven signals from the interaction between the expression and visual features. Comprehensive experiments demonstrate that IPDN outperforms the state-ofthe-art by 1.9 and 4.2 points in mIoU metrics on the 3D-RES and 3D-GRES tasks, respectively.
Problem

Research questions and friction points this paper is trying to address.

3D reference expression segmentation
unclear features
ambiguous targets
Innovation

Methods, ideas, or system contributions that make the work stand out.

IPDN
MSE
PAD
🔎 Similar Papers
No similar papers found.
Q
Qi Chen
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China.
C
Changli Wu
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China.
Jiayi Ji
Jiayi Ji
Rutgers University
Yiwei Ma
Yiwei Ma
Stevens Institute of Technology
Danni Yang
Danni Yang
Xiamen University
Multimodal LearningVideo Editing
X
Xiaoshuai Sun
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China.