PC-CrossDiff: Point-Cluster Dual-Level Cross-Modal Differential Attention for Unified 3D Referring and Segmentation

📅 2026-03-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing 3D visual grounding methods struggle to interpret implicit linguistic cues and are highly susceptible to interference from co-occurring objects in complex multi-object scenes, leading to degraded performance in referential expression comprehension and segmentation. To address this, this work proposes PC-CrossDiff, a unified dual-task framework that introduces a novel two-level cross-modal differential attention mechanism operating at both point-level (PLDA) and cluster-level (CLDA). The PLDA module adaptively extracts implicit localization cues, while the CLDA module dynamically enhances relevant spatial relationships and suppresses distracting signals. The proposed method achieves state-of-the-art performance across the ScanRefer, NR3D, and SR3D benchmarks, notably improving the Overall@0.50 metric by 10.16% on the implicit subset of ScanRefer for the 3D referring expression comprehension (3DREC) task.

Technology Category

Application Category

📝 Abstract
3D Visual Grounding (3DVG) aims to localize the referent of natural language referring expressions through two core tasks: Referring Expression Comprehension (3DREC) and Segmentation (3DRES). While existing methods achieve high accuracy in simple, single-object scenes, they suffer from severe performance degradation in complex, multi-object scenes that are common in real-world settings, hindering practical deployment. Existing methods face two key challenges in complex, multi-object scenes: inadequate parsing of implicit localization cues critical for disambiguating visually similar objects, and ineffective suppression of dynamic spatial interference from co-occurring objects, resulting in degraded grounding accuracy. To address these challenges, we propose PC-CrossDiff, a unified dual-task framework with a dual-level cross-modal differential attention architecture for 3DREC and 3DRES. Specifically, the framework introduces: (i) Point-Level Differential Attention (PLDA) modules that apply bidirectional differential attention between text and point clouds, adaptively extracting implicit localization cues via learnable weights to improve discriminative representation; (ii) Cluster-Level Differential Attention (CLDA) modules that establish a hierarchical attention mechanism to adaptively enhance localization-relevant spatial relationships while suppressing ambiguous or irrelevant spatial relations through a localization-aware differential attention block. Our method achieves state-of-the-art performance on the ScanRefer, NR3D, and SR3D benchmarks. Notably, on the Implicit subsets of ScanRefer, it improves the Overall@0.50 score by +10.16% for the 3DREC task, highlighting its strong ability to parse implicit spatial cues.
Problem

Research questions and friction points this paper is trying to address.

3D Visual Grounding
implicit localization cues
spatial interference
multi-object scenes
referring expression comprehension
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-modal Differential Attention
Point-Cluster Dual-Level Architecture
Implicit Localization Cues
3D Visual Grounding
Dynamic Spatial Interference Suppression
🔎 Similar Papers
No similar papers found.
W
Wenbin Tan
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, School of Informatics, Xiamen University, Xiamen, China
J
Jiawen Lin
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, School of Informatics, Xiamen University, Xiamen, China
F
Fangyong Wang
Hanjiang National Laboratory, Wuhan, China
Yuan Xie
Yuan Xie
Full Professor, School of Computer Science and Technology, East China Normal University
computer vision and image processing
Y
Yong Xie
Department of Computer Science, Nanjing University of Posts and Telecommunications, Nanjing, China
Yachao Zhang
Yachao Zhang
Xiamen University, Tsinghua University
3D Computer VisionPoint cloud AnalysisUnderstanding of 3D scenesDeep learning
Yanyun Qu
Yanyun Qu
Xiamen University
Computer Vision