🤖 AI Summary
Existing 3D visual grounding methods struggle to interpret implicit linguistic cues and are highly susceptible to interference from co-occurring objects in complex multi-object scenes, leading to degraded performance in referential expression comprehension and segmentation. To address this, this work proposes PC-CrossDiff, a unified dual-task framework that introduces a novel two-level cross-modal differential attention mechanism operating at both point-level (PLDA) and cluster-level (CLDA). The PLDA module adaptively extracts implicit localization cues, while the CLDA module dynamically enhances relevant spatial relationships and suppresses distracting signals. The proposed method achieves state-of-the-art performance across the ScanRefer, NR3D, and SR3D benchmarks, notably improving the Overall@0.50 metric by 10.16% on the implicit subset of ScanRefer for the 3D referring expression comprehension (3DREC) task.
📝 Abstract
3D Visual Grounding (3DVG) aims to localize the referent of natural language referring expressions through two core tasks: Referring Expression Comprehension (3DREC) and Segmentation (3DRES). While existing methods achieve high accuracy in simple, single-object scenes, they suffer from severe performance degradation in complex, multi-object scenes that are common in real-world settings, hindering practical deployment. Existing methods face two key challenges in complex, multi-object scenes: inadequate parsing of implicit localization cues critical for disambiguating visually similar objects, and ineffective suppression of dynamic spatial interference from co-occurring objects, resulting in degraded grounding accuracy. To address these challenges, we propose PC-CrossDiff, a unified dual-task framework with a dual-level cross-modal differential attention architecture for 3DREC and 3DRES. Specifically, the framework introduces: (i) Point-Level Differential Attention (PLDA) modules that apply bidirectional differential attention between text and point clouds, adaptively extracting implicit localization cues via learnable weights to improve discriminative representation; (ii) Cluster-Level Differential Attention (CLDA) modules that establish a hierarchical attention mechanism to adaptively enhance localization-relevant spatial relationships while suppressing ambiguous or irrelevant spatial relations through a localization-aware differential attention block. Our method achieves state-of-the-art performance on the ScanRefer, NR3D, and SR3D benchmarks. Notably, on the Implicit subsets of ScanRefer, it improves the Overall@0.50 score by +10.16% for the 3DREC task, highlighting its strong ability to parse implicit spatial cues.