๐ค AI Summary
To address the challenges of verb-visual relationship modeling and poor generalization in natural language referring localization for 3D point clouds under low-data regimes, this paper proposes a data-efficient framework. Methodologically, it introduces (1) the first unsupervised, progressive referring sequence generation mechanism grounded in large language models (LLMs), explicitly modeling the semantic reasoning path for target localization; and (2) a sequence-aware warm-up training strategy coupled with weakly supervised point cloudโlanguage alignment to enhance cross-modal alignment in few-shot settings. The framework integrates LLMs, stacked referring modules, and sequence-aware contrastive pretraining. Evaluated on the NR3D dataset using only 1% and 10% of labeled data, our method achieves absolute improvements of +9.3% and +7.6% in localization accuracy over prior state-of-the-art methods, demonstrating substantial gains in low-resource scenarios.
๐ Abstract
3D visual grounding aims to identify the target object within a 3D point cloud scene referred to by a natural language description. Previous works usually require significant data relating to point color and their descriptions to exploit the corresponding complicated verbo-visual relations. In our work, we introduce Vigor, a novel Data-Efficient 3D Visual Grounding framework via Order-aware Referring. Vigor leverages LLM to produce a desirable referential order from the input description for 3D visual grounding. With the proposed stacked object-referring blocks, the predicted anchor objects in the above order allow one to locate the target object progressively without supervision on the identities of anchor objects or exact relations between anchor/target objects. In addition, we present an order-aware warm-up training strategy, which augments referential orders for pre-training the visual grounding framework. This allows us to better capture the complex verbo-visual relations and benefit the desirable data-efficient learning scheme. Experimental results on the NR3D and ScanRefer datasets demonstrate our superiority in low-resource scenarios. In particular, Vigor surpasses current state-of-the-art frameworks by 9.3% and 7.6% grounding accuracy under 1% data and 10% data settings on the NR3D dataset, respectively. Our code is publicly available at https://github.com/tony10101105/Vigor.