🤖 AI Summary
Existing 3D visual grounding methods rely on predefined object lookup tables (OLTs), limiting generalization to unseen categories. This work proposes a zero-shot open-world 3D visual referring localization framework that eliminates OLT dependency. We introduce an Active Cognitive Reasoning (ACR) module that emulates human perceptual chains to dynamically expand the cognitive capacity of vision-language models (VLMs). We also construct OpenTarget—the first open-world 3D referring benchmark—and unify predefined and open-category modeling. Our approach integrates VLMs, cognitive task-chain modeling, dynamic OLT updating, and zero-shot cross-domain transfer, enabling fine-grained 3D point cloud–language alignment. Experiments show competitive performance on Nr3D, state-of-the-art results on ScanRefer, and a 17.6% absolute accuracy improvement on OpenTarget.
📝 Abstract
3D visual grounding aims to locate objects based on natural language descriptions in 3D scenes. Existing methods rely on a pre-defined Object Lookup Table (OLT) to query Visual Language Models (VLMs) for reasoning about object locations, which limits the applications in scenarios with undefined or unforeseen targets. To address this problem, we present OpenGround, a novel zero-shot framework for open-world 3D visual grounding. Central to OpenGround is the Active Cognition-based Reasoning (ACR) module, which is designed to overcome the fundamental limitation of pre-defined OLTs by progressively augmenting the cognitive scope of VLMs. The ACR module performs human-like perception of the target via a cognitive task chain and actively reasons about contextually relevant objects, thereby extending VLM cognition through a dynamically updated OLT. This allows OpenGround to function with both pre-defined and open-world categories. We also propose a new dataset named OpenTarget, which contains over 7000 object-description pairs to evaluate our method in open-world scenarios. Extensive experiments demonstrate that OpenGround achieves competitive performance on Nr3D, state-of-the-art on ScanRefer, and delivers a substantial 17.6% improvement on OpenTarget. Project Page at [this https URL](https://why-102.github.io/openground.io/).