OpenGround: Active Cognition-based Reasoning for Open-World 3D Visual Grounding

📅 2025-12-28

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Existing 3D visual grounding methods rely on predefined object lookup tables (OLTs), limiting generalization to unseen categories. This work proposes a zero-shot open-world 3D visual referring localization framework that eliminates OLT dependency. We introduce an Active Cognitive Reasoning (ACR) module that emulates human perceptual chains to dynamically expand the cognitive capacity of vision-language models (VLMs). We also construct OpenTarget—the first open-world 3D referring benchmark—and unify predefined and open-category modeling. Our approach integrates VLMs, cognitive task-chain modeling, dynamic OLT updating, and zero-shot cross-domain transfer, enabling fine-grained 3D point cloud–language alignment. Experiments show competitive performance on Nr3D, state-of-the-art results on ScanRefer, and a 17.6% absolute accuracy improvement on OpenTarget.

Technology Category

Application Category

📝 Abstract

3D visual grounding aims to locate objects based on natural language descriptions in 3D scenes. Existing methods rely on a pre-defined Object Lookup Table (OLT) to query Visual Language Models (VLMs) for reasoning about object locations, which limits the applications in scenarios with undefined or unforeseen targets. To address this problem, we present OpenGround, a novel zero-shot framework for open-world 3D visual grounding. Central to OpenGround is the Active Cognition-based Reasoning (ACR) module, which is designed to overcome the fundamental limitation of pre-defined OLTs by progressively augmenting the cognitive scope of VLMs. The ACR module performs human-like perception of the target via a cognitive task chain and actively reasons about contextually relevant objects, thereby extending VLM cognition through a dynamically updated OLT. This allows OpenGround to function with both pre-defined and open-world categories. We also propose a new dataset named OpenTarget, which contains over 7000 object-description pairs to evaluate our method in open-world scenarios. Extensive experiments demonstrate that OpenGround achieves competitive performance on Nr3D, state-of-the-art on ScanRefer, and delivers a substantial 17.6% improvement on OpenTarget. Project Page at [this https URL](https://why-102.github.io/openground.io/).

Problem

Research questions and friction points this paper is trying to address.

Addresses open-world 3D visual grounding without predefined object categories

Overcomes limitations of fixed object lookup tables in existing methods

Enables zero-shot localization of undefined or unforeseen 3D objects

Innovation

Methods, ideas, or system contributions that make the work stand out.

Active Cognition-based Reasoning for open-world 3D grounding

Dynamically updated Object Lookup Table to extend VLM cognition

Zero-shot framework handling both pre-defined and unforeseen targets

🔎 Similar Papers

Open-set 3D semantic instance maps for vision language navigation – O3D-SIM