ChangingGrounding: 3D Visual Grounding in Changing Scenes

📅 2025-10-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing 3D visual grounding (3DVG) methods rely on static, reconstructed point clouds and thus struggle in real-world dynamic environments, necessitating frequent re-scanning and incurring high deployment costs. This work reframes 3D visual grounding as a memory-driven active reasoning task, enabling robots to efficiently explore and precisely localize targets by leveraging historical observations. Our contributions are threefold: (1) We introduce ChangingGrounding—the first 3D visual grounding benchmark explicitly designed for dynamic scenes; (2) We propose a zero-shot memory-augmented framework integrating cross-modal retrieval, lightweight multi-view fusion, active exploration policies, and scan-projection techniques; (3) We incorporate a failure-recovery mechanism to enhance robustness. Evaluated on ChangingGrounding, our method achieves state-of-the-art localization accuracy with significantly reduced exploration overhead, empirically validating the effectiveness and practicality of the memory-driven paradigm in realistic dynamic settings.

Technology Category

Application Category

📝 Abstract
Real-world robots localize objects from natural-language instructions while scenes around them keep changing. Yet most of the existing 3D visual grounding (3DVG) method still assumes a reconstructed and up-to-date point cloud, an assumption that forces costly re-scans and hinders deployment. We argue that 3DVG should be formulated as an active, memory-driven problem, and we introduce ChangingGrounding, the first benchmark that explicitly measures how well an agent can exploit past observations, explore only where needed, and still deliver precise 3D boxes in changing scenes. To set a strong reference point, we also propose Mem-ChangingGrounder, a zero-shot method for this task that marries cross-modal retrieval with lightweight multi-view fusion: it identifies the object type implied by the query, retrieves relevant memories to guide actions, then explores the target efficiently in the scene, falls back when previous operations are invalid, performs multi-view scanning of the target, and projects the fused evidence from multi-view scans to get accurate object bounding boxes. We evaluate different baselines on ChangingGrounding, and our Mem-ChangingGrounder achieves the highest localization accuracy while greatly reducing exploration cost. We hope this benchmark and method catalyze a shift toward practical, memory-centric 3DVG research for real-world applications. Project page: https://hm123450.github.io/CGB/ .
Problem

Research questions and friction points this paper is trying to address.

Localizing objects from language in changing 3D scenes
Reducing costly re-scans for 3D visual grounding
Exploiting past observations to guide active exploration
Innovation

Methods, ideas, or system contributions that make the work stand out.

Memory-driven active 3D visual grounding
Cross-modal retrieval with multi-view fusion
Efficient exploration and accurate bounding boxes
🔎 Similar Papers
No similar papers found.