🤖 AI Summary
Existing open-vocabulary 3D object localization methods perform well on simple queries but struggle to ground ambiguous natural-language descriptions involving spatial relations. To address this, we propose the first deductive reasoning framework for 3D grounding based on scene graphs: (1) constructing metric- and semantic-aware 3D scene graphs; (2) integrating LLM-driven relational reasoning over these graphs; and (3) designing a DINO-enhanced object-centric mapping scheme coupled with CLIP-guided ray-casting for node-level visual description generation. Our approach significantly improves grounding robustness in multi-instance, same-category scenes. It achieves state-of-the-art zero-shot 3D semantic segmentation on Replica and ScanNet, substantially outperforms prior methods in grounding accuracy on the challenging Sr3D+, Nr3D, and ScanRefer benchmarks, and enables real-time inference on resource-constrained robotic embedded platforms.
📝 Abstract
Locating objects described in natural language presents a significant challenge for autonomous agents. Existing CLIP-based open-vocabulary methods successfully perform 3D object grounding with simple (bare) queries, but cannot cope with ambiguous descriptions that demand an understanding of object relations. To tackle this problem, we propose a modular approach called BBQ (Beyond Bare Queries), which constructs 3D scene graph representation with metric and semantic spatial edges and utilizes a large language model as a human-to-agent interface through our deductive scene reasoning algorithm. BBQ employs robust DINO-powered associations to construct 3D object-centric map and an advanced raycasting algorithm with a 2D vision-language model to describe them as graph nodes. On the Replica and ScanNet datasets, we have demonstrated that BBQ takes a leading place in open-vocabulary 3D semantic segmentation compared to other zero-shot methods. Also, we show that leveraging spatial relations is especially effective for scenes containing multiple entities of the same semantic class. On challenging Sr3D+, Nr3D and ScanRefer benchmarks, our deductive approach demonstrates a significant improvement, enabling objects grounding by complex queries compared to other state-of-the-art methods. The combination of our design choices and software implementation has resulted in significant data processing speed in experiments on the robot on-board computer. This promising performance enables the application of our approach in intelligent robotics projects. We made the code publicly available at https://linukc.github.io/BeyondBareQueries/.