Beyond Bare Queries: Open-Vocabulary Object Grounding with 3D Scene Graph

📅 2024-06-11
📈 Citations: 3
Influential: 0
📄 PDF
🤖 AI Summary
Existing open-vocabulary 3D object localization methods perform well on simple queries but struggle to ground ambiguous natural-language descriptions involving spatial relations. To address this, we propose the first deductive reasoning framework for 3D grounding based on scene graphs: (1) constructing metric- and semantic-aware 3D scene graphs; (2) integrating LLM-driven relational reasoning over these graphs; and (3) designing a DINO-enhanced object-centric mapping scheme coupled with CLIP-guided ray-casting for node-level visual description generation. Our approach significantly improves grounding robustness in multi-instance, same-category scenes. It achieves state-of-the-art zero-shot 3D semantic segmentation on Replica and ScanNet, substantially outperforms prior methods in grounding accuracy on the challenging Sr3D+, Nr3D, and ScanRefer benchmarks, and enables real-time inference on resource-constrained robotic embedded platforms.

Technology Category

Application Category

📝 Abstract
Locating objects described in natural language presents a significant challenge for autonomous agents. Existing CLIP-based open-vocabulary methods successfully perform 3D object grounding with simple (bare) queries, but cannot cope with ambiguous descriptions that demand an understanding of object relations. To tackle this problem, we propose a modular approach called BBQ (Beyond Bare Queries), which constructs 3D scene graph representation with metric and semantic spatial edges and utilizes a large language model as a human-to-agent interface through our deductive scene reasoning algorithm. BBQ employs robust DINO-powered associations to construct 3D object-centric map and an advanced raycasting algorithm with a 2D vision-language model to describe them as graph nodes. On the Replica and ScanNet datasets, we have demonstrated that BBQ takes a leading place in open-vocabulary 3D semantic segmentation compared to other zero-shot methods. Also, we show that leveraging spatial relations is especially effective for scenes containing multiple entities of the same semantic class. On challenging Sr3D+, Nr3D and ScanRefer benchmarks, our deductive approach demonstrates a significant improvement, enabling objects grounding by complex queries compared to other state-of-the-art methods. The combination of our design choices and software implementation has resulted in significant data processing speed in experiments on the robot on-board computer. This promising performance enables the application of our approach in intelligent robotics projects. We made the code publicly available at https://linukc.github.io/BeyondBareQueries/.
Problem

Research questions and friction points this paper is trying to address.

Enables 3D object grounding with complex natural language queries
Improves ambiguous object localization using spatial relation understanding
Advances open-vocabulary 3D segmentation for autonomous agents
Innovation

Methods, ideas, or system contributions that make the work stand out.

Modular approach with 3D scene graph representation
Uses large language model for deductive reasoning
Combines DINO-powered associations and raycasting algorithm
🔎 Similar Papers
No similar papers found.
S
S. Linok
MIPT
T
T. Zemskova
MIPT, AIRI
S
Svetlana Ladanova
MIPT
R
Roman Titkov
MIPT
D
Dmitry A. Yudin
MIPT, AIRI
M
Maxim Monastyrny
A
Aleksei Valenkov