🤖 AI Summary
This work addresses the challenge of interactive object search in open-world home environments, where effective modeling of semantic relationships and contextual cues among objects is crucial yet hindered by the unreliability of vision-language embeddings or the high computational cost of large language models (LLMs). To overcome these limitations, the authors propose SCOUT, a novel approach that integrates relational semantic reasoning with 3D scene graphs. SCOUT leverages heuristic rules—such as room-object containment and object co-occurrence—to assign utility scores to scene elements and introduces an offline distillation framework to transfer structured knowledge from LLMs into a lightweight model. The accompanying symbolic benchmark, SymSearch, enables scalable evaluation. Experiments demonstrate that SCOUT significantly outperforms embedding-based methods in both simulated and real robotic environments, achieving performance comparable to LLMs while maintaining efficient inference.
📝 Abstract
Open-world interactive object search in household environments requires understanding semantic relationships between objects and their surrounding context to guide exploration efficiently. Prior methods either rely on vision-language embeddings similarity, which does not reliably capture task-relevant relational semantics, or large language models (LLMs), which are too slow and costly for real-time deployment. We introduce SCOUT: Scene Graph-Based Exploration with Learned Utility for Open-World Interactive Object Search, a novel method that searches directly over 3D scene graphs by assigning utility scores to rooms, frontiers, and objects using relational exploration heuristics such as room-object containment and object-object co-occurrence. To make this practical without sacrificing open-vocabulary generalization, we propose an offline procedural distillation framework that extracts structured relational knowledge from LLMs into lightweight models for on-robot inference. Furthermore, we present SymSearch, a scalable symbolic benchmark for evaluating semantic reasoning in interactive object search tasks. Extensive evaluations across symbolic and simulation environments show that SCOUT outperforms embedding similarity-based methods and matches LLM-level performance while remaining computationally efficient. Finally, real-world experiments demonstrate effective transfer to physical environments, enabling open-world interactive object search under realistic sensing and navigation constraints.