🤖 AI Summary
Large language models (LLMs) suffer from hallucination and poor alignment between reasoning and retrieval when performing spatial reasoning and task planning over scene graphs.
Method: We propose Schema-Guided Retrieve-while-Reason (SG-RwR), a novel framework introducing the “reasoning-as-retrieval” paradigm. It employs a schema-driven dual-agent architecture—Reasoner and Retriever—that operates exclusively on structural schema patterns of scene graphs, enabling iterative abstract reasoning and programmatic graph querying. Key components include schema-aware prompting, dynamic global graph attention, and interpretable reasoning trace generation.
Contribution/Results: SG-RwR significantly mitigates hallucination and improves reasoning-retrieval fidelity. Evaluated across multiple simulation environments on numerical question answering and task planning benchmarks, it consistently outperforms existing LLM-based methods. Notably, it enables task-level few-shot transfer without agent-level demonstrations, enhancing both accuracy and robustness.
📝 Abstract
Scene graphs have emerged as a structured and serializable environment representation for grounded spatial reasoning with Large Language Models (LLMs). In this work, we propose SG-RwR, a Schema-Guided Retrieve-while-Reason framework for reasoning and planning with scene graphs. Our approach employs two cooperative, code-writing LLM agents: a (1) Reasoner for task planning and information queries generation, and a (2) Retriever for extracting corresponding graph information following the queries. Two agents collaborate iteratively, enabling sequential reasoning and adaptive attention to graph information. Unlike prior works, both agents are prompted only with the scene graph schema rather than the full graph data, which reduces the hallucination by limiting input tokens, and drives the Reasoner to generate reasoning trace abstractly.Following the trace, the Retriever programmatically query the scene graph data based on the schema understanding, allowing dynamic and global attention on the graph that enhances alignment between reasoning and retrieval. Through experiments in multiple simulation environments, we show that our framework surpasses existing LLM-based approaches in numerical Q&A and planning tasks, and can benefit from task-level few-shot examples, even in the absence of agent-level demonstrations. Project code will be released.