SAGE: Scene Graph-Aware Guidance and Execution for Long-Horizon Manipulation Tasks

📅 2025-09-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Long-horizon manipulation tasks suffer from a critical disconnect between high-level symbolic planning and low-level continuous control. Method: We propose a scene-graph-aware vision-language coordination framework, where a structured scene graph serves as the semantic hub—unifying object representations, relational semantics, and action logic. By tightly coupling vision-language models (VLMs) and large language models (LLMs), our approach enables long-horizon task planning; decoupled image inpainting and compositional editing further generate controllable subgoal images to support physically plausible state-transition reasoning. Contribution/Results: To our knowledge, this is the first work to explicitly embed scene graphs into an end-to-end vision–motor control loop, enabling deep synergy between semantic reasoning and pixel-level execution. Our method achieves state-of-the-art performance across multiple long-sequence manipulation benchmarks, improving task success rate by 23.6% and significantly enhancing cross-scene generalization.

Technology Category

Application Category

📝 Abstract
Successfully solving long-horizon manipulation tasks remains a fundamental challenge. These tasks involve extended action sequences and complex object interactions, presenting a critical gap between high-level symbolic planning and low-level continuous control. To bridge this gap, two essential capabilities are required: robust long-horizon task planning and effective goal-conditioned manipulation. Existing task planning methods, including traditional and LLM-based approaches, often exhibit limited generalization or sparse semantic reasoning. Meanwhile, image-conditioned control methods struggle to adapt to unseen tasks. To tackle these problems, we propose SAGE, a novel framework for Scene Graph-Aware Guidance and Execution in Long-Horizon Manipulation Tasks. SAGE utilizes semantic scene graphs as a structural representation for scene states. A structural scene graph enables bridging task-level semantic reasoning and pixel-level visuo-motor control. This also facilitates the controllable synthesis of accurate, novel sub-goal images. SAGE consists of two key components: (1) a scene graph-based task planner that uses VLMs and LLMs to parse the environment and reason about physically-grounded scene state transition sequences, and (2) a decoupled structural image editing pipeline that controllably converts each target sub-goal graph into a corresponding image through image inpainting and composition. Extensive experiments have demonstrated that SAGE achieves state-of-the-art performance on distinct long-horizon tasks.
Problem

Research questions and friction points this paper is trying to address.

Bridging symbolic planning and continuous control for manipulation
Enabling robust long-horizon task planning with scene graphs
Generating accurate sub-goal images for unseen tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses scene graphs for semantic reasoning and control
Combines VLMs and LLMs for task planning
Employs structural image editing for sub-goal generation
🔎 Similar Papers
No similar papers found.
J
Jialiang Li
School of Artificial Intelligence, Shanghai Jiao Tong University
W
Wenzheng Wu
ShanghaiTech University
Gaojing Zhang
Gaojing Zhang
M.S. student
SLAMEnvironment Awareness
Y
Yifan Han
Institute of Automation, Chinese Academy of Sciences
Wenzhao Lian
Wenzhao Lian
Google X
Roboticsmachine learning