SAGE: Scene Graph-Aware Guidance and Execution for Long-Horizon Manipulation Tasks

📅 2025-09-26

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

Long-horizon manipulation tasks suffer from a critical disconnect between high-level symbolic planning and low-level continuous control. Method: We propose a scene-graph-aware vision-language coordination framework, where a structured scene graph serves as the semantic hub—unifying object representations, relational semantics, and action logic. By tightly coupling vision-language models (VLMs) and large language models (LLMs), our approach enables long-horizon task planning; decoupled image inpainting and compositional editing further generate controllable subgoal images to support physically plausible state-transition reasoning. Contribution/Results: To our knowledge, this is the first work to explicitly embed scene graphs into an end-to-end vision–motor control loop, enabling deep synergy between semantic reasoning and pixel-level execution. Our method achieves state-of-the-art performance across multiple long-sequence manipulation benchmarks, improving task success rate by 23.6% and significantly enhancing cross-scene generalization.

Technology Category

Application Category

📝 Abstract

Successfully solving long-horizon manipulation tasks remains a fundamental challenge. These tasks involve extended action sequences and complex object interactions, presenting a critical gap between high-level symbolic planning and low-level continuous control. To bridge this gap, two essential capabilities are required: robust long-horizon task planning and effective goal-conditioned manipulation. Existing task planning methods, including traditional and LLM-based approaches, often exhibit limited generalization or sparse semantic reasoning. Meanwhile, image-conditioned control methods struggle to adapt to unseen tasks. To tackle these problems, we propose SAGE, a novel framework for Scene Graph-Aware Guidance and Execution in Long-Horizon Manipulation Tasks. SAGE utilizes semantic scene graphs as a structural representation for scene states. A structural scene graph enables bridging task-level semantic reasoning and pixel-level visuo-motor control. This also facilitates the controllable synthesis of accurate, novel sub-goal images. SAGE consists of two key components: (1) a scene graph-based task planner that uses VLMs and LLMs to parse the environment and reason about physically-grounded scene state transition sequences, and (2) a decoupled structural image editing pipeline that controllably converts each target sub-goal graph into a corresponding image through image inpainting and composition. Extensive experiments have demonstrated that SAGE achieves state-of-the-art performance on distinct long-horizon tasks.

Problem

Research questions and friction points this paper is trying to address.

Bridging symbolic planning and continuous control for manipulation

Enabling robust long-horizon task planning with scene graphs

Generating accurate sub-goal images for unseen tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses scene graphs for semantic reasoning and control

Combines VLMs and LLMs for task planning

Employs structural image editing for sub-goal generation

🔎 Similar Papers

VICtoR: Learning Hierarchical Vision-Instruction Correlation Rewards for Long-horizon Manipulation

2024-05-26arXiv.orgCitations: 0

What Foundation Models can Bring for Robot Learning in Manipulation : A Survey

2024-04-28arXiv.orgCitations: 15

Learn2Decompose: Learning Problem Decomposition for Efficient Sequential Multi-object Manipulation Planning

2024-08-13Citations: 0