🤖 AI Summary
Long-horizon manipulation tasks suffer from a critical disconnect between high-level symbolic planning and low-level continuous control.
Method: We propose a scene-graph-aware vision-language coordination framework, where a structured scene graph serves as the semantic hub—unifying object representations, relational semantics, and action logic. By tightly coupling vision-language models (VLMs) and large language models (LLMs), our approach enables long-horizon task planning; decoupled image inpainting and compositional editing further generate controllable subgoal images to support physically plausible state-transition reasoning.
Contribution/Results: To our knowledge, this is the first work to explicitly embed scene graphs into an end-to-end vision–motor control loop, enabling deep synergy between semantic reasoning and pixel-level execution. Our method achieves state-of-the-art performance across multiple long-sequence manipulation benchmarks, improving task success rate by 23.6% and significantly enhancing cross-scene generalization.
📝 Abstract
Successfully solving long-horizon manipulation tasks remains a fundamental challenge. These tasks involve extended action sequences and complex object interactions, presenting a critical gap between high-level symbolic planning and low-level continuous control. To bridge this gap, two essential capabilities are required: robust long-horizon task planning and effective goal-conditioned manipulation. Existing task planning methods, including traditional and LLM-based approaches, often exhibit limited generalization or sparse semantic reasoning. Meanwhile, image-conditioned control methods struggle to adapt to unseen tasks. To tackle these problems, we propose SAGE, a novel framework for Scene Graph-Aware Guidance and Execution in Long-Horizon Manipulation Tasks. SAGE utilizes semantic scene graphs as a structural representation for scene states. A structural scene graph enables bridging task-level semantic reasoning and pixel-level visuo-motor control. This also facilitates the controllable synthesis of accurate, novel sub-goal images. SAGE consists of two key components: (1) a scene graph-based task planner that uses VLMs and LLMs to parse the environment and reason about physically-grounded scene state transition sequences, and (2) a decoupled structural image editing pipeline that controllably converts each target sub-goal graph into a corresponding image through image inpainting and composition. Extensive experiments have demonstrated that SAGE achieves state-of-the-art performance on distinct long-horizon tasks.