CoSPlan: Corrective Sequential Planning via Scene Graph Incremental Updates

📅 2025-12-11

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

Vision-language models (VLMs) exhibit insufficient robustness in multi-step visual planning and struggle to detect and correct suboptimal actions. Method: We introduce CoSPlan, the first cross-domain corrective sequential planning benchmark explicitly designed for error detection and step-wise correction, covering maze navigation, block rearrangement, image reconstruction, and object reconfiguration. To address the problem without additional training, we propose Scene Graph Incremental (SGI) updating—a novel method that explicitly models progressive reasoning from initial state → intermediate state → goal state, integrating scene graph representation, chain-of-reasoning, and vision-action joint embedding. Contribution/Results: SGI boosts average performance of VLMs (e.g., Intern-VLM, Qwen2) by 5.2% on CoSPlan and generalizes effectively to Plan-Bench and VQA tasks, demonstrating its broad applicability for enhancing general-purpose visual planning capabilities.

Technology Category

Application Category

📝 Abstract

Large-scale Vision-Language Models (VLMs) exhibit impressive complex reasoning capabilities but remain largely unexplored in visual sequential planning, i.e., executing multi-step actions towards a goal. Additionally, practical sequential planning often involves non-optimal (erroneous) steps, challenging VLMs to detect and correct such steps. We propose Corrective Sequential Planning Benchmark (CoSPlan) to evaluate VLMs in error-prone, vision-based sequential planning tasks across 4 domains: maze navigation, block rearrangement, image reconstruction,and object reorganization. CoSPlan assesses two key abilities: Error Detection (identifying non-optimal action) and Step Completion (correcting and completing action sequences to reach the goal). Despite using state-of-the-art reasoning techniques such as Chain-of-Thought and Scene Graphs, VLMs (e.g. Intern-VLM and Qwen2) struggle on CoSPlan, failing to leverage contextual cues to reach goals. Addressing this, we propose a novel training-free method, Scene Graph Incremental updates (SGI), which introduces intermediate reasoning steps between the initial and goal states. SGI helps VLMs reason about sequences, yielding an average performance gain of 5.2%. In addition to enhancing reliability in corrective sequential planning, SGI generalizes to traditional planning tasks such as Plan-Bench and VQA.

Problem

Research questions and friction points this paper is trying to address.

Evaluates VLMs in error-prone visual sequential planning tasks

Assesses error detection and step completion abilities in planning

Proposes method to improve VLM reasoning for corrective planning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Scene Graph Incremental updates for corrective planning

Training-free method enhancing VLM sequential reasoning

Introduces intermediate steps between initial and goal states

🔎 Similar Papers

Closed-Loop Long-Horizon Robotic Planning via Equilibrium Sequence Modeling