Symbolically-Guided Visual Plan Inference from Uncurated Video Data

📅 2025-05-13

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

To address hallucination, high computational overhead, and poor interpretability in video-generation models for long-horizon robotic manipulation, this paper proposes a symbol-guided white-box visual planning framework. The method automatically extracts task-relevant symbols from unlabeled raw videos and constructs a physically consistent, traceable symbolic transition graph to generate goal-directed visual subgoal sequences. Integrating vision foundation models, symbolic abstraction, graph-structured planning, and cross-modal image generation, the framework avoids reliance on diffusion models, thereby ensuring transparent, physically grounded reasoning. Evaluated on a real-world robot platform, the approach improves overall task success rate by 53% and accelerates visual planning by 35× compared to baseline methods. Moreover, it supports verifiable, multi-objective, multi-stage reasoning with explicit symbolic grounding and execution traceability.

Technology Category

Application Category

📝 Abstract

Visual planning, by offering a sequence of intermediate visual subgoals to a goal-conditioned low-level policy, achieves promising performance on long-horizon manipulation tasks. To obtain the subgoals, existing methods typically resort to video generation models but suffer from model hallucination and computational cost. We present Vis2Plan, an efficient, explainable and white-box visual planning framework powered by symbolic guidance. From raw, unlabeled play data, Vis2Plan harnesses vision foundation models to automatically extract a compact set of task symbols, which allows building a high-level symbolic transition graph for multi-goal, multi-stage planning. At test time, given a desired task goal, our planner conducts planning at the symbolic level and assembles a sequence of physically consistent intermediate sub-goal images grounded by the underlying symbolic representation. Our Vis2Plan outperforms strong diffusion video generation-based visual planners by delivering 53% higher aggregate success rate in real robot settings while generating visual plans 35$ imes$ faster. The results indicate that Vis2Plan is able to generate physically consistent image goals while offering fully inspectable reasoning steps.

Problem

Research questions and friction points this paper is trying to address.

Overcoming model hallucination in video-based visual planning

Reducing computational cost of generating intermediate subgoals

Achieving physically consistent goal images from unlabeled data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Symbolic guidance for visual planning

Vision foundation models extract task symbols

Symbolic transition graph enables multi-goal planning

🔎 Similar Papers

LASER: A Neuro-Symbolic Framework for Learning Spatial-Temporal Scene Graphs with Weak Supervision