Symbolically-Guided Visual Plan Inference from Uncurated Video Data

📅 2025-05-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address hallucination, high computational overhead, and poor interpretability in video-generation models for long-horizon robotic manipulation, this paper proposes a symbol-guided white-box visual planning framework. The method automatically extracts task-relevant symbols from unlabeled raw videos and constructs a physically consistent, traceable symbolic transition graph to generate goal-directed visual subgoal sequences. Integrating vision foundation models, symbolic abstraction, graph-structured planning, and cross-modal image generation, the framework avoids reliance on diffusion models, thereby ensuring transparent, physically grounded reasoning. Evaluated on a real-world robot platform, the approach improves overall task success rate by 53% and accelerates visual planning by 35× compared to baseline methods. Moreover, it supports verifiable, multi-objective, multi-stage reasoning with explicit symbolic grounding and execution traceability.

Technology Category

Application Category

📝 Abstract
Visual planning, by offering a sequence of intermediate visual subgoals to a goal-conditioned low-level policy, achieves promising performance on long-horizon manipulation tasks. To obtain the subgoals, existing methods typically resort to video generation models but suffer from model hallucination and computational cost. We present Vis2Plan, an efficient, explainable and white-box visual planning framework powered by symbolic guidance. From raw, unlabeled play data, Vis2Plan harnesses vision foundation models to automatically extract a compact set of task symbols, which allows building a high-level symbolic transition graph for multi-goal, multi-stage planning. At test time, given a desired task goal, our planner conducts planning at the symbolic level and assembles a sequence of physically consistent intermediate sub-goal images grounded by the underlying symbolic representation. Our Vis2Plan outperforms strong diffusion video generation-based visual planners by delivering 53% higher aggregate success rate in real robot settings while generating visual plans 35$ imes$ faster. The results indicate that Vis2Plan is able to generate physically consistent image goals while offering fully inspectable reasoning steps.
Problem

Research questions and friction points this paper is trying to address.

Overcoming model hallucination in video-based visual planning
Reducing computational cost of generating intermediate subgoals
Achieving physically consistent goal images from unlabeled data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Symbolic guidance for visual planning
Vision foundation models extract task symbols
Symbolic transition graph enables multi-goal planning
🔎 Similar Papers
No similar papers found.
Wenyan Yang
Wenyan Yang
Aalto University
Computer VisionImitation LearningReinforcement Learning
A
Ahmet Tikna
Department of Engineering and Computer Science, University of Trento
Y
Yi Zhao
Department of Electrical Engineering and Automation, Aalto University
Y
Yuying Zhang
Department of Electrical Engineering and Automation, Aalto University
L
Luigi Palopoli
Department of Engineering and Computer Science, University of Trento
Marco Roveri
Marco Roveri
University of Trento - Department of Information Engineering and Computer Science
Formal MethodsArtificial IntelligenceComputer Science
Joni Pajarinen
Joni Pajarinen
Associate Professor at Aalto University
Reinforcement LearningRoboticsMachine Learning