🤖 AI Summary
To address hallucination, high computational overhead, and poor interpretability in video-generation models for long-horizon robotic manipulation, this paper proposes a symbol-guided white-box visual planning framework. The method automatically extracts task-relevant symbols from unlabeled raw videos and constructs a physically consistent, traceable symbolic transition graph to generate goal-directed visual subgoal sequences. Integrating vision foundation models, symbolic abstraction, graph-structured planning, and cross-modal image generation, the framework avoids reliance on diffusion models, thereby ensuring transparent, physically grounded reasoning. Evaluated on a real-world robot platform, the approach improves overall task success rate by 53% and accelerates visual planning by 35× compared to baseline methods. Moreover, it supports verifiable, multi-objective, multi-stage reasoning with explicit symbolic grounding and execution traceability.
📝 Abstract
Visual planning, by offering a sequence of intermediate visual subgoals to a goal-conditioned low-level policy, achieves promising performance on long-horizon manipulation tasks. To obtain the subgoals, existing methods typically resort to video generation models but suffer from model hallucination and computational cost. We present Vis2Plan, an efficient, explainable and white-box visual planning framework powered by symbolic guidance. From raw, unlabeled play data, Vis2Plan harnesses vision foundation models to automatically extract a compact set of task symbols, which allows building a high-level symbolic transition graph for multi-goal, multi-stage planning. At test time, given a desired task goal, our planner conducts planning at the symbolic level and assembles a sequence of physically consistent intermediate sub-goal images grounded by the underlying symbolic representation. Our Vis2Plan outperforms strong diffusion video generation-based visual planners by delivering 53% higher aggregate success rate in real robot settings while generating visual plans 35$ imes$ faster. The results indicate that Vis2Plan is able to generate physically consistent image goals while offering fully inspectable reasoning steps.