🤖 AI Summary
This work addresses the lack of a unified evaluation benchmark in visual planning by introducing ViPlan—the first open-source benchmark covering visual Blocksworld and household robotics scenarios, supporting symbolic predicate modeling and multi-VLM evaluation. Methodologically, it systematically compares VLM-driven symbolic planning (integrating PDDL with vision-based predicate extraction) against end-to-end VLM planning, and proposes a multi-scale action evaluation protocol. Key findings reveal that symbolic planning significantly outperforms end-to-end approaches in Blocksworld, whereas end-to-end VLMs excel in household tasks—demonstrating that task characteristics (e.g., image grounding precision vs. commonsense robustness) critically determine method efficacy. Moreover, Chain-of-Thought prompting yields no consistent improvement, exposing fundamental limitations in current VLMs’ visual reasoning capabilities. ViPlan establishes the first standardized infrastructure for fair, cross-paradigm comparison in visual planning, enabling reproducible and principled advancement of the field.
📝 Abstract
Integrating Large Language Models with symbolic planners is a promising direction for obtaining verifiable and grounded plans compared to planning in natural language, with recent works extending this idea to visual domains using Vision-Language Models (VLMs). However, rigorous comparison between VLM-grounded symbolic approaches and methods that plan directly with a VLM has been hindered by a lack of common environments, evaluation protocols and model coverage. We introduce ViPlan, the first open-source benchmark for Visual Planning with symbolic predicates and VLMs. ViPlan features a series of increasingly challenging tasks in two domains: a visual variant of the classic Blocksworld planning problem and a simulated household robotics environment. We benchmark nine open-source VLM families across multiple sizes, along with selected closed models, evaluating both VLM-grounded symbolic planning and using the models directly to propose actions. We find symbolic planning to outperform direct VLM planning in Blocksworld, where accurate image grounding is crucial, whereas the opposite is true in the household robotics tasks, where commonsense knowledge and the ability to recover from errors are beneficial. Finally, we show that across most models and methods, there is no significant benefit to using Chain-of-Thought prompting, suggesting that current VLMs still struggle with visual reasoning.