Vision Language Models Cannot Plan, but Can They Formalize?

📅 2025-09-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing vision-language models (VLMs) exhibit poor performance on long-horizon multimodal planning tasks, particularly in reliably formalizing real-world, low-quality, multi-view images into PDDL for verification by classical planners. Method: We propose the “VLM-as-formalizer” paradigm, designing five PDDL formalization pipelines that support few-shot, open-vocabulary, and multimodal inputs—integrating intermediate representations such as image captions and scene graphs to enable end-to-end vision-to-domain/problem-definition translation. Contribution/Results: We introduce two novel, real-world planning benchmarks featuring multi-view and low-fidelity imagery. Experiments show our approach significantly outperforms end-to-end planning generation; the primary bottleneck lies in visual perception—not language reasoning—and while intermediate representations mitigate visual distortion, they do not fully resolve it.

Technology Category

Application Category

📝 Abstract
The advancement of vision language models (VLMs) has empowered embodied agents to accomplish simple multimodal planning tasks, but not long-horizon ones requiring long sequences of actions. In text-only simulations, long-horizon planning has seen significant improvement brought by repositioning the role of LLMs. Instead of directly generating action sequences, LLMs translate the planning domain and problem into a formal planning language like the Planning Domain Definition Language (PDDL), which can call a formal solver to derive the plan in a verifiable manner. In multimodal environments, research on VLM-as-formalizer remains scarce, usually involving gross simplifications such as predefined object vocabulary or overly similar few-shot examples. In this work, we present a suite of five VLM-as-formalizer pipelines that tackle one-shot, open-vocabulary, and multimodal PDDL formalization. We evaluate those on an existing benchmark while presenting another two that for the first time account for planning with authentic, multi-view, and low-quality images. We conclude that VLM-as-formalizer greatly outperforms end-to-end plan generation. We reveal the bottleneck to be vision rather than language, as VLMs often fail to capture an exhaustive set of necessary object relations. While generating intermediate, textual representations such as captions or scene graphs partially compensate for the performance, their inconsistent gain leaves headroom for future research directions on multimodal planning formalization.
Problem

Research questions and friction points this paper is trying to address.

VLMs struggle with long-horizon multimodal planning tasks
Research lacks VLM formalization for open-vocabulary PDDL translation
Vision limitations hinder exhaustive object relation capture in formalization
Innovation

Methods, ideas, or system contributions that make the work stand out.

VLMs translate multimodal scenes into PDDL planning language
Five pipelines handle one-shot open-vocabulary PDDL formalization
Vision limitations addressed through intermediate textual representations
🔎 Similar Papers
No similar papers found.
M
Muyu He
University of Pennsylvania
Y
Yuxi Zheng
Drexel University
Y
Yuchen Liu
Drexel University
Zijian An
Zijian An
Unknown affiliation
B
Bill Cai
Drexel University
Jiani Huang
Jiani Huang
The Hong Kong Polytechnic University
LLMRecommender System
Lifeng Zhou
Lifeng Zhou
Assistant Professor, Drexel University
Robotics
F
Feng Liu
Drexel University
Ziyang Li
Ziyang Li
Johns Hopkins University
Programming LanguagesMachine Learning
L
Li Zhang
Drexel University