🤖 AI Summary
Multimodal program planning (MPP) faces the core challenge of cross-modal misalignment between textual instructions and visual object states. To address this, we propose Object State Reasoning Chain (OSR-CoT), a novel prompting method enabling zero-shot, stepwise text–image collaborative planning generation for the first time. We further introduce an LLM-as-a-judge evaluation protocol and a visual step reordering task to systematically quantify cross-modal alignment and temporal coherence. Our contributions are twofold: (1) an explicit chain-of-reasoning mechanism modeling object state transitions, and (2) a dual-path evaluation framework that decouples assessment from generation. On RECIPEPLAN and WIKIPLAN benchmarks, OSR-CoT achieves +6.8% improvement in text planning accuracy, +11.9% in cross-modal alignment, and +26.7% in visual step ordering accuracy—substantially outperforming prior methods.
📝 Abstract
Multimodal Procedural Planning (MPP) aims to generate step-by-step instructions that combine text and images, with the central challenge of preserving object-state consistency across modalities while producing informative plans. Existing approaches often leverage large language models (LLMs) to refine textual steps; however, visual object-state alignment and systematic evaluation are largely underexplored. We present MMPlanner, a zero-shot MPP framework that introduces Object State Reasoning Chain-of-Thought (OSR-CoT) prompting to explicitly model object-state transitions and generate accurate multimodal plans. To assess plan quality, we design LLM-as-a-judge protocols for planning accuracy and cross-modal alignment, and further propose a visual step-reordering task to measure temporal coherence. Experiments on RECIPEPLAN and WIKIPLAN show that MMPlanner achieves state-of-the-art performance, improving textual planning by +6.8%, cross-modal alignment by +11.9%, and visual step ordering by +26.7%