MMPlanner: Zero-Shot Multimodal Procedural Planning with Chain-of-Thought Object State Reasoning

📅 2025-09-25

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

Multimodal program planning (MPP) faces the core challenge of cross-modal misalignment between textual instructions and visual object states. To address this, we propose Object State Reasoning Chain (OSR-CoT), a novel prompting method enabling zero-shot, stepwise text–image collaborative planning generation for the first time. We further introduce an LLM-as-a-judge evaluation protocol and a visual step reordering task to systematically quantify cross-modal alignment and temporal coherence. Our contributions are twofold: (1) an explicit chain-of-reasoning mechanism modeling object state transitions, and (2) a dual-path evaluation framework that decouples assessment from generation. On RECIPEPLAN and WIKIPLAN benchmarks, OSR-CoT achieves +6.8% improvement in text planning accuracy, +11.9% in cross-modal alignment, and +26.7% in visual step ordering accuracy—substantially outperforming prior methods.

Technology Category

Application Category

📝 Abstract

Multimodal Procedural Planning (MPP) aims to generate step-by-step instructions that combine text and images, with the central challenge of preserving object-state consistency across modalities while producing informative plans. Existing approaches often leverage large language models (LLMs) to refine textual steps; however, visual object-state alignment and systematic evaluation are largely underexplored. We present MMPlanner, a zero-shot MPP framework that introduces Object State Reasoning Chain-of-Thought (OSR-CoT) prompting to explicitly model object-state transitions and generate accurate multimodal plans. To assess plan quality, we design LLM-as-a-judge protocols for planning accuracy and cross-modal alignment, and further propose a visual step-reordering task to measure temporal coherence. Experiments on RECIPEPLAN and WIKIPLAN show that MMPlanner achieves state-of-the-art performance, improving textual planning by +6.8%, cross-modal alignment by +11.9%, and visual step ordering by +26.7%

Problem

Research questions and friction points this paper is trying to address.

Generating multimodal plans with object-state consistency

Improving visual object-state alignment in procedural planning

Evaluating temporal coherence in cross-modal step sequences

Innovation

Methods, ideas, or system contributions that make the work stand out.

Zero-shot multimodal planning with object state reasoning

Object State Reasoning Chain-of-Thought prompting technique

LLM-as-judge protocols for multimodal plan evaluation

🔎 Similar Papers

No similar papers found.