Planning with Unified Multimodal Models

📅 2025-09-26

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

Existing LLM/VLM-based decision-making approaches rely solely on linguistic reasoning, limiting their capability in long-horizon, multimodal planning. This paper proposes Uni-Plan: the first end-to-end planning framework that unifies a single multimodal model to jointly parameterize policy, dynamics, and value functions. It enhances reasoning interpretability and spatiotemporal consistency by generating intermediate visual representations. A novel self-discriminative filtering mechanism is introduced to effectively suppress hallucination in generation. Uni-Plan integrates generative modeling with multimodal joint optimization, requiring no expert demonstrations and exhibiting strong data scalability. Experiments demonstrate that Uni-Plan significantly improves success rates on long-horizon planning tasks, outperforming state-of-the-art VLM-based baselines under equivalent data budgets.

Technology Category

Application Category

📝 Abstract

With the powerful reasoning capabilities of large language models (LLMs) and vision-language models (VLMs), many recent works have explored using them for decision-making. However, most of these approaches rely solely on language-based reasoning, which limits their ability to reason and make informed decisions. Recently, a promising new direction has emerged with unified multimodal models (UMMs), which support both multimodal inputs and outputs. We believe such models have greater potential for decision-making by enabling reasoning through generated visual content. To this end, we propose Uni-Plan, a planning framework built on UMMs. Within this framework, a single model simultaneously serves as the policy, dynamics model, and value function. In addition, to avoid hallucinations in dynamics predictions, we present a novel approach self-discriminated filtering, where the generative model serves as a self-discriminator to filter out invalid dynamics predictions. Experiments on long-horizon planning tasks show that Uni-Plan substantially improves success rates compared to VLM-based methods, while also showing strong data scalability, requiring no expert demonstrations and achieving better performance under the same training-data size. This work lays a foundation for future research in reasoning and decision-making with UMMs.

Problem

Research questions and friction points this paper is trying to address.

Addressing limitations of language-only reasoning in decision-making

Proposing unified multimodal models for visual reasoning in planning

Reducing hallucinations in dynamics predictions through self-discrimination

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified multimodal models enable reasoning through generated visual content

Single model serves as policy dynamics and value function

Self-discriminated filtering removes invalid dynamics predictions

🔎 Similar Papers

No similar papers found.