🤖 AI Summary
Existing LLM/VLM-based decision-making approaches rely solely on linguistic reasoning, limiting their capability in long-horizon, multimodal planning. This paper proposes Uni-Plan: the first end-to-end planning framework that unifies a single multimodal model to jointly parameterize policy, dynamics, and value functions. It enhances reasoning interpretability and spatiotemporal consistency by generating intermediate visual representations. A novel self-discriminative filtering mechanism is introduced to effectively suppress hallucination in generation. Uni-Plan integrates generative modeling with multimodal joint optimization, requiring no expert demonstrations and exhibiting strong data scalability. Experiments demonstrate that Uni-Plan significantly improves success rates on long-horizon planning tasks, outperforming state-of-the-art VLM-based baselines under equivalent data budgets.
📝 Abstract
With the powerful reasoning capabilities of large language models (LLMs) and vision-language models (VLMs), many recent works have explored using them for decision-making. However, most of these approaches rely solely on language-based reasoning, which limits their ability to reason and make informed decisions. Recently, a promising new direction has emerged with unified multimodal models (UMMs), which support both multimodal inputs and outputs. We believe such models have greater potential for decision-making by enabling reasoning through generated visual content. To this end, we propose Uni-Plan, a planning framework built on UMMs. Within this framework, a single model simultaneously serves as the policy, dynamics model, and value function. In addition, to avoid hallucinations in dynamics predictions, we present a novel approach self-discriminated filtering, where the generative model serves as a self-discriminator to filter out invalid dynamics predictions. Experiments on long-horizon planning tasks show that Uni-Plan substantially improves success rates compared to VLM-based methods, while also showing strong data scalability, requiring no expert demonstrations and achieving better performance under the same training-data size. This work lays a foundation for future research in reasoning and decision-making with UMMs.