On the Out-of-Distribution Generalization of Reasoning in Multimodal LLMs for Simple Visual Planning Tasks

📅 2026-02-17

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

This work investigates the limited out-of-distribution (OOD) generalization of multimodal large language models in visual planning tasks. The authors introduce a grid-based navigation evaluation framework to systematically analyze the generalization capabilities of Chain-of-Thought (CoT) reasoning under OOD conditions, comparing different input modalities (visual vs. textual) and CoT strategies. Experimental results demonstrate that while CoT improves in-distribution performance, most models struggle to generalize to OOD scenarios—such as larger maps. Notably, reasoning traces expressed in a hybrid textual format achieve non-trivial OOD generalization, and purely text-based models consistently outperform multimodal counterparts that rely on visual inputs. These findings highlight the critical role of input representation in determining the generalization capacity of planning systems.

Technology Category

Application Category

📝 Abstract

Integrating reasoning in large language models and large vision-language models has recently led to significant improvement of their capabilities. However, the generalization of reasoning models is still vaguely defined and poorly understood. In this work, we present an evaluation framework to rigorously examine how well chain-of-thought (CoT) approaches generalize on a simple planning task. Specifically, we consider a grid-based navigation task in which a model is provided with a map and must output a sequence of moves that guides a player from a start position to a goal while avoiding obstacles. The versatility of the task and its data allows us to fine-tune model variants using different input representations (visual and textual) and CoT reasoning strategies, and systematically evaluate them under both in-distribution (ID) and out-of-distribution (OOD) test conditions. Our experiments show that, while CoT reasoning improves in-distribution generalization across all representations, out-of-distribution generalization (e.g., to larger maps) remains very limited in most cases when controlling for trivial matches with the ID data. Surprisingly, we find that reasoning traces which combine multiple text formats yield the best (and non-trivial) OOD generalization. Finally, purely text-based models consistently outperform those utilizing image-based inputs, including a recently proposed approach relying on latent space reasoning.

Problem

Research questions and friction points this paper is trying to address.

out-of-distribution generalization

reasoning

multimodal LLMs

visual planning

chain-of-thought

Innovation

Methods, ideas, or system contributions that make the work stand out.

out-of-distribution generalization

chain-of-thought reasoning

multimodal LLMs