Do multimodal models imagine electric sheep?

📅 2026-05-10

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

This study investigates whether multimodal models spontaneously develop internal visual representations—akin to “mental imagery”—during spatial reasoning tasks. Using the Qwen-VL (Qwen3.5) model, the authors train it to predict open-loop action sequences from initial states to solutions without explicit visual supervision and find, for the first time, that the model autonomously generates an imperfect visual world model of intermediate states. They further propose augmenting the chain-of-thought reasoning process with a small number of visual tokens, which substantially enhances performance. This approach improves average task success rates from 83% to 89%, with particularly pronounced gains on challenging benchmarks such as jigsaw puzzles and 3D mental rotation tasks.

📝 Abstract

Yes. We find that large multimodal models develop mental imagery when solving spatial puzzles, and they do imagine sheep when solving sheep puzzles. We fine-tune a Qwen3.5 VLM to solve twelve diverse visual reasoning tasks -- including tangram, jigsaw, sokoban, 3D mental rotation, and rush hour -- that require understanding geometry, spatial relationships, and the consequences of actions. By supervising the model to predict the open-loop sequence of actions to solve a puzzle from an initial state, we show that the model's activations after each action encode meaningful visual information about the intermediate state. This finding suggests that an imperfect visual world model begins to form as a byproduct of learning to select correct actions, in the absence of any explicit visual supervision. Building on this observation, we propose two ways to sharpen and use the mental images formed by the model. We find that integrating as few as sixteen visual tokens per step into the chain of thought improves the average solve rate from 83% to 89%, with particularly strong gains on reasoning-heavy tasks such as jigsaw and 3D mental rotation.

Problem

Research questions and friction points this paper is trying to address.

mental imagery

multimodal models

visual reasoning

spatial puzzles

world model

Innovation

Methods, ideas, or system contributions that make the work stand out.

mental imagery

visual reasoning

multimodal models