🤖 AI Summary
This work investigates whether state-of-the-art multimodal models can effectively leverage intermediate visual representations for goal-directed, multi-step reasoning in a manner analogous to human mental imagery. To this end, we introduce the MentisOculi benchmark suite—a structured evaluation framework comprising procedurally generated, hierarchically designed multi-step reasoning tasks—that systematically assesses a model’s capacity to generate and utilize visual intermediate representations. Our experiments reveal that, despite strong capabilities in textual reasoning and image generation, current models struggle to integrate these modalities synergistically to enhance reasoning performance. Neither implicit latent tokens nor explicitly generated images consistently improve accuracy; instead, visual intermediates often degrade performance due to error propagation, exposing a fundamental limitation in the visual reasoning mechanisms of existing architectures.
📝 Abstract
Frontier models are transitioning from multimodal large language models (MLLMs) that merely ingest visual information to unified multimodal models (UMMs) capable of native interleaved generation. This shift has sparked interest in using intermediate visualizations as a reasoning aid, akin to human mental imagery. Central to this idea is the ability to form, maintain, and manipulate visual representations in a goal-oriented manner. To evaluate and probe this capability, we develop MentisOculi, a procedural, stratified suite of multi-step reasoning problems amenable to visual solution, tuned to challenge frontier models. Evaluating visual strategies ranging from latent tokens to explicit generated imagery, we find they generally fail to improve performance. Analysis of UMMs specifically exposes a critical limitation: While they possess the textual reasoning capacity to solve a task and can sometimes generate correct visuals, they suffer from compounding generation errors and fail to leverage even ground-truth visualizations. Our findings suggest that despite their inherent appeal, visual thoughts do not yet benefit model reasoning. MentisOculi establishes the necessary foundation to analyze and close this gap across diverse model families.