MentisOculi: Revealing the Limits of Reasoning with Mental Imagery

📅 2026-02-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates whether state-of-the-art multimodal models can effectively leverage intermediate visual representations for goal-directed, multi-step reasoning in a manner analogous to human mental imagery. To this end, we introduce the MentisOculi benchmark suite—a structured evaluation framework comprising procedurally generated, hierarchically designed multi-step reasoning tasks—that systematically assesses a model’s capacity to generate and utilize visual intermediate representations. Our experiments reveal that, despite strong capabilities in textual reasoning and image generation, current models struggle to integrate these modalities synergistically to enhance reasoning performance. Neither implicit latent tokens nor explicitly generated images consistently improve accuracy; instead, visual intermediates often degrade performance due to error propagation, exposing a fundamental limitation in the visual reasoning mechanisms of existing architectures.

Technology Category

Application Category

📝 Abstract
Frontier models are transitioning from multimodal large language models (MLLMs) that merely ingest visual information to unified multimodal models (UMMs) capable of native interleaved generation. This shift has sparked interest in using intermediate visualizations as a reasoning aid, akin to human mental imagery. Central to this idea is the ability to form, maintain, and manipulate visual representations in a goal-oriented manner. To evaluate and probe this capability, we develop MentisOculi, a procedural, stratified suite of multi-step reasoning problems amenable to visual solution, tuned to challenge frontier models. Evaluating visual strategies ranging from latent tokens to explicit generated imagery, we find they generally fail to improve performance. Analysis of UMMs specifically exposes a critical limitation: While they possess the textual reasoning capacity to solve a task and can sometimes generate correct visuals, they suffer from compounding generation errors and fail to leverage even ground-truth visualizations. Our findings suggest that despite their inherent appeal, visual thoughts do not yet benefit model reasoning. MentisOculi establishes the necessary foundation to analyze and close this gap across diverse model families.
Problem

Research questions and friction points this paper is trying to address.

mental imagery
multimodal reasoning
visual representation
unified multimodal models
intermediate visualization
Innovation

Methods, ideas, or system contributions that make the work stand out.

mental imagery
unified multimodal models
visual reasoning
interleaved generation
MentisOculi
🔎 Similar Papers
No similar papers found.
J
Jana Zeller
Max-Planck-Institute for Intelligent Systems, Tübingen, Germany; ELLIS Institute Tübingen, Tübingen, Germany; University of Tübingen, Tübingen, Germany
Thaddäus Wiedemer
Thaddäus Wiedemer
Max Planck Institute for Intelligent Systems & University of Tübingen
F
Fanfei Li
Max-Planck-Institute for Intelligent Systems, Tübingen, Germany; ELLIS Institute Tübingen, Tübingen, Germany
T
Thomas Klein
Max-Planck-Institute for Intelligent Systems, Tübingen, Germany; ELLIS Institute Tübingen, Tübingen, Germany; University of Tübingen, Tübingen, Germany
Prasanna Mayilvahanan
Prasanna Mayilvahanan
Max Planck Institute for Intelligent Systems & University of Tübingen
GeneralizationReasoningDiscovery
Matthias Bethge
Matthias Bethge
Tübingen University & Maddox Co-Founder
Computational NeuroscienceMachine LearningVision
Felix Wichmann
Felix Wichmann
Eberhard Karls Universität Tübingen
psychophysicsvisionvisual perceptionhuman vision
Ryan Cotterell
Ryan Cotterell
ETH Zürich
LanguageLearningInformation
Wieland Brendel
Wieland Brendel
Fellow at ELLIS Institut Tübingen, Group Leader, Max Planck Institute for Intelligent Systems
machine learningcomputer vision