Toward Cognitive Supersensing in Multimodal Large Language Model

📅 2026-02-02

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This work addresses the limitation of existing multimodal large language models (MLLMs) in complex cognitive tasks that require human-like visual reasoning grounded in visual memory and abstract details. To bridge this gap, we propose a novel cognitive hyper-perception training paradigm that introduces visual imagery mechanisms into MLLMs for the first time. Specifically, a Latent Visual Imagery Prediction (LVIP) head is employed to jointly learn sequences of latent visual cognitive embeddings aligned with target answers, thereby constructing an internal vision-based reasoning chain. This chain is further refined through reinforcement learning to optimize textual reasoning pathways. The resulting framework establishes a synergistic vision–language cognitive reasoning architecture that transcends the constraints of purely text-based chain-of-thought reasoning. Evaluated on our newly curated CogSense-Bench benchmark, the method significantly outperforms current approaches and demonstrates strong generalization on out-of-domain mathematical and scientific visual question answering tasks.

Technology Category

Application Category

📝 Abstract

Multimodal Large Language Models (MLLMs) have achieved remarkable success in open-vocabulary perceptual tasks, yet their ability to solve complex cognitive problems remains limited, especially when visual details are abstract and require visual memory. Current approaches primarily scale Chain-of-Thought (CoT) reasoning in the text space, even when language alone is insufficient for clear and structured reasoning, and largely neglect visual reasoning mechanisms analogous to the human visuospatial sketchpad and visual imagery. To mitigate this deficiency, we introduce Cognitive Supersensing, a novel training paradigm that endows MLLMs with human-like visual imagery capabilities by integrating a Latent Visual Imagery Prediction (LVIP) head that jointly learns sequences of visual cognitive latent embeddings and aligns them with the answer, thereby forming vision-based internal reasoning chains. We further introduce a reinforcement learning stage that optimizes text reasoning paths based on this grounded visual latent. To evaluate the cognitive capabilities of MLLMs, we present CogSense-Bench, a comprehensive visual question answering (VQA) benchmark assessing five cognitive dimensions. Extensive experiments demonstrate that MLLMs trained with Cognitive Supersensing significantly outperform state-of-the-art baselines on CogSense-Bench and exhibit superior generalization on out-of-domain mathematics and science VQA benchmarks, suggesting that internal visual imagery is potentially key to bridging the gap between perceptual recognition and cognitive understanding. We will open-source the CogSense-Bench and our model weights.

Problem

Research questions and friction points this paper is trying to address.

Multimodal Large Language Models

Cognitive Reasoning

Visual Imagery

Visual Memory

Chain-of-Thought

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cognitive Supersensing

Latent Visual Imagery Prediction

Multimodal Large Language Models