From Perception to Cognition: A Survey of Vision-Language Interactive Reasoning in Multimodal Large Language Models

📅 2025-09-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Multimodal large language models (MLLMs) suffer from a fundamental perception–cognition misalignment: visual inputs trigger only shallow cross-modal alignment, failing to support coherent internal world modeling—leading to pervasive hallucinations and high-order reasoning failures. This work introduces a two-tier “perception-to-cognition” analytical framework that exposes the structural gap between low-level visual representations and high-level symbolic reasoning, advocating a dynamic “observe–reason–verify” cycle. Methodologically, we integrate fine-grained cross-modal alignment, multi-step chain-of-reasoning, and explicit hallucination suppression, and systematically benchmark state-of-the-art MLLMs on critical reasoning tasks. Our analysis identifies the core bottlenecks impeding deep multimodal reasoning and proposes a scalable pathway toward building trustworthy internal world models. The study establishes both theoretical foundations and practical guidelines for next-generation embodied cognitive MLLMs.

Technology Category

Application Category

📝 Abstract
Multimodal Large Language Models (MLLMs) strive to achieve a profound, human-like understanding of and interaction with the physical world, but often exhibit a shallow and incoherent integration when acquiring information (Perception) and conducting reasoning (Cognition). This disconnect leads to a spectrum of reasoning failures, with hallucination being the most prominent. Collectively, these issues expose a fundamental challenge: the ability to process pixels does not yet confer the ability to construct a coherent, credible internal world model. To systematically dissect and address this challenge, this survey introduces a novel and unified analytical framework: ``From Perception to Cognition." We deconstruct the complex process of vision-language interactive understanding into two interdependent layers: Perception, the foundational ability to accurately extract visual information and achieve fine-grained alignment with textual instructions; and Cognition, the higher-order capability for proactive, multi-step, goal-oriented reasoning built upon this perceptual foundation, the core of which is the formation of a dynamic observe-think-verify reasoning loop. Guided by this framework, this paper systematically analyzes the key bottlenecks of current MLLMs at both layers. It surveys the landscape of cutting-edge methods designed to address these challenges, spanning from techniques that enhance low-level visual representations to those that improve high-level reasoning paradigms. Furthermore, we review critical benchmarks and delineate future research directions. This survey aims to provide the research community with a clear, structured perspective for understanding the intrinsic limitations of current MLLMs and to illuminate the path toward building next-generation models capable of deep reasoning and a genuine understanding of the world.
Problem

Research questions and friction points this paper is trying to address.

Addressing shallow integration between visual perception and cognitive reasoning
Reducing reasoning failures and hallucinations in multimodal models
Building coherent internal world models from visual information
Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes Perception-Cognition framework for MLLMs
Introduces dynamic observe-think-verify reasoning loop
Surveys enhanced visual representations and reasoning paradigms
🔎 Similar Papers
No similar papers found.
C
Chenyue Zhou
Nanyang Technological University, Singapore
M
Mingxuan Wang
Y
Yanbiao Ma
Gaoling School of Artificial Intelligence, Renmin University of China
Chenxu Wu
Chenxu Wu
USTC
diffusion-based methods,multimodal learning
W
Wanyi Chen
Z
Zhe Qian
X
Xinyu Liu
Y
Yiwei Zhang
J
Junhao Wang
H
Hengbo Xu
F
Fei Luo
X
Xiaohua Chen
Xiaoshuai Hao
Xiaoshuai Hao
Beijing Academy of Artificial Intelligence,BAAI
vision and language
H
Hehan Li
Andi Zhang
Andi Zhang
Research Associate in Machine Learning, University of Manchester
probabilistic generative modelsout-of-distribution detectionAdversarial Attack
W
Wenxuan Wang
Lingling Li
Lingling Li
Associate Director of Biostatistics, Sanofi Genzyme
Causal inferencemissing datapropensity scoresequential analytic methodsdrug and vaccine safety
Zhiwu Lu
Zhiwu Lu
Professor, Renmin University of China
Machine LearningComputer VisionLarge Multimodal ModelsVideo Generation
Y
Yang Lu
Xiamen University, China
Y
Yike Guo
The Hong Kong University of Science and Technology, Hong Kong SAR, China