🤖 AI Summary
Multimodal large language models (MLLMs) suffer from a fundamental perception–cognition misalignment: visual inputs trigger only shallow cross-modal alignment, failing to support coherent internal world modeling—leading to pervasive hallucinations and high-order reasoning failures. This work introduces a two-tier “perception-to-cognition” analytical framework that exposes the structural gap between low-level visual representations and high-level symbolic reasoning, advocating a dynamic “observe–reason–verify” cycle. Methodologically, we integrate fine-grained cross-modal alignment, multi-step chain-of-reasoning, and explicit hallucination suppression, and systematically benchmark state-of-the-art MLLMs on critical reasoning tasks. Our analysis identifies the core bottlenecks impeding deep multimodal reasoning and proposes a scalable pathway toward building trustworthy internal world models. The study establishes both theoretical foundations and practical guidelines for next-generation embodied cognitive MLLMs.
📝 Abstract
Multimodal Large Language Models (MLLMs) strive to achieve a profound, human-like understanding of and interaction with the physical world, but often exhibit a shallow and incoherent integration when acquiring information (Perception) and conducting reasoning (Cognition). This disconnect leads to a spectrum of reasoning failures, with hallucination being the most prominent. Collectively, these issues expose a fundamental challenge: the ability to process pixels does not yet confer the ability to construct a coherent, credible internal world model. To systematically dissect and address this challenge, this survey introduces a novel and unified analytical framework: ``From Perception to Cognition." We deconstruct the complex process of vision-language interactive understanding into two interdependent layers: Perception, the foundational ability to accurately extract visual information and achieve fine-grained alignment with textual instructions; and Cognition, the higher-order capability for proactive, multi-step, goal-oriented reasoning built upon this perceptual foundation, the core of which is the formation of a dynamic observe-think-verify reasoning loop. Guided by this framework, this paper systematically analyzes the key bottlenecks of current MLLMs at both layers. It surveys the landscape of cutting-edge methods designed to address these challenges, spanning from techniques that enhance low-level visual representations to those that improve high-level reasoning paradigms. Furthermore, we review critical benchmarks and delineate future research directions. This survey aims to provide the research community with a clear, structured perspective for understanding the intrinsic limitations of current MLLMs and to illuminate the path toward building next-generation models capable of deep reasoning and a genuine understanding of the world.