Cognitive Mismatch in Multimodal Large Language Models for Discrete Symbol Understanding

📅 2026-03-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses a critical gap in multimodal large language models (MLLMs): their mismatch between perceptual and reasoning capabilities when interpreting discrete symbolic notations—such as mathematical expressions or chemical structures—hindering genuine mastery of the symbolic languages underpinning scientific discovery. To investigate this, the authors construct a comprehensive benchmark spanning five domains: language, culture, mathematics, physics, and chemistry. Their analysis reveals a previously unreported “cognitive mismatch” phenomenon, wherein models often fail at basic symbol recognition yet succeed in complex reasoning tasks, exposing a fundamental reliance on linguistic priors rather than authentic visual perception. Through extensive cross-domain evaluation and qualitative–quantitative diagnostics, the study systematically identifies key weaknesses in current MLLMs’ symbolic understanding and offers crucial guidance for developing more rigorous, human-aligned multimodal intelligence.

Technology Category

Application Category

📝 Abstract
While Multimodal Large Language Models (MLLMs) have achieved remarkable success in interpreting natural scenes, their ability to process discrete symbols -- the fundamental building blocks of human cognition -- remains a critical open question. Unlike continuous visual data, symbols such as mathematical formulas, chemical structures, and linguistic characters require precise, deeper interpretation. This paper introduces a comprehensive benchmark to evaluate how top-tier MLLMs navigate these "discrete semantic spaces" across five domains: language, culture, mathematics, physics, and chemistry. Our investigation uncovers a counterintuitive phenomenon: models often fail at basic symbol recognition yet succeed in complex reasoning tasks, suggesting they rely on linguistic probability rather than true visual perception. By exposing this "cognitive mismatch", we highlight a significant gap in current AI capabilities: the struggle to truly perceive and understand the symbolic languages that underpin scientific discovery and abstract thought. This work offers a roadmap for developing more rigorous, human-aligned intelligent systems.
Problem

Research questions and friction points this paper is trying to address.

Cognitive Mismatch
Multimodal Large Language Models
Discrete Symbols
Symbol Understanding
Symbolic Reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

cognitive mismatch
multimodal large language models
discrete symbols
symbolic understanding
comprehensive benchmark
Y
Yinghui Li
Tsinghua University, China
J
Jiayi Kuang
Sun Yat-sen University, China
P
Peng Xing
Tsinghua University, China
D
Daixian Liu
Tsinghua University, China
Junnan Dong
Junnan Dong
Tencent Youtu Lab | HKPolyU
Large Language ModelsGraphRAGAgentKnowledge Graphs
S
Shu-Yu Guo
Tsinghua University, China
Y
Yangning Li
Tsinghua University, China
Qingyu Zhou
Qingyu Zhou
Unknown affiliation
Wenhao Jiang
Wenhao Jiang
GML, Tencent, PolyU
Computer VisionMachine LearningFoundation Models
H
Hai-Tao Zheng
Tsinghua University, China
Y
Ying Shen
Sun Yat-sen University, China
Liang Lin
Liang Lin
Fellow of IEEE/IAPR, Professor of Computer Science, Sun Yat-sen University
Embodied AICausal Inference and LearningMultimodal Data Analysis
Philip S. Yu
Philip S. Yu
Professor of Computer Science, University of Illinons at Chicago
Data miningDatabasePrivacy