🤖 AI Summary
Large language models (LLMs) exhibit a cognition–performance dissociation in representing linguistic form (signifier) versus meaning (signified), challenging assumptions about their linguistic competence.
Method: We propose the first neuro-linguistic evaluation paradigm for LLMs, moving beyond traditional psycholinguistic approaches. We construct bilingual minimal-pair datasets (COMPS-ZH/DE) in Chinese and German, and employ diagnostic neural probing alongside cross-layer activation pattern analysis to systematically characterize how form and meaning are encoded across hidden layers.
Contribution/Results: We identify a pervasive “performance–capability dissociation”: instruction tuning improves task performance without enhancing deep semantic representation; form representations are robust and cross-lingually consistent, whereas meaning representations remain weak; and model output probabilities do not reliably reflect underlying linguistic competence. Our work establishes a novel theoretical framework and empirical benchmark for assessing LLMs’ true language capabilities.
📝 Abstract
This study investigates the linguistic understanding of Large Language Models (LLMs) regarding signifier (form) and signified (meaning) by distinguishing two LLM assessment paradigms: psycholinguistic and neurolinguistic. Traditional psycholinguistic evaluations often reflect statistical rules that may not accurately represent LLMs' true linguistic competence. We introduce a neurolinguistic approach, utilizing a novel method that combines minimal pair and diagnostic probing to analyze activation patterns across model layers. This method allows for a detailed examination of how LLMs represent form and meaning, and whether these representations are consistent across languages. We found: (1) Psycholinguistic and neurolinguistic methods reveal that language performance and competence are distinct; (2) Direct probability measurement may not accurately assess linguistic competence; (3) Instruction tuning won't change much competence but improve performance; (4) LLMs exhibit higher competence and performance in form compared to meaning. Additionally, we introduce new conceptual minimal pair datasets for Chinese (COMPS-ZH) and German (COMPS-DE), complementing existing English datasets.