🤖 AI Summary
This paper investigates the “linguistic interpretability” of Transformer language models—specifically, whether their internal representations implicitly encode human-like linguistic knowledge. To address this, we systematically review 160 studies, synthesizing cross-lingual and cross-model evidence across four linguistic dimensions: syntax, morphology, lexical semantics, and discourse. Our methodology integrates probing, attribution analysis, and representational similarity comparison, grounded in classical linguistic theory. We thereby bridge critical gaps in multilingual representation analysis and foundational pretraining model interpretation. Results demonstrate that multilingual Transformers consistently encode hierarchical linguistic knowledge, with distinct layers exhibiting functional specialization for specific linguistic phenomena. These findings provide essential theoretical foundations for model diagnostics, controllable text generation, and interdisciplinary research at the intersection of computational linguistics and cognitive neuroscience.
📝 Abstract
Language models based on the Transformer architecture achieve excellent results in many language-related tasks, such as text classification or sentiment analysis. However, despite the architecture of these models being well-defined, little is known about how their internal computations help them achieve their results. This renders these models, as of today, a type of 'black box' systems. There is, however, a line of research -- 'interpretability' -- aiming to learn how information is encoded inside these models. More specifically, there is work dedicated to studying whether Transformer-based models possess knowledge of linguistic phenomena similar to human speakers -- an area we call 'linguistic interpretability' of these models. In this survey we present a comprehensive analysis of 160 research works, spread across multiple languages and models -- including multilingual ones -- that attempt to discover linguistic information from the perspective of several traditional Linguistics disciplines: Syntax, Morphology, Lexico-Semantics and Discourse. Our survey fills a gap in the existing interpretability literature, which either not focus on linguistic knowledge in these models or present some limitations -- e.g. only studying English-based models. Our survey also focuses on Pre-trained Language Models not further specialized for a downstream task, with an emphasis on works that use interpretability techniques that explore models' internal representations.