Linguistic Interpretability of Transformer-based Language Models: a systematic review

📅 2025-04-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper investigates the “linguistic interpretability” of Transformer language models—specifically, whether their internal representations implicitly encode human-like linguistic knowledge. To address this, we systematically review 160 studies, synthesizing cross-lingual and cross-model evidence across four linguistic dimensions: syntax, morphology, lexical semantics, and discourse. Our methodology integrates probing, attribution analysis, and representational similarity comparison, grounded in classical linguistic theory. We thereby bridge critical gaps in multilingual representation analysis and foundational pretraining model interpretation. Results demonstrate that multilingual Transformers consistently encode hierarchical linguistic knowledge, with distinct layers exhibiting functional specialization for specific linguistic phenomena. These findings provide essential theoretical foundations for model diagnostics, controllable text generation, and interdisciplinary research at the intersection of computational linguistics and cognitive neuroscience.

Technology Category

Application Category

📝 Abstract
Language models based on the Transformer architecture achieve excellent results in many language-related tasks, such as text classification or sentiment analysis. However, despite the architecture of these models being well-defined, little is known about how their internal computations help them achieve their results. This renders these models, as of today, a type of 'black box' systems. There is, however, a line of research -- 'interpretability' -- aiming to learn how information is encoded inside these models. More specifically, there is work dedicated to studying whether Transformer-based models possess knowledge of linguistic phenomena similar to human speakers -- an area we call 'linguistic interpretability' of these models. In this survey we present a comprehensive analysis of 160 research works, spread across multiple languages and models -- including multilingual ones -- that attempt to discover linguistic information from the perspective of several traditional Linguistics disciplines: Syntax, Morphology, Lexico-Semantics and Discourse. Our survey fills a gap in the existing interpretability literature, which either not focus on linguistic knowledge in these models or present some limitations -- e.g. only studying English-based models. Our survey also focuses on Pre-trained Language Models not further specialized for a downstream task, with an emphasis on works that use interpretability techniques that explore models' internal representations.
Problem

Research questions and friction points this paper is trying to address.

Understanding internal computations of Transformer-based language models
Investigating linguistic knowledge in multilingual Transformer models
Analyzing interpretability techniques for models' internal representations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Systematic review of Transformer models' linguistic interpretability
Analyzes 160 works across multiple languages and models
Focuses on internal representations without downstream specialization
🔎 Similar Papers
No similar papers found.
M
Miguel L'opez-Otal
Aragon Institute of Engineering Research, University of Zaragoza, Spain
Jorge Gracia
Jorge Gracia
University of Zaragoza
Semantic WebOntologiesLinguistic Linked DataOntology MatchingQuery interpretation
J
Jordi Bernad
Aragon Institute of Engineering Research, University of Zaragoza, Spain
Carlos Bobed
Carlos Bobed
Assistant Professor at University of Zaragoza, Spain
Semantic WebOntologiesKnowledge GraphsNLPMobile Computing
L
Luc'ia Pitarch-Ballesteros
Aragon Institute of Engineering Research, University of Zaragoza, Spain
E
Emma Angl'es-Herrero
Aragon Institute of Engineering Research, University of Zaragoza, Spain