Mechanistic Decomposition of Sentence Representations

📅 2025-06-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Sentence embeddings lack interpretability due to implicit mixing of semantic and syntactic information through deep neural transformations and pooling operations, obscuring feature provenance. To address this, we propose the first mechanistic decomposition framework: leveraging dictionary learning to model token-level representations interpretably and inverting the pooling process to reveal how semantic and syntactic features are explicitly and separately encoded within linear subspaces. Experiments demonstrate that diverse linguistic attributes—including topic, tense, and dependency relations—are linearly separable in these subspaces, and yield human-interpretable embedding components. Our approach bridges token- and sentence-level interpretability, significantly enhancing transparency, controllability, and analytical granularity of sentence representations. It establishes a novel paradigm for probing embedding mechanisms, enabling fine-grained, attribution-aware analysis of sentence encoders across languages.

Technology Category

Application Category

📝 Abstract
Sentence embeddings are central to modern NLP and AI systems, yet little is known about their internal structure. While we can compare these embeddings using measures such as cosine similarity, the contributing features are not human-interpretable, and the content of an embedding seems untraceable, as it is masked by complex neural transformations and a final pooling operation that combines individual token embeddings. To alleviate this issue, we propose a new method to mechanistically decompose sentence embeddings into interpretable components, by using dictionary learning on token-level representations. We analyze how pooling compresses these features into sentence representations, and assess the latent features that reside in a sentence embedding. This bridges token-level mechanistic interpretability with sentence-level analysis, making for more transparent and controllable representations. In our studies, we obtain several interesting insights into the inner workings of sentence embedding spaces, for instance, that many semantic and syntactic aspects are linearly encoded in the embeddings.
Problem

Research questions and friction points this paper is trying to address.

Understanding internal structure of sentence embeddings
Making sentence embeddings interpretable and traceable
Bridging token-level and sentence-level interpretability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decompose sentence embeddings into interpretable components
Use dictionary learning on token-level representations
Analyze pooling compression and latent features
🔎 Similar Papers
No similar papers found.
M
Matthieu Tehenan
University of Cambridge
V
Vikram Natarajan
Independent
J
Jonathan Michala
University of Southern California
M
Milton Lin
Johns Hopkins University
Juri Opitz
Juri Opitz
University of Zurich