Mechanistic Decomposition of Sentence Representations

📅 2025-06-04

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

Sentence embeddings lack interpretability due to implicit mixing of semantic and syntactic information through deep neural transformations and pooling operations, obscuring feature provenance. To address this, we propose the first mechanistic decomposition framework: leveraging dictionary learning to model token-level representations interpretably and inverting the pooling process to reveal how semantic and syntactic features are explicitly and separately encoded within linear subspaces. Experiments demonstrate that diverse linguistic attributes—including topic, tense, and dependency relations—are linearly separable in these subspaces, and yield human-interpretable embedding components. Our approach bridges token- and sentence-level interpretability, significantly enhancing transparency, controllability, and analytical granularity of sentence representations. It establishes a novel paradigm for probing embedding mechanisms, enabling fine-grained, attribution-aware analysis of sentence encoders across languages.

Technology Category

Application Category

📝 Abstract

Sentence embeddings are central to modern NLP and AI systems, yet little is known about their internal structure. While we can compare these embeddings using measures such as cosine similarity, the contributing features are not human-interpretable, and the content of an embedding seems untraceable, as it is masked by complex neural transformations and a final pooling operation that combines individual token embeddings. To alleviate this issue, we propose a new method to mechanistically decompose sentence embeddings into interpretable components, by using dictionary learning on token-level representations. We analyze how pooling compresses these features into sentence representations, and assess the latent features that reside in a sentence embedding. This bridges token-level mechanistic interpretability with sentence-level analysis, making for more transparent and controllable representations. In our studies, we obtain several interesting insights into the inner workings of sentence embedding spaces, for instance, that many semantic and syntactic aspects are linearly encoded in the embeddings.

Problem

Research questions and friction points this paper is trying to address.

Understanding internal structure of sentence embeddings

Making sentence embeddings interpretable and traceable

Bridging token-level and sentence-level interpretability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Decompose sentence embeddings into interpretable components

Use dictionary learning on token-level representations

Analyze pooling compression and latent features

🔎 Similar Papers

No similar papers found.