🤖 AI Summary
Sentence embeddings lack interpretability due to implicit mixing of semantic and syntactic information through deep neural transformations and pooling operations, obscuring feature provenance. To address this, we propose the first mechanistic decomposition framework: leveraging dictionary learning to model token-level representations interpretably and inverting the pooling process to reveal how semantic and syntactic features are explicitly and separately encoded within linear subspaces. Experiments demonstrate that diverse linguistic attributes—including topic, tense, and dependency relations—are linearly separable in these subspaces, and yield human-interpretable embedding components. Our approach bridges token- and sentence-level interpretability, significantly enhancing transparency, controllability, and analytical granularity of sentence representations. It establishes a novel paradigm for probing embedding mechanisms, enabling fine-grained, attribution-aware analysis of sentence encoders across languages.
📝 Abstract
Sentence embeddings are central to modern NLP and AI systems, yet little is known about their internal structure. While we can compare these embeddings using measures such as cosine similarity, the contributing features are not human-interpretable, and the content of an embedding seems untraceable, as it is masked by complex neural transformations and a final pooling operation that combines individual token embeddings. To alleviate this issue, we propose a new method to mechanistically decompose sentence embeddings into interpretable components, by using dictionary learning on token-level representations. We analyze how pooling compresses these features into sentence representations, and assess the latent features that reside in a sentence embedding. This bridges token-level mechanistic interpretability with sentence-level analysis, making for more transparent and controllable representations. In our studies, we obtain several interesting insights into the inner workings of sentence embedding spaces, for instance, that many semantic and syntactic aspects are linearly encoded in the embeddings.