Identifying Intervenable and Interpretable Features via Orthogonality Regularization

📅 2026-02-04

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

This work addresses the challenge of entangled, superimposed, and poorly interpretable feature representations in language models by proposing an orthogonality-regularized fine-tuning approach for sparse autoencoders. Specifically, the decoder matrix is decomposed into an approximately orthogonal set of feature bases, grounded in the principle of independent causal mechanisms. The orthogonality constraint enhances feature uniqueness and modularity, while feature identifiability and interpretability are further improved through embedding distance metrics and causal intervention analysis. This enables more precise and isolated manipulation of individual features. Experimental results demonstrate that the proposed method yields substantially clearer and more controllable internal representations, with negligible degradation in downstream task performance.

Technology Category

Application Category

📝 Abstract

With recent progress on fine-tuning language models around a fixed sparse autoencoder, we disentangle the decoder matrix into almost orthogonal features. This reduces interference and superposition between the features, while keeping performance on the target dataset essentially unchanged. Our orthogonality penalty leads to identifiable features, ensuring the uniqueness of the decomposition. Further, we find that the distance between embedded feature explanations increases with stricter orthogonality penalty, a desirable property for interpretability. Invoking the $\textit{Independent Causal Mechanisms}$ principle, we argue that orthogonality promotes modular representations amenable to causal intervention. We empirically show that these increasingly orthogonalized features allow for isolated interventions. Our code is available under $\texttt{https://github.com/mrtzmllr/sae-icm}$.

Problem

Research questions and friction points this paper is trying to address.

intervenable features

interpretable features

orthogonality

causal intervention

feature disentanglement

Innovation

Methods, ideas, or system contributions that make the work stand out.

orthogonality regularization

interpretable features

causal intervention