Geometry-Adaptive Explainer for Faithful Dictionary-Based Interpretability under Distribution Shift

πŸ“… 2026-05-20
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

198K/year
πŸ€– AI Summary
This work addresses the degradation of explanation fidelity in dictionary-based interpreters under distribution shift, which arises from rotation of the activation subspace that misaligns the dictionary with out-of-distribution (OOD) samples. To mitigate this issue, the paper proposes a training-free Geometrically Adaptive Interpreter (GAE) that dynamically aligns the dictionary with the OOD-active subspace using unlabeled OOD activation data. GAE leverages geometric analysis of subspaces, second-moment alignment, and sparse dictionary mapping, introducing a β€œfidelity gap” as a geometric metric to quantify misalignment. This approach preserves the original feature structure while achieving faithful OOD explanations. Experiments demonstrate that GAE substantially reduces the fidelity gap under distribution shift and matches or even surpasses the causal fidelity of trainable baselines across diverse models and OOD scenarios.
πŸ“ Abstract
Mechanistic interpretability aims to explain a model's behavior by identifying causally responsible internal structures. Dictionary-based explainers such as sparse autoencoders and transcoders are a primary tool, but their faithfulness under out-of-distribution (OOD) shift has received little systematic attention. We show that distribution shift rotates the subspace that the model actively uses, misaligning the explainer's dictionary trained on in-distribution (ID) activations. We formalize this misalignment as the faithfulness gap, a geometric distance between the ID dictionary and the OOD-active subspace, and show that it controls OOD faithfulness degradation. To reduce this gap, we propose the Geometry-Adaptive Explainer (GAE), which realigns the explainer's dictionary with the OOD-active subspace while preserving the original feature structure. This requires only unlabeled OOD activations and no gradient updates. We prove that GAE improves over the unadapted ID explainer, with excess loss bounded quadratically by the second-moment shift. Empirically, GAE even matches or surpasses all training-based baselines in causal faithfulness across multiple models and OOD settings.
Problem

Research questions and friction points this paper is trying to address.

distribution shift
dictionary-based interpretability
faithfulness gap
mechanistic interpretability
OOD
Innovation

Methods, ideas, or system contributions that make the work stand out.

geometry-adaptive
dictionary-based interpretability
distribution shift
faithfulness gap
mechanistic interpretability
πŸ”Ž Similar Papers
No similar papers found.