FADE: Why Bad Descriptions Happen to Good Features

📅 2025-02-24
📈 Citations: 0
Influential: 0
📄 PDF

career value

212K/year
🤖 AI Summary
The mechanistic interpretability field lacks standardized methods to evaluate consistency between features and their natural language descriptions. Method: We propose FADE—the first model-agnostic, scalable framework for evaluating feature-description alignment—quantifying four dimensions: clarity, responsiveness, purity, and faithfulness. FADE integrates perturbation analysis, attribution consistency checking, and semantic similarity modeling, enabling compatibility with arbitrary models and explanation methods. Contribution/Results: We introduce a novel multidimensional diagnostic metric suite, revealing for the first time that SAE features are fundamentally harder to describe accurately than MLP neurons. Evaluated across multiple open-source SAE and MLP feature banks, FADE precisely identifies primary causes of description failure, significantly enhancing the reliability of automated interpretability pipelines. We release the complete toolkit as open source.

Technology Category

Application Category

📝 Abstract
Recent advances in mechanistic interpretability have highlighted the potential of automating interpretability pipelines in analyzing the latent representations within LLMs. While they may enhance our understanding of internal mechanisms, the field lacks standardized evaluation methods for assessing the validity of discovered features. We attempt to bridge this gap by introducing FADE: Feature Alignment to Description Evaluation, a scalable model-agnostic framework for evaluating feature-description alignment. FADE evaluates alignment across four key metrics - Clarity, Responsiveness, Purity, and Faithfulness - and systematically quantifies the causes for the misalignment of feature and their description. We apply FADE to analyze existing open-source feature descriptions, and assess key components of automated interpretability pipelines, aiming to enhance the quality of descriptions. Our findings highlight fundamental challenges in generating feature descriptions, particularly for SAEs as compared to MLP neurons, providing insights into the limitations and future directions of automated interpretability. We release FADE as an open-source package at: https://github.com/brunibrun/FADE.
Problem

Research questions and friction points this paper is trying to address.

Lack of standardized evaluation in interpretability
Misalignment between features and descriptions
Challenges in generating accurate feature descriptions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces FADE framework
Evaluates feature-description alignment
Open-source package released
🔎 Similar Papers
B
Bruno Puri
Department of Artificial Intelligence, Fraunhofer Heinrich Hertz Institute, Berlin, Germany; Department of Electrical Engineering and Computer Science, Technische Universität Berlin, Berlin, Germany
Aakriti Jain
Aakriti Jain
Department of Artificial Intelligence, Fraunhofer Heinrich Hertz Institute, Berlin, Germany
E
Elena Golimblevskaia
Department of Artificial Intelligence, Fraunhofer Heinrich Hertz Institute, Berlin, Germany
P
Patrick Kahardipraja
Department of Artificial Intelligence, Fraunhofer Heinrich Hertz Institute, Berlin, Germany
Thomas Wiegand
Thomas Wiegand
Professor, TU Berlin and Fraunhofer HHI, Berlin, Germany
Image and Video CodingData CompressionMachine LearningCommunicationsDigital Health
Wojciech Samek
Wojciech Samek
Professor at TU Berlin, Head of AI Department at Fraunhofer HHI, BIFOLD Fellow
Deep LearningInterpretabilityExplainable AITrustworthy AIFederated Learning
Sebastian Lapuschkin
Sebastian Lapuschkin
Head of Explainable AI, Fraunhofer Heinrich Hertz Institute
InterpretabilityExplainable AIXAIMachine LearningArtificial Intelligence