🤖 AI Summary
The mechanistic interpretability field lacks standardized methods to evaluate consistency between features and their natural language descriptions.
Method: We propose FADE—the first model-agnostic, scalable framework for evaluating feature-description alignment—quantifying four dimensions: clarity, responsiveness, purity, and faithfulness. FADE integrates perturbation analysis, attribution consistency checking, and semantic similarity modeling, enabling compatibility with arbitrary models and explanation methods.
Contribution/Results: We introduce a novel multidimensional diagnostic metric suite, revealing for the first time that SAE features are fundamentally harder to describe accurately than MLP neurons. Evaluated across multiple open-source SAE and MLP feature banks, FADE precisely identifies primary causes of description failure, significantly enhancing the reliability of automated interpretability pipelines. We release the complete toolkit as open source.
📝 Abstract
Recent advances in mechanistic interpretability have highlighted the potential of automating interpretability pipelines in analyzing the latent representations within LLMs. While they may enhance our understanding of internal mechanisms, the field lacks standardized evaluation methods for assessing the validity of discovered features. We attempt to bridge this gap by introducing FADE: Feature Alignment to Description Evaluation, a scalable model-agnostic framework for evaluating feature-description alignment. FADE evaluates alignment across four key metrics - Clarity, Responsiveness, Purity, and Faithfulness - and systematically quantifies the causes for the misalignment of feature and their description. We apply FADE to analyze existing open-source feature descriptions, and assess key components of automated interpretability pipelines, aiming to enhance the quality of descriptions. Our findings highlight fundamental challenges in generating feature descriptions, particularly for SAEs as compared to MLP neurons, providing insights into the limitations and future directions of automated interpretability. We release FADE as an open-source package at: https://github.com/brunibrun/FADE.