Interpretability Illusions with Sparse Autoencoders: Evaluating Robustness of Concept Representations

📅 2025-05-21

📈 Citations: 0

✨ Influential: 0

career value

228K/year

🤖 AI Summary

Sparse autoencoders (SAEs) are widely used to extract human-interpretable concept representations from large language model (LLM) internal activations; however, their vulnerability to input perturbations—where minimal adversarial perturbations drastically alter concept activations while leaving base model outputs nearly unchanged—has been largely overlooked, undermining their reliability for model monitoring and oversight. Method: This work formally establishes concept representation robustness as a core desideratum of interpretability and introduces the first quantitative evaluation framework based on adversarial optimization in the input space, coupled with a realistic, scenario-aware benchmarking paradigm. Results: Extensive experiments reveal pervasive robustness deficiencies across state-of-the-art SAEs, validating both the necessity and effectiveness of the proposed framework. The study establishes a new benchmark for trustworthy, robustly interpretable AI and identifies concrete directions for improving SAE design and evaluation.

Technology Category

Application Category

📝 Abstract

Sparse autoencoders (SAEs) are commonly used to interpret the internal activations of large language models (LLMs) by mapping them to human-interpretable concept representations. While existing evaluations of SAEs focus on metrics such as the reconstruction-sparsity tradeoff, human (auto-)interpretability, and feature disentanglement, they overlook a critical aspect: the robustness of concept representations to input perturbations. We argue that robustness must be a fundamental consideration for concept representations, reflecting the fidelity of concept labeling. To this end, we formulate robustness quantification as input-space optimization problems and develop a comprehensive evaluation framework featuring realistic scenarios in which adversarial perturbations are crafted to manipulate SAE representations. Empirically, we find that tiny adversarial input perturbations can effectively manipulate concept-based interpretations in most scenarios without notably affecting the outputs of the base LLMs themselves. Overall, our results suggest that SAE concept representations are fragile and may be ill-suited for applications in model monitoring and oversight.

Problem

Research questions and friction points this paper is trying to address.

Evaluates robustness of sparse autoencoder concept representations

Examines vulnerability to adversarial input perturbations

Assesses suitability for model monitoring and oversight

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates robustness of concept representations

Uses input-space optimization problems

Develops adversarial perturbation framework

🔎 Similar Papers

A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders