SAGE: An Agentic Explainer Framework for Interpreting SAE Features in Language Models

📅 2025-11-25

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

The opacity of large language model (LLM) internal mechanisms severely hinders their safe deployment, while existing sparse autoencoder (SAE)-based feature interpretation methods are largely passive, static, and lack empirical validation. To address this, we propose the first agent-based active interpretation framework, reframing feature explanation as a closed-loop reasoning process: generating multiple hypotheses → designing targeted stimulation experiments → iteratively optimizing interpretations via feedback-driven refinement. Our method integrates generative hypothesis modeling, activation-guided experimental design, and SAE activation feedback. We systematically evaluate it across multiple LLMs. Compared to state-of-the-art approaches, our framework significantly improves interpretation accuracy (+18.7%) and verifiability, achieving, for the first time, logically rigorous and experimentally falsifiable deep attribution. This work establishes a novel paradigm for trustworthy AI.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have achieved remarkable progress, yet their internal mechanisms remain largely opaque, posing a significant challenge to their safe and reliable deployment. Sparse autoencoders (SAEs) have emerged as a promising tool for decomposing LLM representations into more interpretable features, but explaining the features captured by SAEs remains a challenging task. In this work, we propose SAGE (SAE AGentic Explainer), an agent-based framework that recasts feature interpretation from a passive, single-pass generation task into an active, explanation-driven process. SAGE implements a rigorous methodology by systematically formulating multiple explanations for each feature, designing targeted experiments to test them, and iteratively refining explanations based on empirical activation feedback. Experiments on features from SAEs of diverse language models demonstrate that SAGE produces explanations with significantly higher generative and predictive accuracy compared to state-of-the-art baselines.an agent-based framework that recasts feature interpretation from a passive, single-pass generation task into an active, explanationdriven process. SAGE implements a rigorous methodology by systematically formulating multiple explanations for each feature, designing targeted experiments to test them, and iteratively refining explanations based on empirical activation feedback. Experiments on features from SAEs of diverse language models demonstrate that SAGE produces explanations with significantly higher generative and predictive accuracy compared to state-of-the-art baselines.

Problem

Research questions and friction points this paper is trying to address.

Interpreting sparse autoencoder features in large language models

Transforming passive feature explanation into active agentic process

Improving accuracy of feature explanations through iterative refinement

Innovation

Methods, ideas, or system contributions that make the work stand out.

Agent-based framework for active feature interpretation

Systematically formulates and tests multiple feature explanations

Iteratively refines explanations using empirical activation feedback

🔎 Similar Papers

Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models