Unveiling Decision-Making in LLMs for Text Classification : Extraction of influential and interpretable concepts with Sparse Autoencoders

📅 2025-06-30

📈 Citations: 0

✨ Influential: 0

career value

241K/year

🤖 AI Summary

This work addresses the lack of systematic exploration in extracting interpretable concepts for text classification with large language models (LLMs). We propose a novel sparse autoencoder (SAE) architecture tailored for classification tasks, integrating a dedicated classification head and an activation-sparsity loss to explicitly model how semantic concepts influence decision-making. To enhance causal fidelity and semantic precision of extracted concepts, we further design two evaluation metrics grounded in external sentence encoders. Experiments across four Pythia models and two classification benchmarks demonstrate that our method consistently outperforms baselines—including ConceptShap and ICA—achieving simultaneous improvements in concept interpretability, causal attribution accuracy, and task performance. The approach establishes a reproducible, verifiable paradigm for fine-grained semantic analysis of LLM internal representations.

Technology Category

Application Category

📝 Abstract

Sparse Autoencoders (SAEs) have been successfully used to probe Large Language Models (LLMs) and extract interpretable concepts from their internal representations. These concepts are linear combinations of neuron activations that correspond to human-interpretable features. In this paper, we investigate the effectiveness of SAE-based explainability approaches for sentence classification, a domain where such methods have not been extensively explored. We present a novel SAE-based architecture tailored for text classification, leveraging a specialized classifier head and incorporating an activation rate sparsity loss. We benchmark this architecture against established methods such as ConceptShap, Independent Component Analysis, and other SAE-based concept extraction techniques. Our evaluation covers two classification benchmarks and four fine-tuned LLMs from the Pythia family. We further enrich our analysis with two novel metrics for measuring the precision of concept-based explanations, using an external sentence encoder. Our empirical results show that our architecture improves both the causality and interpretability of the extracted features.

Problem

Research questions and friction points this paper is trying to address.

Extract interpretable concepts from LLMs for classification

Evaluate SAE-based explainability in text classification tasks

Measure precision of concept explanations with novel metrics

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse Autoencoders extract interpretable LLM concepts

Specialized classifier head enhances text classification

Activation rate sparsity loss improves feature interpretability

🔎 Similar Papers

Self-supervised Interpretable Concept-based Models for Text Classification