Sparse Autoencoder Features for Classifications and Transferability

📅 2025-02-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates sparse autoencoders (SAEs) for extracting interpretable, structured features from large language models (LLMs) to support safety-critical classification tasks, and systematically evaluates their zero-shot transferability across models, languages, and modalities. Methodologically, it first identifies the critical impact of pooling strategies and binary activation thresholds on SAE feature quality; proposes activation binarization—replacing conventional feature selection—to substantially improve computational efficiency while preserving or enhancing performance; and integrates layer selection, scaling analysis, and transfer adaptation to enhance feature generalization. Experiments demonstrate that SAE-derived features achieve macro-F1 scores exceeding 0.81 on classification tasks, significantly outperforming both raw hidden states and bag-of-words baselines. Furthermore, the approach enables successful zero-shot cross-model transfer (e.g., Gemma-2 2B → 9B-IT), zero-shot cross-lingual toxicity detection, and cross-modal generalization to image classification.

Technology Category

Application Category

📝 Abstract
Sparse Autoencoders (SAEs) provide potentials for uncovering structured, human-interpretable representations in Large Language Models (LLMs), making them a crucial tool for transparent and controllable AI systems. We systematically analyze SAE for interpretable feature extraction from LLMs in safety-critical classification tasks. Our framework evaluates (1) model-layer selection and scaling properties, (2) SAE architectural configurations, including width and pooling strategies, and (3) the effect of binarizing continuous SAE activations. SAE-derived features achieve macro F1>0.8, outperforming hidden-state and BoW baselines while demonstrating cross-model transfer from Gemma 2 2B to 9B-IT models. These features generalize in a zero-shot manner to cross-lingual toxicity detection and visual classification tasks. Our analysis highlights the significant impact of pooling strategies and binarization thresholds, showing that binarization offers an efficient alternative to traditional feature selection while maintaining or improving performance. These findings establish new best practices for SAE-based interpretability and enable scalable, transparent deployment of LLMs in real-world applications. Full repo: https://github.com/shan23chen/MOSAIC.
Problem

Research questions and friction points this paper is trying to address.

Analyze Sparse Autoencoders for interpretable feature extraction
Evaluate SAE architectural configurations and binarization effects
Demonstrate SAE-derived features' cross-model and cross-lingual transferability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse Autoencoders for interpretability
Binarization enhances feature selection
Cross-model transferability demonstrated
🔎 Similar Papers
No similar papers found.
J
J. Gallifant
Harvard University, Mass General Brigham
S
Shan Chen
Harvard University, Mass General Brigham, Boston Children’s Hospital
K
Kuleen Sasse
Johns Hopkins University
Hugo Aerts
Hugo Aerts
Professor, Harvard | Director, AI in Medicine Program, Mass General Brigham | Professor, MaastrichtU
Deep LearningArtificial IntelligenceBioinformaticsRadiomicsRadiogenomics
T
Thomas Hartvigsen
University of Virginia
Danielle S. Bitterman
Danielle S. Bitterman
Harvard Medical School
OncologyNatural Language ProcessingArtificial Intelligence