On the transferability of Sparse Autoencoders for interpreting compressed models

📅 2025-07-21

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

While model compression techniques (e.g., pruning, quantization) improve the inference efficiency of large language models (LLMs), their impact on interpretability—particularly via sparse autoencoders (SAEs)—remains poorly understood. Method: We systematically investigate the cross-compression-state transferability of SAEs trained on original LLMs to their pruned or quantized counterparts. We further propose structured pruning of the SAE itself—without retraining—to reduce its computational footprint while preserving explanatory capability. Results: We find that SAEs trained on uncompressed models transfer effectively to compressed variants, with only marginal degradation in explanation quality. Moreover, pruning the SAE directly achieves comparable interpretability to SAEs trained from scratch on compressed models—yet at drastically lower training cost. This work is the first to empirically validate strong SAE transferability across LLM compression states and establishes a novel, cost-efficient paradigm for maintaining interpretability in deployment-grade LLMs, thereby enabling practical, trustworthy model analysis.

Technology Category

Application Category

📝 Abstract

Modern LLMs face inference efficiency challenges due to their scale. To address this, many compression methods have been proposed, such as pruning and quantization. However, the effect of compression on a model's interpretability remains elusive. While several model interpretation approaches exist, such as circuit discovery, Sparse Autoencoders (SAEs) have proven particularly effective in decomposing a model's activation space into its feature basis. In this work, we explore the differences in SAEs for the original and compressed models. We find that SAEs trained on the original model can interpret the compressed model albeit with slight performance degradation compared to the trained SAE on the compressed model. Furthermore, simply pruning the original SAE itself achieves performance comparable to training a new SAE on the pruned model. This finding enables us to mitigate the extensive training costs of SAEs.

Problem

Research questions and friction points this paper is trying to address.

Effect of model compression on interpretability via SAEs

Transferability of SAEs between original and compressed models

Reducing SAE training costs for compressed models

Innovation

Methods, ideas, or system contributions that make the work stand out.

SAEs interpret compressed models effectively

Pruning original SAEs matches new SAE performance

Reduces SAE training costs significantly

🔎 Similar Papers

Efficient Dictionary Learning with Switch Sparse Autoencoders