Revisiting End To End Sparse Autoencoder Training -- A Short Finetune is All You Need

📅 2025-03-21

📈 Citations: 0

✨ Influential: 0

career value

227K/year

🤖 AI Summary

Sparse autoencoders (SAEs) suffer from insufficient reconstruction fidelity and large cross-entropy loss gaps when applied to language model interpretability. Method: We propose a lightweight fine-tuning strategy that jointly optimizes KL divergence and mean squared error (MSE) losses over only the final 25M tokens, using parameter-efficient adaptation via LoRA or linear adapters. Contribution/Results: This is the first work to demonstrate that ultra-short end-to-end fine-tuning can closely match full-training performance; it reveals systematic, correctable errors in MSE-trained SAEs and enables hyperparameter transfer across loss scales. Experiments show a 20–50% reduction in cross-entropy loss gap with only ~3% additional computational overhead. The method significantly improves performance on interpretability benchmarks—including SAEBench and circuit analysis—while remaining compatible with mainstream SAE architectures such as ReLU and Top-K.

Technology Category

Application Category

📝 Abstract

Sparse autoencoders (SAEs) are widely used for interpreting language model activations. A key evaluation metric is the increase in cross-entropy loss when replacing model activations with SAE reconstructions. Typically, SAEs are trained solely on mean squared error (MSE) using precomputed, shuffled activations. Recent work introduced training SAEs directly with a combination of KL divergence and MSE ("end-to-end"SAEs), significantly improving reconstruction accuracy at the cost of substantially increased computation, which has limited their widespread adoption. We propose a brief KL+MSE fine-tuning step applied only to the final 25M training tokens (just a few percent of typical training budgets) that achieves comparable improvements, reducing the cross-entropy loss gap by 20-50%, while incurring minimal additional computational cost. We further find that multiple fine-tuning methods (KL fine-tuning, LoRA adapters, linear adapters) yield similar, non-additive cross-entropy improvements, suggesting a common, easily correctable error source in MSE-trained SAEs. We demonstrate a straightforward method for effectively transferring hyperparameters and sparsity penalties despite scale differences between KL and MSE losses. While both ReLU and TopK SAEs see significant cross-entropy loss improvements, evaluations on supervised SAEBench metrics yield mixed results, suggesting practical benefits depend on both SAE architecture and the specific downstream task. Nonetheless, our method offers meaningful improvements in interpretability applications such as circuit analysis with minor additional cost.

Problem

Research questions and friction points this paper is trying to address.

Improving sparse autoencoder training efficiency

Reducing cross-entropy loss gap cost-effectively

Enhancing interpretability with minimal computation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Brief KL+MSE fine-tuning on final tokens

Multiple fine-tuning methods yield similar improvements

Transfer hyperparameters despite KL-MSE scale differences

🔎 Similar Papers

No similar papers found.