Train Sparse Autoencoders Efficiently by Utilizing Features Correlation

📅 2025-05-28

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

Training large-scale sparse autoencoders (SAEs) incurs prohibitive computational and memory overhead—especially with large dictionary sizes—hindering scalability and interpretability. Method: We propose KronSAE, a novel architecture featuring (1) a Kronecker-product-based latent-space decomposition framework that drastically reduces linear transformation parameters; (2) mAND, a differentiable activation function approximating binary AND logic to enforce high sparsity while enhancing semantic interpretability of learned features; and (3) sparse-aware kernel optimizations for efficient inference. Results: On neuron feature disentanglement in language models, KronSAE significantly reduces memory and FLOP consumption, enabling training with substantially larger dictionaries. It simultaneously improves reconstruction accuracy and direction-level interpretability—measured via feature alignment and causal mediation analysis—without compromising sparsity or fidelity.

Technology Category

Application Category

📝 Abstract

Sparse Autoencoders (SAEs) have demonstrated significant promise in interpreting the hidden states of language models by decomposing them into interpretable latent directions. However, training SAEs at scale remains challenging, especially when large dictionary sizes are used. While decoders can leverage sparse-aware kernels for efficiency, encoders still require computationally intensive linear operations with large output dimensions. To address this, we propose KronSAE, a novel architecture that factorizes the latent representation via Kronecker product decomposition, drastically reducing memory and computational overhead. Furthermore, we introduce mAND, a differentiable activation function approximating the binary AND operation, which improves interpretability and performance in our factorized framework.

Problem

Research questions and friction points this paper is trying to address.

Efficient training of Sparse Autoencoders at scale

Reducing computational overhead in encoder operations

Improving interpretability with factorized latent representation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Kronecker product decomposition reduces computational overhead

mAND activation function enhances interpretability and performance

Factorized latent representation improves training efficiency

🔎 Similar Papers

Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models