Data Whitening Improves Sparse Autoencoder Learning

📅 2025-11-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Sparse autoencoders (SAEs) often suffer from optimization difficulties and feature coupling when learning interpretable features from neural network activations, primarily due to strong correlations among input activations. This work identifies that such input correlations severely degrade the optimization landscape. To address this, we propose PCA-based whitening as a critical preprocessing step: it reshapes the loss surface, enhancing convexity and trainability, thereby improving feature decoupling and detection accuracy. Our approach challenges the conventional view that interpretability arises solely from the sparsity-fidelity trade-off, advocating instead that whitening should be adopted as a default step in SAE training. Extensive experiments across diverse model architectures and sparsity levels—evaluated on the SAEBench benchmark—demonstrate that whitening significantly improves sparse feature detection accuracy and feature decoupling metrics, with only a marginal increase in reconstruction loss.

Technology Category

Application Category

📝 Abstract
Sparse autoencoders (SAEs) have emerged as a promising approach for learning interpretable features from neural network activations. However, the optimization landscape for SAE training can be challenging due to correlations in the input data. We demonstrate that applying PCA Whitening to input activations -- a standard preprocessing technique in classical sparse coding -- improves SAE performance across multiple metrics. Through theoretical analysis and simulation, we show that whitening transforms the optimization landscape, making it more convex and easier to navigate. We evaluate both ReLU and Top-K SAEs across diverse model architectures, widths, and sparsity regimes. Empirical evaluation on SAEBench, a comprehensive benchmark for sparse autoencoders, reveals that whitening consistently improves interpretability metrics, including sparse probing accuracy and feature disentanglement, despite minor drops in reconstruction quality. Our results challenge the assumption that interpretability aligns with an optimal sparsity--fidelity trade-off and suggest that whitening should be considered as a default preprocessing step for SAE training, particularly when interpretability is prioritized over perfect reconstruction.
Problem

Research questions and friction points this paper is trying to address.

Data whitening improves sparse autoencoder optimization landscape
Whitening enhances interpretability metrics despite reconstruction trade-offs
Method addresses input correlation challenges in feature learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

PCA Whitening transforms input data for SAEs
Whitening makes optimization landscape more convex
Whitening improves interpretability over reconstruction quality
🔎 Similar Papers
No similar papers found.
A
Ashwin Saraswatula
Case Western Reserve University, Cold Spring Harbor Laboratory
David Klindt
David Klindt
Cold Spring Harbor Laboratory
artificial intelligencecomputational neuroscience