Improving Robustness In Sparse Autoencoders via Masked Regularization

📅 2026-04-07

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

This work addresses the challenges of feature entanglement and degraded out-of-distribution (OOD) performance in sparse autoencoders, which arise from under-constrained training objectives and undermine interpretability. To mitigate these issues, the authors propose a mask-based regularization method that randomly replaces input tokens during training to disrupt co-occurring feature patterns. This approach effectively alleviates feature absorption, enhances the stability and robustness of latent representations, and narrows the performance gap between in-distribution and OOD settings. The method is architecture-agnostic and compatible with various sparsity levels, demonstrating consistent improvements across different sparse autoencoder configurations. Furthermore, it leads to better performance on probing tasks, indicating more disentangled and semantically meaningful representations.

Technology Category

Application Category

📝 Abstract

Sparse autoencoders (SAEs) are widely used in mechanistic interpretability to project LLM activations onto sparse latent spaces. However, sparsity alone is an imperfect proxy for interpretability, and current training objectives often result in brittle latent representations. SAEs are known to be prone to feature absorption, where general features are subsumed by more specific ones due to co-occurrence, degrading interpretability despite high reconstruction fidelity. Recent negative results on Out-of-Distribution (OOD) performance further underscore broader robustness related failures tied to under-specified training objectives. We address this by proposing a masking-based regularization that randomly replaces tokens during training to disrupt co-occurrence patterns. This improves robustness across SAE architectures and sparsity levels reducing absorption, enhancing probing performance, and narrowing the OOD gap. Our results point toward a practical path for more reliable interpretability tools.

Problem

Research questions and friction points this paper is trying to address.

sparse autoencoders

feature absorption

robustness

out-of-distribution

interpretability

Innovation

Methods, ideas, or system contributions that make the work stand out.

masked regularization

sparse autoencoders

feature absorption