A Comparative Analysis of Sparse Autoencoder and Activation Difference in Language Model Steering

📅 2025-09-24

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Sparse autoencoders (SAEs) in language model steering often capture non-semantic features (e.g., punctuation), while constant-strength steering induces output degradation (e.g., repetition). To address these issues, we propose two key improvements: (1) activating only the top-1 semantic-relevant latent variable per token—thereby eliminating redundancy and noise—and (2) introducing a token-wise decaying steering strength that dynamically adapts to the generation process. Compared to baseline methods such as mean activation difference, our approach significantly improves reasoning quality on mathematical reasoning benchmarks, achieves comparable performance on IF-Eval, and—critically—enables fine-grained, interpretable, instruction-like semantic control for the first time. This establishes a more robust and semantically aligned steering paradigm for SAE-driven controllable generation.

Technology Category

Application Category

📝 Abstract

Sparse autoencoders (SAEs) have recently emerged as a powerful tool for language model steering. Prior work has explored top-k SAE latents for steering, but we observe that many dimensions among the top-k latents capture non-semantic features such as punctuation rather than semantic attributes like instructions. To address this, we propose focusing on a single, most relevant SAE latent (top-1), eliminating redundant features. We further identify a limitation in constant SAE steering, which often produces degenerate outputs such as repetitive single words. To mitigate this, we introduce a token-wise decaying steering strategy, enabling more faithful comparisons with mean activation difference baselines. Empirically, we show that steering an SAE latent associated with reasoning reliably elicits step-by-step mathematical reasoning and enhances inference quality, functionally resembling the effect of appending a guiding token. Our results demonstrate that SAEs outperform mean activation difference methods on mathematical reasoning benchmarks and match their performance on IF-Eval.

Problem

Research questions and friction points this paper is trying to address.

Improving semantic feature selection in sparse autoencoder steering

Addressing degenerate outputs from constant steering with decay strategy

Enhancing mathematical reasoning quality through targeted latent steering

Innovation

Methods, ideas, or system contributions that make the work stand out.

Focuses on single most relevant SAE latent

Introduces token-wise decaying steering strategy

Steers reasoning latent for mathematical inference

🔎 Similar Papers

No similar papers found.

Authors to Follow