Steering Large Language Model Activations in Sparse Spaces

📅 2025-02-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of behavior steering during the inference phase of large language models (LLMs), where entangled activations in dense representation spaces hinder both interpretability and precise behavioral control. To tackle this, we propose Sparse Activation Steering (SAS), a method that integrates sparse autoencoders (SAEs) with contrastive prompt pairs to identify and intervene upon monosemantic behavioral features within the sparse activation space—enabling interpretable and editable behavior modulation. Experiments on Gemma-2 demonstrate SAS’s capability for fine-grained behavioral control; moreover, scaling SAE size significantly improves the monosemanticity of SAS steering vectors, enhancing intervention reliability and mechanistic interpretability. To our knowledge, this is the first work to systematically incorporate sparse representation learning into test-time alignment steering for LLMs, establishing a novel paradigm for AI alignment.

Technology Category

Application Category

📝 Abstract
A key challenge in AI alignment is guiding large language models (LLMs) to follow desired behaviors at test time. Activation steering, which modifies internal model activations during inference, offers a potential solution. However, prior work in dense activation spaces struggles with superposition, wherein multiple features become entangled, limiting interpretability and precise control. In contrast, sparse representations provide an untapped opportunity for more interpretable behavior modulation. In this work, we introduce sparse activation steering (SAS), a method that leverages sparse autoencoders (SAEs) to steer LLM behavior in sparse spaces. By isolating behavior-specific features through a contrastive prompt-pairing approach, we define a set of features that can selectively reinforce or suppress behaviors. Experiments on Gemma 2 LLMs show that SAS vectors enable nuanced behavioral modulation and finer-grained control. Furthermore, scaling SAEs improves monosemanticity of SAS vectors, suggesting more reliable and interpretable interventions.
Problem

Research questions and friction points this paper is trying to address.

Guiding large language models to follow desired behaviors
Overcoming superposition in dense activation spaces
Enhancing interpretability and control with sparse activation steering
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse activation steering for LLM behavior control
Contrastive prompt-pairing isolates behavior-specific features
Scaling SAEs enhances interpretable behavioral interventions
🔎 Similar Papers
No similar papers found.