DSPA: Dynamic SAE Steering for Data-Efficient Preference Alignment

📅 2026-03-22

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

This work proposes an inference-time alignment method that circumvents the high computational cost and opacity of traditional weight-update-based approaches. By constructing a prompt-conditioned sparse autoencoder (SAE), the method dynamically intervenes on latent variables associated with the current token’s activations during decoding, achieving alignment without modifying the base model weights. This approach introduces the first prompt-conditioned dynamic control mechanism for SAEs, reducing alignment-related FLOPs by up to 4.47×. Analysis reveals that preference directions are primarily driven by discourse and stylistic signals. Evaluated on Gemma-2-2B/9B and Qwen3-8B, the method improves MT-Bench scores, achieves competitive performance on AlpacaEval, and maintains robust multiple-choice accuracy even under limited preference data.

Technology Category

Application Category

📝 Abstract

Preference alignment is usually achieved by weight-updating training on preference data, which adds substantial alignment-stage compute and provides limited mechanistic visibility. We propose Dynamic SAE Steering for Preference Alignment (DSPA), an inference-time method that makes sparse autoencoder (SAE) steering prompt-conditional. From preference triples, DSPA computes a conditional-difference map linking prompt features to generation-control features; during decoding, it modifies only token-active latents, without base-model weight updates. Across Gemma-2-2B/9B and Qwen3-8B, DSPA improves MT-Bench and is competitive on AlpacaEval while preserving multiple-choice accuracy. Under restricted preference data, DSPA remains robust and can rival the two-stage RAHF-SCIT pipeline while requiring up to $4.47\times$ fewer alignment-stage FLOPs. Finally, we audit the SAE features DSPA modifies, finding that preference directions are dominated by discourse and stylistic signals, and provide theory clarifying the conditional-difference map estimate and when top-$k$ ablation is principled.

Problem

Research questions and friction points this paper is trying to address.

preference alignment

data efficiency

inference-time steering

sparse autoencoder

compute cost

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic SAE Steering

Preference Alignment

Inference-time Control