π€ AI Summary
This work proposes an inference-time alignment method that circumvents the high computational cost and opacity of traditional weight-update-based approaches. By constructing a prompt-conditioned sparse autoencoder (SAE), the method dynamically intervenes on latent variables associated with the current tokenβs activations during decoding, achieving alignment without modifying the base model weights. This approach introduces the first prompt-conditioned dynamic control mechanism for SAEs, reducing alignment-related FLOPs by up to 4.47Γ. Analysis reveals that preference directions are primarily driven by discourse and stylistic signals. Evaluated on Gemma-2-2B/9B and Qwen3-8B, the method improves MT-Bench scores, achieves competitive performance on AlpacaEval, and maintains robust multiple-choice accuracy even under limited preference data.
π Abstract
Preference alignment is usually achieved by weight-updating training on preference data, which adds substantial alignment-stage compute and provides limited mechanistic visibility. We propose Dynamic SAE Steering for Preference Alignment (DSPA), an inference-time method that makes sparse autoencoder (SAE) steering prompt-conditional. From preference triples, DSPA computes a conditional-difference map linking prompt features to generation-control features; during decoding, it modifies only token-active latents, without base-model weight updates. Across Gemma-2-2B/9B and Qwen3-8B, DSPA improves MT-Bench and is competitive on AlpacaEval while preserving multiple-choice accuracy. Under restricted preference data, DSPA remains robust and can rival the two-stage RAHF-SCIT pipeline while requiring up to $4.47\times$ fewer alignment-stage FLOPs. Finally, we audit the SAE features DSPA modifies, finding that preference directions are dominated by discourse and stylistic signals, and provide theory clarifying the conditional-difference map estimate and when top-$k$ ablation is principled.