🤖 AI Summary
Existing latent-space intervention methods for large language models typically operate on dense representations, often leading to semantic entanglement and hindering precise, controllable generation. This work proposes a novel approach that, for the first time, applies sparse autoencoders (SAEs) to query activations within the attention mechanism to extract disentangled and interpretable features. During inference, it combines gradient-based optimization with alignment to prototypical target behaviors to precisely steer generated content. The method effectively satisfies strict path-planning constraints in the Textualized Gridworld benchmark and successfully modulates the cognitive complexity of feedback in educational scenarios, demonstrating its unified applicability, interpretability, and efficacy in tasks requiring rule adherence and stylistic control.
📝 Abstract
Latent steering exploits internal representations of Large Language Models (LLMs) to guide generation, yet interventions on dense states can entangle distinct semantic features. In this paper, we investigate attention query activations as a high-fidelity site for precise control, hypothesizing that manipulating the attention mechanism itself offers sharper steerability than general state interventions. We introduce Prototype-Based Sparse Steering, a framework that applies Sparse Autoencoders (SAEs) specifically to query activations, to decompose them into interpretable features, then apply gradient-based optimization during inference to align the sparse representation with class prototypes of target behaviors. To validate this architectural insight, we first analyze the mechanism in Textualized Gridworld, a controlled environment for verifiable planning constraints. We demonstrate that optimizing sparse query features enables effective navigation of rigid planning requirements (i.e., safe vs. short paths), confirming the method's ability to satisfy objective rules. We then demonstrate the framework's versatility by training SAEs on a high-dimensional educational domain, where the framework steers the cognitive complexity of feedback (i.e., Bloom's Taxonomy). Our experiments establish that sparse query representations provide the necessary disentanglement for unified, interpretable control over both logical planning and stylistic nuance.