🤖 AI Summary
Existing activation intervention methods are constrained by oversimplified assumptions—namely, fixed, single-step, and position-invariant transformations—resulting in limited generalization and inferior performance compared to in-context prompting. This work proposes FLAS, the first approach to incorporate continuous flow fields into activation intervention. FLAS models the transformation from original to target activations via concept-conditional ordinary differential equations, learning a nonlinear, multi-step, token-adaptive mapping without requiring parameter freezing or per-concept fine-tuning. Evaluated on AxBench, FLAS achieves state-of-the-art results, surpassing in-context prompting for the first time, with harmonic mean scores of 1.015 and 1.113 on the Gemma-2-2B-IT and 9B-IT models, respectively.
📝 Abstract
Activation steering has emerged as a promising alternative for controlling language-model behavior at inference time by modifying intermediate representations while keeping model parameters frozen. However, large-scale evaluations such as AxBench show that existing steering methods are often outperformed by simple in-context prompting and generalize poorly to unseen concepts. We hypothesize that these limitations arise from unvalidated simplifying assumptions shared across prior methods, which typically restrict steering interventions to fixed, single-step, position-invariant transforms. We propose FLAS (Flow-based Activation Steering), which learns a general, concept-conditioned velocity field $v_t(h,t,c)$ that transports unsteered activations to steered ones without relying on these assumptions. On AxBench, FLAS is the first learned method to consistently outperform prompting, reaching held-out harmonic means of $1.015$ on Gemma-2-2B-IT and $1.113$ on Gemma-2-9B-IT without per-concept tuning. Analysis of the learned flow shows curved, multi-step, token-varying trajectories, which suggests that previous hypotheses on activation space geometry might be incomplete.