Beyond Steering Vector: Flow-based Activation Steering for Inference-Time Intervention

📅 2026-05-07

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

Existing activation intervention methods are constrained by oversimplified assumptions—namely, fixed, single-step, and position-invariant transformations—resulting in limited generalization and inferior performance compared to in-context prompting. This work proposes FLAS, the first approach to incorporate continuous flow fields into activation intervention. FLAS models the transformation from original to target activations via concept-conditional ordinary differential equations, learning a nonlinear, multi-step, token-adaptive mapping without requiring parameter freezing or per-concept fine-tuning. Evaluated on AxBench, FLAS achieves state-of-the-art results, surpassing in-context prompting for the first time, with harmonic mean scores of 1.015 and 1.113 on the Gemma-2-2B-IT and 9B-IT models, respectively.

📝 Abstract

Activation steering has emerged as a promising alternative for controlling language-model behavior at inference time by modifying intermediate representations while keeping model parameters frozen. However, large-scale evaluations such as AxBench show that existing steering methods are often outperformed by simple in-context prompting and generalize poorly to unseen concepts. We hypothesize that these limitations arise from unvalidated simplifying assumptions shared across prior methods, which typically restrict steering interventions to fixed, single-step, position-invariant transforms. We propose FLAS (Flow-based Activation Steering), which learns a general, concept-conditioned velocity field $v_t(h,t,c)$ that transports unsteered activations to steered ones without relying on these assumptions. On AxBench, FLAS is the first learned method to consistently outperform prompting, reaching held-out harmonic means of $1.015$ on Gemma-2-2B-IT and $1.113$ on Gemma-2-9B-IT without per-concept tuning. Analysis of the learned flow shows curved, multi-step, token-varying trajectories, which suggests that previous hypotheses on activation space geometry might be incomplete.

Problem

Research questions and friction points this paper is trying to address.

activation steering

inference-time intervention

language model control

generalization

representation editing

Innovation

Methods, ideas, or system contributions that make the work stand out.

activation steering

flow-based modeling

inference-time intervention