π€ AI Summary
In multi-turn dialogues, standard residual stream activation interventions suffer from degraded behavioral control and reduced coherence due to KV cache contamination. This work proposes Gated Clipped Attention Differencing (GCAD), a method that extracts the contribution of system prompts to self-attention as an intervention signal and incorporates a token-level gating mechanism to align the intervention pathway with the modelβs intrinsic prompt-dependent regulation. This alignment effectively mitigates KV cache pollution. Experimental results demonstrate that GCAD substantially enhances long-term consistency: average coherence drift improves from β18.6 to β1.9, and the trait expression rate at turn 10 increases from 78.0% to 93.1%.
π Abstract
Activation steering controls language model behavior by adding directions to internal representations at inference time, but standard residual-stream steering can fail in stateful dialogue. We identify KV-cache contamination as a key failure mode: steered token states are stored and repeatedly reused, turning a local perturbation into cumulative coherence degradation. To address this challenge, we propose Gated Cropped Attention-Delta steering (GCAD), which extracts steering signals from system-prompt contributions to self-attention and applies them with token-level gating. Across persona-steering experiments, GCAD preserves trait control while substantially improving long-horizon coherence. On the main multi-turn benchmark, GCAD improves average coherence drift from -18.6 to -1.9 and raises turn-10 trait expression from 78.0 to 93.1. These results suggest that activation steering becomes more reliable when interventions follow the prompt-mediated pathways that models already use for behavioral control.