Prompt-Activation Duality: Improving Activation Steering via Attention-Level Interventions

📅 2026-05-11

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

In multi-turn dialogues, standard residual stream activation interventions suffer from degraded behavioral control and reduced coherence due to KV cache contamination. This work proposes Gated Clipped Attention Differencing (GCAD), a method that extracts the contribution of system prompts to self-attention as an intervention signal and incorporates a token-level gating mechanism to align the intervention pathway with the model’s intrinsic prompt-dependent regulation. This alignment effectively mitigates KV cache pollution. Experimental results demonstrate that GCAD substantially enhances long-term consistency: average coherence drift improves from −18.6 to −1.9, and the trait expression rate at turn 10 increases from 78.0% to 93.1%.

📝 Abstract

Activation steering controls language model behavior by adding directions to internal representations at inference time, but standard residual-stream steering can fail in stateful dialogue. We identify KV-cache contamination as a key failure mode: steered token states are stored and repeatedly reused, turning a local perturbation into cumulative coherence degradation. To address this challenge, we propose Gated Cropped Attention-Delta steering (GCAD), which extracts steering signals from system-prompt contributions to self-attention and applies them with token-level gating. Across persona-steering experiments, GCAD preserves trait control while substantially improving long-horizon coherence. On the main multi-turn benchmark, GCAD improves average coherence drift from -18.6 to -1.9 and raises turn-10 trait expression from 78.0 to 93.1. These results suggest that activation steering becomes more reliable when interventions follow the prompt-mediated pathways that models already use for behavioral control.

Problem

Research questions and friction points this paper is trying to address.

activation steering

KV-cache contamination

coherence degradation

stateful dialogue

language model behavior

Innovation

Methods, ideas, or system contributions that make the work stand out.

activation steering

KV-cache contamination

attention-level intervention