🤖 AI Summary
To address the high computational cost of fine-tuning large language models (LLMs) for safety control during inference and the lack of fine-grained, adaptive intervention in existing methods, this paper proposes a lightweight trainable controller that operates without modifying LLM parameters. Our method dynamically modulates layer-wise activations at inference time using precomputed rejection-direction vectors. Crucially, we introduce a novel hierarchical weighted activation guidance mechanism that jointly predicts input-dependent global scaling factors and layer-specific weights, enabling layer-aware, fine-grained, and adaptive safety intervention. The controller is trained via supervised learning on paired harmful/benign prompts. Experiments demonstrate substantial improvements in refusal rates on benchmarks including ToxicChat and In-The-Wild Jailbreak, outperforming state-of-the-art approaches. The method is model-agnostic and compatible with mainstream open-weight LLMs such as Llama-3.1-8B, Llama-3.2-1B, and Mistral-7B.
📝 Abstract
Controlling undesirable Large Language Model (LLM) behaviors, such as the generation of unsafe content or failing to adhere to safety guidelines, often relies on costly fine-tuning. Activation steering provides an alternative for inference-time control, but existing methods typically lack fine-grained, adaptive mechanisms. We introduce a novel approach using a lightweight, trainable controller network integrated during inference. This controller network observes specific intermediate LLM activations and predicts both a global scaling factor and layer-specific weights. The predicted global scaling factor and layer-specific weights then dynamically modulate the intensity of a steering patch, derived from a pre-computed"refusal direction"vector, applied across the LLM's layers during generation. Trained on activations from both harmful and benign prompts, our controller learns to discriminatively apply nuanced, layer-aware interventions, activating steering primarily for harmful inputs. Experiments using safety benchmarks like ToxicChat&In-The-Wild Jailbreak Prompts demonstrate that our weighted steering controller significantly increases refusal rates compared to the base LLM, achieving targeted behavioral modification without altering the original model parameters. Our experiments with Llama-3.1-8B, Llama-3.2-1B&Mistral-7B show our approach outperforms existing methods, presenting an efficient and adaptive method for fine-grained control over LLM behavior at inference time.