Guiding Giants: Lightweight Controllers for Weighted Activation Steering in LLMs

📅 2025-05-22

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

To address the high computational cost of fine-tuning large language models (LLMs) for safety control during inference and the lack of fine-grained, adaptive intervention in existing methods, this paper proposes a lightweight trainable controller that operates without modifying LLM parameters. Our method dynamically modulates layer-wise activations at inference time using precomputed rejection-direction vectors. Crucially, we introduce a novel hierarchical weighted activation guidance mechanism that jointly predicts input-dependent global scaling factors and layer-specific weights, enabling layer-aware, fine-grained, and adaptive safety intervention. The controller is trained via supervised learning on paired harmful/benign prompts. Experiments demonstrate substantial improvements in refusal rates on benchmarks including ToxicChat and In-The-Wild Jailbreak, outperforming state-of-the-art approaches. The method is model-agnostic and compatible with mainstream open-weight LLMs such as Llama-3.1-8B, Llama-3.2-1B, and Mistral-7B.

Technology Category

Application Category

📝 Abstract

Controlling undesirable Large Language Model (LLM) behaviors, such as the generation of unsafe content or failing to adhere to safety guidelines, often relies on costly fine-tuning. Activation steering provides an alternative for inference-time control, but existing methods typically lack fine-grained, adaptive mechanisms. We introduce a novel approach using a lightweight, trainable controller network integrated during inference. This controller network observes specific intermediate LLM activations and predicts both a global scaling factor and layer-specific weights. The predicted global scaling factor and layer-specific weights then dynamically modulate the intensity of a steering patch, derived from a pre-computed"refusal direction"vector, applied across the LLM's layers during generation. Trained on activations from both harmful and benign prompts, our controller learns to discriminatively apply nuanced, layer-aware interventions, activating steering primarily for harmful inputs. Experiments using safety benchmarks like ToxicChat&In-The-Wild Jailbreak Prompts demonstrate that our weighted steering controller significantly increases refusal rates compared to the base LLM, achieving targeted behavioral modification without altering the original model parameters. Our experiments with Llama-3.1-8B, Llama-3.2-1B&Mistral-7B show our approach outperforms existing methods, presenting an efficient and adaptive method for fine-grained control over LLM behavior at inference time.

Problem

Research questions and friction points this paper is trying to address.

Control unsafe LLM content without costly fine-tuning

Enable fine-grained adaptive activation steering during inference

Improve refusal rates for harmful inputs without parameter changes

Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight controller for adaptive activation steering

Dynamic modulation with global and layer-specific weights

Discriminative intervention without altering model parameters

🔎 Similar Papers

No similar papers found.