Energy-Driven Steering: Reducing False Refusals in Large Language Models

📅 2025-10-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing safety alignment methods often cause large language models (LLMs) to over-reject harmless prompts, degrading usability. To address this, we propose Energy-Driven Steering (EDS): a lightweight, inference-time technique that integrates an energy-based model (EBM) without fine-tuning. EDS maps LLM hidden states onto an energy landscape and dynamically steers them toward low-energy regions via gradient-based optimization, effectively decoupling safety control from knowledge representation. This enables flexible, low-overhead real-time behavioral regulation. On the ORB-H benchmark, EDS increases compliant response rate from 57.3% to 82.6%, substantially reduces false rejection rates, and preserves original safety guarantees without compromise. Our key contribution is the first application of energy-based modeling to inference-time safety steering in LLMs—achieving a principled balance between safety assurance and response robustness.

Technology Category

Application Category

📝 Abstract
Safety alignment of large language models (LLMs) faces a key challenge: current alignment techniques often only focus on improving safety against harmful prompts, causing LLMs to become over-cautious and refuse to respond to benign prompts. Therefore, a key objective of safe alignment is to enhance safety while simultaneously reducing false refusals. In this paper, we introduce Energy-Driven Steering (EDS), a novel, fine-tuning free framework designed to resolve this challenge through dynamic, inference-time intervention. We trained a lightweight, external Energy-Based Model (EBM) to assign high energy to undesirable (false refusal or jailbreak) states and low energy to desirable (helpful response or safe reject) ones. During inference, EBM maps the LLM's internal activations to an "energy landscape". We use the gradient of the energy function to dynamically steer the LLM's hidden states to low energy regions, correcting the model to generate a desirable response in real-time without modifying its weights. This method decouples behavioral control from the model's core knowledge, offering a flexible solution with minimal computational overhead. Extensive experiments across a wide range of models show our method successfully achieves this objective: it substantially lowers false refusal rates. For example, raising compliance on the ORB-H benchmark from 57.3% to 82.6% while maintaining the baseline safety performance. Our work presents an effective paradigm for building LLMs that achieve both low false refusal rates and high safety.
Problem

Research questions and friction points this paper is trying to address.

Reducing false refusals in LLM safety alignment
Preventing over-cautious responses to benign prompts
Maintaining safety while improving helpful response rates
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic steering of hidden states during inference
Lightweight external energy model guides responses
Fine-tuning free framework maintains safety performance
E
Eric Hanchen Jiang
University of California, Los Angeles
W
Weixuan Ou
Alibaba Cloud Computing
R
Run Liu
Shanghai Jiaotong University
S
Shengyuan Pang
Alibaba Cloud Computing
Guancheng Wan
Guancheng Wan
Computer Science, UCLA
AI AgentAI4ScienceLarge Language ModelTrustworthy AI
Ranjie Duan
Ranjie Duan
Alibaba Group
AIAI 安全AI推动共同富裕
W
Wei Dong
Nanyang Technological University
K
Kai-Wei Chang
University of California, Los Angeles
XiaoFeng Wang
XiaoFeng Wang
Chair, ACM SIGSAC
AI-Centered SecuritySystems Security and PrivacyHealthcare PrivacyIncentive Engineering
Ying Nian Wu
Ying Nian Wu
UCLA Department of Statistics and Data Science
Generative AIRepresentation learningComputer visionComputational neuroscienceBioinformatics
X
Xinfeng Li
Nanyang Technological University