The Energy Loss Phenomenon in RLHF: A New Perspective on Mitigating Reward Hacking

📅 2025-01-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work identifies an “energy loss” phenomenon in the final layer of large language models (LLMs) during Reinforcement Learning from Human Feedback (RLHF), wherein activation energy monotonically increases with training, degrading contextual coherence and inducing reward hacking—thereby undermining human preference alignment. To address this, we propose Energy-aware PPO (EPPO): a theoretically grounded variant that models energy constraints via entropy regularization and dynamically monitors and penalizes anomalous energy growth within the reward computation. EPPO is plug-and-play, fully compatible with standard LLM architectures. Extensive experiments demonstrate the universality of energy loss across diverse LLMs and tasks. EPPO effectively mitigates reward hacking, enhances response coherence, and improves alignment with human preferences, achieving an average 12.7% increase in win rate over baselines.

Technology Category

Application Category

📝 Abstract
This work identifies the Energy Loss Phenomenon in Reinforcement Learning from Human Feedback (RLHF) and its connection to reward hacking. Specifically, energy loss in the final layer of a Large Language Model (LLM) gradually increases during the RL process, with an excessive increase in energy loss characterizing reward hacking. Beyond empirical analysis, we further provide a theoretical foundation by proving that, under mild conditions, the increased energy loss reduces the upper bound of contextual relevance in LLMs, which is a critical aspect of reward hacking as the reduced contextual relevance typically indicates overfitting to reward model-favored patterns in RL. To address this issue, we propose an Energy loss-aware PPO algorithm (EPPO) which penalizes the increase in energy loss in the LLM's final layer during reward calculation to prevent excessive energy loss, thereby mitigating reward hacking. We theoretically show that EPPO can be conceptually interpreted as an entropy-regularized RL algorithm, which provides deeper insights into its effectiveness. Extensive experiments across various LLMs and tasks demonstrate the commonality of the energy loss phenomenon, as well as the effectiveness of exttt{EPPO} in mitigating reward hacking and improving RLHF performance.
Problem

Research questions and friction points this paper is trying to address.

Energy Loss
Reinforcement Learning with Human Feedback (RLHF)
Large Language Model (LLM)
Innovation

Methods, ideas, or system contributions that make the work stand out.

RLHF Energy Loss
EPPO Algorithm
Learning Efficiency
🔎 Similar Papers
No similar papers found.
Yuchun Miao
Yuchun Miao
School of Computer Science, Wuhan University
Image ProcessingRemote SensingLarge Language ModelRLHFMachine Learning
S
Sen Zhang
The University of Sydney
L
Liang Ding
The University of Sydney
Y
Yuqi Zhang
National Engineering Research Center for Multimedia Software, School of Computer Science, Wuhan University
Lefei Zhang
Lefei Zhang
School of Computer Science, Wuhan University
Pattern RecognitionMachine LearningImage ProcessingRemote Sensing
D
D. Tao
Nanyang Technological University