The Energy Loss Phenomenon in RLHF: A New Perspective on Mitigating Reward Hacking

📅 2025-01-31

📈 Citations: 0

✨ Influential: 0

career value

241K/year

🤖 AI Summary

This work identifies an “energy loss” phenomenon in the final layer of large language models (LLMs) during Reinforcement Learning from Human Feedback (RLHF), wherein activation energy monotonically increases with training, degrading contextual coherence and inducing reward hacking—thereby undermining human preference alignment. To address this, we propose Energy-aware PPO (EPPO): a theoretically grounded variant that models energy constraints via entropy regularization and dynamically monitors and penalizes anomalous energy growth within the reward computation. EPPO is plug-and-play, fully compatible with standard LLM architectures. Extensive experiments demonstrate the universality of energy loss across diverse LLMs and tasks. EPPO effectively mitigates reward hacking, enhances response coherence, and improves alignment with human preferences, achieving an average 12.7% increase in win rate over baselines.

Technology Category

Application Category

📝 Abstract

This work identifies the Energy Loss Phenomenon in Reinforcement Learning from Human Feedback (RLHF) and its connection to reward hacking. Specifically, energy loss in the final layer of a Large Language Model (LLM) gradually increases during the RL process, with an excessive increase in energy loss characterizing reward hacking. Beyond empirical analysis, we further provide a theoretical foundation by proving that, under mild conditions, the increased energy loss reduces the upper bound of contextual relevance in LLMs, which is a critical aspect of reward hacking as the reduced contextual relevance typically indicates overfitting to reward model-favored patterns in RL. To address this issue, we propose an Energy loss-aware PPO algorithm (EPPO) which penalizes the increase in energy loss in the LLM's final layer during reward calculation to prevent excessive energy loss, thereby mitigating reward hacking. We theoretically show that EPPO can be conceptually interpreted as an entropy-regularized RL algorithm, which provides deeper insights into its effectiveness. Extensive experiments across various LLMs and tasks demonstrate the commonality of the energy loss phenomenon, as well as the effectiveness of exttt{EPPO} in mitigating reward hacking and improving RLHF performance.

Problem

Research questions and friction points this paper is trying to address.

Energy Loss

Reinforcement Learning with Human Feedback (RLHF)

Large Language Model (LLM)

Innovation

Methods, ideas, or system contributions that make the work stand out.

RLHF Energy Loss

EPPO Algorithm

Learning Efficiency

🔎 Similar Papers

No similar papers found.

Nvidia

30 USD - 94 USD

US, CA, Santa Clara

AI Research Scientist, Language - Monetization GenAI