Activation Reward Models for Few-Shot Model Alignment

📅 2025-07-02

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

This work addresses the reliance of large language/model (LLM/LMM) alignment on reward models (RMs) that require extensive human annotations and independent training. We propose Activation RMs: a novel approach that directly extracts preference signals from pretrained LLMs/LMMs via neural activation steering—enabling effective reward modeling with few-shot (or even zero-shot fine-tuning) supervision. Our key contribution is the first integration of activation steering into reward modeling, yielding a plug-and-play, training-free RM mechanism. We further introduce PreferenceHack, the first paired preference benchmark explicitly designed to evaluate robustness against reward hacking. Experiments demonstrate that Activation RMs achieve state-of-the-art performance on standard few-shot RM benchmarks and outperform GPT-4o on PreferenceHack, significantly mitigating reward hacking behaviors.

Technology Category

Application Category

📝 Abstract

Aligning Large Language Models (LLMs) and Large Multimodal Models (LMMs) to human preferences is a central challenge in improving the quality of the models' generative outputs for real-world applications. A common approach is to use reward modeling to encode preferences, enabling alignment via post-training using reinforcement learning. However, traditional reward modeling is not easily adaptable to new preferences because it requires a separate reward model, commonly trained on large preference datasets. To address this, we introduce Activation Reward Models (Activation RMs) -- a novel few-shot reward modeling method that leverages activation steering to construct well-aligned reward signals using minimal supervision and no additional model finetuning. Activation RMs outperform existing few-shot reward modeling approaches such as LLM-as-a-judge with in-context learning, voting-based scoring, and token probability scoring on standard reward modeling benchmarks. Furthermore, we demonstrate the effectiveness of Activation RMs in mitigating reward hacking behaviors, highlighting their utility for safety-critical applications. Toward this end, we propose PreferenceHack, a novel few-shot setting benchmark, the first to test reward models on reward hacking in a paired preference format. Finally, we show that Activation RM achieves state-of-the-art performance on this benchmark, surpassing even GPT-4o.

Problem

Research questions and friction points this paper is trying to address.

Aligning LLMs and LMMs to human preferences efficiently

Overcoming traditional reward modeling's need for large datasets

Mitigating reward hacking in safety-critical applications

Innovation

Methods, ideas, or system contributions that make the work stand out.

Activation RMs enable few-shot reward modeling

Leverage activation steering for alignment

Mitigate reward hacking behaviors effectively

🔎 Similar Papers

No similar papers found.