Temper and Tilt Lead to SLOP: Reward Hacking Mitigation with Inference-Time Alignment

📅 2026-05-13

📈 Citations: 0

✨ Influential: 0

career value

226K/year

🤖 AI Summary

This work addresses the vulnerability of reward models to “reward hacking” during inference-time alignment by proposing a robust alignment framework. The approach uniquely integrates temperature-controlled reference models with Sharpened Logarithmic Opinion Pooling (SLOP), leveraging an ensemble of generative reward models and introducing a weight calibration algorithm to dynamically adjust SLOP parameters. Experimental results demonstrate that the proposed framework effectively suppresses reward hacking while significantly preserving—and in some cases even enhancing—alignment performance, thereby establishing a more reliable paradigm for inference-time alignment.

📝 Abstract

Inference-time alignment techniques offer a lightweight alternative or complement to costly reinforcement learning, while enabling continual adaptation as alignment objectives and reward targets evolve. Existing theoretical analyses justify these methods as approximations to sampling from distributions optimally tilted toward a given reward model. We extend these techniques by introducing reference-model temperature adjustment, which leads to further generalization of inference-time alignment to ensembles of generative reward models combined as a sharpened logarithmic opinion pool (SLOP). To mitigate reward hacking, we propose an algorithm for calibrating SLOP weight parameters and experimentally demonstrate that it improves robustness while preserving alignment performance.

Problem

Research questions and friction points this paper is trying to address.

reward hacking

inference-time alignment

SLOP

alignment robustness

generative reward models

Innovation

Methods, ideas, or system contributions that make the work stand out.

inference-time alignment

temperature adjustment

SLOP