🤖 AI Summary
This work addresses the vulnerability of reward models to “reward hacking” during inference-time alignment by proposing a robust alignment framework. The approach uniquely integrates temperature-controlled reference models with Sharpened Logarithmic Opinion Pooling (SLOP), leveraging an ensemble of generative reward models and introducing a weight calibration algorithm to dynamically adjust SLOP parameters. Experimental results demonstrate that the proposed framework effectively suppresses reward hacking while significantly preserving—and in some cases even enhancing—alignment performance, thereby establishing a more reliable paradigm for inference-time alignment.
📝 Abstract
Inference-time alignment techniques offer a lightweight alternative or complement to costly reinforcement learning, while enabling continual adaptation as alignment objectives and reward targets evolve. Existing theoretical analyses justify these methods as approximations to sampling from distributions optimally tilted toward a given reward model. We extend these techniques by introducing reference-model temperature adjustment, which leads to further generalization of inference-time alignment to ensembles of generative reward models combined as a sharpened logarithmic opinion pool (SLOP). To mitigate reward hacking, we propose an algorithm for calibrating SLOP weight parameters and experimentally demonstrate that it improves robustness while preserving alignment performance.