InfAlign: Inference-aware language model alignment

📅 2024-12-27

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

Existing alignment methods neglect the impact of inference-time decoding strategies—such as Best-of-N sampling and tree search—on model behavior, leading to performance degradation under advanced decoding. This paper proposes IAPO (Inference-Aware Preference Optimization), the first framework to systematically model the coupling between decoding dynamics and alignment objectives. IAPO reformulates inference-time win-rate optimization as a KL-regularized reinforcement learning problem under a transformed reward function. We provide theoretical guarantees showing that its optimal policy coincides with that of RLHF under a corrected reward. Furthermore, we design a dedicated reward calibration and transformation mechanism—CTRL—to address Best-of-N sampling and jailbreaking scenarios. On Anthropic’s helpfulness and harmlessness benchmarks, IAPO achieves 8–12% and 4–9% absolute improvements in inference-time win rate over state-of-the-art methods that ignore inference dynamics.

Technology Category

Application Category

📝 Abstract

Language model alignment has become a critical step in training modern generative language models. The goal of alignment is to finetune a reference model such that the win rate of a sample from the aligned model over a sample from the reference model is high, subject to a KL divergence constraint. Today, we are increasingly using inference-time algorithms (e.g., Best-of-N, controlled decoding, tree search) to decode from language models rather than standard sampling. However, the alignment objective does not capture such inference-time decoding procedures. We show that the existing alignment framework is sub-optimal in view of such inference-time methods. We then modify the alignment objective and propose a framework for inference-aware alignment (IAPO). We prove that for any inference-time decoding algorithm, the optimal solution that optimizes the inference-time win rate of the aligned policy against the reference policy is the solution to the typical RLHF problem with a transformation of the reward. This motivates us to provide the KL-regularized calibrate-and-transform RL (CTRL) algorithm to solve this problem, which involves a reward calibration step and a KL-regularized reward maximization step with a transformation of the calibrated reward. We particularize our study to two important inference-time strategies: best-of-N sampling and best-of-N jailbreaking, where N responses are sampled from the model and the one with the highest or lowest reward is selected. We propose specific transformations for these strategies and demonstrate that our framework offers significant improvements over existing state-of-the-art methods for language model alignment. Empirically, we outperform baselines that are designed without taking inference-time decoding into consideration by 8-12% and 4-9% on inference-time win rates over the Anthropic helpfulness and harmlessness dialog benchmark datasets.

Problem

Research questions and friction points this paper is trying to address.

alignment strategy

decoding techniques

language model adaptation

Innovation

Methods, ideas, or system contributions that make the work stand out.

IAPO

CTRL

language model alignment

🔎 Similar Papers

Is Free Self-Alignment Possible?