🤖 AI Summary
This paper addresses the misalignment between training and inference objectives in language models by proposing a reinforcement learning framework optimized for inference-time performance. Methodologically, it pioneers the direct integration of differentiable inference-time objectives—such as pass@k and majority voting—into the training pipeline via Proximal Policy Optimization (PPO) with sample-based gradient estimation. Unlike conventional paradigms that optimize only single-step token likelihood, this approach ensures end-to-end consistency between training objectives and actual inference behavior. Empirical evaluation demonstrates substantial improvements in pass@k on code generation tasks and validates controllable trade-offs between accuracy and inference efficiency across diverse reasoning benchmarks. The core contributions are: (1) the first end-to-end differentiable modeling of inference-time multi-sample objectives; and (2) establishing a direct linkage between training objectives and real-world deployment efficacy.
📝 Abstract
In this work, we investigate the merits of explicitly optimizing for inference time algorithmic performance during model training. We show how optimizing for inference time performance can improve overall model efficacy. We consider generic inference time objectives with $k$ samples, with a focus on pass@$k$ and majority voting as two main applications. With language model training on reasoning datasets, we showcase the performance trade-off enabled by training with such objectives. When training on code generation tasks, we show that the approach significantly improves pass@$k$ objectives compared to the baseline method.