Optimizing Language Models for Inference Time Objectives using Reinforcement Learning

📅 2025-03-25

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

This paper addresses the misalignment between training and inference objectives in language models by proposing a reinforcement learning framework optimized for inference-time performance. Methodologically, it pioneers the direct integration of differentiable inference-time objectives—such as pass@k and majority voting—into the training pipeline via Proximal Policy Optimization (PPO) with sample-based gradient estimation. Unlike conventional paradigms that optimize only single-step token likelihood, this approach ensures end-to-end consistency between training objectives and actual inference behavior. Empirical evaluation demonstrates substantial improvements in pass@k on code generation tasks and validates controllable trade-offs between accuracy and inference efficiency across diverse reasoning benchmarks. The core contributions are: (1) the first end-to-end differentiable modeling of inference-time multi-sample objectives; and (2) establishing a direct linkage between training objectives and real-world deployment efficacy.

Technology Category

Application Category

📝 Abstract

In this work, we investigate the merits of explicitly optimizing for inference time algorithmic performance during model training. We show how optimizing for inference time performance can improve overall model efficacy. We consider generic inference time objectives with $k$ samples, with a focus on pass@$k$ and majority voting as two main applications. With language model training on reasoning datasets, we showcase the performance trade-off enabled by training with such objectives. When training on code generation tasks, we show that the approach significantly improves pass@$k$ objectives compared to the baseline method.

Problem

Research questions and friction points this paper is trying to address.

Optimizing inference time performance in language models

Improving pass@k and majority voting objectives

Enhancing code generation task efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimizing inference time using reinforcement learning

Focus on pass@k and majority voting objectives

Improves pass@k in code generation tasks

🔎 Similar Papers

No similar papers found.