RSPO: Risk-Seeking Policy Optimization for Pass@k and Max@k Metrics in Large Language Models

📅 2025-08-01

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

Current large language model (LLM) post-training employs risk-neutral objectives—maximizing expected reward—while evaluation predominantly relies on risk-seeking metrics such as Pass@k and Max@k, leading to objective–evaluation misalignment. To address this, we propose Risk-Seeking Policy Optimization (RSPO), the first method to explicitly model Pass@k and Max@k as differentiable, closed-form probabilistic objectives. RSPO eliminates gradient interference from low-reward responses via nested gradient computation over multiple sampled responses and explicit modeling of the maximum-response probability. Theoretically, we prove its convergence under standard assumptions. Empirically, RSPO achieves significant and consistent improvements in Pass@k and Max@k across multiple code and mathematical reasoning benchmarks, demonstrating both effectiveness and robustness. Our approach bridges the gap between training objectives and risk-sensitive evaluation, enabling unbiased and efficient policy optimization tailored to real-world LLM deployment criteria.

Technology Category

Application Category

📝 Abstract

Current large language model post-training optimizes a risk-neutral objective that maximizes expected reward, yet evaluation relies heavily on risk-seeking metrics like Pass@k (at least one success in k trials) and Max@k (maximum reward across k responses). This mismatch in risk preferences can inevitably lead to suboptimal performance. To bridge this gap, we propose Risk-Seeking Policy Optimization (RSPO), a novel method that directly targets Pass@k and Max@k during training. A key challenge in optimizing these metrics is the "hitchhiking" problem: low-reward responses are inadvertently reinforced if they co-occur with a high-reward response within a sample of k generations, resulting in inefficient optimization. RSPO addresses this problem by leveraging the closed-form probability that a given response is the maximum among k samplings. Despite the complexity of nested gradients over multiple responses, RSPO produces efficient, unbiased gradient estimators for both metrics. We validate our approach with both rigorous theoretical analysis and comprehensive experimental results.

Problem

Research questions and friction points this paper is trying to address.

Mismatch between risk-neutral training and risk-seeking evaluation metrics

Hitchhiking problem in optimizing Pass@k and Max@k metrics

Need for efficient unbiased gradient estimators for risk-seeking objectives

Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimizes Pass@k and Max@k metrics directly

Addresses hitchhiking with closed-form probability

Efficient unbiased gradient estimators for metrics

🔎 Similar Papers

No similar papers found.

💼 Related Jobs

Research Engineer, Monetization AI