RSPO: Risk-Seeking Policy Optimization for Pass@k and Max@k Metrics in Large Language Models

📅 2025-08-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current large language model (LLM) post-training employs risk-neutral objectives—maximizing expected reward—while evaluation predominantly relies on risk-seeking metrics such as Pass@k and Max@k, leading to objective–evaluation misalignment. To address this, we propose Risk-Seeking Policy Optimization (RSPO), the first method to explicitly model Pass@k and Max@k as differentiable, closed-form probabilistic objectives. RSPO eliminates gradient interference from low-reward responses via nested gradient computation over multiple sampled responses and explicit modeling of the maximum-response probability. Theoretically, we prove its convergence under standard assumptions. Empirically, RSPO achieves significant and consistent improvements in Pass@k and Max@k across multiple code and mathematical reasoning benchmarks, demonstrating both effectiveness and robustness. Our approach bridges the gap between training objectives and risk-sensitive evaluation, enabling unbiased and efficient policy optimization tailored to real-world LLM deployment criteria.

Technology Category

Application Category

📝 Abstract
Current large language model post-training optimizes a risk-neutral objective that maximizes expected reward, yet evaluation relies heavily on risk-seeking metrics like Pass@k (at least one success in k trials) and Max@k (maximum reward across k responses). This mismatch in risk preferences can inevitably lead to suboptimal performance. To bridge this gap, we propose Risk-Seeking Policy Optimization (RSPO), a novel method that directly targets Pass@k and Max@k during training. A key challenge in optimizing these metrics is the "hitchhiking" problem: low-reward responses are inadvertently reinforced if they co-occur with a high-reward response within a sample of k generations, resulting in inefficient optimization. RSPO addresses this problem by leveraging the closed-form probability that a given response is the maximum among k samplings. Despite the complexity of nested gradients over multiple responses, RSPO produces efficient, unbiased gradient estimators for both metrics. We validate our approach with both rigorous theoretical analysis and comprehensive experimental results.
Problem

Research questions and friction points this paper is trying to address.

Mismatch between risk-neutral training and risk-seeking evaluation metrics
Hitchhiking problem in optimizing Pass@k and Max@k metrics
Need for efficient unbiased gradient estimators for risk-seeking objectives
Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimizes Pass@k and Max@k metrics directly
Addresses hitchhiking with closed-form probability
Efficient unbiased gradient estimators for metrics
🔎 Similar Papers
No similar papers found.
Kaichen Zhang
Kaichen Zhang
Hong Kong University of Science and Technology (Guangzhou)
S
Shenghao Gao
Hong Kong University of Science and Technology (Guangzhou)
Y
Yuzhong Hong
Zuoyebang Education Technology
H
Haipeng Sun
Zuoyebang Education Technology
Junwei Bao
Junwei Bao
zuoyebang.com // JD.com // MSRA
NLPLLMQA+DialogGeneration
H
Hongfei Jiang
Zuoyebang Education Technology
Y
Yang Song
Zuoyebang Education Technology
D
Dingqian Hong
Zuoyebang Education Technology
Hui Xiong
Hui Xiong
Senior Scientist, Candela Corporation
Ultrafast dynamicsatomic molecular physicsfree electron laser