🤖 AI Summary
Current large language model (LLM) post-training employs risk-neutral objectives—maximizing expected reward—while evaluation predominantly relies on risk-seeking metrics such as Pass@k and Max@k, leading to objective–evaluation misalignment. To address this, we propose Risk-Seeking Policy Optimization (RSPO), the first method to explicitly model Pass@k and Max@k as differentiable, closed-form probabilistic objectives. RSPO eliminates gradient interference from low-reward responses via nested gradient computation over multiple sampled responses and explicit modeling of the maximum-response probability. Theoretically, we prove its convergence under standard assumptions. Empirically, RSPO achieves significant and consistent improvements in Pass@k and Max@k across multiple code and mathematical reasoning benchmarks, demonstrating both effectiveness and robustness. Our approach bridges the gap between training objectives and risk-sensitive evaluation, enabling unbiased and efficient policy optimization tailored to real-world LLM deployment criteria.
📝 Abstract
Current large language model post-training optimizes a risk-neutral objective that maximizes expected reward, yet evaluation relies heavily on risk-seeking metrics like Pass@k (at least one success in k trials) and Max@k (maximum reward across k responses). This mismatch in risk preferences can inevitably lead to suboptimal performance. To bridge this gap, we propose Risk-Seeking Policy Optimization (RSPO), a novel method that directly targets Pass@k and Max@k during training. A key challenge in optimizing these metrics is the "hitchhiking" problem: low-reward responses are inadvertently reinforced if they co-occur with a high-reward response within a sample of k generations, resulting in inefficient optimization. RSPO addresses this problem by leveraging the closed-form probability that a given response is the maximum among k samplings. Despite the complexity of nested gradients over multiple responses, RSPO produces efficient, unbiased gradient estimators for both metrics. We validate our approach with both rigorous theoretical analysis and comprehensive experimental results.