SetPO: Set-Level Policy Optimization for Diversity-Preserving LLM Reasoning

📅 2026-02-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the tendency of large language models to sacrifice output diversity during reinforcement learning–based optimization of reasoning capabilities, which often leads to a collapse of the solution space. To mitigate this, the authors propose a set-level policy optimization method that, for the first time, integrates trajectory diversity directly into the policy gradient framework. Diversity among reasoning trajectories is measured via kernelized similarity, and a leave-one-out estimate of each trajectory’s marginal contribution to global diversity is incorporated as a shaping term in the advantage function. Theoretical analysis shows that rarer trajectories contribute more significantly to overall diversity. The approach is plug-and-play and model-agnostic, consistently outperforming strong baselines across multiple mathematical reasoning benchmarks with models of varying scales, while simultaneously improving both Pass@1 and Pass@K metrics.

Technology Category

Application Category

📝 Abstract
Reinforcement learning with verifiable rewards has shown notable effectiveness in enhancing large language models (LLMs) reasoning performance, especially in mathematics tasks. However, such improvements often come with reduced outcome diversity, where the model concentrates probability mass on a narrow set of solutions. Motivated by diminishing-returns principles, we introduce a set level diversity objective defined over sampled trajectories using kernelized similarity. Our approach derives a leave-one-out marginal contribution for each sampled trajectory and integrates this objective as a plug-in advantage shaping term for policy optimization. We further investigate the contribution of a single trajectory to language model diversity within a distribution perturbation framework. This analysis theoretically confirms a monotonicity property, proving that rarer trajectories yield consistently higher marginal contributions to the global diversity. Extensive experiments across a range of model scales demonstrate the effectiveness of our proposed algorithm, consistently outperforming strong baselines in both Pass@1 and Pass@K across various benchmarks.
Problem

Research questions and friction points this paper is trying to address.

diversity preservation
large language models
reinforcement learning
reasoning
outcome diversity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Set-Level Policy Optimization
Diversity-Preserving RL
Kernelized Similarity
Marginal Contribution
Trajectory Diversity
🔎 Similar Papers
No similar papers found.
C
Chenyi Li
Peking University
Y
Yuan Zhang
JD Explore Academy, China
B
Bo Wang
Beijing Institute of Technology
G
Guoqing Ma
JD Explore Academy, China
W
Wei Tang
JD Explore Academy, China
Haoyang Huang
Haoyang Huang
JD Explore Academy (present) | StepFun | Microsoft Research
Multimodal & Multilingual Foundation Model
Nan Duan
Nan Duan
JD.Com (now) | StepFun | Microsoft Research
NLPArtificial General Intelligence