SetPO: Set-Level Policy Optimization for Diversity-Preserving LLM Reasoning

📅 2026-02-01

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the tendency of large language models to sacrifice output diversity during reinforcement learning–based optimization of reasoning capabilities, which often leads to a collapse of the solution space. To mitigate this, the authors propose a set-level policy optimization method that, for the first time, integrates trajectory diversity directly into the policy gradient framework. Diversity among reasoning trajectories is measured via kernelized similarity, and a leave-one-out estimate of each trajectory’s marginal contribution to global diversity is incorporated as a shaping term in the advantage function. Theoretical analysis shows that rarer trajectories contribute more significantly to overall diversity. The approach is plug-and-play and model-agnostic, consistently outperforming strong baselines across multiple mathematical reasoning benchmarks with models of varying scales, while simultaneously improving both Pass@1 and Pass@K metrics.

Technology Category

Application Category

📝 Abstract

Reinforcement learning with verifiable rewards has shown notable effectiveness in enhancing large language models (LLMs) reasoning performance, especially in mathematics tasks. However, such improvements often come with reduced outcome diversity, where the model concentrates probability mass on a narrow set of solutions. Motivated by diminishing-returns principles, we introduce a set level diversity objective defined over sampled trajectories using kernelized similarity. Our approach derives a leave-one-out marginal contribution for each sampled trajectory and integrates this objective as a plug-in advantage shaping term for policy optimization. We further investigate the contribution of a single trajectory to language model diversity within a distribution perturbation framework. This analysis theoretically confirms a monotonicity property, proving that rarer trajectories yield consistently higher marginal contributions to the global diversity. Extensive experiments across a range of model scales demonstrate the effectiveness of our proposed algorithm, consistently outperforming strong baselines in both Pass@1 and Pass@K across various benchmarks.

Problem

Research questions and friction points this paper is trying to address.

diversity preservation

large language models

reinforcement learning

reasoning

outcome diversity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Set-Level Policy Optimization

Diversity-Preserving RL

Kernelized Similarity

Marginal Contribution

Trajectory Diversity

🔎 Similar Papers

No similar papers found.

Authors to Follow