DRA-GRPO: Exploring Diversity-Aware Reward Adjustment for R1-Zero-Like Training of Large Language Models

📅 2025-05-14

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

To address the problem that reward signals in reinforcement learning–based post-training of large language models (LLMs) neglect semantic diversity—leading to inconsistent trade-offs between generation quality and diversity—this paper proposes a diversity-aware reward adjustment mechanism. Our method introduces submodular mutual information (SMI) into reward computation for the first time, dynamically reweighting rewards to suppress redundant sampling while amplifying rewards for high-quality, semantically diverse outputs, thereby achieving synergistic optimization of exploration and exploitation. Integrated into the GRPO framework, it forms the DR.GRPO training paradigm. Evaluated on five mathematical reasoning benchmarks, DR.GRPO achieves a state-of-the-art average accuracy of 58.2%, using only 7,000 samples and approximately $55 in training cost—significantly improving efficiency and performance in low-resource settings. The core contribution lies in SMI-driven reward modeling that explicitly incorporates semantic diversity awareness.

Technology Category

Application Category

📝 Abstract

Recent advances in reinforcement learning for language model post-training, such as Group Relative Policy Optimization (GRPO), have shown promise in low-resource settings. However, GRPO typically relies on solution-level and scalar reward signals that fail to capture the semantic diversity among sampled completions. This leads to what we identify as a diversity-quality inconsistency, where distinct reasoning paths may receive indistinguishable rewards. To address this limitation, we propose $ extit{Diversity-aware Reward Adjustment}$ (DRA), a method that explicitly incorporates semantic diversity into the reward computation. DRA uses Submodular Mutual Information (SMI) to downweight redundant completions and amplify rewards for diverse ones. This encourages better exploration during learning, while maintaining stable exploitation of high-quality samples. Our method integrates seamlessly with both GRPO and its variant DR.~GRPO, resulting in $ extit{DRA-GRPO}$ and $ extit{DGA-DR.~GRPO}$. We evaluate our method on five mathematical reasoning benchmarks and find that it outperforms recent strong baselines. It achieves state-of-the-art performance with an average accuracy of 58.2%, using only 7,000 fine-tuning samples and a total training cost of approximately $55. The code is available at https://github.com/xiwenc1/DRA-GRPO.

Problem

Research questions and friction points this paper is trying to address.

Addresses diversity-quality inconsistency in reward signals

Incorporates semantic diversity into reward computation

Enhances exploration while maintaining high-quality sample exploitation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Diversity-aware Reward Adjustment (DRA) enhances rewards.

Uses Submodular Mutual Information (SMI) for diversity.

Integrates with GRPO for improved performance.

🔎 Similar Papers

No similar papers found.