Group-Aware Reinforcement Learning for Output Diversity in Large Language Models

📅 2025-11-16

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

Large language models (LLMs) often suffer from mode collapse, resulting in insufficient output diversity and limiting their practicality in open-ended generation tasks. To address this, we propose Group-Aware Policy Optimization (GAPO), a reinforcement learning method that models the group-level output distribution and introduces a frequency-aware diversity reward to encourage more uniform sampling within the GRPO framework. GAPO requires no supervised labels and is directly applicable to open prompting settings. Experiments across benchmarks—including GSM8K, MATH, HumanEval, and MMLU-Pro—demonstrate that GAPO maintains or improves task accuracy while significantly enhancing output diversity and coverage. Notably, GAPO achieves, for the first time, joint optimization of high accuracy and high diversity within a single training paradigm, bridging a critical gap between fidelity and expressiveness in LLM-based generation.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) often suffer from mode collapse, repeatedly generating the same few completions even when many valid answers exist, limiting their diversity across a wide range of tasks. We introduce Group-Aware Policy Optimization (GAPO), a simple extension of the recent and popular Group Relative Policy Optimization (GRPO) that computes rewards over the group as a whole. GAPO enables learning from the group-level properties such as diversity and coverage. We demonstrate GAPO using a frequency-aware reward function that encourages uniform sampling over valid LLM completions, and show that GAPO-trained models produce valid and more diverse model responses. Beyond this setup, GAPO generalizes to open-ended prompts and improves response diversity without compromising accuracy on standard LLM benchmarks (GSM8K, MATH, HumanEval, MMLU-Pro). Our code will be made publicly available.

Problem

Research questions and friction points this paper is trying to address.

Addressing mode collapse in Large Language Models

Improving output diversity across multiple tasks

Enhancing response variety without sacrificing accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Group-Aware Policy Optimization extends GRPO

Computes rewards over group-level diversity properties

Uses frequency-aware reward for uniform valid completions

🔎 Similar Papers

Can LLMs Generate Diverse Molecules? Towards Alignment with Structural Diversity