Beyond Mode Collapse: Distribution Matching for Diverse Reasoning

📅 2026-05-19

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

This work addresses the mode collapse issue in on-policy reinforcement learning methods such as GRPO, where minimizing the reverse KL divergence often causes premature convergence to a single high-reward solution, thereby sacrificing policy diversity. To mitigate this, the authors propose Distribution Matching Policy Optimization (DMPO), which introduces forward KL divergence minimization into on-policy RL for the first time. DMPO constructs a population-level target distribution proportional to trajectory rewards and employs reward-weighted sampling to approximate this distribution without requiring direct access to an intractable global distribution, enabling sustained exploration of multimodal solutions. Evaluated on NP-Bench text and vision tasks, DMPO achieves quality ratios of 43.9% and 43.1%, representing relative improvements of 9% and 12%, respectively, and demonstrates enhanced generalization with gains of 2.0% and 2.3% in mathematical reasoning and cross-domain tasks.

📝 Abstract

On-policy reinforcement learning methods like GRPO suffer from mode collapse: they exhibit reduced solution diversity, concentrating probability mass on a single solution once discovered and ceasing exploration of alternative strategies. We show this stems from reverse KL minimization's mode-seeking behavior, which reinforces the first high-reward trajectory found rather than maintaining a distribution over multiple diverse solutions. We propose DMPO (Distribution-Matching Policy Optimization), which prevents mode collapse through principled approximation of forward KL minimization. DMPO constructs a group level target distribution over sampled trajectories proportional to their rewards, then aligns the policy distribution to this target. This provides mode-covering behavior without requiring sampling from the intractable global target distribution, enabling sustained exploration throughout training. We validate DMPO on NP-hard combinatorial optimization, where exponentially many feasible solutions exist but only a few approach optimality, an ideal testbed for evaluating exploration. DMPO achieves 43.9% Quality Ratio on text-based NP-Bench (vs. GRPO's 40.1%) and 43.1% on vision-based NP-Bench (vs. 38.4%), demonstrating 9% and 12% relative improvements respectively. These gains generalize to mathematical reasoning (+2.0%) and out-of-domain tasks (+2.3%), showing that diversity-preserving training enhances general reasoning capabilities across modalities. Our work establishes distribution matching as a practical, principled approach to preventing mode collapse in on-policy RL, with consistent quality improvements demonstrating sustained exploration across diverse reasoning tasks.

Problem

Research questions and friction points this paper is trying to address.

mode collapse

reinforcement learning

solution diversity

on-policy RL

distribution matching

Innovation

Methods, ideas, or system contributions that make the work stand out.

Distribution Matching

Mode Collapse

Forward KL Minimization