DSDR: Dual-Scale Diversity Regularization for Exploration in LLM Reasoning

📅 2026-02-23

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This work addresses the limited reasoning diversity and weak, unstable learning signals in existing verifier-based reinforcement learning methods for large language models. To overcome these issues, we propose a dual-scale diversity regularization framework that decouples reasoning diversity into global pattern disparity and local intra-trajectory stochasticity. The former encourages structural differences among distinct correct reasoning paths, while the latter enhances internal exploration via a length-invariant token-level entropy regularizer. A global-local coupling allocation mechanism is further introduced to enable synergistic optimization. Theoretically, our approach preserves optimal correctness under bounded regularization while maintaining effective learning signals. Extensive experiments across multiple reasoning benchmarks demonstrate significant improvements in both accuracy and pass@k, validating the efficacy of dual-scale diversity in promoting deep exploration.

Technology Category

Application Category

📝 Abstract

Reinforcement learning with verifiers (RLVR) is a central paradigm for improving large language model (LLM) reasoning, yet existing methods often suffer from limited exploration. Policies tend to collapse onto a few reasoning patterns and prematurely stop deep exploration, while conventional entropy regularization introduces only local stochasticity and fails to induce meaningful path-level diversity, leading to weak and unstable learning signals in group-based policy optimization. We propose DSDR, a Dual-Scale Diversity Regularization reinforcement learning framework that decomposes diversity in LLM reasoning into global and coupling components. Globally, DSDR promotes diversity among correct reasoning trajectories to explore distinct solution modes. Locally, it applies a length-invariant, token-level entropy regularization restricted to correct trajectories, preventing entropy collapse within each mode while preserving correctness. The two scales are coupled through a global-to-local allocation mechanism that emphasizes local regularization for more distinctive correct trajectories. We provide theoretical support showing that DSDR preserves optimal correctness under bounded regularization, sustains informative learning signals in group-based optimization, and yields a principled global-to-local coupling rule. Experiments on multiple reasoning benchmarks demonstrate consistent improvements in accuracy and pass@k, highlighting the importance of dual-scale diversity for deep exploration in RLVR. Code is available at https://github.com/SUSTechBruce/DSDR.

Problem

Research questions and friction points this paper is trying to address.

reinforcement learning

large language model

reasoning

exploration

diversity regularization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-Scale Diversity Regularization

Reinforcement Learning with Verifiers

Reasoning Trajectory Diversity