🤖 AI Summary
This work addresses the insufficient exploration in large language models during inference, a limitation caused by distribution sharpening that undermines solution diversity and reasoning capabilities. To mitigate this, the authors propose Exploration-Driven Optimization (EDO), which systematically integrates an exploration-oriented reward bias into iterative Direct Preference Optimization (iDPO) and Group Relative Policy Optimization (GRPO) frameworks for the first time. EDO incorporates entropy-preserving mechanisms and test-time self-consistency computation to effectively alleviate over-optimization collapse during reinforcement learning-based post-training, thereby balancing exploration and exploitation. Experimental results demonstrate that EDO improves average accuracy by 1.0–1.3% across three in-distribution reasoning benchmarks and yields an additional 1.5% gain on five out-of-distribution tasks, while significantly enhancing solution diversity and training stability.
📝 Abstract
Post-training techniques combined with inference-time scaling significantly enhance the reasoning and alignment capabilities of large language models (LLMs). However, a fundamental tension arises: inference-time methods benefit from diverse sampling from a relatively flattened probability distribution, whereas reinforcement learning (RL)-based post-training inherently sharpens these distributions. To address this, we propose Exploration-Driven Optimization (EDO), which extends reward-biasing style exploration objectives to iterative post-training and integrates them into standard RL objectives, encouraging greater diversity in sampled solutions while facilitating more effective inference-time computation. We incorporate EDO into iterative Direct Preference Optimization (iDPO) and Group Relative Policy Optimization (GRPO), resulting in two variants: ED-iDPO and ED-GRPO. Extensive experiments demonstrate that both ED-iDPO and ED-GRPO exhibit greater solution diversity and improved reasoning abilities, particularly when combined with test-time computation techniques like self-consistency. Across three in-distribution reasoning benchmarks, EDO achieves a 1.0-1.3\% improvement over the strongest baselines, and delivers an additional 1.5\% average gain on five out-of-distribution tasks. Beyond accuracy, EDO preserves model entropy and stabilizes RL training dynamics, highlighting its effectiveness in preventing over-optimization collapse. Taken together, these results establish EDO as a practical framework for balancing exploration and exploitation in LLM reasoning, especially in settings that rely on test-time scaling.