🤖 AI Summary
Existing unsupervised self-improvement methods suffer from exploration decay and generation entropy collapse, leading to increasingly short, homogeneous, and poorly generalizable outputs. To address this, we propose EVOL-RL—a novel unsupervised reinforcement learning framework that pioneers the integration of “majority voting + novelty-driven mutation” into language model self-evolution. EVOL-RL synergistically combines GRPO optimization, semantic novelty rewards, asymmetric gradient clipping, and entropy regularization to jointly sustain output diversity and reasoning depth without labeled data. On AIME25, Qwen3-4B-Base achieves a 11.8-percentage-point gain in pass@1 (4.6% → 16.4%) and a 19.4-point improvement in pass@16 (18.5% → 37.9%). Significant cross-domain generalization gains are also observed on GPQA. Our core contribution lies in resolving the exploration-exploitation trade-off in fully unsupervised settings, enabling autonomous, stable, and diverse self-evolution of language models.
📝 Abstract
Large language models (LLMs) are increasingly trained with reinforcement learning from verifiable rewards (RLVR), yet real-world deployment demands models that can self-improve without labels or external judges. Existing label-free methods, confidence minimization, self-consistency, or majority-vote objectives, stabilize learning but steadily shrink exploration, causing an entropy collapse: generations become shorter, less diverse, and brittle. Unlike prior approaches such as Test-Time Reinforcement Learning (TTRL), which primarily adapt models to the immediate unlabeled dataset at hand, our goal is broader: to enable general improvements without sacrificing the model's inherent exploration capacity and generalization ability, i.e., evolving. We formalize this issue and propose EVolution-Oriented and Label-free Reinforcement Learning (EVOL-RL), a simple rule that couples stability with variation under a label-free setting. EVOL-RL keeps the majority-voted answer as a stable anchor (selection) while adding a novelty-aware reward that favors responses whose reasoning differs from what has already been produced (variation), measured in semantic space. Implemented with GRPO, EVOL-RL also uses asymmetric clipping to preserve strong signals and an entropy regularizer to sustain search. This majority-for-selection + novelty-for-variation design prevents collapse, maintains longer and more informative chains of thought, and improves both pass@1 and pass@n. EVOL-RL consistently outperforms the majority-only TTRL baseline; e.g., training on label-free AIME24 lifts Qwen3-4B-Base AIME25 pass@1 from TTRL's 4.6% to 16.4%, and pass@16 from 18.5% to 37.9%. EVOL-RL not only prevents diversity collapse but also unlocks stronger generalization across domains (e.g., GPQA). Furthermore, we demonstrate that EVOL-RL also boosts performance in the RLVR setting, highlighting its broad applicability.