🤖 AI Summary
This work identifies anomalous phenomena in Reinforcement Learning from Verifiable Rewards (RLVR) with large language models: two-phase learning, V-shaped response-length evolution, and severe catastrophic forgetting. To explain these, we propose the “reasoning-as-semantic-network-self-organization” hypothesis: RLVR training induces structural phase transitions in semantic complex networks, where sparse topology (mean degree ≈ 2) engenders skill isolation and abrupt capability emergence. We present the first dynamical model of RLVR as a semantic network phase transition process, enabling principled design of a maximum-frustration-point heating mechanism and SFT pre-warming to alleviate competitive bottlenecks. Based on this, we introduce Annealed-RLVR—a unified algorithmic framework integrating RLVR, supervised fine-tuning (SFT), complex network theory, and phase-transition analysis. Evaluated on a 1.5B-parameter model, Annealed-RLVR significantly mitigates forgetting and outperforms standard RLVR on both in-distribution and out-of-distribution reasoning benchmarks.
📝 Abstract
Training large language models with Reinforcement Learning from Verifiable Rewards (RLVR) exhibits a set of distinctive and puzzling behaviors that remain poorly understood, including a two-stage learning curve, V-shaped response-length trajectories, and a pronounced vulnerability to catastrophic forgetting. In this work, we propose that these seemingly disparate phenomena can be explained using a single unifying theory: the model's reasoning process maps to the self-organization of a semantic complex network whose topology remains persistently sparse, with the average degree pinned close to two. This topology imposes a fundamental mechanism for forgetting and learning: it first drives the system into a maximally frustrated state where ``skill islands'' form, slow-learning happens, and forgetting is induced; then it enters a sharp growth phase where the new skills are ``bolted on'', driven by phase-transition-like learning at the web's frontier. Equipped with the theory, we propose extit{Annealed-RLVR}, a principled algorithm that introduces an SFT-based ``heating'' step at the point of maximal frustration to resolve the competitive bottleneck and enhance the reasoning capability of the model. Experiments on a 1.5B-parameter model demonstrate that the approach outperforms standard RLVR on both in-distribution and out-of-distribution benchmarks. By recasting RLVR from black-box optimization into a predictable process of structural self-organization, our work provides a new physical intuition for engineering the emergent reasoning capabilities of future AI systems.