π€ AI Summary
This work addresses the inefficiency of large language models in reasoning, which often stems from unstructured reasoning trajectories that produce redundant stepsβsuch as performing unnecessary verifications after arriving at the correct answer. To mitigate this, the authors propose Structured Reasoning (SCR), a framework that explicitly decouples reasoning into three components: generation, verification, and revision. SCR incorporates a dynamic termination supervision mechanism to guide the model to halt reasoning at the appropriate time. The framework is trained via a two-stage progressive reinforcement learning strategy: the first stage focuses on generation and self-verification, while the second stage refines the revision process, thereby disentangling learning signals for distinct reasoning capabilities. Experiments demonstrate that SCR significantly enhances both reasoning efficiency and self-verification accuracy across three mainstream models, reducing output length by up to 50%.
π Abstract
Large language models (LLMs) achieve strong performance by generating long chains of thought, but longer traces always introduce redundant or ineffective reasoning steps. One typical behavior is that they often perform unnecessary verification and revisions even if they have reached the correct answers. This limitation stems from the unstructured nature of reasoning trajectories and the lack of targeted supervision for critical reasoning abilities. To address this, we propose Structured Reasoning (SCR), a framework that decouples reasoning trajectories into explicit, evaluable, and trainable components. We mainly implement SCR using a Generate-Verify-Revise paradigm. Specifically, we construct structured training data and apply Dynamic Termination Supervision to guide the model in deciding when to terminate reasoning. To avoid interference between learning signals for different reasoning abilities, we adopt a progressive two-stage reinforcement learning strategy: the first stage targets initial generation and self-verification, and the second stage focuses on revision. Extensive experiments on three backbone models show that SCR substantially improves reasoning efficiency and self-verification. Besides, compared with existing reasoning paradigms, it reduces output token length by up to 50%.