Boosting LLM Reasoning via Spontaneous Self-Correction

📅 2025-06-07

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

Large language models (LLMs) struggle with real-time, spontaneous self-correction in mathematical reasoning, relying instead on post-hoc, multi-step self-refinement that incurs high latency and requires external prompting or auxiliary systems. Method: We propose Single-Pass Opportunistic Correction (SPOC), a novel framework enabling a single LLM to concurrently assume dual roles—“proposer” and “verifier”—within one forward pass. SPOC interleaves step-by-step solution generation with real-time verification judgments and dynamically terminates generation upon detecting errors, eliminating the need for additional prompting or external modules. It integrates synthetic data fine-tuning, online reinforcement learning, and multi-role prompting. Results: On MATH500, AMC23, and AIME24, SPOC boosts accuracy of Llama-3.1-8B/70B-Instruct by 11.6%, 20.0%, and 6.7%, respectively—outperforming existing self-correction paradigms. To our knowledge, SPOC is the first method to achieve efficient, end-to-end, intrinsically grounded self-correction in mathematical reasoning.

Technology Category

Application Category

📝 Abstract

While large language models (LLMs) have demonstrated remarkable success on a broad range of tasks, math reasoning remains a challenging one. One of the approaches for improving math reasoning is self-correction, which designs self-improving loops to let the model correct its own mistakes. However, existing self-correction approaches treat corrections as standalone post-generation refinements, relying on extra prompt and system designs to elicit self-corrections, instead of performing real-time, spontaneous self-corrections in a single pass. To address this, we propose SPOC, a spontaneous self-correction approach that enables LLMs to generate interleaved solutions and verifications in a single inference pass, with generation dynamically terminated based on verification outcomes, thereby effectively scaling inference time compute. SPOC considers a multi-agent perspective by assigning dual roles -- solution proposer and verifier -- to the same model. We adopt a simple yet effective approach to generate synthetic data for fine-tuning, enabling the model to develop capabilities for self-verification and multi-agent collaboration. We further improve its solution proposal and verification accuracy through online reinforcement learning. Experiments on mathematical reasoning benchmarks show that SPOC significantly improves performance. Notably, SPOC boosts the accuracy of Llama-3.1-8B and 70B Instruct models, achieving gains of 8.8% and 11.6% on MATH500, 10.0% and 20.0% on AMC23, and 3.3% and 6.7% on AIME24, respectively.

Problem

Research questions and friction points this paper is trying to address.

Enabling real-time self-correction in LLMs for math reasoning

Reducing reliance on extra prompts for self-improving loops

Improving accuracy through multi-agent collaboration and reinforcement learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Single-pass interleaved solution and verification generation

Dual-role multi-agent model for self-correction

Synthetic data fine-tuning for self-verification

🔎 Similar Papers

No similar papers found.