FABSVer: Faster Training and Better Self-Verification for LLM Mathematical Reasoning

📅 2026-05-27

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

This work addresses the challenge that large language models struggle to reliably self-verify their solutions in mathematical reasoning, and existing approaches suffer from high training costs and low efficiency. The authors propose a unified training framework that integrates problem solving and verification into a single generation process, introducing a Dynamic Reference Model Update (DRMU) mechanism combined with reward-based reinforcement learning for joint optimization. This approach significantly enhances self-verification performance, outperforming state-of-the-art methods across multiple mathematical benchmarks while reducing training time to only 51%–71% of prior approaches. The study also reveals the critical role of model scale in verification capability.

📝 Abstract

While large language models have made significant progress in mathematical reasoning, they remain unreliable at judging the correctness of their own solutions. Existing approaches that equip models with self-verification typically treat solution generation and verification as two separate tasks, leading to substantially increased training time. In this paper, we propose FABSVer, which fuses these two tasks into a single generation pass, dramatically reducing training overhead while jointly optimizing both capabilities. We further identify a convergence bottleneck both theoretically and empirically: as training progresses, the reward reaches a plateau because the policy is constrained by a fixed reference model. To overcome this, we introduce Dynamic Reference Model Update (DRMU), which raises the reward ceiling and enables sustained reward growth. Extensive experiments on math benchmarks demonstrate that FABSVer achieves superior self-verification and reasoning performance across three model scales, while requiring only 51%--71% of the training time of existing methods. Analysis further reveals distinct learning phases in how models acquire self-verification, and that the gap between verify and answer rewards shrinks noticeably as model size increases.

Problem

Research questions and friction points this paper is trying to address.

mathematical reasoning

self-verification

large language models

training efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

self-verification

mathematical reasoning

dynamic reference model update

training efficiency