🤖 AI Summary
Large language models (LLMs) suffer from low sample efficiency and poor performance on challenging multi-step mathematical reasoning tasks under reinforcement learning (RL).
Method: We propose QuestA, a problem-augmentation framework that injects high-quality partial solutions into RL training and dynamically adjusts problem difficulty to reduce optimization complexity and provide dense supervisory signals.
Contribution/Results: Theoretical analysis demonstrates that QuestA significantly improves sample efficiency. Empirical evaluation on a 1.5B-parameter model establishes new state-of-the-art results across multiple high-difficulty mathematical benchmarks: AIME24 (+5.3% absolute, 67.1%), AIME25 (+10.0%, 59.5%), and HMMT25 (+4.0%, 35.5%). QuestA overcomes fundamental convergence and generalization bottlenecks of conventional RL in complex reasoning, delivering a scalable framework for continual improvement of LLMs’ mathematical reasoning capabilities.
📝 Abstract
Reinforcement learning (RL) has become a key component in training large language reasoning models (LLMs). However, recent studies questions its effectiveness in improving multi-step reasoning-particularly on hard problems. To address this challenge, we propose a simple yet effective strategy via Question Augmentation: introduce partial solutions during training to reduce problem difficulty and provide more informative learning signals. Our method, QuestA, when applied during RL training on math reasoning tasks, not only improves pass@1 but also pass@k-particularly on problems where standard RL struggles to make progress. This enables continual improvement over strong open-source models such as DeepScaleR and OpenMath Nemotron, further enhancing their reasoning capabilities. We achieve new state-of-the-art results on math benchmarks using 1.5B-parameter models: 67.1% (+5.3%) on AIME24, 59.5% (+10.0%) on AIME25, and 35.5% (+4.0%) on HMMT25. Further, we provide theoretical explanations that QuestA improves sample efficiency, offering a practical and generalizable pathway for expanding reasoning capability through RL.