PCL-Reasoner-V1.5: Advancing Math Reasoning with Offline Reinforcement Learning

📅 2026-01-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited performance and instability of large language models in complex mathematical reasoning tasks by proposing a novel offline reinforcement learning (RL) approach based on the Qwen2.5-32B architecture. The resulting 32-billion-parameter model integrates supervised fine-tuning (SFT) with the proposed offline RL strategy, achieving significantly improved training stability and efficiency compared to conventional online RL methods such as GRPO. Training is efficiently conducted on the Huawei Ascend 910C NPU platform. The model attains state-of-the-art results among existing post-training approaches for Qwen2.5-32B, achieving average accuracies of 90.9% and 85.6% on the AIME 2024 and AIME 2025 benchmarks, respectively.

Technology Category

Application Category

📝 Abstract
We present PCL-Reasoner-V1.5, a 32-billion-parameter large language model (LLM) for mathematical reasoning. The model is built upon Qwen2.5-32B and refined via supervised fine-tuning (SFT) followed by reinforcement learning (RL). A central innovation is our proposed offline RL method, which provides superior training stability and efficiency over standard online RL methods such as GRPO. Our model achieves state-of-the-art performance among models post-trained on Qwen2.5-32B, attaining average accuracies of 90.9% on AIME 2024 and 85.6% on AIME 2025. Our work demonstrates offline RL as a stable and efficient paradigm for advancing reasoning in LLMs. All experiments were conducted on Huawei Ascend 910C NPUs.
Problem

Research questions and friction points this paper is trying to address.

mathematical reasoning
large language model
offline reinforcement learning
training stability
reasoning performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

offline reinforcement learning
mathematical reasoning
large language model
training stability
Qwen2.5-32B
🔎 Similar Papers
No similar papers found.
Y
Yao Lu
Peng Cheng Laboratory
D
Dengdong Fan
Peng Cheng Laboratory
J
Jianzheng Nie
Peng Cheng Laboratory
F
Fan Xu
Peng Cheng Laboratory
Jie Chen
Jie Chen
Peking University
computer visiondeep learningmedical image analysis
B
Bin Zhou
Peng Cheng Laboratory, Peking University
Y
Yonghong Tian
Peng Cheng Laboratory, Peking University