Offline Exploration-Aware Fine-Tuning for Long-Chain Mathematical Reasoning

📅 2026-03-17

📈 Citations: 0

✨ Influential: 0

career value

229K/year

🤖 AI Summary

This work proposes Offline eXploration-Aware fine-tuning (OXA), a novel supervised fine-tuning (SFT) approach that explicitly incorporates exploration awareness to address the limitation of conventional SFT methods in mathematical reasoning—namely, their inability to model exploratory behavior and thus provide high-quality initial policies for reinforcement learning. OXA introduces a confidence-aware data reweighting mechanism that amplifies low-confidence yet verified teacher-distilled samples to absorb novel reasoning patterns while suppressing high-confidence but incorrect self-distilled samples to reallocate probability mass. This strategy increases the entropy of the initial policy, yielding a more effective starting point for subsequent reinforcement learning. Evaluated across six mathematical reasoning benchmarks, OXA consistently outperforms standard SFT, achieving average gains of +6 Pass@1 and +5 Pass@k on the Qwen2.5-1.5B-Math model and maintaining its advantage throughout RLVR training.

Technology Category

Application Category

📝 Abstract

Through encouraging self-exploration, reinforcement learning from verifiable rewards (RLVR) has significantly advanced the mathematical reasoning capabilities of large language models. As the starting point for RLVR, the capacity of supervised fine-tuning (SFT) to memorize new chain-of-thought trajectories provides a crucial initialization that shapes the subsequent exploration landscape. However, existing research primarily focuses on facilitating exploration during RLVR training, leaving exploration-aware SFT under-explored. To bridge this gap, we propose Offline eXploration-Aware (OXA) fine-tuning. Specifically, OXA optimizes two objectives: promoting low-confidence verified teacher-distillation data to internalize previously uncaptured reasoning patterns, and suppressing high-confidence incorrect self-distillation data to redistribute probability mass of incorrect patterns toward potentially correct candidates. Experimental results across 6 benchmarks show that OXA consistently improves mathematical reasoning performance, especially achieving an average gain of $+6$ Pass@1 and $+5$ Pass@$k$ points compared to conventional SFT on the Qwen2.5-1.5B-Math. Crucially, OXA elevates initial policy entropy, and performance gains persist throughout extensive RLVR training, demonstrating the long-term value of OXA.

Problem

Research questions and friction points this paper is trying to address.

Offline Exploration

Supervised Fine-Tuning

Mathematical Reasoning

Chain-of-Thought

Reinforcement Learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Offline Exploration-Aware Fine-Tuning

Reinforcement Learning from Verifiable Rewards

Chain-of-Thought Reasoning