🤖 AI Summary
To address the excessive data and computational overhead in reinforcement learning (RL)-based fine-tuning for enhancing large language models’ (LLMs) reasoning capabilities, this paper proposes SPaRFT—a self-paced reinforcement fine-tuning framework. SPaRFT innovatively integrates a semantic-difficulty joint clustering mechanism for data compression with a multi-armed bandit (MAB)-driven dynamic sample scheduling strategy, enabling model-capability-aware adaptive curriculum learning. Unlike existing methods relying on heuristic or costly data selection, SPaRFT requires no additional annotations or pretraining overhead, significantly improving training efficiency and generalization under limited-data regimes. Experiments across multiple reasoning benchmarks demonstrate that SPaRFT achieves or surpasses state-of-the-art (SOTA) accuracy while reducing required training samples by up to 100×. These results validate SPaRFT’s efficiency, scalability, and practical deployability.
📝 Abstract
Large language models (LLMs) have shown strong reasoning capabilities when fine-tuned with reinforcement learning (RL). However, such methods require extensive data and compute, making them impractical for smaller models. Current approaches to curriculum learning or data selection are largely heuristic-driven or demand extensive computational resources, limiting their scalability and generalizability. We propose extbf{SPaRFT}, a self-paced learning framework that enables efficient learning based on the capability of the model being trained through optimizing which data to use and when. First, we apply emph{cluster-based data reduction} to partition training data by semantics and difficulty, extracting a compact yet diverse subset that reduces redundancy. Then, a emph{multi-armed bandit} treats data clusters as arms, optimized to allocate training samples based on model current performance. Experiments across multiple reasoning benchmarks show that SPaRFT achieves comparable or better accuracy than state-of-the-art baselines while using up to (100 imes) fewer samples. Ablation studies and analyses further highlight the importance of both data clustering and adaptive selection. Our results demonstrate that carefully curated, performance-driven training curricula can unlock strong reasoning abilities in LLMs with minimal resources.