SPaRFT: Self-Paced Reinforcement Fine-Tuning for Large Language Models

📅 2025-08-06

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

To address the excessive data and computational overhead in reinforcement learning (RL)-based fine-tuning for enhancing large language models’ (LLMs) reasoning capabilities, this paper proposes SPaRFT—a self-paced reinforcement fine-tuning framework. SPaRFT innovatively integrates a semantic-difficulty joint clustering mechanism for data compression with a multi-armed bandit (MAB)-driven dynamic sample scheduling strategy, enabling model-capability-aware adaptive curriculum learning. Unlike existing methods relying on heuristic or costly data selection, SPaRFT requires no additional annotations or pretraining overhead, significantly improving training efficiency and generalization under limited-data regimes. Experiments across multiple reasoning benchmarks demonstrate that SPaRFT achieves or surpasses state-of-the-art (SOTA) accuracy while reducing required training samples by up to 100×. These results validate SPaRFT’s efficiency, scalability, and practical deployability.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have shown strong reasoning capabilities when fine-tuned with reinforcement learning (RL). However, such methods require extensive data and compute, making them impractical for smaller models. Current approaches to curriculum learning or data selection are largely heuristic-driven or demand extensive computational resources, limiting their scalability and generalizability. We propose extbf{SPaRFT}, a self-paced learning framework that enables efficient learning based on the capability of the model being trained through optimizing which data to use and when. First, we apply emph{cluster-based data reduction} to partition training data by semantics and difficulty, extracting a compact yet diverse subset that reduces redundancy. Then, a emph{multi-armed bandit} treats data clusters as arms, optimized to allocate training samples based on model current performance. Experiments across multiple reasoning benchmarks show that SPaRFT achieves comparable or better accuracy than state-of-the-art baselines while using up to (100 imes) fewer samples. Ablation studies and analyses further highlight the importance of both data clustering and adaptive selection. Our results demonstrate that carefully curated, performance-driven training curricula can unlock strong reasoning abilities in LLMs with minimal resources.

Problem

Research questions and friction points this paper is trying to address.

Efficient fine-tuning for small LLMs with limited resources

Reducing data redundancy via cluster-based semantic partitioning

Adaptive sample allocation using performance-driven bandit optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cluster-based data reduction for diverse subset

Multi-armed bandit optimizes sample allocation

Self-paced learning adapts to model capability

🔎 Similar Papers

No similar papers found.