🤖 AI Summary
Self-teaching reasoning language models (e.g., STaR, RFT) suffer from training sample imbalance due to stochastic sampling—leading to overfitting on easy instances and insufficient coverage of hard cases.
Method: We propose the first dual-adaptive data sampling framework that jointly optimizes: (1) diversity-aware balanced sampling to mitigate distributional shift, and (2) curriculum-style difficulty scheduling dynamically calibrated to the model’s evolving capability for progressive learning. The method integrates seamlessly with rejection sampling fine-tuning (RFT), requiring no additional annotations or architectural modifications.
Results: Our approach achieves state-of-the-art accuracy across six major reasoning benchmarks, reduces training FLOPs by 58.6% on average, and generalizes robustly across diverse PLM scales—significantly improving both training efficiency and generalization robustness.
📝 Abstract
Self-Taught Reasoners (STaR), synonymously known as Rejection sampling Fine-Tuning (RFT), is an integral part of the training pipeline of self-improving reasoning Language Models (LMs). The self-improving mechanism often employs random observation (data) sampling. However, this results in trained observation imbalance; inefficiently over-training on solved examples while under-training on challenging ones. In response, we introduce Adaptive STaR (AdaSTaR), a novel algorithm that rectifies this by integrating two adaptive sampling principles: (1) Adaptive Sampling for Diversity: promoting balanced training across observations, and (2) Adaptive Sampling for Curriculum: dynamically adjusting data difficulty to match the model's evolving strength. Across six benchmarks, AdaSTaR achieves best test accuracy in all instances (6/6) and reduces training FLOPs by an average of 58.6% against an extensive list of baselines. These improvements in performance and efficiency generalize to different pre-trained LMs and larger models, paving the way for more efficient and effective self-improving LMs.