AdaSTaR: Adaptive Data Sampling for Training Self-Taught Reasoners

📅 2025-05-22

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Self-teaching reasoning language models (e.g., STaR, RFT) suffer from training sample imbalance due to stochastic sampling—leading to overfitting on easy instances and insufficient coverage of hard cases. Method: We propose the first dual-adaptive data sampling framework that jointly optimizes: (1) diversity-aware balanced sampling to mitigate distributional shift, and (2) curriculum-style difficulty scheduling dynamically calibrated to the model’s evolving capability for progressive learning. The method integrates seamlessly with rejection sampling fine-tuning (RFT), requiring no additional annotations or architectural modifications. Results: Our approach achieves state-of-the-art accuracy across six major reasoning benchmarks, reduces training FLOPs by 58.6% on average, and generalizes robustly across diverse PLM scales—significantly improving both training efficiency and generalization robustness.

Technology Category

Application Category

📝 Abstract

Self-Taught Reasoners (STaR), synonymously known as Rejection sampling Fine-Tuning (RFT), is an integral part of the training pipeline of self-improving reasoning Language Models (LMs). The self-improving mechanism often employs random observation (data) sampling. However, this results in trained observation imbalance; inefficiently over-training on solved examples while under-training on challenging ones. In response, we introduce Adaptive STaR (AdaSTaR), a novel algorithm that rectifies this by integrating two adaptive sampling principles: (1) Adaptive Sampling for Diversity: promoting balanced training across observations, and (2) Adaptive Sampling for Curriculum: dynamically adjusting data difficulty to match the model's evolving strength. Across six benchmarks, AdaSTaR achieves best test accuracy in all instances (6/6) and reduces training FLOPs by an average of 58.6% against an extensive list of baselines. These improvements in performance and efficiency generalize to different pre-trained LMs and larger models, paving the way for more efficient and effective self-improving LMs.

Problem

Research questions and friction points this paper is trying to address.

Addresses trained observation imbalance in self-improving LMs

Improves efficiency by reducing redundant training on solved examples

Dynamically adjusts data difficulty to match model capability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive sampling for balanced training diversity

Dynamic curriculum sampling for model strength

Reduces training FLOPs by 58.6% efficiently

🔎 Similar Papers

Semantic Self-Consistency: Enhancing Language Model Reasoning via Semantic Weighting