AdaSTaR: Adaptive Data Sampling for Training Self-Taught Reasoners

📅 2025-05-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Self-teaching reasoning language models (e.g., STaR, RFT) suffer from training sample imbalance due to stochastic sampling—leading to overfitting on easy instances and insufficient coverage of hard cases. Method: We propose the first dual-adaptive data sampling framework that jointly optimizes: (1) diversity-aware balanced sampling to mitigate distributional shift, and (2) curriculum-style difficulty scheduling dynamically calibrated to the model’s evolving capability for progressive learning. The method integrates seamlessly with rejection sampling fine-tuning (RFT), requiring no additional annotations or architectural modifications. Results: Our approach achieves state-of-the-art accuracy across six major reasoning benchmarks, reduces training FLOPs by 58.6% on average, and generalizes robustly across diverse PLM scales—significantly improving both training efficiency and generalization robustness.

Technology Category

Application Category

📝 Abstract
Self-Taught Reasoners (STaR), synonymously known as Rejection sampling Fine-Tuning (RFT), is an integral part of the training pipeline of self-improving reasoning Language Models (LMs). The self-improving mechanism often employs random observation (data) sampling. However, this results in trained observation imbalance; inefficiently over-training on solved examples while under-training on challenging ones. In response, we introduce Adaptive STaR (AdaSTaR), a novel algorithm that rectifies this by integrating two adaptive sampling principles: (1) Adaptive Sampling for Diversity: promoting balanced training across observations, and (2) Adaptive Sampling for Curriculum: dynamically adjusting data difficulty to match the model's evolving strength. Across six benchmarks, AdaSTaR achieves best test accuracy in all instances (6/6) and reduces training FLOPs by an average of 58.6% against an extensive list of baselines. These improvements in performance and efficiency generalize to different pre-trained LMs and larger models, paving the way for more efficient and effective self-improving LMs.
Problem

Research questions and friction points this paper is trying to address.

Addresses trained observation imbalance in self-improving LMs
Improves efficiency by reducing redundant training on solved examples
Dynamically adjusts data difficulty to match model capability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive sampling for balanced training diversity
Dynamic curriculum sampling for model strength
Reduces training FLOPs by 58.6% efficiently
🔎 Similar Papers
No similar papers found.
Woosung Koh
Woosung Koh
Trillion Labs, KAIST AI
Foundation ModelsAgents
Wonbeen Oh
Wonbeen Oh
Yonsei University
J
Jaein Jang
Yonsei University
M
MinHyung Lee
Yonsei University
H
Hyeongjin Kim
Yonsei University
A
Ah Yeon Kim
Yonsei University
Joonkee Kim
Joonkee Kim
LG AI Research
Language ModelingReinforcement Learning
Junghyun Lee
Junghyun Lee
PhD Student @ KAIST AI
BanditsReinforcement LearningStatisticsLearning TheoryDeep Learning Theory
T
Taehyeon Kim
LG AI Research
S
Se-Young Yun
KAIST AI