HS-STAR: Hierarchical Sampling for Self-Taught Reasoners via Difficulty Estimation and Budget Reallocation

📅 2025-05-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing Self-Taught Reasoning (STaR) models allocate sampling budgets uniformly across all questions, ignoring problem difficulty heterogeneity and varying learning utility—particularly underestimating the high pedagogical value of questions near the model’s capability boundary. Method: We propose a hierarchical sampling framework: (1) lightweight reward-guided pre-sampling to estimate question difficulty and precisely identify boundary-proximal problems; (2) a difficulty-aware, two-stage dynamic budget reallocation mechanism that boosts high-quality response sampling without incurring additional computational overhead. Our approach integrates reinforcement-based self-training, multi-round sampling optimization, and difficulty-informed data filtering. Contribution/Results: Evaluated on mathematical reasoning benchmarks (e.g., GSM8K), our method significantly outperforms state-of-the-art STaR variants, achieving substantial accuracy gains while preserving the original total sampling budget.

Technology Category

Application Category

📝 Abstract
Self-taught reasoners (STaRs) enhance the mathematical reasoning abilities of large language models (LLMs) by leveraging self-generated responses for self-training. Recent studies have incorporated reward models to guide response selection or decoding, aiming to obtain higher-quality data. However, they typically allocate a uniform sampling budget across all problems, overlooking the varying utility of problems at different difficulty levels. In this work, we conduct an empirical study and find that problems near the boundary of the LLM's reasoning capability offer significantly greater learning utility than both easy and overly difficult ones. To identify and exploit such problems, we propose HS-STaR, a Hierarchical Sampling framework for Self-Taught Reasoners. Given a fixed sampling budget, HS-STaR first performs lightweight pre-sampling with a reward-guided difficulty estimation strategy to efficiently identify boundary-level problems. Subsequently, it dynamically reallocates the remaining budget toward these high-utility problems during a re-sampling phase, maximizing the generation of valuable training data. Extensive experiments across multiple reasoning benchmarks and backbone LLMs demonstrate that HS-STaR significantly outperforms other baselines without requiring additional sampling budget.
Problem

Research questions and friction points this paper is trying to address.

Optimizes sampling budget allocation for self-taught reasoners
Identifies high-utility problems near LLM capability boundary
Enhances training data quality without extra sampling cost
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical sampling for self-taught reasoners
Reward-guided difficulty estimation strategy
Dynamic budget reallocation for high-utility problems
F
Feng Xiong
Alibaba Group
Hongling Xu
Hongling Xu
Harbin Institute of Technology at Shenzhen
Natural Language Processing
Y
Yifei Wang
Alibaba Group
Runxi Cheng
Runxi Cheng
Tsinghua University
Y
Yong Wang
Alibaba Group
X
Xiangxiang Chu
Alibaba Group