🤖 AI Summary
Existing LLM self-training methods suffer from insufficient sampling of difficult queries, leading to inadequate learning on complex tasks. To address this, we propose DAST, a difficulty-aware self-training framework that introduces a novel sampling-based dynamic difficulty estimation mechanism. DAST integrates difficulty-weighted data augmentation with synergistic supervised fine-tuning (SFT) and direct preference optimization (DPO), overcoming the limitations of uniform sampling in conventional approaches. By precisely identifying challenging samples and reinforcing both their generation and preference learning, DAST significantly improves model performance and generalization on demanding tasks such as mathematical reasoning. Extensive experiments demonstrate consistent and substantial gains across multiple benchmarks, validating the critical benefit of difficulty-aware strategies in LLM self-training. Our work establishes a new paradigm for efficient and robust autonomous model evolution.
📝 Abstract
Present Large Language Models (LLM) self-training methods always under-sample on challenging queries, leading to inadequate learning on difficult problems which limits LLMs' ability. Therefore, this work proposes a difficulty-aware self-training (DAST) framework that focuses on improving both the quantity and quality of self-generated responses on challenging queries during self-training. DAST is specified in three components: 1) sampling-based difficulty level estimation, 2) difficulty-aware data augmentation, and 3) the self-training algorithm using SFT and DPO respectively. Experiments on mathematical tasks demonstrate the effectiveness and generalization of DAST, highlighting the critical role of difficulty-aware strategies in advancing LLM self-training.