🤖 AI Summary
Current LLM reasoning enhancement heavily relies on human annotation or reward modeling, facing severe scalability limitations and prohibitive annotation costs. This paper introduces the first fully unsupervised self-training framework for reasoning improvement—requiring no external supervision signals and instead leveraging only the model’s own generation and introspection to refine reasoning paths. Our method features: (1) a stepwise lookahead resampling strategy that integrates Monte Carlo simulation of future outcomes to improve search efficiency; and (2) an Advantage Calibration Optimization (ACO) loss function that enhances noise robustness and gradient stability. Evaluated across mathematical reasoning and code generation tasks, our approach significantly outperforms supervised baselines. It is the first to empirically validate the feasibility of sustained, autonomous reasoning improvement in LLMs under a purely unsupervised paradigm. This work shifts the paradigm of reasoning advancement—from dependence on deep, costly annotation toward exploitation of broad, unlabeled data.
📝 Abstract
Advancing LLM reasoning skills has captivated wide interest. However, current post-training techniques rely heavily on supervisory signals, such as outcome supervision or auxiliary reward models, which face the problem of scalability and high annotation costs. This motivates us to enhance LLM reasoning without the need for external supervision. We introduce a generalizable and purely unsupervised self-training framework, named Genius. Without external auxiliary, Genius requires to seek the optimal response sequence in a stepwise manner and optimize the LLM. To explore the potential steps and exploit the optimal ones, Genius introduces a stepwise foresight re-sampling strategy to sample and estimate the step value by simulating future outcomes. Further, we recognize that the unsupervised setting inevitably induces the intrinsic noise and uncertainty. To provide a robust optimization, we propose an advantage-calibrated optimization (ACO) loss function to mitigate estimation inconsistencies. Combining these techniques together, Genius provides an advanced initial step towards self-improve LLM reasoning with general queries and without supervision, revolutionizing reasoning scaling laws given the vast availability of general queries. The code will be released at https://github.com/xufangzhi/Genius.