The Self-Improvement Paradox: Can Language Models Bootstrap Reasoning Capabilities without External Scaffolding?

📅 2025-02-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates whether large language models (LLMs) can autonomously enhance their mathematical reasoning capabilities without any external supervision—relying solely on self-generated data. To this end, we propose Crescent, a fully autonomous framework that (1) employs answer-eliciting prompts to stimulate diverse reasoning paths; (2) applies rejection sampling for self-deduplication and multi-round majority voting over model responses to construct high-quality synthetic question-answer pairs; and (3) performs fine-tuning and knowledge distillation using only these self-synthesized data. Crescent is the first method to achieve purely self-driven capability evolution—requiring no seed data, human annotations, or auxiliary third-party models. Experiments demonstrate substantial gains in zero-shot mathematical reasoning performance of base models, without compromising general-purpose capabilities. Moreover, the distilled models outperform state-of-the-art approaches relying on seed-data augmentation, empirically validating that LLMs can achieve genuine capability leaps via internal consistency mechanisms.

Technology Category

Application Category

📝 Abstract
Self-improving large language models (LLMs) -- i.e., to improve the performance of an LLM by fine-tuning it with synthetic data generated by itself -- is a promising way to advance the capabilities of LLMs while avoiding extensive supervision. Existing approaches to self-improvement often rely on external supervision signals in the form of seed data and/or assistance from third-party models. This paper presents Crescent -- a simple yet effective framework for generating high-quality synthetic question-answer data in a fully autonomous manner. Crescent first elicits the LLM to generate raw questions via a bait prompt, then diversifies these questions leveraging a rejection sampling-based self-deduplication, and finally feeds the questions to the LLM and collects the corresponding answers by means of majority voting. We show that Crescent sheds light on the potential of true self-improvement with zero external supervision signals for math reasoning; in particular, Crescent-generated question-answer pairs suffice to (i) improve the reasoning capabilities of an LLM while preserving its general performance (especially in the 0-shot setting); and (ii) distil LLM knowledge to weaker models more effectively than existing methods based on seed-dataset augmentation.
Problem

Research questions and friction points this paper is trying to address.

Self-improvement of LLMs without external supervision
Autonomous generation of synthetic question-answer data
Enhancing reasoning capabilities in zero-shot settings
Innovation

Methods, ideas, or system contributions that make the work stand out.

autonomous synthetic data generation
rejection sampling-based self-deduplication
majority voting for answer collection
🔎 Similar Papers
No similar papers found.