Incentivizing Reasoning from Weak Supervision

📅 2025-05-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high computational cost of enhancing large language models’ (LLMs) reasoning capabilities—typically incurred by reliance on high-quality chain-of-thought annotations or reinforcement learning (RL). We propose a novel “weak-to-strong” (W2S) paradigm that leverages only low-quality supervision signals generated by weak reasoning models, trained via supervised fine-tuning (SFT) without gradient backpropagation or reward modeling. Our approach integrates multi-level model distillation with task-adaptive prompt alignment to effectively transfer reasoning knowledge. Experiments demonstrate substantial improvements across diverse reasoning benchmarks—including mathematical, logical, and commonsense reasoning—achieving 94% of the performance gain attained by standard RL methods, while reducing training costs dramatically. The method exhibits strong cross-architecture generalization and is publicly available as open-source code.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have demonstrated impressive performance on reasoning-intensive tasks, but enhancing their reasoning abilities typically relies on either reinforcement learning (RL) with verifiable signals or supervised fine-tuning (SFT) with high-quality long chain-of-thought (CoT) demonstrations, both of which are expensive. In this paper, we study a novel problem of incentivizing the reasoning capacity of LLMs without expensive high-quality demonstrations and reinforcement learning. We investigate whether the reasoning capabilities of LLMs can be effectively incentivized via supervision from significantly weaker models. We further analyze when and why such weak supervision succeeds in eliciting reasoning abilities in stronger models. Our findings show that supervision from significantly weaker reasoners can substantially improve student reasoning performance, recovering close to 94% of the gains of expensive RL at a fraction of the cost. Experiments across diverse benchmarks and model architectures demonstrate that weak reasoners can effectively incentivize reasoning in stronger student models, consistently improving performance across a wide range of reasoning tasks. Our results suggest that this simple weak-to-strong paradigm is a promising and generalizable alternative to costly methods for incentivizing strong reasoning capabilities at inference-time in LLMs. The code is publicly available at https://github.com/yuanyige/W2SR.
Problem

Research questions and friction points this paper is trying to address.

Enhancing LLM reasoning without costly high-quality demonstrations
Using weak model supervision to improve strong model reasoning
Analyzing conditions for weak supervision success in reasoning tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses weak supervision from weaker models
Recovers 94% of RL gains cheaply
Generalizable alternative to costly methods
🔎 Similar Papers
No similar papers found.