🤖 AI Summary
Enhancing the reasoning capabilities of large language models (LLMs) typically requires extensive labeled data and costly supervised fine-tuning. Method: This paper proposes LightReasoner—a novel framework that leverages small language models (SLMs) as unsupervised teaching signals. By contrasting behavioral discrepancies between strong and weak models along reasoning paths, it automatically identifies high-value “critical reasoning segments” and generates high-quality supervision samples without ground-truth labels. The method operates in two stages: (1) sampling—localizing critical reasoning nodes; and (2) fine-tuning—performing targeted optimization using contrastively generated samples. Contribution/Results: LightReasoner eliminates reliance on human annotations, drastically reducing training overhead: it achieves up to +28.1% accuracy gain across seven mathematical reasoning benchmarks, cuts training time by 90%, reduces sampled problem count by 80%, and lowers training token consumption by 99%. Its core innovation lies in distilling reasoning knowledge via inter-model behavioral contrast, enabling efficient, lightweight, and label-free reasoning enhancement.
📝 Abstract
Large language models (LLMs) have demonstrated remarkable progress in reasoning, often through supervised fine-tuning (SFT). However, SFT is resource-intensive, relying on large curated datasets, rejection-sampled demonstrations, and uniform optimization across all tokens, even though only a fraction carry meaningful learning value. In this work, we explore a counterintuitive idea: can smaller language models (SLMs) teach larger language models (LLMs) by revealing high-value reasoning moments that reflect the latter's unique strength? We propose LightReasoner, a novel framework that leverages the behavioral divergence between a stronger expert model (LLM) and a weaker amateur model (SLM). LightReasoner operates in two stages: (1) a sampling stage that pinpoints critical reasoning moments and constructs supervision examples capturing the expert's advantage through expert-amateur contrast, and (2) a fine-tuning stage that aligns the expert model with these distilled examples, amplifying its reasoning strengths. Across seven mathematical benchmarks, LightReasoner improves accuracy by up to 28.1%, while reducing time consumption by 90%, sampled problems by 80%, and tuned token usage by 99%, all without relying on ground-truth labels. By turning weaker SLMs into effective teaching signals, LightReasoner offers a scalable and resource-efficient approach for advancing LLM reasoning. Code is available at: https://github.com/HKUDS/LightReasoner