🤖 AI Summary
Small language models (SLMs) often lack robust metacognitive awareness and reasoning capabilities. Method: We propose ReflectEvo, the first self-driven reflective evolution framework for SLMs, wherein models iteratively generate high-quality, multi-domain reflective data (460K samples) without reliance on large-model distillation or human annotation. ReflectEvo integrates instruction expansion and multi-task coverage strategies to construct diverse training data, followed by supervised fine-tuning (SFT) and direct preference optimization (DPO). Contribution/Results: On standard reasoning benchmarks, Llama-3 and Mistral achieve absolute accuracy gains of 18.8% and 26.7%, reaching 71.2% and 71.1%, respectively. On BIG-bench, they match or surpass three leading open-weight large language models—Llama-3-70B, Mixtral-8x22B, and Qwen2-72B—demonstrating that self-reflection critically enables error localization, correction, and sustained self-improvement in SLMs.
📝 Abstract
We present a novel pipeline, ReflectEvo, to demonstrate that small language models (SLMs) can enhance meta introspection through reflection learning. This process iteratively generates self-reflection for self-training, fostering a continuous and self-evolving process. Leveraging this pipeline, we construct ReflectEvo-460k, a large-scale, comprehensive, self-generated reflection dataset with broadened instructions and diverse multi-domain tasks. Building upon this dataset, we demonstrate the effectiveness of reflection learning to improve SLMs' reasoning abilities using SFT and DPO with remarkable performance, substantially boosting Llama-3 from 52.4% to 71.2% and Mistral from 44.4% to 71.1%. It validates that ReflectEvo can rival or even surpass the reasoning capability of the three prominent open-sourced models on BIG-bench without distillation from superior models or fine-grained human annotation. We further conduct a deeper analysis of the high quality of self-generated reflections and their impact on error localization and correction. Our work highlights the potential of continuously enhancing the reasoning performance of SLMs through iterative reflection learning in the long run.