🤖 AI Summary
Long-chain-of-thought (long-CoT) reasoning instruction tuning faces challenges of large-scale data, absence of principled sample selection criteria, and high computational overhead. Method: We propose the first instruction-quality modeling approach grounded in emergent “rethinking behaviors” (e.g., self-correction, backtracking) exhibited in reasoning traces; further, we design a weighted joint ranking mechanism integrating quantified problem difficulty and reasoning trajectory length to automatically select high-quality long-reasoning samples. Contribution/Results: Using only 10% of curated data from OpenR1-Math-220k for supervised fine-tuning, our method achieves performance on par with or exceeding full-data training and OpenR1-Qwen-7B across nine mathematical benchmarks, significantly enhancing long-chain reasoning capability. The approach is scalable and generalizes across diverse data pools, establishing a novel paradigm for efficiently activating long-chain reasoning in large language models.
📝 Abstract
A practical approach to activate long chain-of-thoughts reasoning ability in pre-trained large language models is to perform supervised fine-tuning on instruction datasets synthesized by strong Large Reasoning Models such as DeepSeek-R1, offering a cost-effective alternative to reinforcement learning. However, large-scale instruction sets with more than 100k samples incur significant training overhead, while effective strategies for automatic long-CoT instruction selection still remain unexplored. In this work, we propose Select2Reason, a novel and efficient instruction-tuning data selection framework for long-CoT reasoning. From the perspective of emergence of rethinking behaviors like self-correction and backtracking, we investigate common metrics that may determine the quality of long-CoT reasoning instructions. Select2Reason leverages a quantifier to estimate difficulty of question and jointly incorporates a reasoning trace length-based heuristic through a weighted scheme for ranking to prioritize high-utility examples. Empirical results on OpenR1-Math-220k demonstrate that fine-tuning LLM on only 10% of the data selected by Select2Reason achieves performance competitive with or superior to full-data tuning and open-source baseline OpenR1-Qwen-7B across three competition-level and six comprehensive mathematical benchmarks. Further experiments highlight the scalability in varying data size, efficiency during inference, and its adaptability to other instruction pools with minimal cost.