Select2Reason: Efficient Instruction-Tuning Data Selection for Long-CoT Reasoning

📅 2025-05-22

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

Long-chain-of-thought (long-CoT) reasoning instruction tuning faces challenges of large-scale data, absence of principled sample selection criteria, and high computational overhead. Method: We propose the first instruction-quality modeling approach grounded in emergent “rethinking behaviors” (e.g., self-correction, backtracking) exhibited in reasoning traces; further, we design a weighted joint ranking mechanism integrating quantified problem difficulty and reasoning trajectory length to automatically select high-quality long-reasoning samples. Contribution/Results: Using only 10% of curated data from OpenR1-Math-220k for supervised fine-tuning, our method achieves performance on par with or exceeding full-data training and OpenR1-Qwen-7B across nine mathematical benchmarks, significantly enhancing long-chain reasoning capability. The approach is scalable and generalizes across diverse data pools, establishing a novel paradigm for efficiently activating long-chain reasoning in large language models.

Technology Category

Application Category

📝 Abstract

A practical approach to activate long chain-of-thoughts reasoning ability in pre-trained large language models is to perform supervised fine-tuning on instruction datasets synthesized by strong Large Reasoning Models such as DeepSeek-R1, offering a cost-effective alternative to reinforcement learning. However, large-scale instruction sets with more than 100k samples incur significant training overhead, while effective strategies for automatic long-CoT instruction selection still remain unexplored. In this work, we propose Select2Reason, a novel and efficient instruction-tuning data selection framework for long-CoT reasoning. From the perspective of emergence of rethinking behaviors like self-correction and backtracking, we investigate common metrics that may determine the quality of long-CoT reasoning instructions. Select2Reason leverages a quantifier to estimate difficulty of question and jointly incorporates a reasoning trace length-based heuristic through a weighted scheme for ranking to prioritize high-utility examples. Empirical results on OpenR1-Math-220k demonstrate that fine-tuning LLM on only 10% of the data selected by Select2Reason achieves performance competitive with or superior to full-data tuning and open-source baseline OpenR1-Qwen-7B across three competition-level and six comprehensive mathematical benchmarks. Further experiments highlight the scalability in varying data size, efficiency during inference, and its adaptability to other instruction pools with minimal cost.

Problem

Research questions and friction points this paper is trying to address.

Efficient selection of instruction-tuning data for long chain-of-thought reasoning

Reducing training overhead by prioritizing high-utility examples

Improving reasoning performance with minimal data selection cost

Innovation

Methods, ideas, or system contributions that make the work stand out.

Efficient instruction-tuning data selection framework

Quantifier estimates question difficulty for ranking

Prioritizes high-utility examples via weighted scheme

🔎 Similar Papers

Semantic Self-Consistency: Enhancing Language Model Reasoning via Semantic Weighting