Unified Data Selection for LLM Reasoning

📅 2026-05-21
📈 Citations: 0
Influential: 0
📄 PDF

career value

190K/year
🤖 AI Summary
This work addresses the challenge of acquiring high-quality data for training large language models in complex reasoning tasks, where existing data selection methods are either computationally expensive or ineffective at accurately assessing reasoning quality. The authors propose a training-free metric called High-Entropy Sum (HES), which efficiently evaluates reasoning quality by summing the entropy values of only the top 0.5% highest-entropy tokens in each sample. HES is shown to be universally applicable across three dominant training paradigms: supervised fine-tuning (SFT), reward fine-tuning (RFT), and reinforcement learning (RL). Experiments demonstrate that in SFT, using only the top 20% of HES-ranked data achieves performance comparable to training on the full dataset, while in RFT and RL, HES significantly outperforms baseline selection strategies, confirming its effectiveness in enhancing reasoning capabilities while reducing computational costs.
📝 Abstract
Effectively training Large Language Models (LLMs) for complex, long-CoT reasoning is often bottlenecked by the need for massive high-quality reasoning data. Existing methods are either computationally expensive or fail to reliably distinguish high- from low-quality reasoning samples. To address this, we propose High-Entropy Sum (HES), a training-free metric that quantifies reasoning quality by summing only the entropy of the top (e.g., 0.5\%) highest-entropy tokens in each reasoning sample. We validate HES across three mainstream training paradigms: Supervised Fine-tuning (SFT), Rejection Fine-tuning (RFT), and Reinforcement Learning (RL), with extensive results demonstrating its consistent effectiveness and significantly reduced computational overhead. In SFT, training on the top 20\% HES-ranked data matches full-dataset performance, while using the lowest-HES data degrades it. In RFT, our HES-based training approach significantly outperforms baseline methods. In RL, HES-selected successful trajectories enable the model to learn strong reasoning patterns, significantly surpassing other compared methods. Our findings establish HES as a robust, training-free metric that enables a unified, effective, and efficient method for developing advanced reasoning in LLMs.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
reasoning data selection
long-chain-of-thought reasoning
data quality assessment
computational efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

High-Entropy Sum
data selection
LLM reasoning
training-free metric
chain-of-thought