🤖 AI Summary
This work addresses the limited generalization of existing data selection methods for supervised fine-tuning of large language models in cross-domain scenarios, where trade-offs between efficiency and performance remain challenging. It reveals, for the first time, a directional relationship between differential entropy shifts and task domains: reasoning tasks favor entropy increase (cognitive expansion), whereas general instruction-following tasks prefer entropy decrease (cognitive compression). Building on this insight, the authors propose a unified domain-adaptive data selection framework that integrates warm-up calibration, bidirectional negative log-likelihood filtering, and differential entropy-based ranking to enable efficient, task-aware data curation. Using only 10% of the training data, the method outperforms full-data training by 17% on mathematical reasoning tasks and achieves a 52% improvement on general instruction-following benchmarks, substantially surpassing current baselines.
📝 Abstract
Supervised fine-tuning (SFT) is fundamental to adapting large language models, yet training on complete datasets incurs prohibitive costs with diminishing returns. Existing data selection methods suffer from severe domain specificity: techniques optimized for general instruction-following fail on reasoning tasks, and vice versa. We observe that measuring entropy differences between base models and minimally instruction-tuned calibrated models reveals a pattern -- samples with the lowest differential entropy consistently yield optimal performance across domains, yet this principle manifests domain-adaptively: reasoning tasks favor entropy increase (cognitive expansion), while general tasks favor entropy decrease (cognitive compression). We introduce InstructDiff, a unified framework that operationalizes differential entropy as a domain-adaptive selection criterion through warmup calibration, bi-directional NLL filtering, and entropy-based ranking. Extensive experiments show that InstructDiff achieves 17\% relative improvement over full data training on mathematical reasoning and 52\% for general instruction-following, outperforming prior baselines while using only 10\% of the data.