Scaling Towards the Information Boundary of Instruction Set: InfinityInstruct-Subject Technical Report

📅 2025-07-09

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

Existing instruction datasets, though reaching millions in scale, suffer from insufficient coverage across task types, limited diversity across knowledge domains, and inadequate depth in instruction complexity—constraining fine-tuned models’ generalization to complex instructions and low-resource domains. To address this, we propose a “coverage–depth” co-enhancement paradigm, introducing a closed-loop data construction framework integrating hierarchical annotation, informative seed selection, evolutionary synthesis, and defect-driven targeted generation. This framework shifts emphasis from mere quantity to qualitative advancement, substantially expanding the information-theoretic boundary of instruction distributions. Leveraging it, we curate a high-quality dataset of 1.5 million instructions. Empirical evaluation across multiple foundation models and benchmarks (e.g., MT-Bench, AlpacaEval) demonstrates systematic improvements in instruction-following capability—particularly on challenging tasks requiring long-chain reasoning and cross-domain inference.

Technology Category

Application Category

📝 Abstract

Instruction tuning has become a foundation for unlocking the capabilities of large-scale pretrained models and improving their performance on complex tasks. Thus, the construction of high-quality instruction datasets is crucial for enhancing model performance and generalizability. Although current instruction datasets have reached tens of millions of samples, models finetuned on them may still struggle with complex instruction following and tasks in rare domains. This is primarily due to limited expansion in both ``coverage'' (coverage of task types and knowledge areas) and ``depth'' (instruction complexity) of the instruction set. To address this issue, we propose a systematic instruction data construction framework, which integrates a hierarchical labeling system, an informative seed selection algorithm, an evolutionary data synthesis process, and a model deficiency diagnosis with targeted data generation. These components form an iterative closed-loop to continuously enhance the coverage and depth of instruction data. Based on this framework, we construct InfinityInstruct-Subject, a high-quality dataset containing ~1.5 million instructions. Experiments on multiple foundation models and benchmark tasks demonstrate its effectiveness in improving instruction-following capabilities. Further analyses suggest that InfinityInstruct-Subject shows enlarged coverage and depth compared to comparable synthesized instruction datasets. Our work lays a theoretical and practical foundation for the efficient, continuous evolution of instruction datasets, moving from data quantity expansion to qualitative improvement.

Problem

Research questions and friction points this paper is trying to address.

Enhancing instruction dataset coverage and depth

Improving complex instruction-following in rare domains

Systematic framework for high-quality instruction data construction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical labeling system enhances data organization

Evolutionary synthesis process expands instruction complexity

Deficiency diagnosis targets specific data generation needs

🔎 Similar Papers

Using Large Multimodal Models to Extract Knowledge Components for Knowledge Tracing from Multimedia Question Information