Infinity Instruct: Scaling Instruction Selection and Synthesis to Enhance Language Models

📅 2025-06-09

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

Existing open-source instruction datasets suffer from narrow domain coverage (e.g., focusing solely on mathematics or programming), limiting LLM generalization and widening the performance gap with closed-source models. To address this, we propose Infinity-Instruct—a novel two-stage synthetic paradigm integrating data selection, instruction evolution, and diagnostic filtering: first selecting 7.4M foundational instructions, then synthesizing 1.5M high-quality dialogic instructions. This framework provides the first empirical validation that foundational and conversational capabilities can be jointly optimized, outperforming strong closed-source base models. Leveraging Infinity-Instruct, we develop InfInstruct-LLaMA3.1-70B, which achieves an 8.6% improvement over GPT-4-0314 on instruction-following benchmarks and surpasses official fine-tuned variants across diverse evaluation suites—including mathematical reasoning, code generation, and general dialogue.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) demonstrate strong performance in real-world applications, yet existing open-source instruction datasets often concentrate on narrow domains, such as mathematics or coding, limiting generalization and widening the gap with proprietary models. To bridge this gap, we introduce Infinity-Instruct, a high-quality instruction dataset designed to enhance both foundational and chat capabilities of LLMs through a two-phase pipeline. In Phase 1, we curate 7.4M high-quality foundational instructions (InfInstruct-F-7.4M) from over 100M samples using hybrid data selection techniques. In Phase 2, we synthesize 1.5M high-quality chat instructions (InfInstruct-G-1.5M) through a two-stage process involving instruction selection, evolution, and diagnostic filtering. We empirically evaluate Infinity-Instruct by fine-tuning several open-source models, including Mistral, LLaMA, Qwen, and Yi, and observe substantial performance gains across both foundational and instruction following benchmarks, consistently surpassing official instruction-tuned counterparts. Notably, InfInstruct-LLaMA3.1-70B outperforms GPT-4-0314 by 8.6% on instruction following tasks while achieving comparable foundational performance. These results underscore the synergy between foundational and chat training and offer new insights into holistic LLM development. Our datasetfootnote{https://huggingface.co/datasets/BAAI/Infinity-Instruct} and codesfootnote{https://gitee.com/li-touch/infinity-instruct} have been publicly released.

Problem

Research questions and friction points this paper is trying to address.

Addressing narrow domain focus in open-source LLM instruction datasets

Bridging performance gap between open-source and proprietary LLMs

Enhancing LLM capabilities through scalable instruction synthesis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid data selection for high-quality instructions

Two-stage synthesis of chat instructions

Fine-tuning models with Infinity-Instruct dataset

🔎 Similar Papers

No similar papers found.