🤖 AI Summary
To address the weak generalization of large language models (LLMs) caused by narrow instruction-tuning data distributions and misalignment with pretraining knowledge, this paper proposes a coverage-aligned instruction data adaptive synthesis framework. Methodologically, it systematically aligns instruction-tuning distributions with pretraining distributions for the first time; introduces a coverage bias detection mechanism to identify knowledge gaps; employs controllable text rewriting to transform underrepresented pretraining texts into high-quality instruction-response pairs; and designs a balanced fusion strategy for multi-stage data integration. The framework achieves significant performance gains across three fully open-source LLMs and eight benchmark datasets. Ablation studies confirm the synergistic effectiveness of all components. This work establishes a novel paradigm for preserving pretraining knowledge while enhancing task-specific adaptation in LLMs.
📝 Abstract
Instruction tuning enhances large language models (LLMs) to follow human instructions across diverse tasks, relying on high-quality datasets to guide behavior. However, these datasets, whether manually curated or synthetically generated, are often narrowly focused and misaligned with the broad distributions captured during pre-training, limiting LLM generalization and effective use of pre-trained knowledge. We propose *Aligning Instruction Tuning with Pre-training* (AITP), a method that bridges this gap by identifying coverage shortfalls in instruction-tuning datasets and rewriting underrepresented pre-training data into high-quality instruction-response pairs. This approach enriches dataset diversity while preserving task-specific objectives. Evaluations on three fully open LLMs across eight benchmarks demonstrate consistent performance improvements with AITP. Ablations highlight the benefits of adaptive data selection, controlled rewriting, and balanced integration, emphasizing the importance of aligning instruction tuning with pre-training distributions to unlock the full potential of LLMs.