MIG: Automatic Data Selection for Instruction Tuning by Maximizing Information Gain in Semantic Space

📅 2025-04-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of jointly optimizing quality and diversity in open-source instruction-tuning data, this paper proposes Maximum Information Gain (MIG) sampling, a novel method based on semantic label graphs. Unlike conventional heuristic metrics relying on distance or clustering, MIG constructs a semantic label graph within the instruction embedding space to jointly model data informativeness and distributional structure, then performs greedy iterative sampling to maximize information gain. Its core innovation lies in explicitly incorporating semantic structure into instruction data selection—enabling joint quantification and optimization of both quality and diversity for the first time. Experiments demonstrate that an SFT model trained on only 5% of the Tulu3 dataset achieves performance comparable to the full-data baseline, yielding +5.73% and +6.89% improvements on AlpacaEval and WildBench, respectively—substantially outperforming existing state-of-the-art methods.

Technology Category

Application Category

📝 Abstract
Data quality and diversity are key to the construction of effective instruction-tuning datasets. % With the increasing availability of open-source instruction-tuning datasets, it is advantageous to automatically select high-quality and diverse subsets from a vast amount of data. % Existing methods typically prioritize instance quality and use heuristic rules to maintain diversity. % However, this absence of a comprehensive view of the entire collection often leads to suboptimal results. % Moreover, heuristic rules generally focus on distance or clustering within the embedding space, which fails to accurately capture the intent of complex instructions in the semantic space. % To bridge this gap, we propose a unified method for quantifying the information content of datasets. This method models the semantic space by constructing a label graph and quantifies diversity based on the distribution of information within the graph. % Based on such a measurement, we further introduce an efficient sampling method that selects data samples iteratively to extbf{M}aximize the extbf{I}nformation extbf{G}ain (MIG) in semantic space. % Experiments on various datasets and base models demonstrate that MIG consistently outperforms state-of-the-art methods. % Notably, the model fine-tuned with 5% Tulu3 data sampled by MIG achieves comparable performance to the official SFT model trained on the full dataset, with improvements of +5.73% on AlpacaEval and +6.89% on Wildbench.
Problem

Research questions and friction points this paper is trying to address.

Automatically select high-quality instruction-tuning datasets
Maximize information gain in semantic space
Improve diversity and quality of data subsets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Maximizes information gain in semantic space
Models semantic space using label graph
Selects data iteratively for diversity
🔎 Similar Papers
No similar papers found.
Y
Yicheng Chen
Shanghai AI Laboratory, Fudan University
Yining Li
Yining Li
Shanghai AI Laboratory
Multimodal LearningLarge Language Model
K
Kai Hu
Shanghai AI Laboratory, Carnegie Mellon University
Z
Zerun Ma
Shanghai AI Laboratory
H
Haochen Ye
Shanghai AI Laboratory
K
Kai Chen
Shanghai AI Laboratory