🤖 AI Summary
This work addresses the challenge in instruction tuning of large language models (LLMs) where existing data selection methods struggle to simultaneously ensure high quality, diversity, and alignment with target objectives, often due to reliance on flat embeddings or coarse-grained labels that overlook fine-grained knowledge and its hierarchical structure. To overcome this limitation, the authors propose the Tree-aware Aligned Global Sampling (TAGS) framework, which organizes fine-grained knowledge into a hierarchical knowledge tree through LLM-driven atomic annotation and bottom-up clustering. TAGS further introduces tree-aware metrics for quality and diversity, along with a KL divergence constraint, enabling precise and interpretable targeted sampling. Remarkably, using only 5% of the data, TAGS outperforms full-data training by 5.84%, and alignment-aware sampling yields an additional average performance gain of 4.24%, significantly surpassing current state-of-the-art methods.
📝 Abstract
Effective and controllable data selection is critical for LLM instruction tuning, especially with massive open-source datasets. Existing approaches primarily rely on instance-level quality scores, or diversity metrics based on embedding clusters or semantic tags. However, constrained by the flatness of embedding spaces or the coarseness of tags, these approaches overlook fine-grained knowledge and its intrinsic hierarchical dependencies, consequently hindering precise data valuation and knowledge-aligned sampling. To address this challenge, we propose Tree-aware Aligned Global Sampling (TAGS), a unified framework that leverages a knowledge tree built from fine-grained tags, thereby enabling joint control of global quality, diversity, and target alignment. Using an LLM-based tagger, we extract atomic knowledge concepts, which are organized into a global tree through bottom-up hierarchical clustering. By grounding data instances onto this tree, a tree-aware metric then quantifies data quality and diversity, facilitating effective sampling. Our controllable sampling strategy maximizes tree-level information gain and enforces leaf-level alignment via KL-divergence for specific domains. Extensive experiments demonstrate that TAGS significantly outperforms state-of-the-art baselines. Notably, it surpasses the full-dataset model by \textbf{+5.84\%} using only \textbf{5\%} of the data, while our aligned sampling strategy further boosts average performance by \textbf{+4.24\%}.