From Tags to Trees: Structuring Fine-Grained Knowledge for Controllable Data Selection in LLM Instruction Tuning

📅 2026-01-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge in instruction tuning of large language models (LLMs) where existing data selection methods struggle to simultaneously ensure high quality, diversity, and alignment with target objectives, often due to reliance on flat embeddings or coarse-grained labels that overlook fine-grained knowledge and its hierarchical structure. To overcome this limitation, the authors propose the Tree-aware Aligned Global Sampling (TAGS) framework, which organizes fine-grained knowledge into a hierarchical knowledge tree through LLM-driven atomic annotation and bottom-up clustering. TAGS further introduces tree-aware metrics for quality and diversity, along with a KL divergence constraint, enabling precise and interpretable targeted sampling. Remarkably, using only 5% of the data, TAGS outperforms full-data training by 5.84%, and alignment-aware sampling yields an additional average performance gain of 4.24%, significantly surpassing current state-of-the-art methods.

Technology Category

Application Category

📝 Abstract
Effective and controllable data selection is critical for LLM instruction tuning, especially with massive open-source datasets. Existing approaches primarily rely on instance-level quality scores, or diversity metrics based on embedding clusters or semantic tags. However, constrained by the flatness of embedding spaces or the coarseness of tags, these approaches overlook fine-grained knowledge and its intrinsic hierarchical dependencies, consequently hindering precise data valuation and knowledge-aligned sampling. To address this challenge, we propose Tree-aware Aligned Global Sampling (TAGS), a unified framework that leverages a knowledge tree built from fine-grained tags, thereby enabling joint control of global quality, diversity, and target alignment. Using an LLM-based tagger, we extract atomic knowledge concepts, which are organized into a global tree through bottom-up hierarchical clustering. By grounding data instances onto this tree, a tree-aware metric then quantifies data quality and diversity, facilitating effective sampling. Our controllable sampling strategy maximizes tree-level information gain and enforces leaf-level alignment via KL-divergence for specific domains. Extensive experiments demonstrate that TAGS significantly outperforms state-of-the-art baselines. Notably, it surpasses the full-dataset model by \textbf{+5.84\%} using only \textbf{5\%} of the data, while our aligned sampling strategy further boosts average performance by \textbf{+4.24\%}.
Problem

Research questions and friction points this paper is trying to address.

data selection
instruction tuning
fine-grained knowledge
hierarchical dependencies
controllable sampling
Innovation

Methods, ideas, or system contributions that make the work stand out.

knowledge tree
fine-grained tagging
controllable data selection
hierarchical clustering
instruction tuning
🔎 Similar Papers
No similar papers found.
Z
Zihan Niu
University of Science and Technology of China
W
Wenping Hu
Klear Team, Kuaishou Technology
J
Junmin Chen
Klear Team, Kuaishou Technology
X
Xiyue Wang
Klear Team, Kuaishou Technology
Tong Xu
Tong Xu
Professor, University of Science and Technology of China
Data Mining
R
Ruiming Tang
Klear Team, Kuaishou Technology