TaP: A Taxonomy-Guided Framework for Automated and Scalable Preference Data Generation

📅 2025-06-30
📈 Citations: 0
Influential: 0
📄 PDF

career value

179K/year
🤖 AI Summary
Current multilingual preference dataset construction faces bottlenecks including high resource demands, English dominance, and insufficient linguistic and task diversity. To address these challenges, we propose TaP (Taxonomy-guided Preference generation), the first framework to introduce a taxonomy-guided mechanism for automated, scalable multilingual preference data generation. Leveraging a hierarchical taxonomy, TaP enables fine-grained control over data types, difficulty levels, and language distributions, supporting both instruction-following and human-preference-aligned supervised fine-tuning (SFT) as well as preference optimization (e.g., RLHF and DPO). This structured guidance significantly improves training efficiency and generalization under low-resource, few-shot settings. Empirical results demonstrate that TaP-generated data—despite being only 1/180 the size of leading open-source datasets (e.g., UltraFeedback, PKU-SafeRLHF)—achieves state-of-the-art performance across multiple benchmarks.

Technology Category

Application Category

📝 Abstract
Conducting supervised fine-tuning and preference fine-tuning on large language models (LLMs) requires high-quality datasets to improve their ability to follow instructions and align with human preferences and values. However, constructing such datasets is resource-intensive, and most available datasets for supervised and preference fine-tuning are in English. To address these challenges, we propose the underline{ extbf{Ta}}xonomy-Guided underline{ extbf{P}}reference Data Generation (TaP) framework, which facilitates automated and scalable construction of preference datasets across various languages. TaP is grounded in a structured taxonomy that allows fine-grained control over dataset composition, thereby ensuring both diversity and comprehensive coverage. We employ TaP-generated datasets to perform supervised and preference fine-tuning on various LLMs. Experimental results demonstrate that LLMs trained on TaP-generated datasets outperform those trained on existing open-source datasets. Remarkably, LLMs trained on TaP-generated datasets surpass the performance of those trained on an open-source dataset that is 180 times larger.
Problem

Research questions and friction points this paper is trying to address.

Generating high-quality multilingual preference datasets efficiently
Reducing resource-intensive manual dataset construction for LLMs
Ensuring diversity and coverage in preference fine-tuning data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Taxonomy-guided automated preference data generation
Structured taxonomy ensures diversity and coverage
Outperforms models trained on larger datasets
Renren Jin
Renren Jin
College of Intelligence and Computing, Tianjin University
Natural Language Processing
T
Tianhao Shen
College of Intelligence and Computing, Tianjin University, Tianjin, China
X
Xinwei Wu
College of Intelligence and Computing, Tianjin University, Tianjin, China
D
Dan Shi
College of Intelligence and Computing, Tianjin University, Tianjin, China
H
Haoran Sun
College of Intelligence and Computing, Tianjin University, Tianjin, China
W
Wuwei Huang
Xiaomi AI Lab, Beijing, China
Quandong Wang
Quandong Wang
Senior Speech Engineer, Xiaomi Corporation, Beijing, China
Far field speech recognition/enhancement/separation
W
Wei Liu
Xiaomi AI Lab, Beijing, China
Jian Luan
Jian Luan
Toshiba, Microsoft, Xiaomi
LLMVLMTTSSinging Synthesis
B
Bin Wang
Xiaomi AI Lab, Beijing, China
Deyi Xiong
Deyi Xiong
Professor, College of Intelligence and Computing, Tianjin University, China
Natural Language ProcessingLarge Language ModelsAI4Science