TaP: A Taxonomy-Guided Framework for Automated and Scalable Preference Data Generation

📅 2025-06-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current multilingual preference dataset construction faces bottlenecks including high resource demands, English dominance, and insufficient linguistic and task diversity. To address these challenges, we propose TaP (Taxonomy-guided Preference generation), the first framework to introduce a taxonomy-guided mechanism for automated, scalable multilingual preference data generation. Leveraging a hierarchical taxonomy, TaP enables fine-grained control over data types, difficulty levels, and language distributions, supporting both instruction-following and human-preference-aligned supervised fine-tuning (SFT) as well as preference optimization (e.g., RLHF and DPO). This structured guidance significantly improves training efficiency and generalization under low-resource, few-shot settings. Empirical results demonstrate that TaP-generated data—despite being only 1/180 the size of leading open-source datasets (e.g., UltraFeedback, PKU-SafeRLHF)—achieves state-of-the-art performance across multiple benchmarks.

Technology Category

Application Category

📝 Abstract
Conducting supervised fine-tuning and preference fine-tuning on large language models (LLMs) requires high-quality datasets to improve their ability to follow instructions and align with human preferences and values. However, constructing such datasets is resource-intensive, and most available datasets for supervised and preference fine-tuning are in English. To address these challenges, we propose the underline{ extbf{Ta}}xonomy-Guided underline{ extbf{P}}reference Data Generation (TaP) framework, which facilitates automated and scalable construction of preference datasets across various languages. TaP is grounded in a structured taxonomy that allows fine-grained control over dataset composition, thereby ensuring both diversity and comprehensive coverage. We employ TaP-generated datasets to perform supervised and preference fine-tuning on various LLMs. Experimental results demonstrate that LLMs trained on TaP-generated datasets outperform those trained on existing open-source datasets. Remarkably, LLMs trained on TaP-generated datasets surpass the performance of those trained on an open-source dataset that is 180 times larger.
Problem

Research questions and friction points this paper is trying to address.

Generating high-quality multilingual preference datasets efficiently
Reducing resource-intensive manual dataset construction for LLMs
Ensuring diversity and coverage in preference fine-tuning data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Taxonomy-guided automated preference data generation
Structured taxonomy ensures diversity and coverage
Outperforms models trained on larger datasets
🔎 Similar Papers
No similar papers found.
Renren Jin
Renren Jin
College of Intelligence and Computing, Tianjin University
Natural Language Processing
T
Tianhao Shen
College of Intelligence and Computing, Tianjin University, Tianjin, China
X
Xinwei Wu
College of Intelligence and Computing, Tianjin University, Tianjin, China
D
Dan Shi
College of Intelligence and Computing, Tianjin University, Tianjin, China
H
Haoran Sun
College of Intelligence and Computing, Tianjin University, Tianjin, China
W
Wuwei Huang
Xiaomi AI Lab, Beijing, China
Quandong Wang
Quandong Wang
Senior Speech Engineer, Xiaomi Corporation, Beijing, China
Far field speech recognition/enhancement/separation
W
Wei Liu
Xiaomi AI Lab, Beijing, China
Jian Luan
Jian Luan
Toshiba, Microsoft, Xiaomi
LLMVLMTTSSinging Synthesis
B
Bin Wang
Xiaomi AI Lab, Beijing, China
Deyi Xiong
Deyi Xiong
Professor, College of Intelligence and Computing, Tianjin University, China
Natural Language ProcessingLarge Language ModelsAI4Science