TaP: A Taxonomy-Guided Framework for Automated and Scalable Preference Data Generation

📅 2025-06-30

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

Current multilingual preference dataset construction faces bottlenecks including high resource demands, English dominance, and insufficient linguistic and task diversity. To address these challenges, we propose TaP (Taxonomy-guided Preference generation), the first framework to introduce a taxonomy-guided mechanism for automated, scalable multilingual preference data generation. Leveraging a hierarchical taxonomy, TaP enables fine-grained control over data types, difficulty levels, and language distributions, supporting both instruction-following and human-preference-aligned supervised fine-tuning (SFT) as well as preference optimization (e.g., RLHF and DPO). This structured guidance significantly improves training efficiency and generalization under low-resource, few-shot settings. Empirical results demonstrate that TaP-generated data—despite being only 1/180 the size of leading open-source datasets (e.g., UltraFeedback, PKU-SafeRLHF)—achieves state-of-the-art performance across multiple benchmarks.

Technology Category

Application Category

📝 Abstract

Conducting supervised fine-tuning and preference fine-tuning on large language models (LLMs) requires high-quality datasets to improve their ability to follow instructions and align with human preferences and values. However, constructing such datasets is resource-intensive, and most available datasets for supervised and preference fine-tuning are in English. To address these challenges, we propose the underline{ extbf{Ta}}xonomy-Guided underline{ extbf{P}}reference Data Generation (TaP) framework, which facilitates automated and scalable construction of preference datasets across various languages. TaP is grounded in a structured taxonomy that allows fine-grained control over dataset composition, thereby ensuring both diversity and comprehensive coverage. We employ TaP-generated datasets to perform supervised and preference fine-tuning on various LLMs. Experimental results demonstrate that LLMs trained on TaP-generated datasets outperform those trained on existing open-source datasets. Remarkably, LLMs trained on TaP-generated datasets surpass the performance of those trained on an open-source dataset that is 180 times larger.

Problem

Research questions and friction points this paper is trying to address.

Generating high-quality multilingual preference datasets efficiently

Reducing resource-intensive manual dataset construction for LLMs

Ensuring diversity and coverage in preference fine-tuning data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Taxonomy-guided automated preference data generation

Structured taxonomy ensures diversity and coverage

Outperforms models trained on larger datasets

🔎 Similar Papers

Flexible Generation of Preference Data for Recommendation Analysis