🤖 AI Summary
Current multilingual preference dataset construction faces bottlenecks including high resource demands, English dominance, and insufficient linguistic and task diversity. To address these challenges, we propose TaP (Taxonomy-guided Preference generation), the first framework to introduce a taxonomy-guided mechanism for automated, scalable multilingual preference data generation. Leveraging a hierarchical taxonomy, TaP enables fine-grained control over data types, difficulty levels, and language distributions, supporting both instruction-following and human-preference-aligned supervised fine-tuning (SFT) as well as preference optimization (e.g., RLHF and DPO). This structured guidance significantly improves training efficiency and generalization under low-resource, few-shot settings. Empirical results demonstrate that TaP-generated data—despite being only 1/180 the size of leading open-source datasets (e.g., UltraFeedback, PKU-SafeRLHF)—achieves state-of-the-art performance across multiple benchmarks.
📝 Abstract
Conducting supervised fine-tuning and preference fine-tuning on large language models (LLMs) requires high-quality datasets to improve their ability to follow instructions and align with human preferences and values. However, constructing such datasets is resource-intensive, and most available datasets for supervised and preference fine-tuning are in English. To address these challenges, we propose the underline{ extbf{Ta}}xonomy-Guided underline{ extbf{P}}reference Data Generation (TaP) framework, which facilitates automated and scalable construction of preference datasets across various languages. TaP is grounded in a structured taxonomy that allows fine-grained control over dataset composition, thereby ensuring both diversity and comprehensive coverage. We employ TaP-generated datasets to perform supervised and preference fine-tuning on various LLMs. Experimental results demonstrate that LLMs trained on TaP-generated datasets outperform those trained on existing open-source datasets. Remarkably, LLMs trained on TaP-generated datasets surpass the performance of those trained on an open-source dataset that is 180 times larger.