🤖 AI Summary
This study investigates the irreplaceable role of human-authored instructions in instruction tuning. To address data scarcity and cross-lingual alignment challenges, we propose a lightweight synthetic paradigm: leveraging high-quality human-written instructions as inputs and generating corresponding responses using open-source LLMs (e.g., Llama, Phi) to construct multilingual instruction-tuning datasets. We conduct the first systematic empirical validation of the critical impact of human instructions on instruction alignment, uncovering fundamental limitations in cross-lingual transfer of cultural knowledge. Our approach achieves state-of-the-art performance on both English and Japanese instruction-following benchmarks, significantly improving model adherence to diverse, linguistically nuanced directives. All generated datasets and fine-tuned models are released under permissive open-source licenses, enabling full reproducibility and facilitating adaptation to low-resource settings.
📝 Abstract
Instruction tuning is crucial for enabling Large Language Models (LLMs) to solve real-world tasks. Prior work has shown the effectiveness of instruction-tuning data synthesized solely from LLMs, raising a fundamental question: Do we still need human-originated signals for instruction tuning? This work answers the question affirmatively: we build state-of-the-art instruction-tuning datasets sourced from human-written instructions, by simply pairing them with LLM-generated responses. LLMs fine-tuned on our datasets consistently outperform those fine-tuned on existing ones. Our data construction approach can be easily adapted to other languages; we build datasets for Japanese and confirm that LLMs tuned with our data reach state-of-the-art performance. Analyses suggest that instruction-tuning in a new language allows LLMs to follow instructions, while the tuned models exhibit a notable lack of culture-specific knowledge in that language. The datasets and fine-tuned models will be publicly available. Our datasets, synthesized with open-weight LLMs, are openly distributed under permissive licenses, allowing for diverse use cases.