🤖 AI Summary
This work addresses the scarcity and source homogeneity of instruction-tuning data for Polish large language models (LLMs). To this end, we construct and systematically evaluate PLLuM, a multilingual-instruction dataset for Polish. Methodologically, we propose a functional instruction taxonomy and integrate three complementary data sources—human-authored instructions, cross-lingual translation, and synthetic generation—augmented by linguistically informed quality control and evaluation. Our key contributions are: (1) the release of PLLuMIC, the first open-source, high-quality, functionally comprehensive Polish instruction subset (12K samples), establishing a reusable framework and benchmark for low-resource language instruction-data development; and (2) empirical evidence demonstrating that hybrid-source instruction tuning significantly outperforms single-source approaches, yielding an average 4.2% improvement in Polish LLM adaptation performance (measured via BLEU and ROUGE).
📝 Abstract
This paper describes the instruction dataset used to fine-tune a set of transformer-based large language models (LLMs) developed in the PLLuM (Polish Large Language Model) project. We present a functional typology of the organic, converted, and synthetic instructions used in PLLuM and share some observations about the implications of using human-authored versus synthetic instruction datasets in the linguistic adaptation of base LLMs. Additionally, we release the first representative subset of the PLLuM instruction corpus (PLLuMIC), which we believe to be useful in guiding and planning the development of similar datasets for other LLMs.