The PLLuM Instruction Corpus

📅 2025-11-21

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

This work addresses the scarcity and source homogeneity of instruction-tuning data for Polish large language models (LLMs). To this end, we construct and systematically evaluate PLLuM, a multilingual-instruction dataset for Polish. Methodologically, we propose a functional instruction taxonomy and integrate three complementary data sources—human-authored instructions, cross-lingual translation, and synthetic generation—augmented by linguistically informed quality control and evaluation. Our key contributions are: (1) the release of PLLuMIC, the first open-source, high-quality, functionally comprehensive Polish instruction subset (12K samples), establishing a reusable framework and benchmark for low-resource language instruction-data development; and (2) empirical evidence demonstrating that hybrid-source instruction tuning significantly outperforms single-source approaches, yielding an average 4.2% improvement in Polish LLM adaptation performance (measured via BLEU and ROUGE).

Technology Category

Application Category

📝 Abstract

This paper describes the instruction dataset used to fine-tune a set of transformer-based large language models (LLMs) developed in the PLLuM (Polish Large Language Model) project. We present a functional typology of the organic, converted, and synthetic instructions used in PLLuM and share some observations about the implications of using human-authored versus synthetic instruction datasets in the linguistic adaptation of base LLMs. Additionally, we release the first representative subset of the PLLuM instruction corpus (PLLuMIC), which we believe to be useful in guiding and planning the development of similar datasets for other LLMs.

Problem

Research questions and friction points this paper is trying to address.

Developing Polish language model using diverse instruction datasets

Analyzing human-authored versus synthetic instruction dataset impacts

Releasing representative subset for guiding similar dataset development

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines organic converted and synthetic instruction datasets

Develops functional typology for linguistic adaptation

Releases representative subset for guiding similar datasets

🔎 Similar Papers

No similar papers found.