The PLLuM Instruction Corpus

📅 2025-11-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the scarcity and source homogeneity of instruction-tuning data for Polish large language models (LLMs). To this end, we construct and systematically evaluate PLLuM, a multilingual-instruction dataset for Polish. Methodologically, we propose a functional instruction taxonomy and integrate three complementary data sources—human-authored instructions, cross-lingual translation, and synthetic generation—augmented by linguistically informed quality control and evaluation. Our key contributions are: (1) the release of PLLuMIC, the first open-source, high-quality, functionally comprehensive Polish instruction subset (12K samples), establishing a reusable framework and benchmark for low-resource language instruction-data development; and (2) empirical evidence demonstrating that hybrid-source instruction tuning significantly outperforms single-source approaches, yielding an average 4.2% improvement in Polish LLM adaptation performance (measured via BLEU and ROUGE).

Technology Category

Application Category

📝 Abstract
This paper describes the instruction dataset used to fine-tune a set of transformer-based large language models (LLMs) developed in the PLLuM (Polish Large Language Model) project. We present a functional typology of the organic, converted, and synthetic instructions used in PLLuM and share some observations about the implications of using human-authored versus synthetic instruction datasets in the linguistic adaptation of base LLMs. Additionally, we release the first representative subset of the PLLuM instruction corpus (PLLuMIC), which we believe to be useful in guiding and planning the development of similar datasets for other LLMs.
Problem

Research questions and friction points this paper is trying to address.

Developing Polish language model using diverse instruction datasets
Analyzing human-authored versus synthetic instruction dataset impacts
Releasing representative subset for guiding similar dataset development
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines organic converted and synthetic instruction datasets
Develops functional typology for linguistic adaptation
Releases representative subset for guiding similar datasets
🔎 Similar Papers
No similar papers found.
Piotr Pęzik
Piotr Pęzik
University of Lodz
F
Filip Żarnecki
University of Lodz
K
Konrad Kaczyński
University of Lodz
A
Anna Cichosz
University of Lodz
Z
Zuzanna Deckert
University of Lodz
M
Monika Garnys
University of Lodz
I
Izabela Grabarczyk
University of Lodz
W
Wojciech Janowski
University of Lodz
S
Sylwia Karasińska
University of Lodz
A
Aleksandra Kujawiak
University of Lodz
P
Piotr Misztela
University of Lodz
M
Maria Szymańska
University of Lodz
K
Karolina Walkusz
University of Lodz
I
Igor Siek
University of Lodz
Maciej Chrabąszcz
Maciej Chrabąszcz
Warsaw University of Technology, NASK - National Research Institute
AI SafetyDeep Learning
A
Anna Kołos
NASK National Research Institute
Agnieszka Karlińska
Agnieszka Karlińska
NASK National Research Institute
Karolina Seweryn
Karolina Seweryn
NASK - National Research Institute, Warsaw University of Technology
A
Aleksandra Krasnodębska
NASK National Research Institute
P
Paula Betscher
NASK National Research Institute
Z
Zofia Cieślińska
NASK National Research Institute
K
Katarzyna Kowol
NASK National Research Institute
A
Artur Wilczek
NASK National Research Institute
M
Maciej Trzciński
NASK National Research Institute
K
Katarzyna Dziewulska
NASK National Research Institute