PLLuM: A Family of Polish Large Language Models

📅 2025-11-05
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
To address the scarcity and inadequate cultural adaptation of large language models (LLMs) for non-English languages, the PLLuM project introduces the first open-source, transparent, Polish-language LLM family. Methodologically: (1) it curates a high-quality, 100-billion-token Polish pretraining corpus and a dedicated instruction-following dataset; (2) it employs a Transformer-based architecture integrating pretraining, supervised fine-tuning, and preference alignment; and (3) it incorporates hybrid output correction and multi-layer safety filtering, grounded in a responsible AI governance framework. The primary contributions are: (i) the first publicly released series of open-weight PLLuM models; (ii) state-of-the-art performance on downstream tasks—including public administration—significantly surpassing existing baselines; and (iii) bridging the critical gap in Polish LLMs to advance a sovereign, trustworthy, and culturally grounded open AI ecosystem.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) play a central role in modern artificial intelligence, yet their development has been primarily focused on English, resulting in limited support for other languages. We present PLLuM (Polish Large Language Model), the largest open-source family of foundation models tailored specifically for the Polish language. Developed by a consortium of major Polish research institutions, PLLuM addresses the need for high-quality, transparent, and culturally relevant language models beyond the English-centric commercial landscape. We describe the development process, including the construction of a new 140-billion-token Polish text corpus for pre-training, a 77k custom instructions dataset, and a 100k preference optimization dataset. A key component is a Responsible AI framework that incorporates strict data governance and a hybrid module for output correction and safety filtering. We detail the models'architecture, training procedures, and alignment techniques for both base and instruction-tuned variants, and demonstrate their utility in a downstream task within public administration. By releasing these models publicly, PLLuM aims to foster open research and strengthen sovereign AI technologies in Poland.
Problem

Research questions and friction points this paper is trying to address.

Addressing limited Polish language support in large language models
Developing culturally relevant AI beyond English-centric commercial systems
Creating transparent foundation models with responsible AI frameworks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Developed largest open-source Polish language foundation models
Built new 140B token corpus and specialized training datasets
Implemented Responsible AI framework with safety filtering
🔎 Similar Papers
No similar papers found.
Jan Kocoń
Jan Kocoń
Department of Artificial Intelligence, Wroclaw University of Science and Technology
Artificial IntelligenceNatural Language ProcessingLarge Language ModelsTransformersPersonalized NLP
Maciej Piasecki
Maciej Piasecki
Department of Artificial Intelligence, Wrocław University of Science and Technology, Wyb. Wyspiańskiego 27, Wrocław 50-370, Poland
Arkadiusz Janz
Arkadiusz Janz
Wrocław University of Science and Technology
machine learningnatural language processingcomputational linguistics
T
Teddy Ferdinan
Department of Artificial Intelligence, Wrocław University of Science and Technology, Wyb. Wyspiańskiego 27, Wrocław 50-370, Poland
Ł
Łukasz Radliński
Department of Artificial Intelligence, Wrocław University of Science and Technology, Wyb. Wyspiańskiego 27, Wrocław 50-370, Poland
B
Bartłomiej Koptyra
Department of Artificial Intelligence, Wrocław University of Science and Technology, Wyb. Wyspiańskiego 27, Wrocław 50-370, Poland
Marcin Oleksy
Marcin Oleksy
doktor, Politechnika Wrocławska
Corpus LinguisticsArtificial IntelligenceNatural Language ProcessingInformation Extraction
S
Stanisław Woźniak
Department of Artificial Intelligence, Wrocław University of Science and Technology, Wyb. Wyspiańskiego 27, Wrocław 50-370, Poland
P
Paweł Walkowiak
Department of Artificial Intelligence, Wrocław University of Science and Technology, Wyb. Wyspiańskiego 27, Wrocław 50-370, Poland
Konrad Wojtasik
Konrad Wojtasik
Wrocław University of Science and Technology
Natural Language Processing
J
Julia Moska
Department of Artificial Intelligence, Wrocław University of Science and Technology, Wyb. Wyspiańskiego 27, Wrocław 50-370, Poland
T
Tomasz Naskręta
Department of Artificial Intelligence, Wrocław University of Science and Technology, Wyb. Wyspiańskiego 27, Wrocław 50-370, Poland
B
Bartosz Walkowiak
Department of Artificial Intelligence, Wrocław University of Science and Technology, Wyb. Wyspiańskiego 27, Wrocław 50-370, Poland
M
Mateusz Gniewkowski
Department of Artificial Intelligence, Wrocław University of Science and Technology, Wyb. Wyspiańskiego 27, Wrocław 50-370, Poland
K
Kamil Szyć
Department of Artificial Intelligence, Wrocław University of Science and Technology, Wyb. Wyspiańskiego 27, Wrocław 50-370, Poland
D
Dawid Motyka
Department of Artificial Intelligence, Wrocław University of Science and Technology, Wyb. Wyspiańskiego 27, Wrocław 50-370, Poland
D
Dawid Banach
Department of Artificial Intelligence, Wrocław University of Science and Technology, Wyb. Wyspiańskiego 27, Wrocław 50-370, Poland
J
Jonatan Dalasiński
Department of Artificial Intelligence, Wrocław University of Science and Technology, Wyb. Wyspiańskiego 27, Wrocław 50-370, Poland
Ewa Rudnicka
Ewa Rudnicka
Department of Artificial Intelligence, Wrocław University of Science and Technology, Wyb. Wyspiańskiego 27, Wrocław 50-370, Poland
B
Bartłomiej Alberski
Department of Artificial Intelligence, Wrocław University of Science and Technology, Wyb. Wyspiańskiego 27, Wrocław 50-370, Poland
Tomasz Walkowiak
Tomasz Walkowiak
Politechnika Wrocławska
NLP
A
Aleksander Szczęsny
Department of Artificial Intelligence, Wrocław University of Science and Technology, Wyb. Wyspiańskiego 27, Wrocław 50-370, Poland
M
Maciej Markiewicz
Department of Artificial Intelligence, Wrocław University of Science and Technology, Wyb. Wyspiańskiego 27, Wrocław 50-370, Poland
T
Tomasz Bernaś
Department of Artificial Intelligence, Wrocław University of Science and Technology, Wyb. Wyspiańskiego 27, Wrocław 50-370, Poland; Institute of Slavic Studies, Polish Academy of Sciences, ul. Jaracza 1, Warszawa 00-378, Poland
H
Hubert Mazur
Department of Artificial Intelligence, Wrocław University of Science and Technology, Wyb. Wyspiańskiego 27, Wrocław 50-370, Poland
K
Kamil Żyta
Department of Artificial Intelligence, Wrocław University of Science and Technology, Wyb. Wyspiańskiego 27, Wrocław 50-370, Poland
M
Mateusz Tykierko
Department of Artificial Intelligence, Wrocław University of Science and Technology, Wyb. Wyspiańskiego 27, Wrocław 50-370, Poland
G
Grzegorz Chodak
Department of Artificial Intelligence, Wrocław University of Science and Technology, Wyb. Wyspiańskiego 27, Wrocław 50-370, Poland
Tomasz Kajdanowicz
Tomasz Kajdanowicz
Wroclaw University of Technology
Data ScienceMachine LearningRepresentation Learning
Przemysław Kazienko
Przemysław Kazienko
Politechnika Wrocławska
NLPaffective computingwearablesmachine learningsocial networks
Agnieszka Karlińska
Agnieszka Karlińska
NASK National Research Institute, ul. Kolska 12, Warszawa 01-045, Poland
Karolina Seweryn
Karolina Seweryn
NASK - National Research Institute, Warsaw University of Technology
A
Anna Kołos
NASK National Research Institute, ul. Kolska 12, Warszawa 01-045, Poland
Maciej Chrabąszcz
Maciej Chrabąszcz
Warsaw University of Technology, NASK - National Research Institute
AI SafetyDeep Learning
Katarzyna Lorenc
Katarzyna Lorenc
NASK - National Research Institute
A
Aleksandra Krasnodębska
NASK National Research Institute, ul. Kolska 12, Warszawa 01-045, Poland
A
Artur Wilczek
NASK National Research Institute, ul. Kolska 12, Warszawa 01-045, Poland
K
Katarzyna Dziewulska
NASK National Research Institute, ul. Kolska 12, Warszawa 01-045, Poland
P
Paula Betscher
NASK National Research Institute, ul. Kolska 12, Warszawa 01-045, Poland
Z
Zofia Cieślińska
NASK National Research Institute, ul. Kolska 12, Warszawa 01-045, Poland
K
Katarzyna Kowol
NASK National Research Institute, ul. Kolska 12, Warszawa 01-045, Poland
D
Daria Mikoś
NASK National Research Institute, ul. Kolska 12, Warszawa 01-045, Poland
M
Maciej Trzciński
NASK National Research Institute, ul. Kolska 12, Warszawa 01-045, Poland
D
Dawid Krutul
NASK National Research Institute, ul. Kolska 12, Warszawa 01-045, Poland
M
Marek Kozłowski
National Information Processing Institute, al. Niepodległości 188B, Warszawa 00-608, Poland
Sławomir Dadas
Sławomir Dadas
National Information Processing Institute, Warsaw, Poland
machine learningnatural language processing
Rafał Poświata
Rafał Poświata
National Information Processing Institute
natural language processingmachine learningdeep learningsentiment analysis
Michał Perełkiewicz
Michał Perełkiewicz
National Information Processing Institute, Warsaw, Poland
machine learningdeep learningneural networks
Małgorzata Grębowiec
Małgorzata Grębowiec
National Information Processing Institute, Warsaw, Poland
M
Maciej Kazuła
National Information Processing Institute, al. Niepodległości 188B, Warszawa 00-608, Poland
M
Marcin Białas
National Information Processing Institute, al. Niepodległości 188B, Warszawa 00-608, Poland
R
Roman Roszko
Institute of Slavic Studies, Polish Academy of Sciences, ul. Jaracza 1, Warszawa 00-378, Poland
D
Danuta Roszko
Institute of Slavic Studies, Polish Academy of Sciences, ul. Jaracza 1, Warszawa 00-378, Poland
J
Jurgita Vaičenonienė
Institute of Slavic Studies, Polish Academy of Sciences, ul. Jaracza 1, Warszawa 00-378, Poland
A
Andrius Utka
Institute of Slavic Studies, Polish Academy of Sciences, ul. Jaracza 1, Warszawa 00-378, Poland
P
Paweł Levchuk
Institute of Slavic Studies, Polish Academy of Sciences, ul. Jaracza 1, Warszawa 00-378, Poland
P
Paweł Kowalski
Institute of Slavic Studies, Polish Academy of Sciences, ul. Jaracza 1, Warszawa 00-378, Poland
I
Irena Prawdzic-Jankowska
Institute of Slavic Studies, Polish Academy of Sciences, ul. Jaracza 1, Warszawa 00-378, Poland
Maciej Ogrodniczuk
Maciej Ogrodniczuk
Institute of Computer Science, Polish Academy of Sciences, ul. Jana Kazimierza 5, Warsaw, 01-248, Poland
M
Monika Borys
Institute of Computer Science, Polish Academy of Sciences, ul. Jana Kazimierza 5, Warsaw, 01-248, Poland
A
Anna Bulińska
Institute of Computer Science, Polish Academy of Sciences, ul. Jana Kazimierza 5, Warsaw, 01-248, Poland
W
Wiktoria Gumienna
Institute of Computer Science, Polish Academy of Sciences, ul. Jana Kazimierza 5, Warsaw, 01-248, Poland
W
Witold Kieraś
Institute of Computer Science, Polish Academy of Sciences, ul. Jana Kazimierza 5, Warsaw, 01-248, Poland
D
Dorota Komosińska
Institute of Computer Science, Polish Academy of Sciences, ul. Jana Kazimierza 5, Warsaw, 01-248, Poland
K
Katarzyna Krasnowska-Kieraś
Institute of Computer Science, Polish Academy of Sciences, ul. Jana Kazimierza 5, Warsaw, 01-248, Poland
Łukasz Kobyliński
Łukasz Kobyliński
Institute of Computer Science, Polish Academy of Sciences, ul. Jana Kazimierza 5, Warsaw, 01-248, Poland
M
Martyna Lewandowska
Institute of Computer Science, Polish Academy of Sciences, ul. Jana Kazimierza 5, Warsaw, 01-248, Poland
M
Marek Łaziński
Institute of Computer Science, Polish Academy of Sciences, ul. Jana Kazimierza 5, Warsaw, 01-248, Poland
M
Mikołaj Łątkowski
Institute of Computer Science, Polish Academy of Sciences, ul. Jana Kazimierza 5, Warsaw, 01-248, Poland
D
Dawid Mastalerz
Institute of Computer Science, Polish Academy of Sciences, ul. Jana Kazimierza 5, Warsaw, 01-248, Poland
B
Beata Milewicz
Institute of Computer Science, Polish Academy of Sciences, ul. Jana Kazimierza 5, Warsaw, 01-248, Poland
A
Agnieszka Anna Mykowiecka
Institute of Computer Science, Polish Academy of Sciences, ul. Jana Kazimierza 5, Warsaw, 01-248, Poland
A
Angelika Peljak-Łapińska
Institute of Computer Science, Polish Academy of Sciences, ul. Jana Kazimierza 5, Warsaw, 01-248, Poland
S
Sandra Penno
Institute of Computer Science, Polish Academy of Sciences, ul. Jana Kazimierza 5, Warsaw, 01-248, Poland
Z
Zuzanna Przybysz
Institute of Computer Science, Polish Academy of Sciences, ul. Jana Kazimierza 5, Warsaw, 01-248, Poland
M
Michał Rudolf
Institute of Computer Science, Polish Academy of Sciences, ul. Jana Kazimierza 5, Warsaw, 01-248, Poland
P
Piotr Rybak
Institute of Computer Science, Polish Academy of Sciences, ul. Jana Kazimierza 5, Warsaw, 01-248, Poland
K
Karolina Saputa
Institute of Computer Science, Polish Academy of Sciences, ul. Jana Kazimierza 5, Warsaw, 01-248, Poland
A
Aleksandra Tomaszewska
Institute of Computer Science, Polish Academy of Sciences, ul. Jana Kazimierza 5, Warsaw, 01-248, Poland
A
Aleksander Wawer
Institute of Computer Science, Polish Academy of Sciences, ul. Jana Kazimierza 5, Warsaw, 01-248, Poland
Marcin Woliński
Marcin Woliński
Institute of Computer Science, Polish Academy of Sciences, ul. Jana Kazimierza 5, Warsaw, 01-248, Poland
J
Joanna Wołoszyn
Institute of Computer Science, Polish Academy of Sciences, ul. Jana Kazimierza 5, Warsaw, 01-248, Poland
A
Alina Wróblewska
Institute of Computer Science, Polish Academy of Sciences, ul. Jana Kazimierza 5, Warsaw, 01-248, Poland
B
Bartosz Żuk
Institute of Computer Science, Polish Academy of Sciences, ul. Jana Kazimierza 5, Warsaw, 01-248, Poland
F
Filip Żarnecki
University of Łódź, ul. Gabriela Narutowicza 68, Łódź 90-136, Poland
K
Konrad Kaczyński
University of Łódź, ul. Gabriela Narutowicza 68, Łódź 90-136, Poland
A
Anna Cichosz
University of Łódź, ul. Gabriela Narutowicza 68, Łódź 90-136, Poland
Z
Zuzanna Deckert
University of Łódź, ul. Gabriela Narutowicza 68, Łódź 90-136, Poland
M
Monika Garnys
University of Łódź, ul. Gabriela Narutowicza 68, Łódź 90-136, Poland
I
Izabela Grabarczyk
University of Łódź, ul. Gabriela Narutowicza 68, Łódź 90-136, Poland
W
Wojciech Janowski
University of Łódź, ul. Gabriela Narutowicza 68, Łódź 90-136, Poland
S
Sylwia Karasińska
University of Łódź, ul. Gabriela Narutowicza 68, Łódź 90-136, Poland
A
Aleksandra Kujawiak
University of Łódź, ul. Gabriela Narutowicza 68, Łódź 90-136, Poland
P
Piotr Misztela
University of Łódź, ul. Gabriela Narutowicza 68, Łódź 90-136, Poland
M
Maria Szymańska
University of Łódź, ul. Gabriela Narutowicza 68, Łódź 90-136, Poland
K
Karolina Walkusz
University of Łódź, ul. Gabriela Narutowicza 68, Łódź 90-136, Poland
I
Igor Siek
University of Łódź, ul. Gabriela Narutowicza 68, Łódź 90-136, Poland
J
Jakub Kwiatkowski
University of Łódź, ul. Gabriela Narutowicza 68, Łódź 90-136, Poland
Piotr Pęzik
Piotr Pęzik
University of Łódź, ul. Gabriela Narutowicza 68, Łódź 90-136, Poland