PLLuM: A Family of Polish Large Language Models

📅 2025-11-05

📈 Citations: 2

✨ Influential: 0

career value

178K/year

🤖 AI Summary

To address the scarcity and inadequate cultural adaptation of large language models (LLMs) for non-English languages, the PLLuM project introduces the first open-source, transparent, Polish-language LLM family. Methodologically: (1) it curates a high-quality, 100-billion-token Polish pretraining corpus and a dedicated instruction-following dataset; (2) it employs a Transformer-based architecture integrating pretraining, supervised fine-tuning, and preference alignment; and (3) it incorporates hybrid output correction and multi-layer safety filtering, grounded in a responsible AI governance framework. The primary contributions are: (i) the first publicly released series of open-weight PLLuM models; (ii) state-of-the-art performance on downstream tasks—including public administration—significantly surpassing existing baselines; and (iii) bridging the critical gap in Polish LLMs to advance a sovereign, trustworthy, and culturally grounded open AI ecosystem.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) play a central role in modern artificial intelligence, yet their development has been primarily focused on English, resulting in limited support for other languages. We present PLLuM (Polish Large Language Model), the largest open-source family of foundation models tailored specifically for the Polish language. Developed by a consortium of major Polish research institutions, PLLuM addresses the need for high-quality, transparent, and culturally relevant language models beyond the English-centric commercial landscape. We describe the development process, including the construction of a new 140-billion-token Polish text corpus for pre-training, a 77k custom instructions dataset, and a 100k preference optimization dataset. A key component is a Responsible AI framework that incorporates strict data governance and a hybrid module for output correction and safety filtering. We detail the models'architecture, training procedures, and alignment techniques for both base and instruction-tuned variants, and demonstrate their utility in a downstream task within public administration. By releasing these models publicly, PLLuM aims to foster open research and strengthen sovereign AI technologies in Poland.

Problem

Research questions and friction points this paper is trying to address.

Addressing limited Polish language support in large language models

Developing culturally relevant AI beyond English-centric commercial systems

Creating transparent foundation models with responsible AI frameworks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Developed largest open-source Polish language foundation models

Built new 140B token corpus and specialized training datasets

Implemented Responsible AI framework with safety filtering

🔎 Similar Papers

No similar papers found.