PL-Guard: Benchmarking Language Model Safety for Polish

📅 2025-06-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM safety evaluation tools exhibit strong bias toward high-resource languages, leaving low-resource languages—such as Polish—without systematic safety assessment. Method: We introduce the first human-annotated Polish-language LLM safety classification benchmark and adversarial perturbation dataset, featuring fine-grained safety categories and diverse attack types. Our approach integrates supervised fine-tuning and adversarial testing using HerBERT, Llama-Guard-3-8B, and a Polish-adapted PLLuM model. Results: The lightweight HerBERT classifier achieves significantly higher accuracy than state-of-the-art guard models under both standard and adversarial settings, demonstrating the efficacy of domain-adapted lightweight models. This work fills a critical gap in LLM safety evaluation for low-resource languages and provides a reusable methodology and benchmark infrastructure for non-English safety assessment.

Technology Category

Application Category

📝 Abstract
Despite increasing efforts to ensure the safety of large language models (LLMs), most existing safety assessments and moderation tools remain heavily biased toward English and other high-resource languages, leaving majority of global languages underexamined. To address this gap, we introduce a manually annotated benchmark dataset for language model safety classification in Polish. We also create adversarially perturbed variants of these samples designed to challenge model robustness. We conduct a series of experiments to evaluate LLM-based and classifier-based models of varying sizes and architectures. Specifically, we fine-tune three models: Llama-Guard-3-8B, a HerBERT-based classifier (a Polish BERT derivative), and PLLuM, a Polish-adapted Llama-8B model. We train these models using different combinations of annotated data and evaluate their performance, comparing it against publicly available guard models. Results demonstrate that the HerBERT-based classifier achieves the highest overall performance, particularly under adversarial conditions.
Problem

Research questions and friction points this paper is trying to address.

Assessing LLM safety bias in non-English languages
Creating Polish safety benchmark with adversarial samples
Evaluating model robustness across architectures and sizes
Innovation

Methods, ideas, or system contributions that make the work stand out.

Manually annotated benchmark dataset for Polish safety
Adversarially perturbed variants to test robustness
Fine-tuned HerBERT-based classifier achieves best performance
🔎 Similar Papers
No similar papers found.
A
Aleksandra Krasnodkebska
NASK – National Research Institute, Warsaw, Poland
Karolina Seweryn
Karolina Seweryn
NASK - National Research Institute, Warsaw University of Technology
S
Szymon Lukasik
NASK – National Research Institute, Warsaw, Poland
Wojciech Kusa
Wojciech Kusa
NASK National Research Institute
Natural Language ProcessingInformation RetrievalMachine LearningLLMs