🤖 AI Summary
Existing LLM safety evaluation tools exhibit strong bias toward high-resource languages, leaving low-resource languages—such as Polish—without systematic safety assessment. Method: We introduce the first human-annotated Polish-language LLM safety classification benchmark and adversarial perturbation dataset, featuring fine-grained safety categories and diverse attack types. Our approach integrates supervised fine-tuning and adversarial testing using HerBERT, Llama-Guard-3-8B, and a Polish-adapted PLLuM model. Results: The lightweight HerBERT classifier achieves significantly higher accuracy than state-of-the-art guard models under both standard and adversarial settings, demonstrating the efficacy of domain-adapted lightweight models. This work fills a critical gap in LLM safety evaluation for low-resource languages and provides a reusable methodology and benchmark infrastructure for non-English safety assessment.
📝 Abstract
Despite increasing efforts to ensure the safety of large language models (LLMs), most existing safety assessments and moderation tools remain heavily biased toward English and other high-resource languages, leaving majority of global languages underexamined. To address this gap, we introduce a manually annotated benchmark dataset for language model safety classification in Polish. We also create adversarially perturbed variants of these samples designed to challenge model robustness. We conduct a series of experiments to evaluate LLM-based and classifier-based models of varying sizes and architectures. Specifically, we fine-tune three models: Llama-Guard-3-8B, a HerBERT-based classifier (a Polish BERT derivative), and PLLuM, a Polish-adapted Llama-8B model. We train these models using different combinations of annotated data and evaluate their performance, comparing it against publicly available guard models. Results demonstrate that the HerBERT-based classifier achieves the highest overall performance, particularly under adversarial conditions.