LLMzSz{L}: a comprehensive LLM benchmark for Polish

📅 2025-01-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the lack of authoritative evaluation benchmarks for Polish large language models (LLMs). We introduce LLMzSz{L}, the first large-scale, nationally representative examination benchmark for Polish, comprising 18,900 authentic multiple-choice questions across 154 domains. Using closed-book automated scoring, we systematically evaluate multilingual (e.g., mBERT, XGLM), English-only, and Polish-only LLMs. Key findings include: (1) multilingual models achieve higher overall accuracy than monolingual ones, yet lightweight Polish-specific models demonstrate superior practicality under resource constraints; (2) model accuracy exhibits a strong positive correlation with human test-taker pass rates; and (3) LLMs effectively detect logical inconsistencies and annotation errors in exam items, underscoring their potential for educational assessment quality assurance. LLMzSz{L} fills a critical gap in LLM evaluation for Central and Eastern European languages and establishes a novel paradigm for cross-lingual knowledge transfer research.

Technology Category

Application Category

📝 Abstract
This article introduces the first comprehensive benchmark for the Polish language at this scale: LLMzSz{L} (LLMs Behind the School Desk). It is based on a coherent collection of Polish national exams, including both academic and professional tests extracted from the archives of the Polish Central Examination Board. It covers 4 types of exams, coming from 154 domains. Altogether, it consists of almost 19k closed-ended questions. We investigate the performance of open-source multilingual, English, and Polish LLMs to verify LLMs' abilities to transfer knowledge between languages. Also, the correlation between LLMs and humans at model accuracy and exam pass rate levels is examined. We show that multilingual LLMs can obtain superior results over monolingual ones; however, monolingual models may be beneficial when model size matters. Our analysis highlights the potential of LLMs in assisting with exam validation, particularly in identifying anomalies or errors in examination tasks.
Problem

Research questions and friction points this paper is trying to address.

Multilingual Models
Polish Language
Cross-lingual Knowledge Transfer
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLMzSzł
cross-lingual knowledge transfer
legal benchmark
🔎 Similar Papers
No similar papers found.
Krzysztof Jassem
Krzysztof Jassem
Adam Mickiewicz University
Przetwarzanie Języka Naturalnego
M
Michal Ciesiolka
Adam Mickiewicz University, Center for Artificial Intelligence AMU
F
Filip Gralinski
Adam Mickiewicz University, Center for Artificial Intelligence AMU
P
Piotr Jablonski
Adam Mickiewicz University, Center for Artificial Intelligence AMU
Jakub Pokrywka
Jakub Pokrywka
Adam Mickiewicz University, Center for Artificial Intelligence AMU
Marek Kubis
Marek Kubis
Adam Mickiewicz University in Poznań
discourse analysisdialogue modelingnatural language processingcomputational lexical semantics
M
Monika Jablonska
Adam Mickiewicz University, Center for Artificial Intelligence AMU
R
Ryszard Staruch
Adam Mickiewicz University, Center for Artificial Intelligence AMU