LLM Probe: Evaluating LLMs for Low-Resource Languages

๐Ÿ“… 2026-03-31
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This study addresses the lack of systematic evaluation of large language models (LLMs) for low-resource, morphologically rich languages, hindered by scarce annotated data and the absence of dedicated benchmarking frameworks. To bridge this gap, the authors propose LLM Probe, the first multidimensional evaluation framework tailored to such languages, encompassing lexical alignment, part-of-speech tagging, morphosyntactic probing, and translation accuracy. They also introduce a high-quality, manually annotated bilingual benchmark dataset. Integrating lexical resources and linguistic annotations, the framework supports multitask evaluation across both causal language models and sequence-to-sequence architectures. Experimental results reveal that sequence-to-sequence models significantly outperform causal models on morphosyntactic and translation tasks, whereas causal models excel in lexical alignment, highlighting distinct architectural strengths in low-resource settings.
๐Ÿ“ Abstract
Despite rapid advances in large language models (LLMs), their linguistic abilities in low-resource and morphologically rich languages are still not well understood due to limited annotated resources and the absence of standardized evaluation frameworks. This paper presents LLM Probe, a lexicon-based assessment framework designed to systematically evaluate the linguistic skills of LLMs in low-resource language environments. The framework analyzes models across four areas of language understanding: lexical alignment, part-of-speech recognition, morphosyntactic probing, and translation accuracy. To illustrate the framework, we create a manually annotated benchmark dataset using a low-resource Semitic language as a case study. The dataset comprises bilingual lexicons with linguistic annotations, including part-of-speech tags, grammatical gender, and morphosyntactic features, which demonstrate high inter-annotator agreement to ensure reliable annotations. We test a variety of models, including causal language models and sequence-to-sequence architectures. The results reveal notable differences in performance across various linguistic tasks: sequence-to-sequence models generally excel in morphosyntactic analysis and translation quality, whereas causal models demonstrate strong performance in lexical alignment but exhibit weaker translation accuracy. Our results emphasize the need for linguistically grounded evaluation to better understand LLM limitations in low-resource settings. We release LLM Probe and the accompanying benchmark dataset as open-source tools to promote reproducible benchmarking and to support the development of more inclusive multilingual language technologies.
Problem

Research questions and friction points this paper is trying to address.

low-resource languages
large language models
linguistic evaluation
morphologically rich languages
evaluation framework
Innovation

Methods, ideas, or system contributions that make the work stand out.

low-resource languages
morphosyntactic probing
lexicon-based evaluation
multilingual LLM benchmarking
linguistically grounded assessment
๐Ÿ”Ž Similar Papers
No similar papers found.
H
Hailay Kidu Teklehaymanot
L3S Research Center, Leibniz University Hannover, Germany
G
Gebrearegawi Gebremariam
Aksum University, Ethiopia
Wolfgang Nejdl
Wolfgang Nejdl
Professor of Computer Science, Leibniz Universitรคt Hannover, L3S Research Center, Hannover, Germany
Information RetrievalWeb ScienceSocial MediaData MiningSemantic Technologies