PersLitEval: Fine-grained Benchmark and Evaluation of LLMs on Persian Literature Questions

📅 2026-05-26

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

This study addresses the lack of fine-grained evaluation benchmarks for large language models (LLMs) on non-English literary corpora, focusing specifically on Persian literature. The authors introduce the first comprehensive Persian literary assessment benchmark, comprising 4,514 multiple-choice questions spanning eight fine-grained categories—including spelling, morphology, rhetoric, and grammar—and stratified by difficulty based on authentic Iranian Konkur examination items. Through systematic evaluation of six prominent LLMs using ten distinct prompting strategies, the study reveals that models perform relatively well on conceptual understanding tasks but exhibit consistent weaknesses in formal linguistic tasks such as spelling. Notably, few-shot prompting with explanatory rationales yields significant performance gains, particularly for form-based tasks. This work establishes a scalable, fine-grained evaluation framework and provides in-depth error analysis to advance the assessment of LLMs’ capabilities in non-English literary domains.

📝 Abstract

Despite impressive multilingual capabilities, large language models (LLMs) remain poorly evaluated on literary knowledge in non-English languages. We introduce PersLitEval, a benchmark of 4,514 Persian literature multiple-choice questions across eight fine-grained categories spanning spelling, literary devices, grammar, vocabulary, word formation, and conceptual understanding, sourced from materials for the Konkur university entrance examination. We evaluate six LLMs across ten prompting strategies, revealing striking category-level disparities across three tiers of task difficulty: models reach higher accuracy on conceptual similarity tasks but struggle with formal linguistic analysis, with spelling and word formation proving the hardest across all models. Prompting strategy has a significant impact on performance, with explained few-shot examples yielding the best results, particularly on formal linguistic categories. An error analysis identifies three failure modes: semantic comprehension gaps, formal linguistic knowledge gaps, and counting/enumeration errors, suggesting that different categories require different improvement strategies.

Problem

Research questions and friction points this paper is trying to address.

large language models

Persian literature

multilingual evaluation

literary knowledge

benchmark

Innovation

Methods, ideas, or system contributions that make the work stand out.

fine-grained benchmark

Persian literature

LLM evaluation