Fann or Flop: A Multigenre, Multiera Benchmark for Arabic Poetry Understanding in LLMs

📅 2025-05-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing large language models (LLMs) lack systematic evaluation for deep understanding of Arabic poetry, reflecting a critical gap in Arabic computational humanities. Method: We introduce the first multidimensional benchmark covering 12 historical periods, 21 poetic forms, and diverse meters. It is built upon manually curated Classical Arabic poetry corpora and expert-annotated four-dimensional explanations—semantic, metaphorical, prosodic, and cultural contextual—enabling a customized evaluation protocol and an open-source evaluation suite. Contribution/Results: This work pioneers the systematic assessment of LLMs’ culturally sensitive interpretation and structured reasoning capabilities in Arabic poetry, moving beyond surface-level tasks prevalent in prior Arabic NLP benchmarks. Empirical results show that state-of-the-art LLMs perform significantly worse on this benchmark than on general Arabic NLP tasks. The benchmark is publicly released, establishing a foundational infrastructure for computational research on Arabic poetry.

Technology Category

Application Category

📝 Abstract
Arabic poetry stands as one of the most sophisticated and culturally embedded forms of expression in the Arabic language, known for its layered meanings, stylistic diversity, and deep historical continuity. Although large language models (LLMs) have demonstrated strong performance across languages and tasks, their ability to understand Arabic poetry remains largely unexplored. In this work, we introduce `Fann or Flop`, the first benchmark designed to assess the comprehension of Arabic poetry by LLMs in twelve historical eras, covering 21 core poetic genres and a variety of metrical forms, from classical structures to contemporary free verse. The benchmark comprises a curated corpus of poems with explanations that assess semantic understanding, metaphor interpretation, prosodic awareness, and cultural context. We argue that poetic comprehension offers a strong indicator for testing how good the LLM is in understanding classical Arabic through the Arabic poetry. Unlike surface-level tasks, this domain demands deeper interpretive reasoning and cultural sensitivity. Our evaluation of state-of-the-art LLMs shows that most models struggle with poetic understanding despite strong results on standard Arabic benchmarks. We release `Fann or Flop` along with the evaluation suite as an open-source resource to enable rigorous evaluation and advancement for Arabic language models. Code is available at: https://github.com/mbzuai-oryx/FannOrFlop.
Problem

Research questions and friction points this paper is trying to address.

Assessing LLMs' understanding of Arabic poetry across eras and genres
Evaluating semantic, metaphorical, and cultural comprehension in Arabic poetry
Addressing gaps in LLMs' interpretive reasoning for classical Arabic
Innovation

Methods, ideas, or system contributions that make the work stand out.

First benchmark for Arabic poetry in LLMs
Covers 12 eras, 21 genres, various forms
Assesses semantic, metaphor, prosodic, cultural understanding
🔎 Similar Papers
No similar papers found.