MEENA (PersianMMMU): Multimodal-Multilingual Educational Exams for N-level Assessment

📅 2025-08-24

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

Current large vision-language models (VLMs) are predominantly evaluated on English-centric multimodal benchmarks, leading to a critical gap in assessing their performance across non-English languages and culturally diverse contexts. Method: To address this, we introduce Persian-MME—the first multimodal educational evaluation benchmark tailored for Persian—spanning elementary to high school curricula and covering six scientific and reasoning tasks: mathematics, physics, chart interpretation, art, humanities, and logical reasoning. Our framework uniquely integrates culturally sensitive content, bilingual (Persian/English) aligned annotations, fine-grained difficulty grading, and joint image-text reasoning design. We release 7,500 Persian and 3,000 English questions. Contribution/Results: Persian-MME enables rigorous assessment of cross-lingual understanding, visual perception, and hallucination detection. Empirical evaluation reveals systematic modality misalignment and generation biases in state-of-the-art VLMs under non-English settings, establishing a foundational infrastructure for multilingual VLM development and evaluation.

Technology Category

Application Category

📝 Abstract

Recent advancements in large vision-language models (VLMs) have primarily focused on English, with limited attention given to other languages. To address this gap, we introduce MEENA (also known as PersianMMMU), the first dataset designed to evaluate Persian VLMs across scientific, reasoning, and human-level understanding tasks. Our dataset comprises approximately 7,500 Persian and 3,000 English questions, covering a wide range of topics such as reasoning, mathematics, physics, diagrams, charts, and Persian art and literature. Key features of MEENA include: (1) diverse subject coverage spanning various educational levels, from primary to upper secondary school, (2) rich metadata, including difficulty levels and descriptive answers, (3) original Persian data that preserves cultural nuances, (4) a bilingual structure to assess cross-linguistic performance, and (5) a series of diverse experiments assessing various capabilities, including overall performance, the model's ability to attend to images, and its tendency to generate hallucinations. We hope this benchmark contributes to enhancing VLM capabilities beyond English.

Problem

Research questions and friction points this paper is trying to address.

Addressing limited multilingual focus in vision-language models

Evaluating Persian VLMs across scientific and reasoning tasks

Assessing cross-linguistic performance through bilingual dataset structure

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal multilingual dataset for Persian evaluation

Bilingual structure assessing cross-linguistic model performance

Diverse experiments testing reasoning and cultural understanding

🔎 Similar Papers

M4U: Evaluating Multilingual Understanding and Reasoning for Large Multimodal Models