JEEM: Vision-Language Understanding in Four Arabic Dialects

📅 2025-03-27

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

Existing vision-language models (VLMs) exhibit poor cross-dialect generalization and weak cultural element comprehension when applied to Arabic visual understanding across major dialects (Jordanian, Emirati, Egyptian, Moroccan). Method: We introduce JEEM, the first multi-dialect Arabic vision-language evaluation benchmark, covering image captioning and visual question answering tasks with emphasis on cultural diversity and regional adaptation. JEEM is built upon manually annotated, culturally rich, multi-regional image data and employs a standardized protocol to evaluate five open-source Arabic VLMs alongside GPT-4V. Contribution/Results: Experiments reveal that all open-source models substantially underperform GPT-4V; though GPT-4V shows uneven dialect proficiency and limitations in visual reasoning, it remains the state-of-the-art. This work provides the first systematic analysis of cultural perception bottlenecks in multi-dialect Arabic VLMs, establishing a critical benchmark and empirical foundation for developing culturally aware Arabic VLMs.

Technology Category

Application Category

📝 Abstract

We introduce JEEM, a benchmark designed to evaluate Vision-Language Models (VLMs) on visual understanding across four Arabic-speaking countries: Jordan, The Emirates, Egypt, and Morocco. JEEM includes the tasks of image captioning and visual question answering, and features culturally rich and regionally diverse content. This dataset aims to assess the ability of VLMs to generalize across dialects and accurately interpret cultural elements in visual contexts. In an evaluation of five prominent open-source Arabic VLMs and GPT-4V, we find that the Arabic VLMs consistently underperform, struggling with both visual understanding and dialect-specific generation. While GPT-4V ranks best in this comparison, the model's linguistic competence varies across dialects, and its visual understanding capabilities lag behind. This underscores the need for more inclusive models and the value of culturally-diverse evaluation paradigms.

Problem

Research questions and friction points this paper is trying to address.

Evaluating VLMs on Arabic dialect visual understanding

Assessing VLM generalization across cultural elements

Addressing underperformance in dialect-specific VLM tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark for Arabic dialect VLMs evaluation

Includes image captioning and VQA tasks

Assesses cultural and dialect generalization

🔎 Similar Papers

Video-Language Understanding: A Survey from Model Architecture, Model Training, and Data Perspectives