TyDi QA-WANA: A Benchmark for Information-Seeking Question Answering in Languages of West Asia and North Africa

📅 2025-07-23

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

To address the lack of evaluation benchmarks for multilingual information-seeking question answering (QA) in low-resource languages of the Middle East and North Africa (MENA), this paper introduces MENA-QA—the first natively collected, culturally adapted, long-context multilingual QA benchmark. It covers 10 local language varieties and comprises 28K real-user–driven QA pairs, each annotated with answers grounded in full-document contexts to avoid cultural bias introduced by translation. We propose a MENA-contextualized crowdsourcing strategy for question elicitation, ensuring authenticity and cultural relevance. Accompanying the benchmark, we release evaluation results for two baseline models and open-source all data and code. MENA-QA establishes a scalable, high-fidelity platform for advancing long-text multilingual understanding and cross-lingual evaluation in under-resourced settings.

Technology Category

Application Category

📝 Abstract

We present TyDi QA-WANA, a question-answering dataset consisting of 28K examples divided among 10 language varieties of western Asia and northern Africa. The data collection process was designed to elicit information-seeking questions, where the asker is genuinely curious to know the answer. Each question in paired with an entire article that may or may not contain the answer; the relatively large size of the articles results in a task suitable for evaluating models' abilities to utilize large text contexts in answering questions. Furthermore, the data was collected directly in each language variety, without the use of translation, in order to avoid issues of cultural relevance. We present performance of two baseline models, and release our code and data to facilitate further improvement by the research community.

Problem

Research questions and friction points this paper is trying to address.

Benchmark for QA in West Asia and North Africa languages

Evaluates model ability to utilize large text contexts

Ensures cultural relevance by avoiding translation methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dataset for 10 West Asia and North Africa languages

Information-seeking questions with full article context

Direct data collection without translation for cultural relevance

🔎 Similar Papers

No similar papers found.