TyDi QA-WANA: A Benchmark for Information-Seeking Question Answering in Languages of West Asia and North Africa

๐Ÿ“… 2025-07-23
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address the lack of evaluation benchmarks for multilingual information-seeking question answering (QA) in low-resource languages of the Middle East and North Africa (MENA), this paper introduces MENA-QAโ€”the first natively collected, culturally adapted, long-context multilingual QA benchmark. It covers 10 local language varieties and comprises 28K real-userโ€“driven QA pairs, each annotated with answers grounded in full-document contexts to avoid cultural bias introduced by translation. We propose a MENA-contextualized crowdsourcing strategy for question elicitation, ensuring authenticity and cultural relevance. Accompanying the benchmark, we release evaluation results for two baseline models and open-source all data and code. MENA-QA establishes a scalable, high-fidelity platform for advancing long-text multilingual understanding and cross-lingual evaluation in under-resourced settings.

Technology Category

Application Category

๐Ÿ“ Abstract
We present TyDi QA-WANA, a question-answering dataset consisting of 28K examples divided among 10 language varieties of western Asia and northern Africa. The data collection process was designed to elicit information-seeking questions, where the asker is genuinely curious to know the answer. Each question in paired with an entire article that may or may not contain the answer; the relatively large size of the articles results in a task suitable for evaluating models' abilities to utilize large text contexts in answering questions. Furthermore, the data was collected directly in each language variety, without the use of translation, in order to avoid issues of cultural relevance. We present performance of two baseline models, and release our code and data to facilitate further improvement by the research community.
Problem

Research questions and friction points this paper is trying to address.

Benchmark for QA in West Asia and North Africa languages
Evaluates model ability to utilize large text contexts
Ensures cultural relevance by avoiding translation methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dataset for 10 West Asia and North Africa languages
Information-seeking questions with full article context
Direct data collection without translation for cultural relevance
๐Ÿ”Ž Similar Papers
No similar papers found.