๐ค AI Summary
To address the lack of evaluation benchmarks for multilingual information-seeking question answering (QA) in low-resource languages of the Middle East and North Africa (MENA), this paper introduces MENA-QAโthe first natively collected, culturally adapted, long-context multilingual QA benchmark. It covers 10 local language varieties and comprises 28K real-userโdriven QA pairs, each annotated with answers grounded in full-document contexts to avoid cultural bias introduced by translation. We propose a MENA-contextualized crowdsourcing strategy for question elicitation, ensuring authenticity and cultural relevance. Accompanying the benchmark, we release evaluation results for two baseline models and open-source all data and code. MENA-QA establishes a scalable, high-fidelity platform for advancing long-text multilingual understanding and cross-lingual evaluation in under-resourced settings.
๐ Abstract
We present TyDi QA-WANA, a question-answering dataset consisting of 28K examples divided among 10 language varieties of western Asia and northern Africa. The data collection process was designed to elicit information-seeking questions, where the asker is genuinely curious to know the answer. Each question in paired with an entire article that may or may not contain the answer; the relatively large size of the articles results in a task suitable for evaluating models' abilities to utilize large text contexts in answering questions. Furthermore, the data was collected directly in each language variety, without the use of translation, in order to avoid issues of cultural relevance. We present performance of two baseline models, and release our code and data to facilitate further improvement by the research community.