NativQA: Multilingual Culturally-Aligned Natural Query for LLMs

📅 2024-07-13
🏛️ arXiv.org
📈 Citations: 6
Influential: 0
📄 PDF
🤖 AI Summary
Existing natural question-answering (QA) benchmarks lack native-speaker-driven design and region-specific cultural alignment, hindering fine-grained evaluation and adaptation of large language models (LLMs) along cultural–linguistic dimensions. To address this, we propose NativQA—a language-agnostic, scalable framework—and introduce MultiNativQA, the first native-user-driven, multilingual, regionally culturally aligned natural QA benchmark. It spans seven languages (including extremely low-resource ones), nine geographic regions, and 18 thematic domains, comprising ~64k high-quality samples. Data collection integrates cross-regional native user queries, expert annotation, cultural sensitivity validation, and multilingual consistency alignment. We conduct systematic evaluations of leading open- and closed-source LLMs on MultiNativQA, revealing—for the first time—substantial performance disparities across regional cultural contexts. Both the benchmark dataset and implementation code are fully open-sourced.

Technology Category

Application Category

📝 Abstract
Natural Question Answering (QA) datasets play a crucial role in evaluating the capabilities of large language models (LLMs), ensuring their effectiveness in real-world applications. Despite the numerous QA datasets that have been developed and some work has been done in parallel, there is a notable lack of a framework and large scale region-specific datasets queried by native users in their own languages. This gap hinders the effective benchmarking and the development of fine-tuned models for regional and cultural specificities. In this study, we propose a scalable, language-independent framework, NativQA, to seamlessly construct culturally and regionally aligned QA datasets in native languages, for LLM evaluation and tuning. We demonstrate the efficacy of the proposed framework by designing a multilingual natural QA dataset, MultiNativQA, consisting of ~64k manually annotated QA pairs in seven languages, ranging from high to extremely low resource, based on queries from native speakers from 9 regions covering 18 topics. We benchmark open- and closed-source LLMs with the MultiNativQA dataset. We made the MultiNativQA dataset(https://huggingface.co/datasets/QCRI/MultiNativQA), and other experimental scripts(https://gitlab.com/nativqa/multinativqa) publicly available for the community.
Problem

Research questions and friction points this paper is trying to address.

Lack of region-specific QA datasets in native languages
Need for culturally-aligned LLM evaluation frameworks
Absence of scalable multilingual QA data collection methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Scalable language-independent framework for QA datasets
Multilingual culturally-aligned manual annotation approach
Publicly available dataset for LLM benchmarking
🔎 Similar Papers
No similar papers found.
Md Arid Hasan
Md Arid Hasan
PhD Student, University of Toronto
LLMsMultimodalityBias in LLMsResponsible AI
Maram Hasanain
Maram Hasanain
Postdoc, Qatar Computing Research Institute, HBKU
Information RetrievalSocial Media
F
Fatema Ahmad
Qatar Computing Research Institute, Qatar
S
Sahinur Rahman Laskar
UPES, India
S
Sunaya Upadhyay
Carnegie Mellon University in Qatar, Qatar
V
Vrunda N. Sukhadia
Qatar Computing Research Institute, Qatar
Mucahid Kutlu
Mucahid Kutlu
Assistant Professor, Qatar University
Information RetrievalNatural Language Processing
S
Shammur A. Chowdhury
Qatar Computing Research Institute, Qatar
F
Firoj Alam
Qatar Computing Research Institute, Qatar