NativQA: Multilingual Culturally-Aligned Natural Query for LLMs

📅 2024-07-13

🏛️ arXiv.org

📈 Citations: 6

✨ Influential: 0

career value

171K/year

🤖 AI Summary

Existing natural question-answering (QA) benchmarks lack native-speaker-driven design and region-specific cultural alignment, hindering fine-grained evaluation and adaptation of large language models (LLMs) along cultural–linguistic dimensions. To address this, we propose NativQA—a language-agnostic, scalable framework—and introduce MultiNativQA, the first native-user-driven, multilingual, regionally culturally aligned natural QA benchmark. It spans seven languages (including extremely low-resource ones), nine geographic regions, and 18 thematic domains, comprising ~64k high-quality samples. Data collection integrates cross-regional native user queries, expert annotation, cultural sensitivity validation, and multilingual consistency alignment. We conduct systematic evaluations of leading open- and closed-source LLMs on MultiNativQA, revealing—for the first time—substantial performance disparities across regional cultural contexts. Both the benchmark dataset and implementation code are fully open-sourced.

Technology Category

Application Category

📝 Abstract

Natural Question Answering (QA) datasets play a crucial role in evaluating the capabilities of large language models (LLMs), ensuring their effectiveness in real-world applications. Despite the numerous QA datasets that have been developed and some work has been done in parallel, there is a notable lack of a framework and large scale region-specific datasets queried by native users in their own languages. This gap hinders the effective benchmarking and the development of fine-tuned models for regional and cultural specificities. In this study, we propose a scalable, language-independent framework, NativQA, to seamlessly construct culturally and regionally aligned QA datasets in native languages, for LLM evaluation and tuning. We demonstrate the efficacy of the proposed framework by designing a multilingual natural QA dataset, MultiNativQA, consisting of ~64k manually annotated QA pairs in seven languages, ranging from high to extremely low resource, based on queries from native speakers from 9 regions covering 18 topics. We benchmark open- and closed-source LLMs with the MultiNativQA dataset. We made the MultiNativQA dataset(https://huggingface.co/datasets/QCRI/MultiNativQA), and other experimental scripts(https://gitlab.com/nativqa/multinativqa) publicly available for the community.

Problem

Research questions and friction points this paper is trying to address.

Lack of region-specific QA datasets in native languages

Need for culturally-aligned LLM evaluation frameworks

Absence of scalable multilingual QA data collection methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Scalable language-independent framework for QA datasets

Multilingual culturally-aligned manual annotation approach

Publicly available dataset for LLM benchmarking

🔎 Similar Papers

Self-Alignment: Improving Alignment of Cultural Values in LLMs via In-Context Learning