CUS-QA: Local-Knowledge-Oriented Open-Ended Question Answering Dataset

📅 2025-07-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses two critical challenges: (1) the inconsistency of large language models (LLMs) in regional knowledge understanding and cross-lingual multimodal generation, and (2) the low correlation between conventional automatic evaluation metrics (e.g., BLEU, ROUGE) and human judgments. To this end, we introduce CZSKUA-QA—the first multilingual, multimodal, open-domain visual question answering benchmark covering the Czech, Slovak, and Ukrainian regions—featuring native-expert-annotated question-answer pairs and their English translations. Our methodology integrates human crowdsourcing, LLM-based prompt-generated baselines, and rigorous human evaluation. Experiments reveal substantial deficiencies in mainstream LLMs’ localized knowledge reasoning and demonstrate near-zero correlation (ρ < 0.1) between all standard automatic metrics and human assessments. Our key contributions are: (1) releasing the first regional multimodal open-domain QA dataset; (2) providing the first systematic evidence of automatic metric failure in cross-lingual multimodal evaluation; and (3) establishing a new empirical foundation for region-aware knowledge modeling and trustworthy evaluation.

Technology Category

Application Category

📝 Abstract
We introduce a benchmark for open-ended regional question answering that encompasses both textual and visual modalities. We also provide strong baselines using state-of-the-art large language models (LLMs). Our dataset consists of manually curated questions and answers grounded in Wikipedia, created by native speakers from Czechia, Slovakia, and Ukraine, with accompanying English translations. It includes both purely textual questions and those requiring visual understanding. As a baseline, we evaluate state-of-the-art LLMs through prompting and complement this with human judgments of answer correctness. Using these human evaluations, we analyze the reliability of existing automatic evaluation metrics. Our baseline results highlight a significant gap in regional knowledge among current LLMs. Moreover, apart from LLM-based evaluation, there is minimal correlation between automated metrics and human judgment. We release this dataset as a resource to (1) assess regional knowledge in LLMs, (2) study cross-lingual generation consistency in a challenging setting, and (3) advance the development of evaluation metrics for open-ended question answering.
Problem

Research questions and friction points this paper is trying to address.

Assessing regional knowledge gaps in large language models
Evaluating cross-lingual consistency in question answering
Improving automated metrics for open-ended QA evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines textual and visual modalities for QA
Uses state-of-the-art LLMs for baselines
Evaluates metrics via human judgments
🔎 Similar Papers
Jindřich Libovický
Jindřich Libovický
Charles University
natural language processingmultilingualityneural machine translationlanguage and vision
J
Jindřich Helcl
University of Oslo
A
Andrei Manea
Charles University
G
Gianluca Vico
Charles University