KoSimpleQA: A Korean Factuality Benchmark with an Analysis of Reasoning LLMs

📅 2025-10-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Korean large language models (LLMs) exhibit significant deficiencies in factual question answering grounded in Korean cultural knowledge. Method: We introduce KoSimpleQA, the first standardized benchmark tailored to Korean cultural context, comprising 1,000 short-answer questions with unambiguous ground-truth answers. We systematically evaluate leading open-source Korean LLMs and conduct a novel comparative analysis between reasoning-based and non-reasoning inference modes. Contribution/Results: The best-performing model achieves only 33.7% accuracy—substantially lower than its English counterpart on SimpleQA—highlighting the unique challenges of factual reasoning in Korean and its cultural specificity. Our analysis demonstrates that explicit reasoning enhances both factual extraction accuracy and calibrated refusal under uncertainty. KoSimpleQA establishes a new, rigorous standard and practical evaluation tool for assessing factual consistency in Korean LLMs.

Technology Category

Application Category

📝 Abstract
We present $ extbf{Korean SimpleQA (KoSimpleQA)}$, a benchmark for evaluating factuality in large language models (LLMs) with a focus on Korean cultural knowledge. KoSimpleQA is designed to be challenging yet easy to grade, consisting of 1,000 short, fact-seeking questions with unambiguous answers. We conduct a comprehensive evaluation across a diverse set of open-source LLMs of varying sizes that support Korean, and find that even the strongest model generates correct answer only 33.7% of the time, underscoring the challenging nature of KoSimpleQA. Notably, performance rankings on KoSimpleQA differ substantially from those on the English SimpleQA, highlighting the unique value of our dataset. Furthermore, our analysis of reasoning LLMs shows that engaging reasoning capabilities in the factual QA task can both help models better elicit their latent knowledge and improve their ability to abstain when uncertain. KoSimpleQA can be found at https://anonymous.4open.science/r/KoSimpleQA-62EB.
Problem

Research questions and friction points this paper is trying to address.

Evaluating factuality in Korean language models
Assessing cultural knowledge reasoning capabilities
Benchmarking model performance on Korean-specific questions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark tests Korean cultural knowledge factuality
Evaluates reasoning models' latent knowledge elicitation capabilities
Analyzes models' abstention improvement when uncertain
🔎 Similar Papers
No similar papers found.