KAIO: A Collection of More Challenging Korean Questions

📅 2025-09-18

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

Existing Korean-language benchmarks are scarce, heavily reliant on translation, narrowly scoped, and outdated—leading to model performance saturation and data contamination. Method: We introduce KAIO, the first private, state-of-the-art Korean mathematical reasoning benchmark tailored for frontier large language models, emphasizing long-chain reasoning evaluation. Its design features: (1) fully human-authored, high-difficulty math problems to eliminate translation artifacts; (2) a held-out evaluation framework with private, dynamically released test sets to substantially mitigate data contamination; and (3) infrastructure supporting continuous iteration and reliable longitudinal assessment. Results: Initial evaluations reveal GPT-5 (62.8%) and Gemini-2.5-Pro (52.3%) as top performers, while leading open-source models—including Qwen3-235B and DeepSeek-R1—score below 30%, underscoring substantial room for improvement in Korean-language complex reasoning capabilities.

Technology Category

Application Category

📝 Abstract

With the advancement of mid/post-training techniques, LLMs are pushing their boundaries at an accelerated pace. Legacy benchmarks saturate quickly (e.g., broad suites like MMLU over the years, newer ones like GPQA-D even faster), which makes frontier progress hard to track. The problem is especially acute in Korean: widely used benchmarks are fewer, often translated or narrow in scope, and updated more slowly, so saturation and contamination arrive sooner. Accordingly, at this moment, there is no Korean benchmark capable of evaluating and ranking frontier models. To bridge this gap, we introduce KAIO, a Korean, math-centric benchmark that stresses long-chain reasoning. Unlike recent Korean suites that are at or near saturation, KAIO remains far from saturated: the best-performing model, GPT-5, attains 62.8, followed by Gemini-2.5-Pro (52.3). Open models such as Qwen3-235B and DeepSeek-R1 cluster falls below 30, demonstrating substantial headroom, enabling robust tracking of frontier progress in Korean. To reduce contamination, KAIO will remain private and be served via a held-out evaluator until the best publicly known model reaches at least 80% accuracy, after which we will release the set and iterate to a harder version.

Problem

Research questions and friction points this paper is trying to address.

Lack of challenging Korean benchmarks for LLMs

Existing Korean benchmarks saturate and contaminate quickly

Need for math-centric evaluation stressing long-chain reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Korean math-centric benchmark KAIO

Stresses long-chain reasoning capabilities

Private held-out evaluator reduces contamination

🔎 Similar Papers

No similar papers found.