Ko-MuSR: A Multistep Soft Reasoning Benchmark for LLMs Capable of Understanding Korean

📅 2025-10-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The absence of a low-contamination, high-quality evaluation benchmark for multi-step soft reasoning over long Korean texts hinders progress in Korean NLP. Method: We introduce Ko-MuSR—the first Korean-specific benchmark for multi-step soft reasoning—comprising human-verified long narrative passages, logically consistent reasoning chains, and multiple-choice questions, with stringent contamination control. We adapt the MuSR framework to Korean NLP via human annotation, few-shot prompting, reasoning trajectory injection, and task-specific instruction tuning. Results: Experiments show that leading multilingual LMs significantly outperform Korean-specialized models on Ko-MuSR, confirming their cross-lingual reasoning generalization capability; carefully engineered prompting strategies push model accuracy close to human performance. This work establishes a reliable, reproducible benchmark and methodological foundation for studying complex, long-context reasoning in Korean.

Technology Category

Application Category

📝 Abstract
We present Ko-MuSR, the first benchmark to comprehensively evaluate multistep, soft reasoning in long Korean narratives while minimizing data contamination. Built following MuSR, Ko-MuSR features fully Korean narratives, reasoning chains, and multiple-choice questions verified by human annotators for logical consistency and answerability. Evaluations of four large language models -- two multilingual and two Korean-specialized -- show that multilingual models outperform Korean-focused ones even in Korean reasoning tasks, indicating cross-lingual generalization of reasoning ability. Carefully designed prompting strategies, which combine few-shot examples, reasoning traces, and task-specific hints, further boost accuracy, approaching human-level performance. Ko-MuSR offers a solid foundation for advancing Korean NLP by enabling systematic evaluation of long-context reasoning and prompting strategies.
Problem

Research questions and friction points this paper is trying to address.

Evaluating multistep soft reasoning in Korean narratives
Assessing cross-lingual generalization of reasoning abilities
Developing effective prompting strategies for Korean reasoning tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Korean benchmark for multistep soft reasoning
Human-verified narratives with logical consistency checks
Prompting strategies combining examples and reasoning traces
🔎 Similar Papers
No similar papers found.
C
Chanwoo Park
Dept. of Computer Science, Seoul National University
S
Suyoung Park
Graduate School of Data Science, Seoul National University
J
JiA Kang
Graduate School of Data Science, Seoul National University
J
Jongyeon Park
Graduate School of Data Science, Seoul National University
Sangho Kim
Sangho Kim
Associate Professor of Biomedical Engineering, National University of Singapore
Blood RheologyMicrocirculationHemodynamicsGas Transport
H
Hyunji M. Park
Graduate School of Data Science, Seoul National University
S
Sumin Bae
Dept. of Computer Science, Seoul National University
Mingyu Kang
Mingyu Kang
UC Berkeley
quantum physicsquantum computing
J
Jaejin Lee
Dept. of Computer Science, Seoul National University and Graduate School of Data Science, Seoul National University