🤖 AI Summary
The absence of a low-contamination, high-quality evaluation benchmark for multi-step soft reasoning over long Korean texts hinders progress in Korean NLP. Method: We introduce Ko-MuSR—the first Korean-specific benchmark for multi-step soft reasoning—comprising human-verified long narrative passages, logically consistent reasoning chains, and multiple-choice questions, with stringent contamination control. We adapt the MuSR framework to Korean NLP via human annotation, few-shot prompting, reasoning trajectory injection, and task-specific instruction tuning. Results: Experiments show that leading multilingual LMs significantly outperform Korean-specialized models on Ko-MuSR, confirming their cross-lingual reasoning generalization capability; carefully engineered prompting strategies push model accuracy close to human performance. This work establishes a reliable, reproducible benchmark and methodological foundation for studying complex, long-context reasoning in Korean.
📝 Abstract
We present Ko-MuSR, the first benchmark to comprehensively evaluate multistep, soft reasoning in long Korean narratives while minimizing data contamination. Built following MuSR, Ko-MuSR features fully Korean narratives, reasoning chains, and multiple-choice questions verified by human annotators for logical consistency and answerability. Evaluations of four large language models -- two multilingual and two Korean-specialized -- show that multilingual models outperform Korean-focused ones even in Korean reasoning tasks, indicating cross-lingual generalization of reasoning ability. Carefully designed prompting strategies, which combine few-shot examples, reasoning traces, and task-specific hints, further boost accuracy, approaching human-level performance. Ko-MuSR offers a solid foundation for advancing Korean NLP by enabling systematic evaluation of long-context reasoning and prompting strategies.