π€ AI Summary
This study addresses the poor performance of current large language models (LLMs) on Korean negation understanding, primarily attributed to the absence of evaluation benchmarks that reflect real-world linguistic distributions. To bridge this gap, the authors construct Thunder-KoNUBench, the first sentence-level benchmark for Korean negation comprehension, grounded in large-scale corpus analysis. They systematically evaluate 47 LLMs on this benchmark, revealing the nuanced effects of model scale and instruction tuning on negation understanding. Furthermore, they demonstrate that targeted fine-tuning on Thunder-KoNUBench not only substantially improves modelsβ ability to handle Korean negation but also enhances their general contextual comprehension. This benchmark thus provides a critical resource and a new direction for advancing Korean language understanding research.
π Abstract
Although negation is known to challenge large language models (LLMs), benchmarks for evaluating negation understanding, especially in Korean, are scarce. We conduct a corpus-based analysis of Korean negation and show that LLM performance degrades under negation. We then introduce Thunder-KoNUBench, a sentence-level benchmark that reflects the empirical distribution of Korean negation phenomena. Evaluating 47 LLMs, we analyze the effects of model size and instruction tuning, and show that fine-tuning on Thunder-KoNUBench improves negation understanding and broader contextual comprehension in Korean.