Evaluating Large language models on Understanding Korean indirect Speech acts

📅 2025-02-16

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

This study systematically evaluates the pragmatic understanding capabilities of mainstream large language models (LLMs) regarding Korean indirect speech acts—focusing on authentic dialogues where literal meaning diverges from intended meaning. We propose the first Korean-specific, dual-paradigm evaluation framework comprising multiple-choice questions (MCQ) and open-ended questions (OEQ), integrating contextualized prompting and human-annotated ground-truth benchmarks. Experimental results reveal a substantial pragmatic gap: even the strongest model, Claude3-Opus, achieves only 71.94% accuracy on MCQ and 65.0% on OEQ—significantly below human performance. All models struggle to detect indirectness, with performance deteriorating markedly as indirectness intensity increases. This work provides the first quantitative characterization of LLMs’ limitations in Korean indirect speech comprehension, establishing a novel benchmark and analytical lens for assessing pragmatic competence in multilingual LLMs.

Technology Category

Application Category

📝 Abstract

To accurately understand the intention of an utterance is crucial in conversational communication. As conversational artificial intelligence models are rapidly being developed and applied in various fields, it is important to evaluate the LLMs' capabilities of understanding the intentions of user's utterance. This study evaluates whether current LLMs can understand the intention of an utterance by considering the given conversational context, particularly in cases where the actual intention differs from the surface-leveled, literal intention of the sentence, i.e. indirect speech acts. Our findings reveal that Claude3-Opus outperformed the other competing models, with 71.94% in MCQ and 65% in OEQ, showing a clear advantage. In general, proprietary models exhibited relatively higher performance compared to open-source models. Nevertheless, no LLMs reached the level of human performance. Most LLMs, except for Claude3-Opus, demonstrated significantly lower performance in understanding indirect speech acts compared to direct speech acts, where the intention is explicitly revealed through the utterance. This study not only performs an overall pragmatic evaluation of each LLM's language use through the analysis of OEQ response patterns, but also emphasizes the necessity for further research to improve LLMs' understanding of indirect speech acts for more natural communication with humans.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs on Korean indirect speech acts

Assessing intention understanding in conversational AI

Comparing human and LLM performance in communication

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates LLMs on Korean indirect speech acts

Claude3-Opus outperforms in MCQ and OEQ

Proprietary models surpass open-source in performance

🔎 Similar Papers

No similar papers found.