How Hypocritical Is Your LLM judge? Listener-Speaker Asymmetries in the Pragmatic Competence of Large Language Models

📅 2026-04-17

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This study addresses the unresolved question of whether large language models (LLMs) exhibit consistent pragmatic competence when acting as speakers—generating contextually appropriate utterances—and as listeners—evaluating the appropriateness of others’ utterances. Through systematic comparative experiments across three distinct pragmatic scenarios, the paper evaluates both generative and judgment capabilities of prominent open- and closed-source LLMs. The findings reveal a significant asymmetry: most models perform markedly better as pragmatic listeners than as speakers, indicating only a weak correlation between their ability to judge and to produce pragmatically appropriate language. This challenges the prevailing assumption that these two capacities are strongly aligned and underscores the necessity of integrating both generative and evaluative assessment paradigms to achieve a comprehensive evaluation of pragmatic competence in LLMs.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are increasingly studied as repositories of linguistic knowledge. In this line of work, models are commonly evaluated both as generators of language and as judges of linguistic output, yet these two roles are rarely examined in direct relation to one another. As a result, it remains unclear whether success in one role aligns with success in the other. In this paper, we address this question for pragmatic competence by comparing LLMs' performance as pragmatic listeners, judging the appropriateness of linguistic outputs, and as pragmatic speakers, generating pragmatically appropriate language. We evaluate multiple open-weight and proprietary LLMs across three pragmatic settings. We find a robust asymmetry between pragmatic evaluation and pragmatic generation: many models perform substantially better as listeners than as speakers. Our results suggest that pragmatic judging and pragmatic generation are only weakly aligned in current LLMs, calling for more integrated evaluation practices.

Problem

Research questions and friction points this paper is trying to address.

pragmatic competence

large language models

listener-speaker asymmetry

language generation

language evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

pragmatic competence

listener-speaker asymmetry

large language models