🤖 AI Summary
This study addresses the lack of systematic evaluation of large language models’ (LLMs) pragmatic reasoning capabilities in presupposition projection from conditional sentences. The authors construct a controlled dataset of conditionals and employ both behavioral experiments with human participants and a linguistically grounded automated evaluation framework—integrating, for the first time, a theory-driven checklist with LLM-as-a-Judge methodology—to enable parallel comparison between human and model performance. Results reveal that while humans integrate probabilistic expectations with pragmatic cues to make judgments, LLMs only partially align with human ratings and exhibit no consistent capacity for deep pragmatic inference. Instead, models predominantly rely on surface-level pattern matching, highlighting significant limitations in their pragmatic understanding of presuppositional phenomena.
📝 Abstract
Presupposition projection in conditionals is central to theories of meaning and pragmatics, yet it remains largely unevaluated in large language models. We address this gap through a parallel behavioral study comparing human judgments and LLM predictions on a normed dataset of conditional sentences that controls the relation between the antecedent and the projected presupposition. We collect likelihood ratings from 120 participants and four LLMs under matched contextual conditions. Results show that humans integrate probabilistic and pragmatic cues in their judgment, whereas LLMs show variable alignment with human patterns. Using a linguistically motivated checklist within an LLM-as-a-Judge framework, we further evaluate model reasoning. We observe models that best match human ratings often lack coherent pragmatic reasoning, while models with stronger reasoning produce less human-like judgments. These findings suggest that LLMs' performance on such tasks may result from surface pattern matching rather than pragmatic competence. Our findings highlight the importance of benchmarks grounded in linguistic theory for comparing humans and models.