🤖 AI Summary
This study investigates whether small language models (BabyLMs) pretrained on fewer than 10–100 million tokens possess pragmatic competence—specifically, the ability to detect violations of Gricean conversational maxims and thereby infer implicatures. To this end, we adapt developmental psychology paradigms from child language comprehension research to model evaluation, constructing the first benchmark for Gricean maxim violation detection tailored to lightweight models. The benchmark comprises dialogic scenarios covering adherence to and violations of five maxims: Quantity, Quality, Relation, Manner, and Politeness. We employ fine-grained evaluation via perplexity and conditional probability comparisons. Results show that pragmatic sensitivity improves with pretraining data scale but remains substantially lower than both human children’s and large language models’ performance. Our primary contribution is a theory-driven, developmentally informed evaluation paradigm for pragmatic competence, providing a reproducible benchmark and analytical framework for studying the semantics–pragmatics interface in small language models.
📝 Abstract
Implicit meanings are integral to human communication, making it essential for language models to be capable of identifying and interpreting them. Grice (1975) proposed a set of conversational maxims that guide cooperative dialogue, noting that speakers may deliberately violate these principles to express meanings beyond literal words, and that listeners, in turn, recognize such violations to draw pragmatic inferences.
Building on Surian et al. (1996)'s study of children's sensitivity to violations of Gricean maxims, we introduce a novel benchmark to test whether language models pretrained on less than 10M and less than 100M tokens can distinguish maxim-adhering from maxim-violating utterances. We compare these BabyLMs across five maxims and situate their performance relative to children and a Large Language Model (LLM) pretrained on 3T tokens.
We find that overall, models trained on less than 100M tokens outperform those trained on less than 10M, yet fall short of child-level and LLM competence. Our results suggest that modest data increases improve some aspects of pragmatic behavior, leading to finer-grained differentiation between pragmatic dimensions.