🤖 AI Summary
This work investigates whether large language models (LLMs) can non-literalize numbers—akin to humans—in contexts involving hyperbole and pragmatic halo effects, focusing on their pragmatic inference capabilities. Method: We first decompose the Rational Speech Act (RSA) framework into empirically testable submodules, then conduct human–model behavioral comparison experiments and evaluate pragmatic reasoning using a dedicated benchmark. This reveals that LLM failures stem from flawed inference mechanisms—not knowledge gaps. Building on this diagnosis, we propose an RSA-inspired chain-of-thought prompting method. Contribution/Results: Our approach significantly improves LLM alignment with human judgments on numerical non-literal interpretation tasks. Key contributions include: (1) establishing an interpretable, cognitively grounded diagnostic pathway for model–human alignment; (2) identifying critical breakpoints in LLM pragmatic reasoning; and (3) providing a theoretically motivated intervention strategy—advancing a new paradigm for developing human-aligned language models.
📝 Abstract
Humans naturally interpret numbers non-literally, effortlessly combining context, world knowledge, and speaker intent. We investigate whether large language models (LLMs) interpret numbers similarly, focusing on hyperbole and pragmatic halo effects. Through systematic comparison with human data and computational models of pragmatic reasoning, we find that LLMs diverge from human interpretation in striking ways. By decomposing pragmatic reasoning into testable components, grounded in the Rational Speech Act framework, we pinpoint where LLM processing diverges from human cognition -- not in prior knowledge, but in reasoning with it. This insight leads us to develop a targeted solution -- chain-of-thought prompting inspired by an RSA model makes LLMs' interpretations more human-like. Our work demonstrates how computational cognitive models can both diagnose AI-human differences and guide development of more human-like language understanding capabilities.