๐ค AI Summary
This study investigates whether large language models (LLMs) possess the capacity to use subtextโimplicit communication that transcends literal meaning. To this end, it introduces the first quantifiable evaluation framework tailored to subtext, a subjectively complex phenomenon, and constructs four benchmark tasks encompassing allegory generation and interpretation, multi-agent interaction, multimodal games, and common-knowledge reasoning. The findings reveal a pervasive tendency among state-of-the-art models toward overly literal outputs: 60% of responses in visual metaphor tasks rely on explicit cues. While explicitly providing shared context reduces literalism by 30%โ50%, models still exhibit marked deficiencies in reasoning with implicit common knowledge. This work establishes a foundational methodology for the systematic assessment of implicit communicative competence in artificial intelligence.
๐ Abstract
Human communication is fundamentally creative, and often makes use of subtext -- implied meaning that goes beyond the literal content of the text. Here, we systematically study whether language models can use subtext in communicative settings, and introduce four new evaluation suites to assess these capabilities. Our evaluation settings range from writing & interpreting allegories to playing multi-agent and multi-modal games inspired by the rules of board games like Dixit. We find that frontier models generally exhibit a strong bias towards overly literal, explicit communication, and thereby fail to account for nuanced constraints -- even the best performing models generate literal clues 60% of times in one of our environments -- Visual Allusions. However, we find that some models can sometimes make use of common ground with another party to help them communicate with subtext, achieving 30%-50% reduction in overly literal clues; but they struggle at inferring presence of a common ground when not explicitly stated. For allegory understanding, we find paratextual and persona conditions to significantly shift the interpretation of subtext. Overall, our work provides quantifiable measures for an inherently complex and subjective phenomenon like subtext and reveals many weaknesses and idiosyncrasies of current LLMs. We hope this research to inspire future work towards socially grounded creative communication and reasoning.