🤖 AI Summary
This study addresses the lack of systematic evaluation of large language models’ (LLMs) ability to interpret pragmatic cues—specifically, ironic emoji usage (e.g., 🙂, 🙃)—in social media contexts, where sociodemographic factors critically shape interpretation.
Method: We conduct the first comparative analysis between human annotators and GPT-4o using zero-shot prompting to assess emoji irony judgments, benchmarked against a large-scale, demographically stratified crowdsourced dataset (spanning age, gender, and other sociodemographic dimensions).
Results: While model–human agreement reaches 68% overall, GPT-4o significantly underestimates irony prevalence among younger users and fails to capture gender-specific pragmatic patterns—revealing structural deficits in sociocultural contextual reasoning.
Contribution: We propose a novel evaluation paradigm for pragmatic robustness in LLMs, empirically diagnosing a fundamental insensitivity to sociodemographic variables. This work provides both empirical evidence and methodological foundations for developing socially aware language models.
📝 Abstract
Emojis have become a universal language in online communication, often carrying nuanced and context-dependent meanings. Among these, irony poses a significant challenge for Large Language Models (LLMs) due to its inherent incongruity between appearance and intent. This study examines the ability of GPT-4o to interpret irony in emojis. By prompting GPT-4o to evaluate the likelihood of specific emojis being used to express irony on social media and comparing its interpretations with human perceptions, we aim to bridge the gap between machine and human understanding. Our findings reveal nuanced insights into GPT-4o's interpretive capabilities, highlighting areas of alignment with and divergence from human behavior. Additionally, this research underscores the importance of demographic factors, such as age and gender, in shaping emoji interpretation and evaluates how these factors influence GPT-4o's performance.