🤖 AI Summary
This study investigates the communicative competence of large language models (LLMs) in complex workplace social scenarios, such as delivering critical feedback or declining requests. The authors developed the HR Simulator, a game-based platform where both human participants and LLMs composed response emails, which were then evaluated through multidimensional automated scoring and preference analysis using ten models of varying capabilities, including GPT-4o. The work presents the first quantitative assessment of LLM performance in workplace moral dilemmas, revealing that LLM-generated emails achieved acceptance rates of 48–54%, significantly outperforming human-written messages at 23.5%. Human-AI collaborative editing further enhanced effectiveness. Notably, more capable model evaluators demonstrated greater scoring consistency and a marked preference for nuanced, indirect phrasing—exhibiting an “emergent appropriateness” that suggests AI may drive convergence toward standardized norms in professional communication.
📝 Abstract
Email communication increasingly involves large language models (LLMs), but we lack intuition on how they will read, write, and optimize for nuanced social goals. We introduce HR Simulator, a game where communication is the core mechanic: players play as a Human Resources officer and write emails to solve socially challenging workplace scenarios. An analysis of 600+ human and LLM emails with LLMs-as-judge reveals evidence for larger LLMs becoming more homogenous in their email quality judgments. Under LLM judges, humans underperform LLMs (e.g., 23.5% vs. 48-54% success rate), but a human+LLM approach can outperform LLM-only (e.g., from 40% to nearly 100% in one scenario). In cases where models'email preferences disagree, emergent tact is a plausible explanation: weaker models prefer less tactful strategies while stronger models prefer more tactful ones. Regarding tone, LLM emails are more formal and empathetic while human emails are more varied. LLM rewrites make human emails more formal and empathetic, but models still struggle to imitate human emails in the low empathy, low formality quadrant, which highlights a limitation of current post-training approaches. Our results demonstrate the efficacy of communication games as instruments to measure communication in the era of LLMs, and posit human-LLM co-writing as an effective form of communication in that future.