🤖 AI Summary
This study investigates how prompt tone (friendly, neutral, or rude) affects the accuracy of mainstream large language models (LLMs)—GPT-4o mini, Gemini 2.0 Flash, and Llama 4 Scout—on the MMMLU benchmark across six STEM and humanities tasks.
Method: Using controlled prompt variants, pairwise accuracy comparisons, and two-tailed t-tests, we conduct the first cross-model-family, cross-disciplinary quantitative assessment of tone sensitivity.
Contribution/Results: (1) Friendly and neutral prompts consistently outperform rude ones; (2) statistically significant tone effects occur only in select humanities tasks and exhibit strong model dependence—GPT-4o mini and Llama 4 Scout show sensitivity, whereas Gemini 2.0 Flash remains robust; (3) aggregated cross-task analysis reveals attenuated effects, indicating high tone robustness in mixed-domain applications. Crucially, effect detection is contingent on sufficient data scale. These findings provide novel empirical evidence for LLM prompt engineering and robustness evaluation, highlighting the interplay between tone, domain, architecture, and data size.
📝 Abstract
Prompt engineering has emerged as a critical factor influencing large language model (LLM) performance, yet the impact of pragmatic elements such as linguistic tone and politeness remains underexplored, particularly across different model families. In this work, we propose a systematic evaluation framework to examine how interaction tone affects model accuracy and apply it to three recently released and widely available LLMs: GPT-4o mini (OpenAI), Gemini 2.0 Flash (Google DeepMind), and Llama 4 Scout (Meta). Using the MMMLU benchmark, we evaluate model performance under Very Friendly, Neutral, and Very Rude prompt variants across six tasks spanning STEM and Humanities domains, and analyze pairwise accuracy differences with statistical significance testing.
Our results show that tone sensitivity is both model-dependent and domain-specific. Neutral or Very Friendly prompts generally yield higher accuracy than Very Rude prompts, but statistically significant effects appear only in a subset of Humanities tasks, where rude tone reduces accuracy for GPT and Llama, while Gemini remains comparatively tone-insensitive. When performance is aggregated across tasks within each domain, tone effects diminish and largely lose statistical significance. Compared with earlier researches, these findings suggest that dataset scale and coverage materially influence the detection of tone effects. Overall, our study indicates that while interaction tone can matter in specific interpretive settings, modern LLMs are broadly robust to tonal variation in typical mixed-domain use, providing practical guidance for prompt design and model selection in real-world deployments.