🤖 AI Summary
Large language models (LLMs) are increasingly deployed in mental health support, yet their systematic emotional and affective responses to depression, anxiety, and stress remain poorly characterized. Method: This study conducts a rigorous, multi-dimensional sentiment and emotion analysis of 2,880 LLM-generated responses across eight state-of-the-art models, controlling for user demographic variables. Responses were quantitatively scored using validated affective lexicons and fine-grained emotion detection tools. Contribution/Results: We identify distinct “affective fingerprints” across models—model architecture exerts significantly greater influence on emotional output than user-profile specifications. GPT-4o mini exhibits the highest fear intensity (0.974) in anxiety-related responses; Llama demonstrates the most consistently positive valence, whereas Mixtral yields the strongest negative emotion load. Stress-related responses show the highest average optimism (0.755). This work provides the first empirical evidence of systematic affective biases in LLMs’ mental health support outputs, offering critical insights for model selection, safety alignment, and responsible clinical integration.
📝 Abstract
Depression, anxiety, and stress are widespread mental health concerns that increasingly drive individuals to seek information from Large Language Models (LLMs). This study investigates how eight LLMs (Claude Sonnet, Copilot, Gemini Pro, GPT-4o, GPT-4o mini, Llama, Mixtral, and Perplexity) reply to twenty pragmatic questions about depression, anxiety, and stress when those questions are framed for six user profiles (baseline, woman, man, young, old, and university student). The models generated 2,880 answers, which we scored for sentiment and emotions using state-of-the-art tools. Our analysis revealed that optimism, fear, and sadness dominated the emotional landscape across all outputs, with neutral sentiment maintaining consistently high values. Gratitude, joy, and trust appeared at moderate levels, while emotions such as anger, disgust, and love were rarely expressed. The choice of LLM significantly influenced emotional expression patterns. Mixtral exhibited the highest levels of negative emotions including disapproval, annoyance, and sadness, while Llama demonstrated the most optimistic and joyful responses. The type of mental health condition dramatically shaped emotional responses: anxiety prompts elicited extraordinarily high fear scores (0.974), depression prompts generated elevated sadness (0.686) and the highest negative sentiment, while stress-related queries produced the most optimistic responses (0.755) with elevated joy and trust. In contrast, demographic framing of queries produced only marginal variations in emotional tone. Statistical analyses confirmed significant model-specific and condition-specific differences, while demographic influences remained minimal. These findings highlight the critical importance of model selection in mental health applications, as each LLM exhibits a distinct emotional signature that could significantly impact user experience and outcomes.