🤖 AI Summary
This study investigates whether enhancing the supportive quality of language model responses in mental health dialogue systems comes at the expense of safety. By designing system prompts with varying levels of supportiveness, the authors evaluate 1,440 responses from six large language models across 80 synthetically generated queries, using a human-validated LLM-based evaluation framework to automatically assess both safety and empathic quality. The work reveals, for the first time, that highly empathetic prompts generally reduce model safety, whereas moderately supportive prompts can effectively balance empathy and safety. Furthermore, substantial performance differences across models highlight the necessity of tailoring prompt engineering and deployment strategies to both model characteristics and domain-specific requirements.
📝 Abstract
Large language models (LLMs) are being integrated into socially assistive robots (SARs) and other conversational agents providing mental health and well-being support. These agents are often designed to sound empathic and supportive in order to maximize user's engagement, yet it remains unclear how increasing the level of supportive framing in system prompts influences safety relevant behavior. We evaluated 6 LLMs across 3 system prompts with varying levels of supportiveness on 80 synthetic queries spanning 4 well-being domains (1440 responses). An LLM judge framework, validated against human ratings, assessed safety and care quality. Moderately supportive prompts improved empathy and constructive support while maintaining safety. In contrast, strongly validating prompts significantly degraded safety and, in some cases, care across all domains, with substantial variation across models. We discuss implications for prompt design, model selection, and domain specific safeguards in SARs deployment.