DialogGuard: Multi-Agent Psychosocial Safety Evaluation of Sensitive LLM Responses

📅 2025-12-01

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

This study addresses psychosocial safety risks of large language models (LLMs) in emotionally sensitive domains—such as mental health support and crisis intervention—by systematically evaluating their behavior across five high-risk dimensions: privacy violation, discrimination, psychological manipulation, psychological harm, and verbal aggression. We propose the first psychosocially grounded multi-agent evaluation framework, integrating the LLM-as-a-judge paradigm with four complementary mechanisms: single-agent scoring, dual-agent calibration, multi-agent debate, and randomized majority voting. Risk is quantified in an interpretable, fine-grained, and cross-model manner using a three-tier universal scoring rubric. Evaluated on the PKU-SafeRLHF dataset, our framework significantly outperforms baseline methods. The framework is open-sourced and features an interactive web interface; empirical validation by 12 domain practitioners confirms its utility for prompt engineering, model auditing, and regulatory oversight.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) now mediate many web-based mental-health, crisis, and other emotionally sensitive services, yet their psychosocial safety in these settings remains poorly understood and weakly evaluated. We present DialogGuard, a multi-agent framework for assessing psychosocial risks in LLM-generated responses along five high-severity dimensions: privacy violations, discriminatory behaviour, mental manipulation, psychological harm, and insulting behaviour. DialogGuard can be applied to diverse generative models through four LLM-as-a-judge pipelines, including single-agent scoring, dual-agent correction, multi-agent debate, and stochastic majority voting, grounded in a shared three-level rubric usable by both human annotators and LLM judges. Using PKU-SafeRLHF with human safety annotations, we show that multi-agent mechanisms detect psychosocial risks more accurately than non-LLM baselines and single-agent judging; dual-agent correction and majority voting provide the best trade-off between accuracy, alignment with human ratings, and robustness, while debate attains higher recall but over-flags borderline cases. We release Dialog-Guard as open-source software with a web interface that provides per-dimension risk scores and explainable natural-language rationales. A formative study with 12 practitioners illustrates how it supports prompt design, auditing, and supervision of web-facing applications for vulnerable users.

Problem

Research questions and friction points this paper is trying to address.

Evaluates psychosocial safety risks in LLM responses for sensitive applications

Assesses five high-severity dimensions including privacy violations and psychological harm

Develops multi-agent framework to improve detection accuracy over single-agent methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-agent framework for evaluating psychosocial risks

Four LLM-as-a-judge pipelines for diverse generative models

Open-source tool with explainable risk scores and rationales

🔎 Similar Papers

No similar papers found.

💼 Related Jobs

Machine Learning Engineer - Agentic AI

Apple

Sunnyvale, United States of America

Research Engineer - Multimodal Embodiment Trust (multiple locations)