MATRIX: Multi-Agent simulaTion fRamework for safe Interactions and conteXtual clinical conversational evaluation

📅 2025-08-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current clinical dialogue systems lack systematic evaluation methodologies tailored for safety-critical scenarios. To address this, we propose MATRIX—a novel framework that tightly integrates structured safety engineering with scalable dialogue AI assessment, establishing a comprehensive safety evaluation system encompassing clinical risk scenarios, behavioral norms, and failure modes. Our key contributions are threefold: (1) BehvJudge, an expert-validated evaluator achieving F1 = 0.96 in hazardous utterance detection across 240 clinical dialogues—surpassing human clinicians; (2) PatBot, a high-fidelity patient simulation agent covering 14 risk categories across 10 clinical domains, validated for realism in 2,100 simulated interactions; and (3) a regulatory-aligned safety auditing workflow. MATRIX establishes a reproducible, scalable, and interpretable automated assessment paradigm for safety verification of medical AI systems.

Technology Category

Application Category

📝 Abstract
Despite the growing use of large language models (LLMs) in clinical dialogue systems, existing evaluations focus on task completion or fluency, offering little insight into the behavioral and risk management requirements essential for safety-critical systems. This paper presents MATRIX (Multi-Agent simulaTion fRamework for safe Interactions and conteXtual clinical conversational evaluation), a structured, extensible framework for safety-oriented evaluation of clinical dialogue agents. MATRIX integrates three components: (1) a safety-aligned taxonomy of clinical scenarios, expected system behaviors and failure modes derived through structured safety engineering methods; (2) BehvJudge, an LLM-based evaluator for detecting safety-relevant dialogue failures, validated against expert clinician annotations; and (3) PatBot, a simulated patient agent capable of producing diverse, scenario-conditioned responses, evaluated for realism and behavioral fidelity with human factors expertise, and a patient-preference study. Across three experiments, we show that MATRIX enables systematic, scalable safety evaluation. BehvJudge with Gemini 2.5-Pro achieves expert-level hazard detection (F1 0.96, sensitivity 0.999), outperforming clinicians in a blinded assessment of 240 dialogues. We also conducted one of the first realism analyses of LLM-based patient simulation, showing that PatBot reliably simulates realistic patient behavior in quantitative and qualitative evaluations. Using MATRIX, we demonstrate its effectiveness in benchmarking five LLM agents across 2,100 simulated dialogues spanning 14 hazard scenarios and 10 clinical domains. MATRIX is the first framework to unify structured safety engineering with scalable, validated conversational AI evaluation, enabling regulator-aligned safety auditing. We release all evaluation tools, prompts, structured scenarios, and datasets.
Problem

Research questions and friction points this paper is trying to address.

Evaluating safety risks in clinical dialogue systems
Detecting safety-relevant dialogue failures systematically
Simulating realistic patient interactions for safety testing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-agent simulation framework for clinical safety evaluation
LLM-based evaluator detecting safety-relevant dialogue failures
Simulated patient agent producing scenario-conditioned responses
🔎 Similar Papers
No similar papers found.
E
Ernest Lim
Ufonia Limited, University of York
Y
Yajie Vera He
Ufonia Limited
J
Jared Joselowitz
Ufonia Limited
K
Kate Preston
University of York
M
Mohita Chowdhury
Ufonia Limited
L
Louis Williams
Ufonia Limited
A
Aisling Higham
Ufonia Limited
K
Katrina Mason
Ufonia Limited
M
Mariane Melo
Ufonia Limited
Tom Lawton
Tom Lawton
Improvement Academy, Bradford Institute for Health Research
critical careroutine datasimulation modellingartificial intelligencesafety
Y
Yan Jia
University of York
Ibrahim Habli
Ibrahim Habli
Professor of Safety-Critical Systems at the University of York
SafetyAI SafetyAutonomous SystemsSoftware Engineering