🤖 AI Summary
Clinical chatbots face severe safety challenges due to insufficient safe-response generation, risking misdiagnosis or harmful outputs. To address this, we propose TACOS—a novel, fine-grained 21-class safety taxonomy specifically designed for clinical conversational agents. TACOS is the first framework to jointly model safety control and tool invocation within user intent recognition, enabling differentiated safety thresholds for clinical versus non-clinical queries and explicit modeling of external tool dependencies. Leveraging a human-annotated TACOS dataset, we develop an end-to-end safety intent classifier and tool router built upon pretrained language models, supporting coordinated multi-level safety policies. Experiments demonstrate substantial improvements in safe-response accuracy. Furthermore, our analysis reveals that both training data distribution and the base model’s prior knowledge critically influence safety performance—highlighting key factors previously underexplored in clinical LLM safety research.
📝 Abstract
Safety is a paramount concern in clinical chatbot applications, where inaccurate or harmful responses can lead to serious consequences. Existing methods--such as guardrails and tool calling--often fall short in addressing the nuanced demands of the clinical domain. In this paper, we introduce TACOS (TAxonomy of COmprehensive Safety for Clinical Agents), a fine-grained, 21-class taxonomy that integrates safety filtering and tool selection into a single user intent classification step. TACOS is a taxonomy that can cover a wide spectrum of clinical and non-clinical queries, explicitly modeling varying safety thresholds and external tool dependencies. To validate our framework, we curate a TACOS-annotated dataset and perform extensive experiments. Our results demonstrate the value of a new taxonomy specialized for clinical agent settings, and reveal useful insights about train data distribution and pretrained knowledge of base models.