CARES: Comprehensive Evaluation of Safety and Adversarial Robustness in Medical LLMs

📅 2025-05-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing safety evaluation benchmarks for large language models (LLMs) in healthcare lack clinical specificity, fine-grained harm categorization, and comprehensive coverage of jailbreaking attacks. Method: We propose CARES—the first safety and adversarial robustness benchmark tailored to medical LLMs—comprising 18,000+ clinically grounded prompts, structured along eight safety principles, four harm severity levels, and four prompt styles. It introduces a novel ternary response protocol (Accept/Caution/Refuse) and a quantitative Safety Score. Our methodology integrates multi-level human annotation, a lightweight jailbreak detector, and reminder-based conditional response regulation. Contribution/Results: Systematic evaluation across 20+ state-of-the-art medical LLMs reveals that role-playing and steganographic prompts reduce refusal rates by up to 47%; our approach improves safety response accuracy by 32%, substantially mitigating both over-refusal and under-refusal failures.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) are increasingly deployed in medical contexts, raising critical concerns about safety, alignment, and susceptibility to adversarial manipulation. While prior benchmarks assess model refusal capabilities for harmful prompts, they often lack clinical specificity, graded harmfulness levels, and coverage of jailbreak-style attacks. We introduce CARES (Clinical Adversarial Robustness and Evaluation of Safety), a benchmark for evaluating LLM safety in healthcare. CARES includes over 18,000 prompts spanning eight medical safety principles, four harm levels, and four prompting styles: direct, indirect, obfuscated, and role-play, to simulate both malicious and benign use cases. We propose a three-way response evaluation protocol (Accept, Caution, Refuse) and a fine-grained Safety Score metric to assess model behavior. Our analysis reveals that many state-of-the-art LLMs remain vulnerable to jailbreaks that subtly rephrase harmful prompts, while also over-refusing safe but atypically phrased queries. Finally, we propose a mitigation strategy using a lightweight classifier to detect jailbreak attempts and steer models toward safer behavior via reminder-based conditioning. CARES provides a rigorous framework for testing and improving medical LLM safety under adversarial and ambiguous conditions.
Problem

Research questions and friction points this paper is trying to address.

Evaluates safety and adversarial robustness in medical LLMs
Addresses lack of clinical specificity in existing benchmarks
Mitigates jailbreak vulnerabilities and over-refusal in medical queries
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces CARES benchmark for medical LLM safety
Proposes three-way response evaluation protocol
Uses lightweight classifier for jailbreak detection
🔎 Similar Papers
No similar papers found.
S
Sijia Chen
Northeastern University
X
Xiaomin Li
Harvard University
Mengxue Zhang
Mengxue Zhang
Umass Amherst
Computer ScienceNLPLLMReasoning
E
Eric Hanchen Jiang
UCLA
Qingcheng Zeng
Qingcheng Zeng
PhD Student in NLP, Northwestern University
Computational Social ScienceNLPComputational Linguistics
C
Chen-Hsiang Yu
Northeastern University