MPIB: A Benchmark for Medical Prompt Injection Attacks and Clinical Safety in LLMs

📅 2026-02-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the vulnerability of large language models (LLMs) to prompt injection attacks in clinical settings, which can produce misleading outputs that jeopardize patient safety. To mitigate this risk, the authors propose the first clinically grounded evaluation framework for prompt injection robustness, introducing the Medical Prompt Injection Benchmark (MPIB)—a dataset of high-quality adversarial examples rigorously validated through a multi-stage review process. The study further presents a joint evaluation metric combining Clinical Harm Event Rate (CHER) and Attack Success Rate (ASR) to differentiate between mere instruction adherence and actual patient risk. Through experiments involving both direct and RAG-mediated indirect injection attacks, the research reveals the critical influence of attack location on model robustness. The MPIB dataset and associated evaluation tools are publicly released to support future research in clinical AI safety.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) systems are increasingly integrated into clinical workflows; however, prompt injection attacks can steer these systems toward clinically unsafe or misleading outputs. We introduce the Medical Prompt Injection Benchmark (MPIB), a dataset-and-benchmark suite for evaluating clinical safety under both direct prompt injection and indirect, RAG-mediated injection across clinically grounded tasks. MPIB emphasizes outcome-level risk via the Clinical Harm Event Rate (CHER), which measures high-severity clinical harm events under a clinically grounded taxonomy, and reports CHER alongside Attack Success Rate (ASR) to disentangle instruction compliance from downstream patient risk. The benchmark comprises 9,697 curated instances constructed through multi-stage quality gates and clinical safety linting. Evaluating MPIB across a diverse set of baseline LLMs and defense configurations, we find that ASR and CHER can diverge substantially, and that robustness depends critically on whether adversarial instructions appear in the user query or in retrieved context. We release MPIB with evaluation code, adversarial baselines, and comprehensive documentation to support reproducible and systematic research on clinical prompt injection. Code and data are available at GitHub (code) and Hugging Face (data).
Problem

Research questions and friction points this paper is trying to address.

prompt injection
clinical safety
large language models
retrieval-augmented generation
medical AI
Innovation

Methods, ideas, or system contributions that make the work stand out.

prompt injection
clinical safety
Retrieval-Augmented Generation
harm evaluation
large language models
🔎 Similar Papers
No similar papers found.