🤖 AI Summary
This work addresses a critical gap in the evaluation of medical large language models (LLMs), which has predominantly focused on accuracy while overlooking privacy risks arising from the recombination of fine-grained medical details in retrieval-augmented generation (RAG) systems—potentially enabling patient re-identification. To bridge this gap, the study proposes the first joint privacy–utility evaluation framework tailored for open-domain medical question answering. It leverages a multi-agent and human-in-the-loop approach to synthesize sensitive contexts and queries, and introduces an automated privacy leakage detection method based on RoBERTa-NLI. Experiments across nine mainstream LLMs reveal a prevalent trade-off between privacy preservation and utility. The proposed automated evaluator achieves an average agreement rate of 85.9% with human expert judgments, thereby establishing a foundational benchmark for privacy-compliant assessment in medical AI.
📝 Abstract
Recent advances in Retrieval-Augmented Generation (RAG) have enabled large language models (LLMs) to ground outputs in clinical evidence. However, connecting LLMs with external databases introduces the risk of contextual leakage: a subtle privacy threat where unique combinations of medical details enable patient re-identification even without explicit identifiers. Current benchmarks in healthcare heavily focus on accuracy, ignoring such privacy issues, despite strict regulations like Health Insurance Portability and Accountability Act (HIPAA) and General Data Protection Regulation (GDPR). To fill this gap, we present MedPriv-Bench, the first benchmark specifically designed to jointly evaluate privacy preservation and clinical utility in medical open-ended question answering. Our framework utilizes a multi-agent, human-in-the-loop pipeline to synthesize sensitive medical contexts and clinically relevant queries that create realistic privacy pressure. We establish a standardized evaluation protocol leveraging a pre-trained RoBERTa-Natural Language Inference (NLI) model as an automated judge to quantify data leakage, achieving an average of 85.9% alignment with human experts. Through an extensive evaluation of 9 representative LLMs, we demonstrate a pervasive privacy-utility trade-off. Our findings underscore the necessity of domain-specific benchmarks to validate the safety and efficacy of medical AI systems in privacy-sensitive environments.