MedPriv-Bench: Benchmarking the Privacy-Utility Trade-off of Large Language Models in Medical Open-End Question Answering

📅 2026-03-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses a critical gap in the evaluation of medical large language models (LLMs), which has predominantly focused on accuracy while overlooking privacy risks arising from the recombination of fine-grained medical details in retrieval-augmented generation (RAG) systems—potentially enabling patient re-identification. To bridge this gap, the study proposes the first joint privacy–utility evaluation framework tailored for open-domain medical question answering. It leverages a multi-agent and human-in-the-loop approach to synthesize sensitive contexts and queries, and introduces an automated privacy leakage detection method based on RoBERTa-NLI. Experiments across nine mainstream LLMs reveal a prevalent trade-off between privacy preservation and utility. The proposed automated evaluator achieves an average agreement rate of 85.9% with human expert judgments, thereby establishing a foundational benchmark for privacy-compliant assessment in medical AI.

Technology Category

Application Category

📝 Abstract
Recent advances in Retrieval-Augmented Generation (RAG) have enabled large language models (LLMs) to ground outputs in clinical evidence. However, connecting LLMs with external databases introduces the risk of contextual leakage: a subtle privacy threat where unique combinations of medical details enable patient re-identification even without explicit identifiers. Current benchmarks in healthcare heavily focus on accuracy, ignoring such privacy issues, despite strict regulations like Health Insurance Portability and Accountability Act (HIPAA) and General Data Protection Regulation (GDPR). To fill this gap, we present MedPriv-Bench, the first benchmark specifically designed to jointly evaluate privacy preservation and clinical utility in medical open-ended question answering. Our framework utilizes a multi-agent, human-in-the-loop pipeline to synthesize sensitive medical contexts and clinically relevant queries that create realistic privacy pressure. We establish a standardized evaluation protocol leveraging a pre-trained RoBERTa-Natural Language Inference (NLI) model as an automated judge to quantify data leakage, achieving an average of 85.9% alignment with human experts. Through an extensive evaluation of 9 representative LLMs, we demonstrate a pervasive privacy-utility trade-off. Our findings underscore the necessity of domain-specific benchmarks to validate the safety and efficacy of medical AI systems in privacy-sensitive environments.
Problem

Research questions and friction points this paper is trying to address.

privacy leakage
medical question answering
LLM benchmarking
contextual re-identification
privacy-utility trade-off
Innovation

Methods, ideas, or system contributions that make the work stand out.

privacy-utility trade-off
medical LLM benchmarking
contextual leakage
human-in-the-loop synthesis
automated privacy evaluation
🔎 Similar Papers
No similar papers found.
S
Shaowei Guan
Centre for Smart Health, School of Nursing, The Hong Kong Polytechnic University, Hong Kong
Y
Yu Zhai
Department of Language Science and Technology, The Hong Kong Polytechnic University, Hong Kong
H
Hin Chi Kwok
Centre for Smart Health, School of Nursing, The Hong Kong Polytechnic University, Hong Kong
Jiawei Du
Jiawei Du
National Taiwan University; ex-Intern @ Samsung Research
Speech processingNeural codingGenerative AIAI security
X
Xinyu Feng
Centre for Smart Health, School of Nursing, The Hong Kong Polytechnic University, Hong Kong
Jing Li
Jing Li
Associate Professor, The Hong Kong Polytechnic University
Natural Language ProcessingHuman-Centered AIEmbodied Artificial Intelligence
H
Harry Qin
Centre for Smart Health, School of Nursing, The Hong Kong Polytechnic University, Hong Kong
V
Vivian Hui
Centre for Smart Health, School of Nursing, The Hong Kong Polytechnic University, Hong Kong