MedPriv-Bench: Benchmarking the Privacy-Utility Trade-off of Large Language Models in Medical Open-End Question Answering

📅 2026-03-15

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

This work addresses a critical gap in the evaluation of medical large language models (LLMs), which has predominantly focused on accuracy while overlooking privacy risks arising from the recombination of fine-grained medical details in retrieval-augmented generation (RAG) systems—potentially enabling patient re-identification. To bridge this gap, the study proposes the first joint privacy–utility evaluation framework tailored for open-domain medical question answering. It leverages a multi-agent and human-in-the-loop approach to synthesize sensitive contexts and queries, and introduces an automated privacy leakage detection method based on RoBERTa-NLI. Experiments across nine mainstream LLMs reveal a prevalent trade-off between privacy preservation and utility. The proposed automated evaluator achieves an average agreement rate of 85.9% with human expert judgments, thereby establishing a foundational benchmark for privacy-compliant assessment in medical AI.

Technology Category

Application Category

📝 Abstract

Recent advances in Retrieval-Augmented Generation (RAG) have enabled large language models (LLMs) to ground outputs in clinical evidence. However, connecting LLMs with external databases introduces the risk of contextual leakage: a subtle privacy threat where unique combinations of medical details enable patient re-identification even without explicit identifiers. Current benchmarks in healthcare heavily focus on accuracy, ignoring such privacy issues, despite strict regulations like Health Insurance Portability and Accountability Act (HIPAA) and General Data Protection Regulation (GDPR). To fill this gap, we present MedPriv-Bench, the first benchmark specifically designed to jointly evaluate privacy preservation and clinical utility in medical open-ended question answering. Our framework utilizes a multi-agent, human-in-the-loop pipeline to synthesize sensitive medical contexts and clinically relevant queries that create realistic privacy pressure. We establish a standardized evaluation protocol leveraging a pre-trained RoBERTa-Natural Language Inference (NLI) model as an automated judge to quantify data leakage, achieving an average of 85.9% alignment with human experts. Through an extensive evaluation of 9 representative LLMs, we demonstrate a pervasive privacy-utility trade-off. Our findings underscore the necessity of domain-specific benchmarks to validate the safety and efficacy of medical AI systems in privacy-sensitive environments.

Problem

Research questions and friction points this paper is trying to address.

privacy leakage

medical question answering

LLM benchmarking

contextual re-identification

privacy-utility trade-off

Innovation

Methods, ideas, or system contributions that make the work stand out.

privacy-utility trade-off

medical LLM benchmarking

contextual leakage