BRIDGE: Benchmarking Large Language Models for Understanding Real-world Clinical Practice Text

📅 2025-04-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing clinical LLM evaluation benchmarks rely predominantly on medical examination questions or PubMed abstracts, failing to reflect the complexity of real-world electronic health records (EHRs) and suffering from limited linguistic, specialty, and task diversity. Method: We introduce BRIDGE—the first EHR-centric, multilingual (9 languages), multispecialty, multitask (87 tasks) clinical text understanding benchmark—designed to rigorously evaluate the generalization capabilities of 52 state-of-the-art LLMs. We employ a standardized evaluation protocol, multiple inference paradigms (zero-shot, few-shot, chain-of-thought), and extensive ablation studies (13,572 evaluations). Contribution/Results: Our analysis reveals that (1) top open-weight models match or surpass closed-source counterparts; (2) medically fine-tuned legacy architectures often underperform modern general-purpose models; and (3) performance varies significantly across language, clinical specialty, and task type. BRIDGE is publicly released with a dynamic leaderboard, establishing a reproducible gold standard for clinical LLM evaluation.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) hold great promise for medical applications and are evolving rapidly, with new models being released at an accelerated pace. However, current evaluations of LLMs in clinical contexts remain limited. Most existing benchmarks rely on medical exam-style questions or PubMed-derived text, failing to capture the complexity of real-world electronic health record (EHR) data. Others focus narrowly on specific application scenarios, limiting their generalizability across broader clinical use. To address this gap, we present BRIDGE, a comprehensive multilingual benchmark comprising 87 tasks sourced from real-world clinical data sources across nine languages. We systematically evaluated 52 state-of-the-art LLMs (including DeepSeek-R1, GPT-4o, Gemini, and Llama 4) under various inference strategies. With a total of 13,572 experiments, our results reveal substantial performance variation across model sizes, languages, natural language processing tasks, and clinical specialties. Notably, we demonstrate that open-source LLMs can achieve performance comparable to proprietary models, while medically fine-tuned LLMs based on older architectures often underperform versus updated general-purpose models. The BRIDGE and its corresponding leaderboard serve as a foundational resource and a unique reference for the development and evaluation of new LLMs in real-world clinical text understanding.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs in real-world clinical text understanding
Addressing limitations of current medical benchmarks
Assessing multilingual performance across clinical specialties
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual benchmark from real-world clinical data
Evaluated 52 LLMs across diverse clinical tasks
Open-source models match proprietary model performance
🔎 Similar Papers
No similar papers found.
Jiageng Wu
Jiageng Wu
Harvard University
Public healthDigital healthcare
B
Bowen Gu
Division of Pharmacoepidemiology and Pharmacoeconomics, Department of Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA, USA
R
Ren Zhou
Siebel School of Computing and Data Science, The Grainger College of Engineering, University of Illinois Urbana-Champaign, Urbana, IL, USA
Kevin Xie
Kevin Xie
University of Toronto
D
Doug Snyder
Department of Otorhinolaryngology – Head & Neck Surgery, Mayo Clinic, Rochester, MN, USA; Department of Biostatistics, Harvard T.H. Chan School of Public Health, Harvard University, Boston, MA, USA
Yixing Jiang
Yixing Jiang
Stanford
V
Valentina Carducci
Department of Otorhinolaryngology – Head & Neck Surgery, Mayo Clinic, Rochester, MN, USA
Richard Wyss
Richard Wyss
Brigham and Women's Hospital/Harvard Medical School
PharmacoepidemiologyHealthcare Data ScienceMedical InformaticsCausal Inference
Rishi J Desai
Rishi J Desai
Brigham & Women's Hospital/Harvard Medical School
Pharmacoepidemiology
Emily Alsentzer
Emily Alsentzer
Assistant Professor, Stanford University
machine learning for healthcare
Leo Anthony Celi
Leo Anthony Celi
Massachusetts Institute of Technology
Adam Rodman
Adam Rodman
Assistant Professor of Medicine, Harvard Medical School
Clinical reasoningAIdigital educationmedical history
S
Sebastian Schneeweiss
Division of Pharmacoepidemiology and Pharmacoeconomics, Department of Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA, USA
J
Jonathan H. Chen
Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, CA, USA; Division of Hospital Medicine, Stanford University, Stanford, CA, USA; Stanford Clinical Excellence Research Center, Stanford University, Stanford, CA, USA
Santiago Romero-Brufau
Santiago Romero-Brufau
Assistant Professor, Mayo Clinic
early warning scoresmachine learningclinical implementationclinical informatics
Kueiyu Joshua Lin
Kueiyu Joshua Lin
Harvard Medical School
J
Jie Yang
Division of Pharmacoepidemiology and Pharmacoeconomics, Department of Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA, USA; Kempner Institute for the Study of Natural and Artificial Intelligence, Harvard University, MA, USA; Broad Institute of MIT and Harvard, Cambridge, MA, USA; Harvard Data Science Initiative, Harvard University, Cambridge, MA, USA