CliniBench: A Clinical Outcome Prediction Benchmark for Generative and Encoder-Based Language Models

📅 2025-09-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study systematically evaluates the performance disparity between generative large language models (LLMs) and encoder-based classifiers in predicting discharge diagnoses within real-world clinical settings. Method: We introduce CliniBench—the first clinical outcome prediction benchmark designed for fair comparison between these two model families—built upon MIMIC-IV admission records. Experiments involve 12 LLMs and 3 encoder models, and we propose retrieval-augmented in-context learning (RAG-ICL), the first such method tailored to enhance generative diagnosis prediction. Contribution/Results: Encoder models significantly outperform LLMs overall; however, RAG-ICL substantially improves LLM accuracy, narrowing the performance gap. This work establishes a standardized evaluation paradigm for clinical diagnosis prediction and provides critical empirical evidence to inform model selection, deployment strategies, and regulatory considerations in healthcare AI.

Technology Category

Application Category

📝 Abstract
With their growing capabilities, generative large language models (LLMs) are being increasingly investigated for complex medical tasks. However, their effectiveness in real-world clinical applications remains underexplored. To address this, we present CliniBench, the first benchmark that enables comparability of well-studied encoder-based classifiers and generative LLMs for discharge diagnosis prediction from admission notes in MIMIC-IV dataset. Our extensive study compares 12 generative LLMs and 3 encoder-based classifiers and demonstrates that encoder-based classifiers consistently outperform generative models in diagnosis prediction. We assess several retrieval augmentation strategies for in-context learning from similar patients and find that they provide notable performance improvements for generative LLMs.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' effectiveness in clinical outcome prediction
Comparing generative and encoder models for diagnosis prediction
Assessing retrieval augmentation for improving LLM performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark compares encoder and generative models
Retrieval augmentation improves generative model performance
Encoder classifiers outperform generative models in diagnosis
🔎 Similar Papers
No similar papers found.