CliniBench: A Clinical Outcome Prediction Benchmark for Generative and Encoder-Based Language Models

📅 2025-09-30

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This study systematically evaluates the performance disparity between generative large language models (LLMs) and encoder-based classifiers in predicting discharge diagnoses within real-world clinical settings. Method: We introduce CliniBench—the first clinical outcome prediction benchmark designed for fair comparison between these two model families—built upon MIMIC-IV admission records. Experiments involve 12 LLMs and 3 encoder models, and we propose retrieval-augmented in-context learning (RAG-ICL), the first such method tailored to enhance generative diagnosis prediction. Contribution/Results: Encoder models significantly outperform LLMs overall; however, RAG-ICL substantially improves LLM accuracy, narrowing the performance gap. This work establishes a standardized evaluation paradigm for clinical diagnosis prediction and provides critical empirical evidence to inform model selection, deployment strategies, and regulatory considerations in healthcare AI.

Technology Category

Application Category

📝 Abstract

With their growing capabilities, generative large language models (LLMs) are being increasingly investigated for complex medical tasks. However, their effectiveness in real-world clinical applications remains underexplored. To address this, we present CliniBench, the first benchmark that enables comparability of well-studied encoder-based classifiers and generative LLMs for discharge diagnosis prediction from admission notes in MIMIC-IV dataset. Our extensive study compares 12 generative LLMs and 3 encoder-based classifiers and demonstrates that encoder-based classifiers consistently outperform generative models in diagnosis prediction. We assess several retrieval augmentation strategies for in-context learning from similar patients and find that they provide notable performance improvements for generative LLMs.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' effectiveness in clinical outcome prediction

Comparing generative and encoder models for diagnosis prediction

Assessing retrieval augmentation for improving LLM performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark compares encoder and generative models

Retrieval augmentation improves generative model performance

Encoder classifiers outperform generative models in diagnosis

🔎 Similar Papers

No similar papers found.