🤖 AI Summary
This study systematically evaluates the performance disparity between generative large language models (LLMs) and encoder-based classifiers in predicting discharge diagnoses within real-world clinical settings. Method: We introduce CliniBench—the first clinical outcome prediction benchmark designed for fair comparison between these two model families—built upon MIMIC-IV admission records. Experiments involve 12 LLMs and 3 encoder models, and we propose retrieval-augmented in-context learning (RAG-ICL), the first such method tailored to enhance generative diagnosis prediction. Contribution/Results: Encoder models significantly outperform LLMs overall; however, RAG-ICL substantially improves LLM accuracy, narrowing the performance gap. This work establishes a standardized evaluation paradigm for clinical diagnosis prediction and provides critical empirical evidence to inform model selection, deployment strategies, and regulatory considerations in healthcare AI.
📝 Abstract
With their growing capabilities, generative large language models (LLMs) are being increasingly investigated for complex medical tasks. However, their effectiveness in real-world clinical applications remains underexplored. To address this, we present CliniBench, the first benchmark that enables comparability of well-studied encoder-based classifiers and generative LLMs for discharge diagnosis prediction from admission notes in MIMIC-IV dataset. Our extensive study compares 12 generative LLMs and 3 encoder-based classifiers and demonstrates that encoder-based classifiers consistently outperform generative models in diagnosis prediction. We assess several retrieval augmentation strategies for in-context learning from similar patients and find that they provide notable performance improvements for generative LLMs.