L2D-Clinical: Learning to Defer for Adaptive Model Selection in Clinical Text Classification

📅 2026-04-14

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

This work addresses the challenge of instance-level model selection in clinical text classification, where specialized fine-tuned models like ClinicalBERT and general-purpose large language models (LLMs) each offer distinct advantages but lack adaptive coordination. The authors propose L2D-Clinical, a novel framework that introduces a “learning-to-delegate” mechanism to dynamically determine whether to delegate a prediction from a BERT-based classifier to an LLM. Delegation decisions are informed by uncertainty estimates, textual features, and consensus labels derived from multiple LLMs. Evaluated on adverse drug event detection and MIMIC-IV treatment outcome classification tasks, the approach achieves F1 scores of 0.928 (+1.7) and 0.980 (+9.3), respectively, while delegating only 7% and 16.8% of instances—demonstrating high performance with substantially reduced API costs.

Technology Category

Application Category

📝 Abstract

Clinical text classification requires choosing between specialized fine-tuned models (BERT variants) and general-purpose large language models (LLMs), yet neither dominates across all instances. We introduce Learning to Defer for clinical text (L2D-Clinical), a framework that learns when a BERT classifier should defer to an LLM based on uncertainty signals and text characteristics. Unlike prior L2D work that defers to human experts assumed universally superior, our approach enables adaptive deferral-improving accuracy when the LLM complements BERT. We evaluate on two English clinical tasks: (1) ADE detection (ADE Corpus V2), where BioBERT (F1=0.911) outperforms the LLM (F1=0.765), and (2) treatment outcome classification (MIMIC-IV with multi-LLM consensus ground truth), where GPT-5-nano (F1=0.967) outperforms ClinicalBERT (F1=0.887). On ADE, L2D-Clinical achieves F1=0.928 (+1.7 points over BERT) by selectively deferring 7% of instances where the LLM's high recall compensates for BERT's misses. On MIMIC, L2D-Clinical achieves F1=0.980 (+9.3 points over BERT) by deferring only 16.8\% of cases to the LLM. The key insight is that L2D-Clinical learns to selectively leverage LLM strengths while minimizing API costs.

Problem

Research questions and friction points this paper is trying to address.

clinical text classification

model selection

learning to defer

large language models

BERT

Innovation

Methods, ideas, or system contributions that make the work stand out.

Learning to Defer

Clinical Text Classification

Model Selection