🤖 AI Summary
To address the insufficient reliability of AI models in clinical decision support, this work proposes an uncertainty-aware selective prediction mechanism that enables models to abstain from diagnosis when prediction confidence falls below a task-adaptive threshold. To tackle cross-task uncertainty calibration—particularly challenging across heterogeneous medical tasks—we introduce HUQ-2, a novel uncertainty quantification method integrating Bayesian approximate inference, ensemble-based estimation, and adaptive thresholding. Evaluated on diverse clinical NLP tasks—including in-hospital mortality prediction, multi-label ICD coding recommendation, outpatient triage, and depression/anxiety detection—HUQ-2 is validated on MIMIC-III/IV, proprietary outpatient, and multi-source mental health datasets. It achieves the first unified modeling framework for cross-task, multi-source, heterogeneous clinical text data. Experiments demonstrate a 12.6–18.3% improvement in selective prediction AUC, >94% accuracy on abstained samples, and substantial gains in calibration, robustness, and clinical interpretability—establishing HUQ-2 as the first SOTA uncertainty modeling framework for medical text analysis.
📝 Abstract
This study addresses the critical issue of reliability for AI-assisted medical diagnosis. We focus on the selection prediction approach that allows the diagnosis system to abstain from providing the decision if it is not confident in the diagnosis. Such selective prediction (or abstention) approaches are usually based on the modeling predictive uncertainty of machine learning models involved. This study explores uncertainty quantification in machine learning models for medical text analysis, addressing diverse tasks across multiple datasets. We focus on binary mortality prediction from textual data in MIMIC-III, multi-label medical code prediction using ICD-10 codes from MIMIC-IV, and multi-class classification with a private outpatient visits dataset. Additionally, we analyze mental health datasets targeting depression and anxiety detection, utilizing various text-based sources, such as essays, social media posts, and clinical descriptions. In addition to comparing uncertainty methods, we introduce HUQ-2, a new state-of-the-art method for enhancing reliability in selective prediction tasks. Our results provide a detailed comparison of uncertainty quantification methods. They demonstrate the effectiveness of HUQ-2 in capturing and evaluating uncertainty, paving the way for more reliable and interpretable applications in medical text analysis.