🤖 AI Summary
Large language models (LLMs) exhibit strong generalization but suffer from low specialty-specific accuracy and poor interpretability in medical time-series analysis. To address this, we propose ConMIL—a plug-and-play decision-support small model that innovatively integrates multiple instance learning (MIL) with conformal prediction (CP). This synergy enables fine-grained localization of clinically relevant signal segments and produces calibrated, reliability-guaranteed confidence scores, thereby enhancing both accuracy and interpretability. ConMIL operates synergistically with a multimodal LLM (Qwen2-VL-7B), significantly improving high-confidence sample accuracy for arrhythmia detection (94.92%) and sleep staging (96.82%), outperforming pure-LLM baselines by over 48 percentage points. Our work establishes a new paradigm for medical time-series AI interpretation—one that jointly optimizes diagnostic precision, robustness, and clinical deployability.
📝 Abstract
Large language models (LLMs) exhibit remarkable capabilities in visual inspection of medical time-series data, achieving proficiency comparable to human clinicians. However, their broad scope limits domain-specific precision, and proprietary weights hinder fine-tuning for specialized datasets. In contrast, small specialized models (SSMs) excel in targeted tasks but lack the contextual reasoning required for complex clinical decision-making. To address these challenges, we propose ConMIL (Conformalized Multiple Instance Learning), a decision-support SSM that integrates seamlessly with LLMs. By using Multiple Instance Learning (MIL) to identify clinically significant signal segments and conformal prediction for calibrated set-valued outputs, ConMIL enhances LLMs' interpretative capabilities for medical time-series analysis. Experimental results demonstrate that ConMIL significantly improves the performance of state-of-the-art LLMs, such as ChatGPT4.0 and Qwen2-VL-7B. Specifically, ConMIL{}-supported Qwen2-VL-7B achieves 94.92% and 96.82% precision for confident samples in arrhythmia detection and sleep staging, compared to standalone LLM accuracy of 46.13% and 13.16%. These findings highlight the potential of ConMIL to bridge task-specific precision and broader contextual reasoning, enabling more reliable and interpretable AI-driven clinical decision support.