Enhancing Visual Inspection Capability of Multi-Modal Large Language Models on Medical Time Series with Supportive Conformalized and Interpretable Small Specialized Models

📅 2025-01-27

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

Large language models (LLMs) exhibit strong generalization but suffer from low specialty-specific accuracy and poor interpretability in medical time-series analysis. To address this, we propose ConMIL—a plug-and-play decision-support small model that innovatively integrates multiple instance learning (MIL) with conformal prediction (CP). This synergy enables fine-grained localization of clinically relevant signal segments and produces calibrated, reliability-guaranteed confidence scores, thereby enhancing both accuracy and interpretability. ConMIL operates synergistically with a multimodal LLM (Qwen2-VL-7B), significantly improving high-confidence sample accuracy for arrhythmia detection (94.92%) and sleep staging (96.82%), outperforming pure-LLM baselines by over 48 percentage points. Our work establishes a new paradigm for medical time-series AI interpretation—one that jointly optimizes diagnostic precision, robustness, and clinical deployability.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) exhibit remarkable capabilities in visual inspection of medical time-series data, achieving proficiency comparable to human clinicians. However, their broad scope limits domain-specific precision, and proprietary weights hinder fine-tuning for specialized datasets. In contrast, small specialized models (SSMs) excel in targeted tasks but lack the contextual reasoning required for complex clinical decision-making. To address these challenges, we propose ConMIL (Conformalized Multiple Instance Learning), a decision-support SSM that integrates seamlessly with LLMs. By using Multiple Instance Learning (MIL) to identify clinically significant signal segments and conformal prediction for calibrated set-valued outputs, ConMIL enhances LLMs' interpretative capabilities for medical time-series analysis. Experimental results demonstrate that ConMIL significantly improves the performance of state-of-the-art LLMs, such as ChatGPT4.0 and Qwen2-VL-7B. Specifically, ConMIL{}-supported Qwen2-VL-7B achieves 94.92% and 96.82% precision for confident samples in arrhythmia detection and sleep staging, compared to standalone LLM accuracy of 46.13% and 13.16%. These findings highlight the potential of ConMIL to bridge task-specific precision and broader contextual reasoning, enabling more reliable and interpretable AI-driven clinical decision support.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Medical Time Series Data

Specialized Task Precision

Innovation

Methods, ideas, or system contributions that make the work stand out.

ConMIL

Medical Time Series Analysis

Accuracy Enhancement

🔎 Similar Papers

No similar papers found.