π€ AI Summary
Clinical workflow fragmentation severely impedes efficiency: heterogeneous scripting, ad-hoc model ensembles, and lack of data-driven modality identification and standardized outputs result in high deployment overhead, costly monitoring, and poor interoperability. To address this, we propose a healthcare-first vision-language unified framework that pioneers the use of a single vision-language model (VLM) for two-tier clinical decision-makingβfirst, an auditable, three-stage routing mechanism matches inputs to expert-defined model cards; second, domain-specific multi-task joint inference (with early-exit capability and candidate arbitration) adheres to clinical risk constraints. Leveraging phased prompting, a candidate answer selector, and specialty-specific fine-tuning, our framework unifies modality identification, abnormality classification, model selection, and multi-task reasoning. Evaluated across gastroenterology, hematology, ophthalmology, and pathology, our single-model solution achieves performance on par with specialized models while substantially reducing deployment complexity, operational overhead, and integration effort.
π Abstract
Clinical workflows are fragmented as a patchwork of scripts and task-specific networks that often handle triage, task selection, and model deployment. These pipelines are rarely streamlined for data science pipeline, reducing efficiency and raising operational costs. Workflows also lack data-driven model identification (from imaging/tabular inputs) and standardized delivery of model outputs. In response, we present a practical, healthcare-first framework that uses a single vision-language model (VLM) in two complementary roles. First (Solution 1), the VLM acts as an aware model-card matcher that routes an incoming image to the appropriate specialist model via a three-stage workflow (modality -> primary abnormality -> model-card id). Checks are provided by (i) stagewise prompts that allow early exit via None/Normal/Other and (ii) a stagewise answer selector that arbitrates between the top-2 candidates at each stage, reducing the chance of an incorrect selection and aligning the workflow with clinical risk tolerance. Second (Solution 2), we fine-tune the VLM on specialty-specific datasets ensuring a single model covers multiple downstream tasks within each specialty, maintaining performance while simplifying deployment. Across gastroenterology, hematology, ophthalmology, and pathology, our single-model deployment matches or approaches specialized baselines.
Compared with pipelines composed of many task-specific agents, this approach shows that one VLM can both decide and do. It may reduce effort by data scientists, shorten monitoring, increase the transparency of model selection (with per-stage justifications), and lower integration overhead.