🤖 AI Summary
This work addresses the limitations of existing AI models in cardiac disease diagnosis—namely, their restriction to single-modality inputs, lack of interactivity, and inability to effectively integrate multimodal data such as electrocardiograms (ECG), echocardiography, and cardiac magnetic resonance (CMR). To overcome these challenges, the authors propose a hierarchical agent-based multimodal vision–language system that incorporates domain-specific visual encoders, modality-specialized vision–language expert models, a multi-stage language model refinement mechanism, and a multimodal coordinator. This architecture effectively mitigates hallucination and spurious reasoning while enabling end-to-end independent and joint multimodal analysis. Experimental results demonstrate diagnostic accuracies of 87–91% for ECG, 67–86% for echocardiography, and 85–88% for CMR tasks, with a multimodal fusion accuracy of 70%, substantially outperforming current approaches and improving generated text quality by 1.7–3.0×.
📝 Abstract
Cardiovascular disease remains the leading cause of global mortality, with progress hindered by human interpretation of complex cardiac tests. Current AI vision-language models are limited to single-modality inputs and are non-interactive. We present MARCUS (Multimodal Autonomous Reasoning and Chat for Ultrasound and Signals), an agentic vision-language system for end-to-end interpretation of electrocardiograms (ECGs), echocardiograms, and cardiac magnetic resonance imaging (CMR) independently and as multimodal input. MARCUS employs a hierarchical agentic architecture comprising modality-specific vision-language expert models, each integrating domain-trained visual encoders with multi-stage language model optimization, coordinated by a multimodal orchestrator. Trained on 13.5 million images (0.25M ECGs, 1.3M echocardiogram images, 12M CMR images) and our novel expert-curated dataset spanning 1.6 million questions, MARCUS achieves state-of-the-art performance surpassing frontier models (GPT-5 Thinking, Gemini 2.5 Pro Deep Think). Across internal (Stanford) and external (UCSF) test cohorts, MARCUS achieves accuracies of 87-91% for ECG, 67-86% for echocardiography, and 85-88% for CMR, outperforming frontier models by 34-45% (P<0.001). On multimodal cases, MARCUS achieved 70% accuracy, nearly triple that of frontier models (22-28%), with 1.7-3.0x higher free-text quality scores. Our agentic architecture also confers resistance to mirage reasoning, whereby vision-language models derive reasoning from unintended textual signals or hallucinated visual content. MARCUS demonstrates that domain-specific visual encoders with an agentic orchestrator enable multimodal cardiac interpretation. We release our models, code, and benchmark open-source.