MARCUS: An agentic, multimodal vision-language model for cardiac diagnosis and management

📅 2026-03-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing AI models in cardiac disease diagnosis—namely, their restriction to single-modality inputs, lack of interactivity, and inability to effectively integrate multimodal data such as electrocardiograms (ECG), echocardiography, and cardiac magnetic resonance (CMR). To overcome these challenges, the authors propose a hierarchical agent-based multimodal vision–language system that incorporates domain-specific visual encoders, modality-specialized vision–language expert models, a multi-stage language model refinement mechanism, and a multimodal coordinator. This architecture effectively mitigates hallucination and spurious reasoning while enabling end-to-end independent and joint multimodal analysis. Experimental results demonstrate diagnostic accuracies of 87–91% for ECG, 67–86% for echocardiography, and 85–88% for CMR tasks, with a multimodal fusion accuracy of 70%, substantially outperforming current approaches and improving generated text quality by 1.7–3.0×.

Technology Category

Application Category

📝 Abstract
Cardiovascular disease remains the leading cause of global mortality, with progress hindered by human interpretation of complex cardiac tests. Current AI vision-language models are limited to single-modality inputs and are non-interactive. We present MARCUS (Multimodal Autonomous Reasoning and Chat for Ultrasound and Signals), an agentic vision-language system for end-to-end interpretation of electrocardiograms (ECGs), echocardiograms, and cardiac magnetic resonance imaging (CMR) independently and as multimodal input. MARCUS employs a hierarchical agentic architecture comprising modality-specific vision-language expert models, each integrating domain-trained visual encoders with multi-stage language model optimization, coordinated by a multimodal orchestrator. Trained on 13.5 million images (0.25M ECGs, 1.3M echocardiogram images, 12M CMR images) and our novel expert-curated dataset spanning 1.6 million questions, MARCUS achieves state-of-the-art performance surpassing frontier models (GPT-5 Thinking, Gemini 2.5 Pro Deep Think). Across internal (Stanford) and external (UCSF) test cohorts, MARCUS achieves accuracies of 87-91% for ECG, 67-86% for echocardiography, and 85-88% for CMR, outperforming frontier models by 34-45% (P<0.001). On multimodal cases, MARCUS achieved 70% accuracy, nearly triple that of frontier models (22-28%), with 1.7-3.0x higher free-text quality scores. Our agentic architecture also confers resistance to mirage reasoning, whereby vision-language models derive reasoning from unintended textual signals or hallucinated visual content. MARCUS demonstrates that domain-specific visual encoders with an agentic orchestrator enable multimodal cardiac interpretation. We release our models, code, and benchmark open-source.
Problem

Research questions and friction points this paper is trying to address.

cardiac diagnosis
multimodal interpretation
vision-language models
electrocardiogram
echocardiography
Innovation

Methods, ideas, or system contributions that make the work stand out.

agentic architecture
multimodal vision-language model
domain-specific visual encoder
mirage reasoning resistance
cardiac diagnosis
🔎 Similar Papers
No similar papers found.
J
Jack W O'Sullivan
Division of Cardiology, Department of Medicine, Stanford University, CA, USA
M
Mohammad Asadi
Department of Electrical Engineering, Stanford University, CA, USA
L
Lennart Elbe
Department of Medicine, Radiology, and Pediatrics, UCSF, CA, USA
Akshay Chaudhari
Akshay Chaudhari
Assistant Professor, Stanford University
Biomedical ImagingMulti-Modal LearningDeep LearningRadiology
T
Tahoura Nedaee
Department of Biology, Stanford University, CA, USA
F
Francois Haddad
Division of Cardiology, Department of Medicine, Stanford University, CA, USA
Michael Salerno
Michael Salerno
UCSF
Cardiovascular MRICardiovascular Imaging
L
Li Fe-Fei
Department of Computer Science, Stanford University, CA, USA
Ehsan Adeli
Ehsan Adeli
Stanford University
Computer VisionComputational NeurosciencePrecision HealthcareAmbient Intelligence
R
Rima Arnaout
Department of Medicine, Radiology, and Pediatrics, UCSF, CA, USA
E
Euan A Ashley
Division of Cardiology, Department of Medicine, Stanford University, CA, USA