CARE: Towards Clinical Accountability in Multi-Modal Medical Reasoning with an Evidence-Grounded Agentic Framework

📅 2026-03-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of clinical accountability in existing large vision-language models (VLMs) for medical multimodal reasoning, which often suffer from hallucinations and fail to align with clinicians’ staged, evidence-driven diagnostic workflows. To bridge this gap, the authors propose CARE, a framework that decouples the reasoning process to emulate clinical practice: a compact VLM first identifies medical entities, followed by an expert segmentation model generating pixel-level evidence, which then informs an evidence-augmented VLM for final reasoning. A novel VLM coordinator is introduced to dynamically orchestrate tool invocation and answer verification. CARE uniquely integrates clinical accountability into multimodal medical reasoning through modular design, explicit evidence grounding, and agent-based coordination, substantially enhancing transparency and reliability. On standard medical VQA benchmarks, CARE-Flow outperforms same-scale state-of-the-art models by 10.9%, and CARE-Coord—with the coordinator—surpasses even stronger pretrained SOTA by 5.2%.

Technology Category

Application Category

📝 Abstract
Large visual language models (VLMs) have shown strong multi-modal medical reasoning ability, but most operate as end-to-end black boxes, diverging from clinicians' evidence-based, staged workflows and hindering clinical accountability. Complementarily, expert visual grounding models can accurately localize regions of interest (ROIs), providing explicit, reliable evidence that improves both reasoning accuracy and trust. In this paper, we introduce CARE, advancing Clinical Accountability in multi-modal medical Reasoning with an Evidence-grounded agentic framework. Unlike existing approaches that couple grounding and reasoning within a single generalist model, CARE decomposes the task into coordinated sub-modules to reduce shortcut learning and hallucination: a compact VLM proposes relevant medical entities; an expert entity-referring segmentation model produces pixel-level ROI evidence; and a grounded VLM reasons over the full image augmented by ROI hints. The VLMs are optimized with reinforcement learning with verifiable rewards to align answers with supporting evidence. Furthermore, a VLM coordinator plans tool invocation and reviews evidence-answer consistency, providing agentic control and final verification. Evaluated on standard medical VQA benchmarks, our CARE-Flow (coordinator-free) improves average accuracy by 10.9% over the same size (10B) state-of-the-art (SOTA). With dynamic planning and answer review, our CARE-Coord yields a further gain, outperforming the heavily pre-trained SOTA by 5.2%. Our experiments demonstrate that an agentic framework that emulates clinical workflows, incorporating decoupled specialized models and explicit evidence, yields more accurate and accountable medical AI.
Problem

Research questions and friction points this paper is trying to address.

clinical accountability
multi-modal medical reasoning
visual language models
evidence grounding
black-box models
Innovation

Methods, ideas, or system contributions that make the work stand out.

evidence-grounded reasoning
clinical accountability
agentic framework
visual grounding
reinforcement learning with verifiable rewards
🔎 Similar Papers
No similar papers found.