MedOpenClaw: Auditable Medical Imaging Agents Reasoning over Uncurated Full Studies

📅 2026-03-25

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

Current evaluations of medical vision-language models are limited to curated 2D images, failing to reflect real-world clinical workflows that require active exploration of full-volume, multimodal 3D imaging studies. To address this gap, this work proposes MedFlowBench—the first agent evaluation benchmark supporting study-level, interactive assessment across multi-sequence/multimodal 3D medical images—and introduces the auditable MedOpenClaw runtime environment, enabling models to dynamically interact with standard clinical tools such as 3D Slicer. The framework supports three operational modes: viewer-only, tool-calling, and open-ended interaction, encompassing multimodal data including brain MRI and thoracic CT/PET. Experiments reveal that while state-of-the-art models perform adequately on basic tasks, their performance significantly degrades when invoking specialized tools due to insufficient spatial localization capabilities, highlighting a critical bottleneck in medical AI agent development.

Technology Category

Application Category

📝 Abstract

Currently, evaluating vision-language models (VLMs) in medical imaging tasks oversimplifies clinical reality by relying on pre-selected 2D images that demand significant manual labor to curate. This setup misses the core challenge of realworld diagnostics: a true clinical agent must actively navigate full 3D volumes across multiple sequences or modalities to gather evidence and ultimately support a final decision. To address this, we propose MEDOPENCLAW, an auditable runtime designed to let VLMs operate dynamically within standard medical tools or viewers (e.g., 3D Slicer). On top of this runtime, we introduce MEDFLOWBENCH, a full-study medical imaging benchmark covering multi-sequence brain MRI and lung CT/PET. It systematically evaluates medical agentic capabilities across viewer-only, tool-use, and open-method tracks. Initial results reveal a critical insight: while state-of-the-art LLMs/VLMs (e.g., Gemini 3.1 Pro and GPT-5.4) can successfully navigate the viewer to solve basic study-level tasks, their performance paradoxically degrades when given access to professional support tools due to a lack of precise spatial grounding. By bridging the gap between static-image perception and interactive clinical workflows, MEDOPENCLAW and MEDFLOWBENCH establish a reproducible foundation for developing auditable, full-study medical imaging agents.

Problem

Research questions and friction points this paper is trying to address.

medical imaging

vision-language models

full-study reasoning

clinical workflow

3D navigation

Innovation

Methods, ideas, or system contributions that make the work stand out.

medical imaging agents

full-study reasoning

auditable runtime