🤖 AI Summary
Existing medical multimodal large language models (MLLMs) generate textual reasoning chains but suffer from insufficient localization accuracy and diagnostic reasoning fidelity in tasks requiring dynamic, fine-grained focus on pathological regions. To address this, we propose Ophiuchus—a novel framework enabling autonomous, adaptive decision-making on *when* and *where* to invoke visual grounding tools, seamlessly integrating localized sub-image features into multimodal chain-of-thought reasoning for diagnostic-level “image thinking.” Our method introduces a three-stage training paradigm: (1) cold-start tool integration, (2) self-reflective fine-tuning, and (3) reward-driven agent reinforcement learning—jointly enhancing the model’s intrinsic perceptual capability and external tool orchestration. Evaluated across diverse medical benchmarks—including visual question answering, lesion detection, and reasoning-aware segmentation—Ophiuchus consistently outperforms both closed- and open-source state-of-the-art models, achieving significant gains in fine-grained lesion localization and diagnostic reasoning accuracy.
📝 Abstract
Recent reasoning based medical MLLMs have made progress in generating step by step textual reasoning chains. However, they still struggle with complex tasks that necessitate dynamic and iterative focusing on fine-grained visual regions to achieve precise grounding and diagnosis. We introduce Ophiuchus, a versatile, tool-augmented framework that equips an MLLM to (i) decide when additional visual evidence is needed, (ii) determine where to probe and ground within the medical image, and (iii) seamlessly weave the relevant sub-image content back into an interleaved, multimodal chain of thought. In contrast to prior approaches limited by the performance ceiling of specialized tools, Ophiuchus integrates the model's inherent grounding and perception capabilities with external tools, thereby fostering higher-level reasoning. The core of our method is a three-stage training strategy: cold-start training with tool-integrated reasoning data to achieve basic tool selection and adaptation for inspecting key regions; self-reflection fine-tuning to strengthen reflective reasoning and encourage revisiting tool outputs; and Agentic Tool Reinforcement Learning to directly optimize task-specific rewards and emulate expert-like diagnostic behavior. Extensive experiments show that Ophiuchus consistently outperforms both closed-source and open-source SOTA methods across diverse medical benchmarks, including VQA, detection, and reasoning-based segmentation. Our approach illuminates a path toward medical AI agents that can genuinely "think with images" through tool-integrated reasoning. Datasets, codes, and trained models will be released publicly.