Incentivizing Tool-augmented Thinking with Images for Medical Image Analysis

📅 2025-12-16

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

Existing medical multimodal large language models (MLLMs) generate textual reasoning chains but suffer from insufficient localization accuracy and diagnostic reasoning fidelity in tasks requiring dynamic, fine-grained focus on pathological regions. To address this, we propose Ophiuchus—a novel framework enabling autonomous, adaptive decision-making on *when* and *where* to invoke visual grounding tools, seamlessly integrating localized sub-image features into multimodal chain-of-thought reasoning for diagnostic-level “image thinking.” Our method introduces a three-stage training paradigm: (1) cold-start tool integration, (2) self-reflective fine-tuning, and (3) reward-driven agent reinforcement learning—jointly enhancing the model’s intrinsic perceptual capability and external tool orchestration. Evaluated across diverse medical benchmarks—including visual question answering, lesion detection, and reasoning-aware segmentation—Ophiuchus consistently outperforms both closed- and open-source state-of-the-art models, achieving significant gains in fine-grained lesion localization and diagnostic reasoning accuracy.

Technology Category

Application Category

📝 Abstract

Recent reasoning based medical MLLMs have made progress in generating step by step textual reasoning chains. However, they still struggle with complex tasks that necessitate dynamic and iterative focusing on fine-grained visual regions to achieve precise grounding and diagnosis. We introduce Ophiuchus, a versatile, tool-augmented framework that equips an MLLM to (i) decide when additional visual evidence is needed, (ii) determine where to probe and ground within the medical image, and (iii) seamlessly weave the relevant sub-image content back into an interleaved, multimodal chain of thought. In contrast to prior approaches limited by the performance ceiling of specialized tools, Ophiuchus integrates the model's inherent grounding and perception capabilities with external tools, thereby fostering higher-level reasoning. The core of our method is a three-stage training strategy: cold-start training with tool-integrated reasoning data to achieve basic tool selection and adaptation for inspecting key regions; self-reflection fine-tuning to strengthen reflective reasoning and encourage revisiting tool outputs; and Agentic Tool Reinforcement Learning to directly optimize task-specific rewards and emulate expert-like diagnostic behavior. Extensive experiments show that Ophiuchus consistently outperforms both closed-source and open-source SOTA methods across diverse medical benchmarks, including VQA, detection, and reasoning-based segmentation. Our approach illuminates a path toward medical AI agents that can genuinely "think with images" through tool-integrated reasoning. Datasets, codes, and trained models will be released publicly.

Problem

Research questions and friction points this paper is trying to address.

Enhances medical MLLMs' ability to focus on fine-grained visual regions for precise diagnosis.

Integrates external tools with MLLMs to improve dynamic, iterative reasoning in image analysis.

Optimizes tool-augmented reasoning through a three-stage training strategy for expert-like diagnostic behavior.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Tool-augmented framework for dynamic visual region focusing

Three-stage training with tool-integrated reasoning and reinforcement learning

Integrates model's grounding with external tools for higher-level reasoning

🔎 Similar Papers

Visual Evaluative AI: A Hypothesis-Driven Tool with Concept-Based Explanations and Weight of Evidence