MEDVISTAGYM: A Scalable Training Environment for Thinking with Medical Images via Tool-Integrated Reinforcement Learning

๐Ÿ“… 2026-01-12
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the limited ability of existing medical vision-language models to re-examine, verify, or integrate visual evidence through interactive operations during multi-step reasoning, particularly their inefficacy in invoking and coordinating external tools. To overcome this, we propose MedVistaGymโ€”the first open-source, extensible training framework that supports tool integration, region grounding, and multi-evidence fusion. By leveraging trajectory sampling and end-to-end reinforcement learning, MedVistaGym enables models to autonomously select and collaboratively orchestrate multiple visual tools for dynamic, interactive reasoning. The resulting model, MedVistaGym-R1-8B, achieves average improvements of 19.10%โ€“24.21% over current tool-augmented baselines across six medical visual question answering benchmarks, demonstrating the efficacy of structured agent training in enhancing multimodal tool-integrated reasoning in medical domains.

Technology Category

Application Category

๐Ÿ“ Abstract
Vision language models (VLMs) achieve strong performance on general image understanding but struggle to think with medical images, especially when performing multi-step reasoning through iterative visual interaction. Medical VLMs often rely on static visual embeddings and single-pass inference, preventing models from re-examining, verifying, or refining visual evidence during reasoning. While tool-integrated reasoning offers a promising path forward, open-source VLMs lack the training infrastructure to learn effective tool selection, invocation, and coordination in multi-modal medical reasoning. We introduce MedVistaGym, a scalable and interactive training environment that incentivizes tool-integrated visual reasoning for medical image analysis. MedVistaGym equips VLMs to determine when and which tools to invoke, localize task-relevant image regions, and integrate single or multiple sub-image evidence into interleaved multimodal reasoning within a unified, executable interface for agentic training. Using MedVistaGym, we train MedVistaGym-R1 to interleave tool use with agentic reasoning through trajectory sampling and end-to-end reinforcement learning. Across six medical VQA benchmarks, MedVistaGym-R1-8B exceeds comparably sized tool-augmented baselines by 19.10% to 24.21%, demonstrating that structured agentic training--not tool access alone--unlocks effective tool-integrated reasoning for medical image analysis.
Problem

Research questions and friction points this paper is trying to address.

medical vision-language models
multi-step reasoning
tool-integrated reasoning
iterative visual interaction
medical image analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

tool-integrated reasoning
reinforcement learning
medical vision-language models
agentic training
interactive multimodal reasoning
๐Ÿ”Ž Similar Papers
No similar papers found.