IBISAgent: Reinforcing Pixel-Level Visual Reasoning in MLLMs for Universal Biomedical Object Referring and Segmentation

📅 2026-01-06
🏛️ arXiv.org
📈 Citations: 1
Influential: 1
📄 PDF
🤖 AI Summary
This work addresses critical limitations of existing medical multimodal large language models—namely catastrophic forgetting, poor out-of-domain generalization, and the absence of iterative refinement mechanisms—in pixel-level understanding tasks. The authors propose the first agent-based multimodal large model framework tailored for medical imaging, reframing segmentation as a vision-centric, multi-step decision-making process. By interleaving reasoning with click-based actions to invoke segmentation tools, the framework generates high-quality masks and enables iterative refinement without altering the underlying model architecture. A two-stage training strategy is employed: initial cold-start supervised fine-tuning followed by reinforcement learning guided by fine-grained rewards, facilitating pixel-level reasoning. Extensive experiments demonstrate that the method significantly outperforms both open- and closed-source state-of-the-art approaches across diverse and complex medical referring and segmentation tasks, exhibiting superior robustness and generalization capabilities.

Technology Category

Application Category

📝 Abstract
Recent research on medical MLLMs has gradually shifted its focus from image-level understanding to fine-grained, pixel-level comprehension. Although segmentation serves as the foundation for pixel-level understanding, existing approaches face two major challenges. First, they introduce implicit segmentation tokens and require simultaneous fine-tuning of both the MLLM and external pixel decoders, which increases the risk of catastrophic forgetting and limits generalization to out-of-domain scenarios. Second, most methods rely on single-pass reasoning and lack the capability to iteratively refine segmentation results, leading to suboptimal performance. To overcome these limitations, we propose a novel agentic MLLM, named IBISAgent, that reformulates segmentation as a vision-centric, multi-step decision-making process. IBISAgent enables MLLMs to generate interleaved reasoning and text-based click actions, invoke segmentation tools, and produce high-quality masks without architectural modifications. By iteratively performing multi-step visual reasoning on masked image features, IBISAgent naturally supports mask refinement and promotes the development of pixel-level visual reasoning capabilities. We further design a two-stage training framework consisting of cold-start supervised fine-tuning and agentic reinforcement learning with tailored, fine-grained rewards, enhancing the model's robustness in complex medical referring and reasoning segmentation tasks. Extensive experiments demonstrate that IBISAgent consistently outperforms both closed-source and open-source SOTA methods. All datasets, code, and trained models will be released publicly.
Problem

Research questions and friction points this paper is trying to address.

biomedical object segmentation
pixel-level visual reasoning
multimodal large language models
referring segmentation
catastrophic forgetting
Innovation

Methods, ideas, or system contributions that make the work stand out.

agentic MLLM
pixel-level visual reasoning
iterative segmentation refinement
reinforcement learning
medical image segmentation
🔎 Similar Papers
No similar papers found.
Y
Yankai Jiang
Zhejiang University
Q
Qiaoru Li
Zhejiang University
B
Binlu Xu
Zhejiang University
H
Haoran Sun
Shanghai Artificial Intelligence Laboratory
C
Chao Ding
Shanghai Artificial Intelligence Laboratory
Junting Dong
Junting Dong
Zhejiang University
Computer Vision
Y
Yuxiang Cai
Zhejiang University
Xuhong Zhang
Xuhong Zhang
Zhejiang University
LLMVLMVLATrustworthy AI
Jianwei Yin
Jianwei Yin
Professor of Computer Science and Technology, Zhejiang University
Service ComputingComputer ArchitectureDistributed ComputingAI