MedGround: Bridging the Evidence Gap in Medical Vision-Language Models with Verified Grounding Data

📅 2026-01-11

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

Medical vision-language models often generate clinically relevant descriptions lacking visual grounding due to the scarcity of high-quality referring expression localization data. To address this, this work proposes the first scalable framework for constructing medical referring expression datasets, which automatically converts expert-annotated segmentation masks into spatial anchors and synthesizes high-fidelity clinical query–image pairs by integrating geometric rules, medical priors, and multi-stage validation—including format checking, rule-based constraints, and visual verification. The resulting MedGround-35K dataset substantially enhances model performance in referring expression localization, multi-object semantic disambiguation, and generalization to unseen clinical scenarios, achieving precise alignment between language and verifiable visual evidence.

Technology Category

Application Category

📝 Abstract

Vision-Language Models (VLMs) can generate convincing clinical narratives, yet frequently struggle to visually ground their statements. We posit this limitation arises from the scarcity of high-quality, large-scale clinical referring-localization pairs. To address this, we introduce MedGround, an automated pipeline that transforms segmentation resources into high-quality medical referring grounding data. Leveraging expert masks as spatial anchors, MedGround precisely derives localization targets, extracts shape and spatial cues, and guides VLMs to synthesize natural, clinically grounded queries that reflect morphology and location. To ensure data rigor, a multi-stage verification system integrates strict formatting checks, geometry- and medical-prior rules, and image-based visual judging to filter out ambiguous or visually unsupported samples. Finally, we present MedGround-35K, a novel multimodal medical dataset. Extensive experiments demonstrate that VLMs trained with MedGround-35K consistently achieve improved referring grounding performance, enhance multi-object semantic disambiguation, and exhibit strong generalization to unseen grounding settings. This work highlights MedGround as a scalable, data-driven approach to anchor medical language to verifiable visual evidence. Dataset and code will be released publicly upon acceptance.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language Models

Medical Grounding

Referring Localization

Clinical Narratives

Visual Evidence

Innovation

Methods, ideas, or system contributions that make the work stand out.

vision-language models

medical referring grounding

verified grounding data