MedVision: Dataset and Benchmark for Quantitative Medical Image Analysis

📅 2025-11-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing medical vision-language models (VLMs) primarily focus on classification and qualitative description, lacking clinically essential quantitative reasoning capabilities—such as tumor size estimation and anatomical angle measurement. To address this gap, we introduce MedQuant, the first large-scale benchmark explicitly designed for quantitative analysis of medical images. MedQuant encompasses three core tasks: anatomical structure detection, lesion size estimation, and geometric distance/angle measurement, compiled from 22 public datasets totaling 30.8 million image-annotation pairs. We propose MedVision, a VLM fine-tuned via supervised learning to preserve linguistic comprehension while substantially enhancing numerical perception and spatial reasoning. Experimental results demonstrate that MedVision reduces mean error rates by 32.7% across all three quantitative tasks, significantly outperforming baseline VLMs. This work establishes a verifiable, scalable VLM paradigm for clinical quantification and decision support.

Technology Category

Application Category

📝 Abstract
Current vision-language models (VLMs) in medicine are primarily designed for categorical question answering (e.g., "Is this normal or abnormal?") or qualitative descriptive tasks. However, clinical decision-making often relies on quantitative assessments, such as measuring the size of a tumor or the angle of a joint, from which physicians draw their own diagnostic conclusions. This quantitative reasoning capability remains underexplored and poorly supported in existing VLMs. In this work, we introduce MedVision, a large-scale dataset and benchmark specifically designed to evaluate and improve VLMs on quantitative medical image analysis. MedVision spans 22 public datasets covering diverse anatomies and modalities, with 30.8 million image-annotation pairs. We focus on three representative quantitative tasks: (1) detection of anatomical structures and abnormalities, (2) tumor/lesion (T/L) size estimation, and (3) angle/distance (A/D) measurement. Our benchmarks show that current off-the-shelf VLMs perform poorly on these tasks. However, with supervised fine-tuning on MedVision, we significantly enhance their performance across detection, T/L estimation, and A/D measurement, demonstrating reduced error rates and improved precision. This work provides a foundation for developing VLMs with robust quantitative reasoning capabilities in medical imaging. Code and data are available at https://medvision-vlm.github.io.
Problem

Research questions and friction points this paper is trying to address.

Addressing the lack of quantitative reasoning in medical vision-language models
Providing benchmark for tumor size estimation and anatomical measurements
Improving detection and measurement accuracy in medical image analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale dataset for quantitative medical image analysis
Supervised fine-tuning enhances detection and measurement accuracy
Benchmark spans diverse anatomies and imaging modalities
🔎 Similar Papers
No similar papers found.