U2-BENCH: Benchmarking Large Vision-Language Models on Ultrasound Understanding

📅 2025-05-23

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Ultrasound image quality is highly susceptible to operator variability, noise, and anatomical heterogeneity, yet the capabilities of Large Vision-Language Models (LVLMs) on ultrasound imagery remain systematically unassessed. Method: We introduce the first comprehensive benchmark for LVLM-based ultrasound understanding—comprising eight clinically motivated tasks spanning 15 anatomical regions, 50 clinical scenarios, and 7,241 real-world cases, covering classification, detection, regression, and text generation. We establish a unified, open-source, multi-granularity evaluation framework specifically designed for dynamic, noise-sensitive, operator-dependent medical imaging. Contribution/Results: Evaluating 20 state-of-the-art LVLMs across models, tasks, and dimensions, we find strong performance on image-level classification but significant bottlenecks in spatial localization and clinical report generation. This benchmark fills a critical gap in ultrasound AI evaluation, providing a reproducible foundation and concrete directions for future advancement.

Technology Category

Application Category

📝 Abstract

Ultrasound is a widely-used imaging modality critical to global healthcare, yet its interpretation remains challenging due to its varying image quality on operators, noises, and anatomical structures. Although large vision-language models (LVLMs) have demonstrated impressive multimodal capabilities across natural and medical domains, their performance on ultrasound remains largely unexplored. We introduce U2-BENCH, the first comprehensive benchmark to evaluate LVLMs on ultrasound understanding across classification, detection, regression, and text generation tasks. U2-BENCH aggregates 7,241 cases spanning 15 anatomical regions and defines 8 clinically inspired tasks, such as diagnosis, view recognition, lesion localization, clinical value estimation, and report generation, across 50 ultrasound application scenarios. We evaluate 20 state-of-the-art LVLMs, both open- and closed-source, general-purpose and medical-specific. Our results reveal strong performance on image-level classification, but persistent challenges in spatial reasoning and clinical language generation. U2-BENCH establishes a rigorous and unified testbed to assess and accelerate LVLM research in the uniquely multimodal domain of medical ultrasound imaging.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LVLMs on ultrasound understanding tasks

Assessing performance across classification, detection, and text generation

Addressing challenges in spatial reasoning and clinical language generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

First benchmark for LVLMs on ultrasound

Evaluates 20 models across 50 scenarios

Tests classification, detection, and text generation

🔎 Similar Papers

LMOD: A Large Multimodal Ophthalmology Dataset and Benchmark for Large Vision-Language Models