🤖 AI Summary
This work addresses the limitations of existing vision-language models, which are primarily evaluated on static ultrasound images and struggle to comprehend dynamic procedural contexts. To bridge this gap, we introduce the first video-based question-answering benchmark specifically designed for end-to-end ultrasound examination workflows. The benchmark assesses three core capabilities: action-goal reasoning, artifact handling and optimization, and procedural context understanding and planning—thereby filling a critical void in procedural medical image comprehension. Employing a multimodal video QA framework that integrates both multiple-choice and open-ended questions, we evaluate state-of-the-art models under zero-shot settings. Experimental results reveal significant deficiencies in current models’ ability to perform troubleshooting and causal reasoning, underscoring the necessity and value of our benchmark in advancing intelligent ultrasound systems.
📝 Abstract
Ultrasound acquisition requires skilled probe manipulation and real-time adjustments. Vision-language models (VLMs) could enable autonomous ultrasound systems, but existing benchmarks evaluate only static images, not dynamic procedural understanding. We introduce ReXSonoVQA, a video QA benchmark with 514 video clips and 514 questions (249 MCQ, 265 free-response) targeting three competencies: Action-Goal Reasoning, Artifact Resolution & Optimization, and Procedure Context & Planning. Zero-shot evaluation of Gemini 3 Pro, Qwen3.5-397B, LLaVA-Video-72B, and Seed 2.0 Pro shows VLMs can extract some procedural information, but troubleshooting questions remain challenging with minimal gains over text-only baselines, exposing limitations in causal reasoning. ReXSonoVQA enables developing perception systems for ultrasound training, guidance, and robotic automation.