ReXSonoVQA: A Video QA Benchmark for Procedure-Centric Ultrasound Understanding

📅 2026-04-12

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

This work addresses the limitations of existing vision-language models, which are primarily evaluated on static ultrasound images and struggle to comprehend dynamic procedural contexts. To bridge this gap, we introduce the first video-based question-answering benchmark specifically designed for end-to-end ultrasound examination workflows. The benchmark assesses three core capabilities: action-goal reasoning, artifact handling and optimization, and procedural context understanding and planning—thereby filling a critical void in procedural medical image comprehension. Employing a multimodal video QA framework that integrates both multiple-choice and open-ended questions, we evaluate state-of-the-art models under zero-shot settings. Experimental results reveal significant deficiencies in current models’ ability to perform troubleshooting and causal reasoning, underscoring the necessity and value of our benchmark in advancing intelligent ultrasound systems.

Technology Category

Application Category

📝 Abstract

Ultrasound acquisition requires skilled probe manipulation and real-time adjustments. Vision-language models (VLMs) could enable autonomous ultrasound systems, but existing benchmarks evaluate only static images, not dynamic procedural understanding. We introduce ReXSonoVQA, a video QA benchmark with 514 video clips and 514 questions (249 MCQ, 265 free-response) targeting three competencies: Action-Goal Reasoning, Artifact Resolution & Optimization, and Procedure Context & Planning. Zero-shot evaluation of Gemini 3 Pro, Qwen3.5-397B, LLaVA-Video-72B, and Seed 2.0 Pro shows VLMs can extract some procedural information, but troubleshooting questions remain challenging with minimal gains over text-only baselines, exposing limitations in causal reasoning. ReXSonoVQA enables developing perception systems for ultrasound training, guidance, and robotic automation.

Problem

Research questions and friction points this paper is trying to address.

ultrasound understanding

video question answering

procedural reasoning

vision-language models

benchmark

Innovation

Methods, ideas, or system contributions that make the work stand out.

Video QA

Ultrasound Procedure Understanding

Vision-Language Models