MedMosaic: A Challenging Large Scale Benchmark of Diverse Medical Audio

📅 2026-05-01

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

Existing medical audio benchmarks are limited by privacy constraints and high annotation costs, making it difficult to evaluate models’ multimodal reasoning capabilities in real-world clinical settings. To address this gap, this work proposes MedMosaic—a large-scale, diverse medical audio question-answering benchmark that uniquely integrates authentic physiological sounds, synthetic noisy speech, and both short and long clinical dialogues. It comprises 46,701 question-answer pairs spanning multiple-choice, multi-turn, and open-ended formats. Leveraging real recordings, controllable speech synthesis, and expert medical annotations, we establish a multimodal evaluation framework to systematically assess 13 state-of-the-art models. Results reveal that even the best-performing model (Gemini-2.5-Pro) achieves only 68.1% accuracy, highlighting substantial performance bottlenecks and significant room for improvement in medical audio understanding.

📝 Abstract

We present MedMosaic, a medical audio question-answering dataset designed to benchmark language and audio reasoning models under realistic clinical constraints. Medical audio data is difficult to collect due to privacy regulations and high annotation costs arising from domain expertise. Thus, existing benchmarks tend to underrepresent complex medical audio scenarios. To address these challenges, MedMosaic features a diverse range of medical audio types, including condition-related physiological sounds, carefully constructed synthetic voices to mimic speech with artifacts as well as real short and long length clinical conversations to model varying context lengths. The dataset also features a total of 46,701 question-answer pairs, spanning categories such as multiple-choice, sequential multi-turn, and open-ended question-answers, enabling systematic evaluation of multi-hop reasoning and answer generation capabilities. Benchmarking 13 audio and multimodal reasoning models reveals that reasoning remains challenging for all evaluated systems, with substantial performance variation across question types. In particular, even state-of-the-art model like Gemini-2.5-pro can only achieve 68.1% accuracy approximately. These findings underscore persistent limitations in medical reasoning and highlight the need for more robust, domain-specific multimodal reasoning models.

Problem

Research questions and friction points this paper is trying to address.

medical audio

question answering

benchmark

multimodal reasoning

clinical constraints

Innovation

Methods, ideas, or system contributions that make the work stand out.

medical audio benchmark

multimodal reasoning

clinical conversation modeling