🤖 AI Summary
Medical audio diagnosis—particularly of heart and lung sounds—faces critical bottlenecks, including heavy reliance on handcrafted features and severe scarcity of labeled data. To address these challenges, we propose the first cross-modal diagnostic reasoning framework specifically designed for cardiopulmonary auscultation, integrating foundational audio encoders (Whisper/AST variants) with large language models (LLaMA/Qwen) to build an end-to-end audio-to-clinically-interpretable question-answering system. Our key contributions are: (1) the first joint audio–language reasoning paradigm for cardiopulmonary sounds; (2) CaReSound, the first medical audio benchmark featuring structured metadata and open-ended clinical question-answer pairs; and (3) support for unstructured, answer-agnostic clinical reasoning. Experiments demonstrate state-of-the-art performance: 86.2% accuracy on open-ended diagnostic tasks and 56.9% on cross-domain closed-set classification—substantially outperforming unimodal audio-only or text-only baselines.
📝 Abstract
Medical audio signals, such as heart and lung sounds, play a crucial role in clinical diagnosis. However, analyzing these signals remains challenging: traditional methods rely on handcrafted features or supervised deep learning models that demand extensive labeled datasets, limiting their scalability and applicability. To address these issues, we propose CaReAQA, an audio-language model that integrates a foundation audio model with the reasoning capabilities of large language models, enabling clinically relevant, open-ended diagnostic responses. Alongside CaReAQA, we introduce CaReSound, a benchmark dataset of annotated medical audio recordings enriched with metadata and paired question-answer examples, intended to drive progress in diagnostic reasoning research. Evaluation results show that CaReAQA achieves 86.2% accuracy on open-ended diagnostic reasoning tasks, outperforming baseline models. It also generalizes well to closed-ended classification tasks, achieving an average accuracy of 56.9% on unseen datasets. Our findings show how audio-language integration and reasoning advances medical diagnostics, enabling efficient AI systems for clinical decision support.