CaReAQA: A Cardiac and Respiratory Audio Question Answering Model for Open-Ended Diagnostic Reasoning

📅 2025-05-02

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

Medical audio diagnosis—particularly of heart and lung sounds—faces critical bottlenecks, including heavy reliance on handcrafted features and severe scarcity of labeled data. To address these challenges, we propose the first cross-modal diagnostic reasoning framework specifically designed for cardiopulmonary auscultation, integrating foundational audio encoders (Whisper/AST variants) with large language models (LLaMA/Qwen) to build an end-to-end audio-to-clinically-interpretable question-answering system. Our key contributions are: (1) the first joint audio–language reasoning paradigm for cardiopulmonary sounds; (2) CaReSound, the first medical audio benchmark featuring structured metadata and open-ended clinical question-answer pairs; and (3) support for unstructured, answer-agnostic clinical reasoning. Experiments demonstrate state-of-the-art performance: 86.2% accuracy on open-ended diagnostic tasks and 56.9% on cross-domain closed-set classification—substantially outperforming unimodal audio-only or text-only baselines.

Technology Category

Application Category

📝 Abstract

Medical audio signals, such as heart and lung sounds, play a crucial role in clinical diagnosis. However, analyzing these signals remains challenging: traditional methods rely on handcrafted features or supervised deep learning models that demand extensive labeled datasets, limiting their scalability and applicability. To address these issues, we propose CaReAQA, an audio-language model that integrates a foundation audio model with the reasoning capabilities of large language models, enabling clinically relevant, open-ended diagnostic responses. Alongside CaReAQA, we introduce CaReSound, a benchmark dataset of annotated medical audio recordings enriched with metadata and paired question-answer examples, intended to drive progress in diagnostic reasoning research. Evaluation results show that CaReAQA achieves 86.2% accuracy on open-ended diagnostic reasoning tasks, outperforming baseline models. It also generalizes well to closed-ended classification tasks, achieving an average accuracy of 56.9% on unseen datasets. Our findings show how audio-language integration and reasoning advances medical diagnostics, enabling efficient AI systems for clinical decision support.

Problem

Research questions and friction points this paper is trying to address.

Challenges in analyzing cardiac and respiratory audio signals

Limitations of traditional methods requiring extensive labeled data

Need for scalable AI models for open-ended diagnostic reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates foundation audio model with LLMs

Introduces CaReSound benchmark dataset

Achieves high accuracy in diagnostic reasoning

🔎 Similar Papers

No similar papers found.