Patient-Level Multimodal Question Answering from Multi-Site Auscultation Recordings

📅 2026-03-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of subjective interpretation in auscultation recordings and the inability of general-purpose audio language models to capture subtle physiological signal characteristics. To overcome these challenges, the authors propose a lightweight, domain-specific encoder that integrates a multi-site auscultation signal aggregation strategy with a gated cross-attention mechanism. This approach aligns multi-channel auscultatory features with the embedding space of a frozen large language model, leveraging its broad world knowledge for holistic patient-level assessment. The method mitigates temporal truncation issues without requiring extensive retraining and achieves state-of-the-art performance on the CaReSound benchmark, yielding an F1-macro score of 0.865 and a BERTScore of 0.952.

Technology Category

Application Category

📝 Abstract
Auscultation is a vital diagnostic tool, yet its utility is often limited by subjective interpretation. While general-purpose Audio-Language Models (ALMs) excel in general domains, they struggle with the nuances of physiological signals. We propose a framework that aligns multi-site auscultation recordings directly with a frozen Large Language Model (LLM) embedding space via gated cross-attention. By leveraging the LLM's latent world knowledge, our approach moves beyond isolated classification toward holistic, patient-level assessment. On the CaReSound benchmark, our model achieves a state-of-the-art 0.865 F1-macro and 0.952 BERTScore. We demonstrate that lightweight, domain-specific encoders rival large-scale ALMs and that multi-site aggregation provides spatial redundancy that mitigates temporal truncation. This alignment of medical acoustics with text foundations offers a scalable path for bridging signal processing and clinical assessment.
Problem

Research questions and friction points this paper is trying to address.

multimodal question answering
auscultation recordings
physiological signals
patient-level assessment
audio-language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal alignment
gated cross-attention
frozen LLM
multi-site auscultation
patient-level assessment
🔎 Similar Papers
No similar papers found.
Fan Wu
Fan Wu
Munich Institute of Robotics and Machine Intelligence, Technical University of Munich
RoboticsMachine Intelligence
T
Tsai-Ning Wang
Eindhoven University of Technology
N
Nicolas Zumarraga
Agentic Systems Lab, ETH Zurich
N
Ning Wang
Agentic Systems Lab, ETH Zurich
Markus Kreft
Markus Kreft
ETH Zurich
machine learningenergy efficiencysmart gridelectric vehiclessustainability
K
Kevin O'Sullivan
Agentic Systems Lab, ETH Zurich
Elgar Fleisch
Elgar Fleisch
Professor for Information and Technology Management
Internet of ThingsInformation ManagementTechnology Management
O
Oliver Aalami
Stanford Mussallem Center for Biodesign, Stanford University
Paul Schmiedmayer
Paul Schmiedmayer
Stanford University
Digital HealthTSLMAISoftware EngineeringMobile Applications
R
Robert Jakob
Agentic Systems Lab, ETH Zurich
P
Patrick Langer
Agentic Systems Lab, ETH Zurich; Stanford Mussallem Center for Biodesign, Stanford University; Centre for Digital Health Interventions, ETH Zurich