🤖 AI Summary
This work addresses the limited generalization of existing biomedical audio question-answering systems, which stems from the high heterogeneity of respiratory sound data and insufficient support for diverse question types and answer formats. To overcome these challenges, the authors propose a hierarchical routing generative model that innovatively integrates a Mixture of Audio Experts (Audio MoE) with a Mixture of Language Adapters (Language MoA). Built upon a frozen large language model, the framework employs a two-stage conditional specialization mechanism to dynamically select the optimal audio encoder and LoRA adapter, thereby unifying support for both continuous and discrete answers across varied question intents. The method achieves an in-domain accuracy of 0.72, substantially outperforming current state-of-the-art baselines (0.61 and 0.67), and demonstrates exceptional diagnostic generalization across cross-domain, cross-modal, and cross-task scenarios.
📝 Abstract
Conversational generative AI is rapidly entering healthcare, where general-purpose models must integrate heterogeneous patient signals and support diverse interaction styles while producing clinically meaningful outputs. In respiratory care, non-invasive audio, such as recordings captured via mobile microphones, enables scalable screening and longitudinal monitoring, but the heterogeneity challenge is particularly acute: recordings vary widely across devices, environments, and acquisition protocols, and questions span multiple intents and question formats. Existing biomedical audio-language QA systems are typically monolithic, without any specialization mechanisms for tackling diverse respiratory corpora and query intents. They are also only validated in limited settings, leaving it unclear how reliably they handle the shifts encountered in real-world settings. To address these limitations, we introduce RAMoEA-QA, a hierarchically routed generative model for respiratory audio question answering that unifies multiple question types and supports both discrete and continuous targets within a single multimodal system. RAMoEA-QA applies two-stage conditional specialization: an Audio Mixture-of-Experts routes each recording to a suitable pre-trained audio encoder, and a Language Mixture-of-Adapters selects a LoRA adapter on a shared frozen LLM to match the query intent and answer format. By specializing both acoustic representations and generation behaviour per example, RAMoEA-QA consistently outperforms strong baselines and routing ablations with minimal parameter overhead, improving in-domain test accuracy to 0.72 (vs. 0.61 and 0.67 for state-of-the-art baselines) and exhibiting the strongest generalization for diagnosis under domain, modality, and task shifts.