🤖 AI Summary
Speech-impaired individuals worldwide face challenges including limited diagnostic accessibility and insufficient multilingual support. To address these, we propose the first Audio Large Language Model (Audio-LLM) tailored for vocal fold health diagnosis, built upon the Qwen-Audio-Chat architecture and trained on hospital-collected tri-modal data—speech audio, transcribed text, and clinical labels—via instruction tuning and safety alignment for clinical deployment. Our key contributions include: (1) a safety-aware evaluation framework integrating diagnostic bias mitigation, cross-lingual robustness validation, and modality ablation analysis; and (2) empirical demonstration of state-of-the-art performance on multilingual voice disorder classification, achieving superior accuracy, strong generalization across languages and demographics, and practical clinical deployability. This work establishes a new paradigm for equitable, globally scalable voice health diagnostics.
📝 Abstract
Vocal health plays a crucial role in peoples' lives, significantly impacting their communicative abilities and interactions. However, despite the global prevalence of voice disorders, many lack access to convenient diagnosis and treatment. This paper introduces VocalAgent, an audio large language model (LLM) to address these challenges through vocal health diagnosis. We leverage Qwen-Audio-Chat fine-tuned on three datasets collected in-situ from hospital patients, and present a multifaceted evaluation framework encompassing a safety assessment to mitigate diagnostic biases, cross-lingual performance analysis, and modality ablation studies. VocalAgent demonstrates superior accuracy on voice disorder classification compared to state-of-the-art baselines. Its LLM-based method offers a scalable solution for broader adoption of health diagnostics, while underscoring the importance of ethical and technical validation.