🤖 AI Summary
This work addresses the challenges of clinical speech recognition—namely, acoustic noise, dense medical terminology, and speaker variability—by proposing an end-to-end unified architecture that deeply integrates a pretrained large language model (LLM) into the medical speech processing pipeline. The model employs an audio encoder to extract acoustic features, which are then projected into the LLM’s input space via a lightweight adapter layer, enabling context-aware transcription and semantic understanding. Evaluated across multiple clinical tasks, the proposed approach reduces word error rate (WER) by 56% relative to the current best baseline and demonstrates substantially improved robustness in both adverse acoustic conditions and scenarios involving complex medical jargon.
📝 Abstract
In this work, we present Au-M-ol, a novel multimodal architecture that extends Large Language Models (LLMs) with audio processing. It is designed to improve performance on clinically relevant tasks such as Automatic Speech Recognition (ASR). Au-M-ol has three main components: (1) an audio encoder that extracts rich acoustic features from medical speech, (2) an adaptation layer that maps audio features into the LLM input space, and (3) a pretrained LLM that performs transcription and clinical language understanding. This design allows the model to interpret spoken medical content directly, improving both accuracy and robustness. In experiments, Au-M-ol reduces Word Error Rate (WER) by 56\% compared to state-of-the-art baselines on medical transcription tasks. The model also performs well in challenging conditions, including noisy environments, domain-specific terminology, and speaker variability. These results suggest that Au-M-ol is a strong candidate for real-world clinical applications, where reliable and context-aware audio understanding is essential.