Language Models as Semantic Teachers: Post-Training Alignment for Medical Audio Understanding

📅 2025-12-04

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

Pretrained audio models excel at capturing acoustic features of auscultatory sounds but lack clinical semantic understanding, limiting their diagnostic performance. To address this, we propose AcuLa—a framework that jointly optimizes a pretrained audio encoder and a medical language model via audio-language contrastive alignment and self-supervised representation learning. Leveraging large language models, AcuLa generates high-quality clinical reports to construct large-scale, temporally preserved audio-text pairs, thereby infusing clinical semantics without sacrificing fine-grained temporal structure. The method requires only lightweight post-training fine-tuning yet significantly enhances clinical perceptual capability. Evaluated on 10 public datasets across 18 cardiopulmonary diagnostic tasks, AcuLa achieves state-of-the-art performance, raising the average AUROC from 0.68 to 0.79. Notably, for COVID-19 cough detection, AUROC improves dramatically from 0.55 to 0.89, demonstrating both the efficacy and generalizability of clinical semantic alignment.

Technology Category

Application Category

📝 Abstract

Pre-trained audio models excel at detecting acoustic patterns in auscultation sounds but often fail to grasp their clinical significance, limiting their use and performance in diagnostic tasks. To bridge this gap, we introduce AcuLa (Audio-Clinical Understanding via Language Alignment), a lightweight post-training framework that instills semantic understanding into any audio encoder by aligning it with a medical language model, which acts as a "semantic teacher." To enable alignment at scale, we construct a large-scale dataset by leveraging off-the-shelf large language models to translate the rich, structured metadata accompanying existing audio recordings into coherent clinical reports. Our alignment strategy combines a representation-level contrastive objective with a self-supervised modeling, ensuring that the model learns clinical semantics while preserving fine-grained temporal cues. AcuLa achieves state-of-the-art results across 18 diverse cardio-respiratory tasks from 10 different datasets, improving the mean AUROC on classification benchmarks from 0.68 to 0.79 and, on the most challenging COVID-19 cough detection task, boosting the AUROC from 0.55 to 0.89. Our work demonstrates that this audio-language alignment transforms purely acoustic models into clinically-aware diagnostic tools, establishing a novel paradigm for enhancing physiological understanding in audio-based health monitoring.

Problem

Research questions and friction points this paper is trying to address.

Align audio models with clinical semantics for medical diagnosis

Bridge gap between acoustic patterns and clinical significance in auscultation

Transform acoustic models into clinically-aware diagnostic tools via language alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Aligns audio encoders with medical language models

Uses LLMs to generate clinical reports from metadata

Combines contrastive learning with self-supervised modeling

🔎 Similar Papers

No similar papers found.