🤖 AI Summary
This study addresses the challenge of real-time detection of auditory difficulty moments in everyday conversations. We propose an end-to-end detection method based on audio-language models (ALMs), diverging from conventional approaches such as ASR-based keyword spotting or fine-tuned Wav2Vec. Our method leverages the joint acoustic-semantic representation capabilities of multimodal pretrained models to achieve continuous, fine-grained localization of auditory-difficulty segments within conversational speech. Experiments on realistic dialogue data demonstrate significant improvements: the proposed approach achieves an average 12.3% higher F1-score and reduces detection latency by over 40% compared to baseline methods. The core contribution lies in the first application of ALMs to dynamic auditory difficulty recognition—enabling low-latency, robust intervention triggering for intelligent hearing assistance devices.
📝 Abstract
Individuals regularly experience Hearing Difficulty Moments in everyday conversation. Identifying these moments of hearing difficulty has particular significance in the field of hearing assistive technology where timely interventions are key for realtime hearing assistance. In this paper, we propose and compare machine learning solutions for continuously detecting utterances that identify these specific moments in conversational audio. We show that audio language models, through their multimodal reasoning capabilities, excel at this task, significantly outperforming a simple ASR hotword heuristic and a more conventional fine-tuning approach with Wav2Vec, an audio-only input architecture that is state-of-the-art for automatic speech recognition (ASR).