🤖 AI Summary
Existing public acoustic datasets suffer from limited accessibility, insufficient scene coverage, and a lack of hearing-relevant semantic labels, hindering fair evaluation and edge deployment of hearing-aid scene recognition models. To address this, we introduce AHEAD-DS—the first standardized, open-source dataset specifically designed for hearing-assistive applications—and propose YAMNet+, a lightweight, efficient model built upon YAMNet. YAMNet+ integrates transfer learning, knowledge distillation, and structural pruning. Evaluated on the AHEAD-DS test set, it achieves 0.83 mean Average Precision (mAP) and 0.93 classification accuracy, with an inference latency of only 30 ms per second of audio—enabling real-time operation on Android devices. This work bridges critical gaps in both data and model resources for hearing-aware scene recognition, establishing a reproducible benchmark and a practical, deployable solution for resource-constrained hearing aids.
📝 Abstract
Scene recognition of audiologically relevant environments is important for hearing aids; however, it is challenging, in part because of the limitations of existing datasets. Datasets often lack public accessibility, completeness, or audiologically relevant labels, hindering systematic comparison of machine learning models. Deploying these models on resource-constrained edge devices presents another challenge. Our solution is two-fold: we leverage several open source datasets to create AHEAD-DS, a dataset designed for scene recognition of audiologically relevant environments, and introduce YAMNet+, a sound recognition model. AHEAD-DS aims to provide a standardised, publicly available dataset with consistent labels relevant to hearing aids, facilitating model comparison. YAMNet+ is designed for deployment on edge devices like smartphones connected to hearing devices, such as hearing aids and wireless earphones with hearing aid functionality; serving as a baseline model for sound-based scene recognition. YAMNet+ achieved a mean average precision of 0.83 and accuracy of 0.93 on the testing set of AHEAD-DS across fourteen categories of audiologically relevant environments. We found that applying transfer learning from the pretrained YAMNet model was essential. We demonstrated real-time sound-based scene recognition capabilities on edge devices by deploying YAMNet+ to an Android smartphone. Even with a Google Pixel 3 (a phone with modest specifications, released in 2018), the model processes audio with approximately 50ms of latency to load the model, and an approximate linear increase of 30ms per 1 second of audio. Our website and code https://github.com/Australian-Future-Hearing-Initiative .