AuditoryHuM: Auditory Scene Label Generation and Clustering using Human-MLLM Collaboration

📅 2026-02-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high cost and inherent trade-off in manual annotation of auditory scenes, where achieving both fine-grained semantic labels and acoustic separability is challenging. The authors propose an unsupervised framework that leverages collaboration between humans and multimodal large language models (MLLMs) to automatically generate semantically meaningful scene labels. These labels guide a clustering process enhanced by a penalized adjusted silhouette coefficient, enabling controllable granularity and semantic coherence. Label–audio consistency is evaluated via Human-CLAP zero-shot alignment, with targeted human-in-the-loop refinement to improve outcomes. Experiments on ADVANCE, AHEAD-DS, and TAU 2019 datasets demonstrate that the approach efficiently constructs a standardized auditory scene taxonomy suitable for deployment on resource-constrained edge devices such as hearing aids.

Technology Category

Application Category

📝 Abstract
Manual annotation of audio datasets is labour intensive, and it is challenging to balance label granularity with acoustic separability. We introduce AuditoryHuM, a novel framework for the unsupervised discovery and clustering of auditory scene labels using a collaborative Human-Multimodal Large Language Model (MLLM) approach. By leveraging MLLMs (Gemma and Qwen) the framework generates contextually relevant labels for audio data. To ensure label quality and mitigate hallucinations, we employ zero-shot learning techniques (Human-CLAP) to quantify the alignment between generated text labels and raw audio content. A strategically targeted human-in-the-loop intervention is then used to refine the least aligned pairs. The discovered labels are grouped into thematically cohesive clusters using an adjusted silhouette score that incorporates a penalty parameter to balance cluster cohesion and thematic granularity. Evaluated across three diverse auditory scene datasets (ADVANCE, AHEAD-DS, and TAU 2019), AuditoryHuM provides a scalable, low-cost solution for creating standardised taxonomies. This solution facilitates the training of lightweight scene recognition models deployable to edge devices, such as hearing aids and smart home assistants. The project page and code: https://github.com/Australian-Future-Hearing-Initiative
Problem

Research questions and friction points this paper is trying to address.

audio annotation
label granularity
acoustic separability
auditory scene labeling
taxonomy creation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Human-MLLM collaboration
unsupervised auditory label generation
zero-shot audio-text alignment
human-in-the-loop refinement
adjusted silhouette clustering
🔎 Similar Papers
No similar papers found.
H
Henry Zhong
Australian Hearing Hub, Macquarie University, Sydney, Australia
J
Jörg M. Buchholz
Australian Hearing Hub, Macquarie University, Sydney, Australia
J
Julian Maclaren
Google Research Australia, Sydney, Australia
Simon Carlile
Simon Carlile
University of Sydney
Auditory neuroscience
Richard F. Lyon
Richard F. Lyon
Research Scientist, Google Inc.
Machine HearingSignal ProcessingImage SensorsPhotography