🤖 AI Summary
In bioacoustic sound event detection (SED), high annotation costs for large-scale audio corpora and low active learning (AL) efficiency under sparse event conditions pose significant challenges. To address these, this paper proposes an efficient AL annotation strategy operating at the full-recording level. Its core innovation is the novel “Top-K entropy uncertainty aggregation” mechanism: instead of conventional mean-based uncertainty aggregation—which suffers in sparse-event scenarios—the method selects the highest segment-level uncertainty to represent the entire recording. This design markedly improves sample selection accuracy for AL in low-event-density audio. Experiments demonstrate that the proposed approach achieves comparable performance to a fully supervised model using only 8% of the annotated data. Extensive evaluation on real-world, multi-source field recordings—including meerkat calls, dog barks, and infant cries—confirms its effectiveness and strong generalization capability across diverse bioacoustic events.
📝 Abstract
The vast amounts of audio data collected in Sound Event Detection (SED) applications require efficient annotation strategies to enable supervised learning. Manual labeling is expensive and time-consuming, making Active Learning (AL) a promising approach for reducing annotation effort. We introduce Top K Entropy, a novel uncertainty aggregation strategy for AL that prioritizes the most uncertain segments within an audio recording, instead of averaging uncertainty across all segments. This approach enables the selection of entire recordings for annotation, improving efficiency in sparse data scenarios. We compare Top K Entropy to random sampling and Mean Entropy, and show that fewer labels can lead to the same model performance, particularly in datasets with sparse sound events. Evaluations are conducted on audio mixtures of sound recordings from parks with meerkat, dog, and baby crying sound events, representing real-world bioacoustic monitoring scenarios. Using Top K Entropy for active learning, we can achieve comparable performance to training on the fully labeled dataset with only 8% of the labels. Top K Entropy outperforms Mean Entropy, suggesting that it is best to let the most uncertain segments represent the uncertainty of an audio file. The findings highlight the potential of AL for scalable annotation in audio and time-series applications, including bioacoustics.