Robust detection of overlapping bioacoustic sound events

📅 2025-03-04

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

Addressing the challenge of overlapping sound event detection in bioacoustics, this paper proposes Voxaboxen: a novel framework that models acoustic events as temporal bounding boxes and introduces the first onset/offset bidirectional detection paradigm. It integrates self-supervised audio encoders (AST/SSAST), temporal bounding box regression, bidirectional temporal prediction, and graph matching. We construct the first benchmark dataset—specifically for overlapping scenarios—using zebra finch vocalizations, accompanied by a dedicated evaluation protocol. Voxaboxen achieves state-of-the-art performance across seven public datasets and our new benchmark, significantly outperforming frame-wise multi-label classification and mainstream sound event detection (SED) methods. Ablation studies demonstrate consistent robustness gains under strong overlap conditions, overcoming fundamental limitations of conventional approaches.

Technology Category

Application Category

📝 Abstract

We propose a method for accurately detecting bioacoustic sound events that is robust to overlapping events, a common issue in domains such as ethology, ecology and conservation. While standard methods employ a frame-based, multi-label approach, we introduce an onset-based detection method which we name Voxaboxen. It takes inspiration from object detection methods in computer vision, but simultaneously takes advantage of recent advances in self-supervised audio encoders. For each time window, Voxaboxen predicts whether it contains the start of a vocalization and how long the vocalization is. It also does the same in reverse, predicting whether each window contains the end of a vocalization, and how long ago it started. The two resulting sets of bounding boxes are then fused using a graph-matching algorithm. We also release a new dataset designed to measure performance on detecting overlapping vocalizations. This consists of recordings of zebra finches annotated with temporally-strong labels and showing frequent overlaps. We test Voxaboxen on seven existing data sets and on our new data set. We compare Voxaboxen to natural baselines and existing sound event detection methods and demonstrate SotA results. Further experiments show that improvements are robust to frequent vocalization overlap.

Problem

Research questions and friction points this paper is trying to address.

Detects overlapping bioacoustic sound events accurately.

Introduces Voxaboxen for onset-based sound event detection.

Releases new dataset for testing overlapping vocalization detection.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Onset-based detection method named Voxaboxen

Uses self-supervised audio encoders for accuracy

Graph-matching algorithm fuses bounding boxes

🔎 Similar Papers

No similar papers found.