π€ AI Summary
To address the scarcity of high-quality temporal annotations in sound event detection (SED), this paper proposes SynSonicβthe first data augmentation framework integrating text-to-audio diffusion models with an energy-envelope-driven ControlNet. By conditioning audio generation on energy envelopes, SynSonic produces temporally coherent and semantically controllable sound event samples. A dual-classifier joint scoring mechanism is introduced to effectively suppress generation artifacts and ensure sample fidelity. Furthermore, SynSonic incorporates Mix-up and SpecAugment into an end-to-end augmentation pipeline. Evaluated on standard benchmarks, SynSonic significantly improves Polyphonic Sound Detection Scores (PSDS1 and PSDS2), while enhancing temporal localization accuracy and multi-class discrimination capability. The framework establishes a scalable, high-fidelity synthetic data paradigm for low-resource SED tasks.
π Abstract
Data synthesis and augmentation are essential for Sound Event Detection (SED) due to the scarcity of temporally labeled data. While augmentation methods like SpecAugment and Mix-up can enhance model performance, they remain constrained by the diversity of existing samples. Recent generative models offer new opportunities, yet their direct application to SED is challenging due to the lack of precise temporal annotations and the risk of introducing noise through unreliable filtering. To address these challenges and enable generative-based augmentation for SED, we propose SynSonic, a data augmentation method tailored for this task. SynSonic leverages text-to-audio diffusion models guided by an energy-envelope ControlNet to generate temporally coherent sound events. A joint score filtering strategy with dual classifiers ensures sample quality, and we explore its practical integration into training pipelines. Experimental results show that SynSonic improves Polyphonic Sound Detection Scores (PSDS1 and PSDS2), enhancing both temporal localization and sound class discrimination.