SynSonic: Augmenting Sound Event Detection through Text-to-Audio Diffusion ControlNet and Effective Sample Filtering

📅 2025-09-22

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

To address the scarcity of high-quality temporal annotations in sound event detection (SED), this paper proposes SynSonic—the first data augmentation framework integrating text-to-audio diffusion models with an energy-envelope-driven ControlNet. By conditioning audio generation on energy envelopes, SynSonic produces temporally coherent and semantically controllable sound event samples. A dual-classifier joint scoring mechanism is introduced to effectively suppress generation artifacts and ensure sample fidelity. Furthermore, SynSonic incorporates Mix-up and SpecAugment into an end-to-end augmentation pipeline. Evaluated on standard benchmarks, SynSonic significantly improves Polyphonic Sound Detection Scores (PSDS1 and PSDS2), while enhancing temporal localization accuracy and multi-class discrimination capability. The framework establishes a scalable, high-fidelity synthetic data paradigm for low-resource SED tasks.

Technology Category

Application Category

📝 Abstract

Data synthesis and augmentation are essential for Sound Event Detection (SED) due to the scarcity of temporally labeled data. While augmentation methods like SpecAugment and Mix-up can enhance model performance, they remain constrained by the diversity of existing samples. Recent generative models offer new opportunities, yet their direct application to SED is challenging due to the lack of precise temporal annotations and the risk of introducing noise through unreliable filtering. To address these challenges and enable generative-based augmentation for SED, we propose SynSonic, a data augmentation method tailored for this task. SynSonic leverages text-to-audio diffusion models guided by an energy-envelope ControlNet to generate temporally coherent sound events. A joint score filtering strategy with dual classifiers ensures sample quality, and we explore its practical integration into training pipelines. Experimental results show that SynSonic improves Polyphonic Sound Detection Scores (PSDS1 and PSDS2), enhancing both temporal localization and sound class discrimination.

Problem

Research questions and friction points this paper is trying to address.

Addressing scarcity of temporally labeled sound event data

Generating temporally coherent sound events using diffusion models

Improving sound event detection through quality-controlled synthetic data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Text-to-audio diffusion models with ControlNet

Joint score filtering using dual classifiers

Energy-envelope guided temporally coherent generation

🔎 Similar Papers

SoloAudio: Target Sound Extraction with Language-oriented Audio Diffusion Transformer