🤖 AI Summary
This study addresses the lack of systematic evaluation for generative audio models in realistic sound scene synthesis by introducing the first standardized benchmark framework dedicated to sound scene generation. Methodologically, it proposes a unified challenge task and a multidimensional evaluation protocol integrating objective metrics—particularly Fréchet Audio Distance (FAD)—with subjective listening tests, jointly assessing source separation accuracy, spatial consistency, and semantic coherence. The contributions are threefold: (1) it establishes the first principled linkage between generative audio research and scene-level application-oriented evaluation; (2) through empirical analysis involving four participating teams, it systematically identifies critical bottlenecks in current models’ capacity for complex acoustic modeling, especially regarding spatial semantics and multi-source interaction; and (3) it provides a reproducible benchmark and concrete, actionable directions for future advancement in generative sound scene modeling.
📝 Abstract
This paper presents Task 7 at the DCASE 2024 Challenge: sound scene synthesis. Recent advances in sound synthesis and generative models have enabled the creation of realistic and diverse audio content. We introduce a standardized evaluation framework for comparing different sound scene synthesis systems, incorporating both objective and subjective metrics. The challenge attracted four submissions, which are evaluated using the Fr'echet Audio Distance (FAD) and human perceptual ratings. Our analysis reveals significant insights into the current capabilities and limitations of sound scene synthesis systems, while also highlighting areas for future improvement in this rapidly evolving field.