Description and Discussion on DCASE 2025 Challenge Task 4: Spatial Semantic Segmentation of Sound Scenes

📅 2025-06-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the need for immersive audio communication by introducing Spatial Semantic Segmentation (S5), a novel task that jointly detects, classifies, and localizes sound events from multichannel microphone array signals, outputting dereverberated object-level audio streams along with their semantic classes and full 6-degree-of-freedom (6DoF) spatial metadata. The proposed method integrates deep learning–based sound source localization, time-frequency masking for separation, and semantic classification into an end-to-end S5 system. We formally define the S5 task for the first time within the DCASE challenge and release the DCASE2025 Task 4 benchmark—a realistic, scalable dataset explicitly designed to support 6DoF spatial modeling. Experiments demonstrate substantial improvements in spatially aware sound event detection and separation accuracy. Our approach establishes a scalable, semantics-aware spatial audio representation framework tailored for next-generation immersive communication systems.

Technology Category

Application Category

📝 Abstract
Spatial Semantic Segmentation of Sound Scenes (S5) aims to enhance technologies for sound event detection and separation from multi-channel input signals that mix multiple sound events with spatial information. This is a fundamental basis of immersive communication. The ultimate goal is to separate sound event signals with 6 Degrees of Freedom (6DoF) information into dry sound object signals and metadata about the object type (sound event class) and representing spatial information, including direction. However, because several existing challenge tasks already provide some of the subset functions, this task for this year focuses on detecting and separating sound events from multi-channel spatial input signals. This paper outlines the S5 task setting of the Detection and Classification of Acoustic Scenes and Events (DCASE) 2025 Challenge Task 4 and the DCASE2025 Task 4 Dataset, newly recorded and curated for this task. We also report experimental results for an S5 system trained and evaluated on this dataset. The full version of this paper will be published after the challenge results are made public.
Problem

Research questions and friction points this paper is trying to address.

Enhance sound event detection and separation from multi-channel signals
Separate sound events with 6DoF spatial information into dry signals
Detect and classify sound events in immersive communication scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Spatial semantic segmentation of sound scenes
Multi-channel sound event detection and separation
6DoF sound object signal extraction
🔎 Similar Papers
No similar papers found.