🤖 AI Summary
This work addresses the limitation of existing audio tagging methods, which cannot localize and tag sound events within specific spatial regions (e.g., designated azimuth angles or radial distances) in spatial audio. We formally introduce “region-specific audio tagging” as a novel task. Methodologically, we propose a multimodal feature representation that jointly encodes spectral, spatial directional, and positional information; extend pre-trained models—PANNs and AST—into spatially aware architectures; and incorporate directional feature enhancement to improve omnidirectional tagging capability. Experiments on both simulated and real microphone array datasets demonstrate substantial improvements in region-specific sound source identification accuracy. Results validate both the well-posedness of the proposed task and the effectiveness of our technical approach. This work establishes a new paradigm for spatial audio understanding and provides a scalable, foundational framework for future research in spatially grounded audio analysis.
📝 Abstract
Audio tagging aims to label sound events appearing in an audio recording. In this paper, we propose region-specific audio tagging, a new task which labels sound events in a given region for spatial audio recorded by a microphone array. The region can be specified as an angular space or a distance from the microphone. We first study the performance of different combinations of spectral, spatial, and position features. Then we extend state-of-the-art audio tagging systems such as pre-trained audio neural networks (PANNs) and audio spectrogram transformer (AST) to the proposed region-specific audio tagging task. Experimental results on both the simulated and the real datasets show the feasibility of the proposed task and the effectiveness of the proposed method. Further experiments show that incorporating the directional features is beneficial for omnidirectional tagging.