Region-Specific Audio Tagging for Spatial Sound

📅 2025-09-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitation of existing audio tagging methods, which cannot localize and tag sound events within specific spatial regions (e.g., designated azimuth angles or radial distances) in spatial audio. We formally introduce “region-specific audio tagging” as a novel task. Methodologically, we propose a multimodal feature representation that jointly encodes spectral, spatial directional, and positional information; extend pre-trained models—PANNs and AST—into spatially aware architectures; and incorporate directional feature enhancement to improve omnidirectional tagging capability. Experiments on both simulated and real microphone array datasets demonstrate substantial improvements in region-specific sound source identification accuracy. Results validate both the well-posedness of the proposed task and the effectiveness of our technical approach. This work establishes a new paradigm for spatial audio understanding and provides a scalable, foundational framework for future research in spatially grounded audio analysis.

Technology Category

Application Category

📝 Abstract
Audio tagging aims to label sound events appearing in an audio recording. In this paper, we propose region-specific audio tagging, a new task which labels sound events in a given region for spatial audio recorded by a microphone array. The region can be specified as an angular space or a distance from the microphone. We first study the performance of different combinations of spectral, spatial, and position features. Then we extend state-of-the-art audio tagging systems such as pre-trained audio neural networks (PANNs) and audio spectrogram transformer (AST) to the proposed region-specific audio tagging task. Experimental results on both the simulated and the real datasets show the feasibility of the proposed task and the effectiveness of the proposed method. Further experiments show that incorporating the directional features is beneficial for omnidirectional tagging.
Problem

Research questions and friction points this paper is trying to address.

Labeling sound events in specific spatial regions
Combining spectral, spatial, and position features
Extending audio tagging systems for directional audio
Innovation

Methods, ideas, or system contributions that make the work stand out.

Region-specific audio tagging for spatial sound
Extending PANNs and AST systems
Incorporating directional features for omnidirectional tagging
🔎 Similar Papers
No similar papers found.
J
Jinzheng Zhao
Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey, UK
Y
Yong Xu
Tencent AI Lab, Bellevue, WA, USA
Haohe Liu
Haohe Liu
Research Scientist at Meta AI
Audio GenerationAudio ClassificationSpeech Quality EnhancementMusic Source Separation
D
Davide Berghi
Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey, UK
Xinyuan Qian
Xinyuan Qian
Associate Professor, University of Science and Technology Beijing, China
speech processingmultimediahuman robot interaction
Qiuqiang Kong
Qiuqiang Kong
The Chinese University of Hong Kong
Audio ProcessingArtificial Intelligence
J
Junqi Zhao
Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey, UK
M
Mark D. Plumbley
Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey, UK
Wenwu Wang
Wenwu Wang
Professor, University of Surrey, UK
signal processingmachine learningmachine listeningaudio/speech/audio-visualmultimodal fusion