🤖 AI Summary
To address coarse-grained audio-text fusion and insufficient long-range dependency modeling in multimodal acoustic scene classification (ASC), this paper proposes ASCMamba. Methodologically: (1) a dual-path Mamba module is designed to separately capture long-range temporal and spectral dependencies; (2) a DenseEncoder extracts hierarchical spectrogram features, while a state-space model processes the time-frequency sequence; (3) semantic textual cues—such as recording location and timestamp—are integrated to establish an end-to-end multimodal joint modeling framework; (4) a two-stage pseudo-labeling mechanism is introduced to enhance label reliability under weak supervision. Evaluated on the APSIPA ASC 2025 Challenge, ASCMamba achieves a 6.2% absolute improvement over all baselines and ranks first.
📝 Abstract
Acoustic Scene Classification (ASC) is a fundamental problem in computational audition, which seeks to classify environments based on the distinctive acoustic features. In the ASC task of the APSIPA ASC 2025 Grand Challenge, the organizers introduce a multimodal ASC task. Unlike traditional ASC systems that rely solely on audio inputs, this challenge provides additional textual information as inputs, including the location where the audio is recorded and the time of recording. In this paper, we present our proposed system for the ASC task in the APSIPA ASC 2025 Grand Challenge. Specifically, we propose a multimodal network, extbf{ASCMamba}, which integrates audio and textual information for fine-grained acoustic scene understanding and effective multimodal ASC. The proposed ASCMamba employs a DenseEncoder to extract hierarchical spectral features from spectrograms, followed by a dual-path Mamba blocks that capture long-range temporal and frequency dependencies using Mamba-based state space models. In addition, we present a two-step pseudo-labeling mechanism to generate more reliable pseudo-labels. Results show that the proposed system outperforms all the participating teams and achieves a 6.2% improvement over the baseline. Code, model and pre-trained checkpoints are available at https://github.com/S-Orion/ASCMamba.git.