ASCMamba: Multimodal Time-Frequency Mamba for Acoustic Scene Classification

📅 2025-08-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address coarse-grained audio-text fusion and insufficient long-range dependency modeling in multimodal acoustic scene classification (ASC), this paper proposes ASCMamba. Methodologically: (1) a dual-path Mamba module is designed to separately capture long-range temporal and spectral dependencies; (2) a DenseEncoder extracts hierarchical spectrogram features, while a state-space model processes the time-frequency sequence; (3) semantic textual cues—such as recording location and timestamp—are integrated to establish an end-to-end multimodal joint modeling framework; (4) a two-stage pseudo-labeling mechanism is introduced to enhance label reliability under weak supervision. Evaluated on the APSIPA ASC 2025 Challenge, ASCMamba achieves a 6.2% absolute improvement over all baselines and ranks first.

Technology Category

Application Category

📝 Abstract
Acoustic Scene Classification (ASC) is a fundamental problem in computational audition, which seeks to classify environments based on the distinctive acoustic features. In the ASC task of the APSIPA ASC 2025 Grand Challenge, the organizers introduce a multimodal ASC task. Unlike traditional ASC systems that rely solely on audio inputs, this challenge provides additional textual information as inputs, including the location where the audio is recorded and the time of recording. In this paper, we present our proposed system for the ASC task in the APSIPA ASC 2025 Grand Challenge. Specifically, we propose a multimodal network, extbf{ASCMamba}, which integrates audio and textual information for fine-grained acoustic scene understanding and effective multimodal ASC. The proposed ASCMamba employs a DenseEncoder to extract hierarchical spectral features from spectrograms, followed by a dual-path Mamba blocks that capture long-range temporal and frequency dependencies using Mamba-based state space models. In addition, we present a two-step pseudo-labeling mechanism to generate more reliable pseudo-labels. Results show that the proposed system outperforms all the participating teams and achieves a 6.2% improvement over the baseline. Code, model and pre-trained checkpoints are available at https://github.com/S-Orion/ASCMamba.git.
Problem

Research questions and friction points this paper is trying to address.

Classifying environments using multimodal audio and text inputs
Integrating location and time metadata with acoustic features
Improving acoustic scene classification accuracy with multimodal fusion
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal network integrating audio and textual information
Dual-path Mamba blocks capturing long-range dependencies
Two-step pseudo-labeling mechanism for reliable pseudo-labels
B
Bochao Sun
School of Marine Science and Technology, Northwestern Polytechnical University, Xi’an, China
D
Dong Wang
School of Automation, Northwestern Polytechnical University, Xi’an, China
Han Yin
Han Yin
Tongyi Speech Lab, Alibaba Group
Audio UnderstandingMultimodal LLM