ASCMamba: Multimodal Time-Frequency Mamba for Acoustic Scene Classification

📅 2025-08-21

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address coarse-grained audio-text fusion and insufficient long-range dependency modeling in multimodal acoustic scene classification (ASC), this paper proposes ASCMamba. Methodologically: (1) a dual-path Mamba module is designed to separately capture long-range temporal and spectral dependencies; (2) a DenseEncoder extracts hierarchical spectrogram features, while a state-space model processes the time-frequency sequence; (3) semantic textual cues—such as recording location and timestamp—are integrated to establish an end-to-end multimodal joint modeling framework; (4) a two-stage pseudo-labeling mechanism is introduced to enhance label reliability under weak supervision. Evaluated on the APSIPA ASC 2025 Challenge, ASCMamba achieves a 6.2% absolute improvement over all baselines and ranks first.

Technology Category

Application Category

📝 Abstract

Acoustic Scene Classification (ASC) is a fundamental problem in computational audition, which seeks to classify environments based on the distinctive acoustic features. In the ASC task of the APSIPA ASC 2025 Grand Challenge, the organizers introduce a multimodal ASC task. Unlike traditional ASC systems that rely solely on audio inputs, this challenge provides additional textual information as inputs, including the location where the audio is recorded and the time of recording. In this paper, we present our proposed system for the ASC task in the APSIPA ASC 2025 Grand Challenge. Specifically, we propose a multimodal network, extbf{ASCMamba}, which integrates audio and textual information for fine-grained acoustic scene understanding and effective multimodal ASC. The proposed ASCMamba employs a DenseEncoder to extract hierarchical spectral features from spectrograms, followed by a dual-path Mamba blocks that capture long-range temporal and frequency dependencies using Mamba-based state space models. In addition, we present a two-step pseudo-labeling mechanism to generate more reliable pseudo-labels. Results show that the proposed system outperforms all the participating teams and achieves a 6.2% improvement over the baseline. Code, model and pre-trained checkpoints are available at https://github.com/S-Orion/ASCMamba.git.

Problem

Research questions and friction points this paper is trying to address.

Classifying environments using multimodal audio and text inputs

Integrating location and time metadata with acoustic features

Improving acoustic scene classification accuracy with multimodal fusion

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal network integrating audio and textual information

Dual-path Mamba blocks capturing long-range dependencies

Two-step pseudo-labeling mechanism for reliable pseudo-labels

🔎 Similar Papers

TF-Mamba: A Time-Frequency Network for Sound Source Localization

2024-09-08arXiv.orgCitations: 1

TSCMamba: Mamba Meets Multi-View Learning for Time Series Classification

2024-06-06arXiv.orgCitations: 0

Authors to Follow