🤖 AI Summary
To address the limited generalization of Acoustic Scene Classification (ASC) models caused by inter-city acoustic environmental disparities, this paper proposes a scene classification method that integrates city-specific acoustic priors. The core method introduces city classification as an auxiliary supervision task and transfers city-level acoustic discriminative knowledge to the primary ASC model via knowledge distillation. Additionally, it establishes a city-scene joint labeling framework, compatible with both CNN- and Transformer-based backbones, to jointly learn cross-city invariant features and city-specific acoustic cues. Evaluated on the DCASE 2023 Task 1 benchmark, the approach consistently improves the accuracy of multiple state-of-the-art ASC models—achieving an average gain of +1.8%. Empirical results demonstrate that explicitly modeling city-level acoustic variations significantly enhances the robustness and generalization capability of ASC systems.
📝 Abstract
Acoustic scene recordings are often collected from a diverse range of cities. Most existing acoustic scene classification (ASC) approaches focus on identifying common acoustic scene patterns across cities to enhance generalization. In contrast, we hypothesize that city-specific environmental and cultural differences in acoustic features are beneficial for the ASC task. In this paper, we introduce City2Scene, a novel framework that leverages city features to improve ASC. City2Scene transfers the city-specific knowledge from city classification models to a scene classification model using knowledge distillation. We evaluated City2Scene on the DCASE Challenge Task 1 datasets, where each audio clip is annotated with both scene and city labels. Experimental results demonstrate that city features provide valuable information for classifying scenes. By distilling the city-specific knowledge, City2Scene effectively improves accuracy for various state-of-the-art ASC backbone models, including both CNNs and Transformers.