🤖 AI Summary
To address performance degradation in sound event classification (SEC) caused by microphone device variability, this paper proposes a unified feature-mapping framework based on frequency-response-conditioned CycleGAN. The core innovation lies in the first incorporation of microphone frequency response information into CycleGAN via feature-level Feature-wise Linear Modulation (FiLM), enabling a single model to perform bidirectional, unpaired time-frequency feature translation across arbitrary microphone pairs. This eliminates the limitations of conventional pairwise modeling and substantially enhances cross-device robustness. Experiments on standard SEC benchmarks demonstrate that the method achieves a 2.6% absolute improvement in macro-averaged F1-score over the state of the art, while reducing inter-device feature distribution discrepancy by 0.8%, validating its effectiveness and generalizability.
📝 Abstract
In this study, we introduce Unified Microphone Conversion, a unified generative framework to enhance the resilience of sound event classification systems against device variability. Building on the limitations of previous works, we condition the generator network with frequency response information to achieve many-to-many device mapping. This approach overcomes the inherent limitation of CycleGAN, requiring separate models for each device pair. Our framework leverages the strengths of CycleGAN for unpaired training to simulate device characteristics in audio recordings and significantly extends its scalability by integrating frequency response related information via Feature-wise Linear Modulation. The experiment results show that our method outperforms the state-of-the-art method by 2.6% and reducing variability by 0.8% in macro-average F1 score.