🤖 AI Summary
Distribution shifts—such as unknown low signal-to-noise ratios and heterogeneous noise types—in real-world anomalous sound detection (ASD) degrade model generalization and distort learned representations.
Method: We propose a “preserve-rather-than-denoise” training paradigm built upon a frozen self-supervised audio encoder. Our approach introduces a hybrid embedding alignment mechanism: using teacher representations derived from convex combinations of clean sources and noise as supervision, it jointly optimizes multi-label classification loss and hybrid alignment loss to guide the student model toward robust, consistent representations of mixed acoustic sources. Inference requires no additional adaptation, preserving efficiency.
Results: Experiments demonstrate substantial improvements in out-of-distribution generalization under static/non-static noise and noise-mismatch conditions. The method effectively narrows the gap between learned and ideal mixed-source representations, offering a scalable, highly robust solution for ASD in realistic acoustic environments.
📝 Abstract
Anomalous sound detection (ASD) in the wild requires robustness to distribution shifts such as unseen low-SNR input mixtures of machine and noise types. State-of-the-art systems extract embeddings from an adapted audio encoder and detect anomalies via nearest-neighbor search, but fine tuning on noisy machine sounds often acts like a denoising objective, suppressing noise and reducing generalization under mismatched mixtures or inconsistent labeling. Training-free systems with frozen self-supervised learning (SSL) encoders avoid this issue and show strong first-shot generalization, yet their performance drops when mixture embeddings deviate from clean-source embeddings. We propose to improve SSL backbones with a retain-not-denoise strategy that better preserves information from mixed sound sources. The approach combines a multi-label audio tagging loss with a mixture alignment loss that aligns student mixture embeddings to convex teacher embeddings of clean and noise inputs. Controlled experiments on stationary, non-stationary, and mismatched noise subsets demonstrate improved robustness under distribution shifts, narrowing the gap toward oracle mixture representations.