Retaining Mixture Representations for Domain Generalized Anomalous Sound Detection

📅 2025-10-29

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

Distribution shifts—such as unknown low signal-to-noise ratios and heterogeneous noise types—in real-world anomalous sound detection (ASD) degrade model generalization and distort learned representations. Method: We propose a “preserve-rather-than-denoise” training paradigm built upon a frozen self-supervised audio encoder. Our approach introduces a hybrid embedding alignment mechanism: using teacher representations derived from convex combinations of clean sources and noise as supervision, it jointly optimizes multi-label classification loss and hybrid alignment loss to guide the student model toward robust, consistent representations of mixed acoustic sources. Inference requires no additional adaptation, preserving efficiency. Results: Experiments demonstrate substantial improvements in out-of-distribution generalization under static/non-static noise and noise-mismatch conditions. The method effectively narrows the gap between learned and ideal mixed-source representations, offering a scalable, highly robust solution for ASD in realistic acoustic environments.

Technology Category

Application Category

📝 Abstract

Anomalous sound detection (ASD) in the wild requires robustness to distribution shifts such as unseen low-SNR input mixtures of machine and noise types. State-of-the-art systems extract embeddings from an adapted audio encoder and detect anomalies via nearest-neighbor search, but fine tuning on noisy machine sounds often acts like a denoising objective, suppressing noise and reducing generalization under mismatched mixtures or inconsistent labeling. Training-free systems with frozen self-supervised learning (SSL) encoders avoid this issue and show strong first-shot generalization, yet their performance drops when mixture embeddings deviate from clean-source embeddings. We propose to improve SSL backbones with a retain-not-denoise strategy that better preserves information from mixed sound sources. The approach combines a multi-label audio tagging loss with a mixture alignment loss that aligns student mixture embeddings to convex teacher embeddings of clean and noise inputs. Controlled experiments on stationary, non-stationary, and mismatched noise subsets demonstrate improved robustness under distribution shifts, narrowing the gap toward oracle mixture representations.

Problem

Research questions and friction points this paper is trying to address.

Detecting anomalous sounds under domain distribution shifts

Preserving mixture representations instead of denoising sounds

Improving generalization with mixture alignment and multi-label training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Retain-not-denoise strategy preserves mixed sound information

Multi-label audio tagging loss with mixture alignment

Aligns student mixture embeddings to convex teacher embeddings

🔎 Similar Papers

No similar papers found.