Retaining Mixture Representations for Domain Generalized Anomalous Sound Detection

📅 2025-10-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Distribution shifts—such as unknown low signal-to-noise ratios and heterogeneous noise types—in real-world anomalous sound detection (ASD) degrade model generalization and distort learned representations. Method: We propose a “preserve-rather-than-denoise” training paradigm built upon a frozen self-supervised audio encoder. Our approach introduces a hybrid embedding alignment mechanism: using teacher representations derived from convex combinations of clean sources and noise as supervision, it jointly optimizes multi-label classification loss and hybrid alignment loss to guide the student model toward robust, consistent representations of mixed acoustic sources. Inference requires no additional adaptation, preserving efficiency. Results: Experiments demonstrate substantial improvements in out-of-distribution generalization under static/non-static noise and noise-mismatch conditions. The method effectively narrows the gap between learned and ideal mixed-source representations, offering a scalable, highly robust solution for ASD in realistic acoustic environments.

Technology Category

Application Category

📝 Abstract
Anomalous sound detection (ASD) in the wild requires robustness to distribution shifts such as unseen low-SNR input mixtures of machine and noise types. State-of-the-art systems extract embeddings from an adapted audio encoder and detect anomalies via nearest-neighbor search, but fine tuning on noisy machine sounds often acts like a denoising objective, suppressing noise and reducing generalization under mismatched mixtures or inconsistent labeling. Training-free systems with frozen self-supervised learning (SSL) encoders avoid this issue and show strong first-shot generalization, yet their performance drops when mixture embeddings deviate from clean-source embeddings. We propose to improve SSL backbones with a retain-not-denoise strategy that better preserves information from mixed sound sources. The approach combines a multi-label audio tagging loss with a mixture alignment loss that aligns student mixture embeddings to convex teacher embeddings of clean and noise inputs. Controlled experiments on stationary, non-stationary, and mismatched noise subsets demonstrate improved robustness under distribution shifts, narrowing the gap toward oracle mixture representations.
Problem

Research questions and friction points this paper is trying to address.

Detecting anomalous sounds under domain distribution shifts
Preserving mixture representations instead of denoising sounds
Improving generalization with mixture alignment and multi-label training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Retain-not-denoise strategy preserves mixed sound information
Multi-label audio tagging loss with mixture alignment
Aligns student mixture embeddings to convex teacher embeddings
🔎 Similar Papers
No similar papers found.
P
Phurich Saengthong
Institute of Science Tokyo
T
Tomoya Nishida
R&D Group, Hitachi Ltd., Japan
K
Kota Dohi
R&D Group, Hitachi Ltd., Japan
N
Natsuo Yamashita
R&D Group, Hitachi Ltd., Japan
Yohei Kawaguchi
Yohei Kawaguchi
Hitachi, Ltd.
Acoustic Signal ProcessingSignal ProcessingMachine LearningSpeech ProcessingAI