TLDiffGAN: A Latent Diffusion-GAN Framework with Temporal Information Fusion for Anomalous Sound Detection

📅 2026-02-01

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This work addresses the challenge of effectively modeling the complex distribution of normal sounds in unsupervised anomalous sound detection, a task where the potential of diffusion models remains underexplored. The authors propose a novel dual-branch framework that integrates a latent diffusion model into the generator of a generative adversarial network (GAN) for adversarial training to enhance generation quality. Concurrently, the framework leverages a pretrained audio encoder to extract features directly from raw waveforms and fuses time-domain waveforms with Mel-spectrogram representations. An innovative TMixup time–frequency data augmentation strategy is introduced to heighten sensitivity to subtle local temporal patterns. As the first approach to combine latent diffusion models with GANs for anomalous sound detection, the method achieves state-of-the-art performance on DCASE 2020 Task 2 and demonstrates strong capabilities in localizing anomalies in the time–frequency domain.

Technology Category

Application Category

📝 Abstract

Existing generative models for unsupervised anomalous sound detection are limited by their inability to fully capture the complex feature distribution of normal sounds, while the potential of powerful diffusion models in this domain remains largely unexplored. To address this challenge, we propose a novel framework, TLDiffGAN, which consists of two complementary branches. One branch incorporates a latent diffusion model into the GAN generator for adversarial training, thereby making the discriminator's task more challenging and improving the quality of generated samples. The other branch leverages pretrained audio model encoders to extract features directly from raw audio waveforms for auxiliary discrimination. This framework effectively captures feature representations of normal sounds from both raw audio and Mel spectrograms. Moreover, we introduce a TMixup spectrogram augmentation technique to enhance sensitivity to subtle and localized temporal patterns that are often overlooked. Extensive experiments on the DCASE 2020 Challenge Task 2 dataset demonstrate the superior detection performance of TLDiffGAN, as well as its strong capability in anomalous time-frequency localization.

Problem

Research questions and friction points this paper is trying to address.

anomalous sound detection

generative models

diffusion models

feature distribution

unsupervised learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Latent Diffusion-GAN

Temporal Information Fusion

TMixup Augmentation