Latent Multi-view Learning for Robust Environmental Sound Representations

📅 2025-10-02

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

This work addresses the insufficient robustness of environmental sound representation learning by proposing a multi-view self-supervised framework that integrates contrastive learning with generative modeling. Methodologically, it introduces view-specific encoders and a shared-subspace disentanglement architecture, jointly optimizing contrastive loss and multi-view reconstruction objectives within a compressed latent space to achieve targeted disentanglement of sound-source content from device-related factors and facilitate structured information flow. Its key innovation lies in the first integration of contrastive learning into a generative pipeline, where subspace decomposition unifies the modeling of invariant (content) and variant (device/context) acoustic features. Experiments on the Urban Acoustic Sensor Network dataset demonstrate that the method significantly outperforms state-of-the-art self-supervised baselines, achieving an average accuracy improvement of 4.2% across downstream tasks including sound source classification and sensor identification.

Technology Category

Application Category

📝 Abstract

Self-supervised learning (SSL) approaches, such as contrastive and generative methods, have advanced environmental sound representation learning using unlabeled data. However, how these approaches can complement each other within a unified framework remains relatively underexplored. In this work, we propose a multi-view learning framework that integrates contrastive principles into a generative pipeline to capture sound source and device information. Our method encodes compressed audio latents into view-specific and view-common subspaces, guided by two self-supervised objectives: contrastive learning for targeted information flow between subspaces, and reconstruction for overall information preservation. We evaluate our method on an urban sound sensor network dataset for sound source and sensor classification, demonstrating improved downstream performance over traditional SSL techniques. Additionally, we investigate the model's potential to disentangle environmental sound attributes within the structured latent space under varied training configurations.

Problem

Research questions and friction points this paper is trying to address.

Integrating contrastive and generative SSL methods for robust sound representations

Capturing sound source and device information through multi-view learning

Disentangling environmental sound attributes in structured latent spaces

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates contrastive learning into generative pipeline

Encodes audio latents into view-specific and common subspaces

Uses dual objectives for information flow and preservation

🔎 Similar Papers

Compositional Audio Representation Learning