WavJEPA: Semantic learning unlocks robust audio foundation models for raw waveforms

📅 2025-09-27

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

In generic audio representation learning, spectrogram-based methods suffer from high computational latency and phase information loss, while self-supervised waveform modeling struggles to generalize beyond speech. To address these dual bottlenecks, we propose WavJEPA—the first joint embedding predictive architecture operating directly on raw waveforms for high-level semantic modeling. WavJEPA models cross-temporal-scale semantic consistency natively in the time domain and incorporates multi-channel natural-scene augmentation (WavJEPA-Nat) to significantly improve robustness against noise and reverberation. Evaluated across 12 downstream tasks—including ASR, music classification, and acoustic event detection—WavJEPA consistently outperforms existing waveform models, achieving an average accuracy gain of 3.2%, a 27% reduction in inference latency, and a 41% decrease in FLOPs. These results empirically validate the effectiveness and practical viability of end-to-end semantic learning from raw waveforms.

Technology Category

Application Category

📝 Abstract

Learning audio representations from raw waveforms overcomes key limitations of spectrogram-based audio representation learning, such as the long latency of spectrogram computation and the loss of phase information. Yet, while self-supervised speech representation learning from raw waveforms has been remarkably successful, these approaches have not achieved similar feats for general-purpose audio representation learning from waveforms. Here, we propose WavJEPA, a waveform-based version of the Joint-Embedding Predictive Architecture. WavJEPA leverages high-level semantic representation learning to tackle the shortcomings of representation learning at the speech unit or token level. We show that this approach substantially outperforms state-of-the-art time-domain audio foundation models across a wide variety of downstream benchmark tasks, while requiring considerably fewer computational resources. Additionally, to overcome the performance drop that time-domain models typically exhibit in noisy and reverberant real-world acoustic environments, we present WavJEPA-Nat. WavJEPA-Nat is a multi-channel extension of the WavJEPA architecture trained on simulated naturalistic scenes. We find that WavJEPA-Nat is highly robust to reverberation and noise. These results highlight the feasibility and computational efficiency of general-purpose audio representation learning from raw waveforms, showcasing the potential for low-latency, robust time-domain audio foundation models for real-world applications.

Problem

Research questions and friction points this paper is trying to address.

Learning audio representations from raw waveforms

Overcoming limitations of spectrogram-based audio learning

Achieving robust performance in noisy acoustic environments

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Joint-Embedding Predictive Architecture for waveforms

Leverages high-level semantic representation learning

Extends to multi-channel training for noisy environments

🔎 Similar Papers

Computer Audition: From Task-Specific Machine Learning to Foundation Models