🤖 AI Summary
In generic audio representation learning, spectrogram-based methods suffer from high computational latency and phase information loss, while self-supervised waveform modeling struggles to generalize beyond speech. To address these dual bottlenecks, we propose WavJEPA—the first joint embedding predictive architecture operating directly on raw waveforms for high-level semantic modeling. WavJEPA models cross-temporal-scale semantic consistency natively in the time domain and incorporates multi-channel natural-scene augmentation (WavJEPA-Nat) to significantly improve robustness against noise and reverberation. Evaluated across 12 downstream tasks—including ASR, music classification, and acoustic event detection—WavJEPA consistently outperforms existing waveform models, achieving an average accuracy gain of 3.2%, a 27% reduction in inference latency, and a 41% decrease in FLOPs. These results empirically validate the effectiveness and practical viability of end-to-end semantic learning from raw waveforms.
📝 Abstract
Learning audio representations from raw waveforms overcomes key limitations of spectrogram-based audio representation learning, such as the long latency of spectrogram computation and the loss of phase information. Yet, while self-supervised speech representation learning from raw waveforms has been remarkably successful, these approaches have not achieved similar feats for general-purpose audio representation learning from waveforms. Here, we propose WavJEPA, a waveform-based version of the Joint-Embedding Predictive Architecture. WavJEPA leverages high-level semantic representation learning to tackle the shortcomings of representation learning at the speech unit or token level. We show that this approach substantially outperforms state-of-the-art time-domain audio foundation models across a wide variety of downstream benchmark tasks, while requiring considerably fewer computational resources. Additionally, to overcome the performance drop that time-domain models typically exhibit in noisy and reverberant real-world acoustic environments, we present WavJEPA-Nat. WavJEPA-Nat is a multi-channel extension of the WavJEPA architecture trained on simulated naturalistic scenes. We find that WavJEPA-Nat is highly robust to reverberation and noise. These results highlight the feasibility and computational efficiency of general-purpose audio representation learning from raw waveforms, showcasing the potential for low-latency, robust time-domain audio foundation models for real-world applications.