🤖 AI Summary
This work proposes WavFlow, the first end-to-end framework for high-fidelity audio generation directly in raw waveform space, circumventing the information loss and architectural complexity inherent in latent-space compression approaches. WavFlow constructs a two-dimensional token grid by chunking waveforms, incorporates an amplitude-boosting strategy, and leverages direct x-prediction within a flow-matching paradigm. The model is trained on large-scale video–text–audio triplets to achieve multimodal alignment. Experimental results demonstrate that WavFlow matches or surpasses state-of-the-art latent-variable methods on standard benchmarks, achieving FD_PaSST = 59.98 on VGGSound and IS_PANNs = 17.40 on AudioCaps, thereby establishing the feasibility and competitiveness of uncompressed waveform-based audio synthesis.
📝 Abstract
Modern audio generation predominantly relies on latent-space compression, introducing additional complexity and potential information loss. In this work, we challenge this paradigm with WavFlow, a framework that generates high-fidelity audio directly in raw waveform space without intermediate representations. To overcome the inherent difficulties of modeling high-dimensional and low-energy signals, we reshape audio into 2D token grids through waveform patchify and introduce amplitude lifting to align signal scales, enabling stable optimization via direct x-prediction in flow matching. To capture complex semantic alignment and temporal synchronization, we leverage an automated data pipeline to curate 5 million high-quality video-text-audio triplets, allowing the model to learn fine-grained acoustic patterns from scratch. Experimental results show that WavFlow achieves competitive performance on the video-to-audio benchmark VGGSound (FD_PaSST: 59.98, IS_PANNs: 17.40, DeSync: 0.44) and the text-to-audio benchmark AudioCaps (FD_PANNs: 10.63, IS_PANNs: 12.62), matching or exceeding the performance of established latent-based methods. Our work demonstrates that intermediate compression is not a prerequisite for high-quality synthesis, offering a simpler and more scalable alternative for multimodal audio generation.