🤖 AI Summary
Addressing the challenges of long-range temporal modeling and high computational cost in Foley sound synthesis, this paper proposes the first conditional diffusion audio synthesis framework based on the Mamba selective state space model (SSM). Unlike conventional Transformer- or U-Net-based architectures, our approach pioneers the integration of Mamba into Foley generation, leveraging its linear-time complexity and superior sequential modeling capability to construct an efficient, high-fidelity conditional denoising diffusion probabilistic model (DDPM). Quantitatively, our method achieves significant improvements over state-of-the-art (SOTA) methods on objective metrics—including STFT-L1 and Fréchet Audio Distance (FAD)—while attaining a notably higher mean opinion score (MOS) in subjective evaluation and accelerating inference by 2.3×. This work empirically validates the effectiveness and practicality of SSMs for professional-grade audio generation and establishes a novel paradigm for long-horizon conditional audio synthesis.
📝 Abstract
Recent advancements in deep learning have led to widespread use of techniques for audio content generation, notably employing Denoising Diffusion Probabilistic Models (DDPM) across various tasks. Among these, Foley Sound Synthesis is of particular interest for its role in applications for the creation of multimedia content. Given the temporal-dependent nature of sound, it is crucial to design generative models that can effectively handle the sequential modeling of audio samples. Selective State Space Models (SSMs) have recently been proposed as a valid alternative to previously proposed techniques, demonstrating competitive performance with lower computational complexity. In this paper, we introduce MambaFoley, a diffusion-based model that, to the best of our knowledge, is the first to leverage the recently proposed SSM known as Mamba for the Foley sound generation task. To evaluate the effectiveness of the proposed method, we compare it with a state-of-the-art Foley sound generative model using both objective and subjective analyses.