MambaFoley: Foley Sound Generation using Selective State-Space Models

📅 2024-09-13

🏛️ IEEE International Conference on Acoustics, Speech, and Signal Processing

📈 Citations: 0

✨ Influential: 0

career value

256K/year

🤖 AI Summary

Addressing the challenges of long-range temporal modeling and high computational cost in Foley sound synthesis, this paper proposes the first conditional diffusion audio synthesis framework based on the Mamba selective state space model (SSM). Unlike conventional Transformer- or U-Net-based architectures, our approach pioneers the integration of Mamba into Foley generation, leveraging its linear-time complexity and superior sequential modeling capability to construct an efficient, high-fidelity conditional denoising diffusion probabilistic model (DDPM). Quantitatively, our method achieves significant improvements over state-of-the-art (SOTA) methods on objective metrics—including STFT-L1 and Fréchet Audio Distance (FAD)—while attaining a notably higher mean opinion score (MOS) in subjective evaluation and accelerating inference by 2.3×. This work empirically validates the effectiveness and practicality of SSMs for professional-grade audio generation and establishes a novel paradigm for long-horizon conditional audio synthesis.

Technology Category

Application Category

📝 Abstract

Recent advancements in deep learning have led to widespread use of techniques for audio content generation, notably employing Denoising Diffusion Probabilistic Models (DDPM) across various tasks. Among these, Foley Sound Synthesis is of particular interest for its role in applications for the creation of multimedia content. Given the temporal-dependent nature of sound, it is crucial to design generative models that can effectively handle the sequential modeling of audio samples. Selective State Space Models (SSMs) have recently been proposed as a valid alternative to previously proposed techniques, demonstrating competitive performance with lower computational complexity. In this paper, we introduce MambaFoley, a diffusion-based model that, to the best of our knowledge, is the first to leverage the recently proposed SSM known as Mamba for the Foley sound generation task. To evaluate the effectiveness of the proposed method, we compare it with a state-of-the-art Foley sound generative model using both objective and subjective analyses.

Problem

Research questions and friction points this paper is trying to address.

Develops MambaFoley for Foley sound generation

Uses Selective State-Space Models for audio synthesis

Compares with state-of-the-art models for effectiveness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Selective State-Space Models (SSMs)

Leverages Mamba for Foley sound generation

Diffusion-based model with low complexity

🔎 Similar Papers

No similar papers found.