MDSGen: Fast and Efficient Masked Diffusion Temporal-Aware Transformers for Open-Domain Sound Generation

📅 2024-10-03
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of model redundancy, excessive memory consumption, and slow inference in open-domain vision-guided audio generation. We propose the first lightweight end-to-end masked diffusion Transformer framework. Our method innovatively introduces a redundant video feature pruning module and a temporal-aware dynamic masking strategy, eliminating the conventional U-Net architecture and enabling pretraining-free audio generation. It integrates compact video feature encoding, temporal-context-driven denoising modeling, and joint optimization without pretraining. Experiments demonstrate that our 5M-parameter model achieves 97.9% audio–visual alignment accuracy—reducing parameters by 172×, decreasing GPU memory usage by 371%, and accelerating inference by 36× compared to prior approaches. A larger 131M-parameter baseline attains near-99% accuracy, whereas our method matches its performance using only 1/6.5 of its parameters. The proposed framework significantly enhances computational efficiency and deployment feasibility while maintaining state-of-the-art alignment fidelity.

Technology Category

Application Category

📝 Abstract
We introduce MDSGen, a novel framework for vision-guided open-domain sound generation optimized for model parameter size, memory consumption, and inference speed. This framework incorporates two key innovations: (1) a redundant video feature removal module that filters out unnecessary visual information, and (2) a temporal-aware masking strategy that leverages temporal context for enhanced audio generation accuracy. In contrast to existing resource-heavy Unet-based models, exttt{MDSGen} employs denoising masked diffusion transformers, facilitating efficient generation without reliance on pre-trained diffusion models. Evaluated on the benchmark VGGSound dataset, our smallest model (5M parameters) achieves $97.9$% alignment accuracy, using $172 imes$ fewer parameters, $371$% less memory, and offering $36 imes$ faster inference than the current 860M-parameter state-of-the-art model ($93.9$% accuracy). The larger model (131M parameters) reaches nearly $99$% accuracy while requiring $6.5 imes$ fewer parameters. These results highlight the scalability and effectiveness of our approach. The code is available at https://bit.ly/mdsgen.
Problem

Research questions and friction points this paper is trying to address.

Optimize vision-guided sound generation
Reduce model size and memory usage
Enhance inference speed and accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Masked diffusion transformers
Redundant video feature removal
Temporal-aware masking strategy
🔎 Similar Papers
No similar papers found.