🤖 AI Summary
This work addresses the challenges of model redundancy, excessive memory consumption, and slow inference in open-domain vision-guided audio generation. We propose the first lightweight end-to-end masked diffusion Transformer framework. Our method innovatively introduces a redundant video feature pruning module and a temporal-aware dynamic masking strategy, eliminating the conventional U-Net architecture and enabling pretraining-free audio generation. It integrates compact video feature encoding, temporal-context-driven denoising modeling, and joint optimization without pretraining. Experiments demonstrate that our 5M-parameter model achieves 97.9% audio–visual alignment accuracy—reducing parameters by 172×, decreasing GPU memory usage by 371%, and accelerating inference by 36× compared to prior approaches. A larger 131M-parameter baseline attains near-99% accuracy, whereas our method matches its performance using only 1/6.5 of its parameters. The proposed framework significantly enhances computational efficiency and deployment feasibility while maintaining state-of-the-art alignment fidelity.
📝 Abstract
We introduce MDSGen, a novel framework for vision-guided open-domain sound generation optimized for model parameter size, memory consumption, and inference speed. This framework incorporates two key innovations: (1) a redundant video feature removal module that filters out unnecessary visual information, and (2) a temporal-aware masking strategy that leverages temporal context for enhanced audio generation accuracy. In contrast to existing resource-heavy Unet-based models, exttt{MDSGen} employs denoising masked diffusion transformers, facilitating efficient generation without reliance on pre-trained diffusion models. Evaluated on the benchmark VGGSound dataset, our smallest model (5M parameters) achieves $97.9$% alignment accuracy, using $172 imes$ fewer parameters, $371$% less memory, and offering $36 imes$ faster inference than the current 860M-parameter state-of-the-art model ($93.9$% accuracy). The larger model (131M parameters) reaches nearly $99$% accuracy while requiring $6.5 imes$ fewer parameters. These results highlight the scalability and effectiveness of our approach. The code is available at https://bit.ly/mdsgen.