Diffusion-Link: Diffusion Probabilistic Model for Bridging the Audio-Text Modality Gap

📅 2025-10-13

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

Existing contrastive audio-text pretraining methods struggle to bridge the modality gap between audio and text representations, hindering effective integration with large language models (LLMs). To address this, we propose Diffusion-Link: the first lightweight bridging module that introduces diffusion probabilistic modeling into audio-text alignment. It employs a three-residual MLP architecture to generatively map frozen audio embeddings onto the distribution of text embeddings—without requiring external knowledge or fine-tuning of multimodal encoders. This approach significantly narrows the cross-modal representation gap. On AudioCaps, it achieves 52.5% and 7.5% relative improvements over prior work under zero-shot and fully supervised settings, respectively, establishing new state-of-the-art performance for automatic audio captioning. Our work pioneers a generative paradigm for modality bridging, offering a scalable, low-overhead alignment pathway for multimodal-LLM synergy.

Technology Category

Application Category

📝 Abstract

Contrastive audio-language pretraining yields powerful joint representations, yet a persistent audio-text modality gap limits the benefits of coupling multimodal encoders with large language models (LLMs). We present Diffusion-Link, a diffusion-based modality-bridging module that generatively maps audio embeddings into the text-embedding distribution. The module is trained at the output embedding from the frozen multimodal encoder and implemented as a lightweight network with three residual MLP blocks. To assess the effect of Diffusion-Link on multimodal encoder-LLM coupling, we evaluate on Automatic Audio Captioning (AAC); to our knowledge, this is the first application of diffusion-based modality bridging to AAC. We report two results. (1) Modality-gap analysis: on similarity and geometric criteria, Diffusion-Link reduces the modality gap the most among prior diffusion-based methods and shows a collective migration of audio embeddings toward the text distribution. (2) Downstream AAC: attaching Diffusion-Link to the same multimodal LLM baseline achieves state-of-the-art on AudioCaps in both zero-shot and fully supervised captioning without external knowledge, with relative gains up to 52.5% and 7.5%, respectively. These findings show that closing the modality gap is pivotal for effective coupling between multimodal encoders and LLMs, and diffusion-based modality bridging offers a promising direction beyond knowledge-retrieval-centric designs. Code will be released upon acceptance https://github.com/DevKiHyun/Diffusion-Link

Problem

Research questions and friction points this paper is trying to address.

Bridging the audio-text modality gap in multimodal representations

Improving coupling between multimodal encoders and large language models

Enhancing automatic audio captioning through diffusion-based modality bridging

Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffusion model bridges audio-text modality gap

Lightweight network with three residual MLP blocks

Generatively maps audio embeddings into text distribution

🔎 Similar Papers

Towards Diverse and Efficient Audio Captioning via Diffusion Models