Diffusion-Link: Diffusion Probabilistic Model for Bridging the Audio-Text Modality Gap

📅 2025-10-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing contrastive audio-text pretraining methods struggle to bridge the modality gap between audio and text representations, hindering effective integration with large language models (LLMs). To address this, we propose Diffusion-Link: the first lightweight bridging module that introduces diffusion probabilistic modeling into audio-text alignment. It employs a three-residual MLP architecture to generatively map frozen audio embeddings onto the distribution of text embeddings—without requiring external knowledge or fine-tuning of multimodal encoders. This approach significantly narrows the cross-modal representation gap. On AudioCaps, it achieves 52.5% and 7.5% relative improvements over prior work under zero-shot and fully supervised settings, respectively, establishing new state-of-the-art performance for automatic audio captioning. Our work pioneers a generative paradigm for modality bridging, offering a scalable, low-overhead alignment pathway for multimodal-LLM synergy.

Technology Category

Application Category

📝 Abstract
Contrastive audio-language pretraining yields powerful joint representations, yet a persistent audio-text modality gap limits the benefits of coupling multimodal encoders with large language models (LLMs). We present Diffusion-Link, a diffusion-based modality-bridging module that generatively maps audio embeddings into the text-embedding distribution. The module is trained at the output embedding from the frozen multimodal encoder and implemented as a lightweight network with three residual MLP blocks. To assess the effect of Diffusion-Link on multimodal encoder-LLM coupling, we evaluate on Automatic Audio Captioning (AAC); to our knowledge, this is the first application of diffusion-based modality bridging to AAC. We report two results. (1) Modality-gap analysis: on similarity and geometric criteria, Diffusion-Link reduces the modality gap the most among prior diffusion-based methods and shows a collective migration of audio embeddings toward the text distribution. (2) Downstream AAC: attaching Diffusion-Link to the same multimodal LLM baseline achieves state-of-the-art on AudioCaps in both zero-shot and fully supervised captioning without external knowledge, with relative gains up to 52.5% and 7.5%, respectively. These findings show that closing the modality gap is pivotal for effective coupling between multimodal encoders and LLMs, and diffusion-based modality bridging offers a promising direction beyond knowledge-retrieval-centric designs. Code will be released upon acceptance https://github.com/DevKiHyun/Diffusion-Link
Problem

Research questions and friction points this paper is trying to address.

Bridging the audio-text modality gap in multimodal representations
Improving coupling between multimodal encoders and large language models
Enhancing automatic audio captioning through diffusion-based modality bridging
Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffusion model bridges audio-text modality gap
Lightweight network with three residual MLP blocks
Generatively maps audio embeddings into text distribution
🔎 Similar Papers
No similar papers found.