LAMB: LLM-based Audio Captioning with Modality Gap Bridging via Cauchy-Schwarz Divergence

📅 2026-01-08
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited semantic description capability of large language models (LLMs) in audio captioning tasks, which stems from insufficient modality alignment between audio features and LLM text embedding spaces. To overcome this, the authors propose the LAMB framework, which achieves both global and token-level cross-modal alignment through a cross-modal aligner and a dual-stream adapter, jointly optimizing Cauchy-Schwarz divergence minimization and mutual information maximization. Additionally, a Token Guide module is introduced to directly steer the generation of more accurate audio descriptions within the LLM’s embedding space. Experimental results demonstrate that LAMB achieves state-of-the-art performance on the AudioCaps dataset, significantly enhancing the LLM’s reasoning and generative capabilities for audio description.

Technology Category

Application Category

📝 Abstract
Automated Audio Captioning aims to describe the semantic content of input audio. Recent works have employed large language models (LLMs) as a text decoder to leverage their reasoning capabilities. However, prior approaches that project audio features into the LLM embedding space without considering cross-modal alignment fail to fully utilize these capabilities. To address this, we propose LAMB, an LLM-based audio captioning framework that bridges the modality gap between audio embeddings and the LLM text embedding space. LAMB incorporates a Cross-Modal Aligner that minimizes Cauchy-Schwarz divergence while maximizing mutual information, yielding tighter alignment between audio and text at both global and token levels. We further design a Two-Stream Adapter that extracts semantically enriched audio embeddings, thereby delivering richer information to the Cross-Modal Aligner. Finally, leveraging the aligned audio embeddings, a proposed Token Guide directly computes scores within the LLM text embedding space to steer the output logits of generated captions. Experimental results confirm that our framework strengthens the reasoning capabilities of the LLM decoder, achieving state-of-the-art performance on AudioCaps.
Problem

Research questions and friction points this paper is trying to address.

Audio Captioning
Large Language Models
Modality Gap
Cross-Modal Alignment
Cauchy-Schwarz Divergence
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-Modal Alignment
Cauchy-Schwarz Divergence
Large Language Models
Audio Captioning
Modality Gap Bridging
🔎 Similar Papers
No similar papers found.
Hyeongkeun Lee
Hyeongkeun Lee
KAIST
Deep LearningMultimodal LearningVideo Understanding
J
Jongmin Choi
Korea Advanced Institute of Science and Technology, South Korea
K
Kihyun Nam
Korea Advanced Institute of Science and Technology, South Korea
Joon Son Chung
Joon Son Chung
KAIST
Machine learningspeech processingcomputer vision