LAVCap: LLM-based Audio-Visual Captioning using Optimal Transport

📅 2025-01-16

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

Existing audio-video captioning methods suffer from insufficient cross-modal fusion, often overlooking fine-grained semantic details. This paper proposes a large language model (LLM)-based joint audio-video description framework that leverages visual information to enhance audio semantic understanding. Its core contributions are: (i) the first alignment loss function grounded in optimal transport theory, enabling fine-grained semantic matching between audio and visual features; and (ii) a learnable optimal transport attention module that supports dynamic, cross-modal feature weighting and fusion. The method requires neither large-scale pretraining nor post-processing. Evaluated on the AudioCaps benchmark, it comprehensively outperforms all existing state-of-the-art approaches, achieving significant improvements in caption accuracy, lexical richness, and contextual coherence.

Technology Category

Application Category

📝 Abstract

Automated audio captioning is a task that generates textual descriptions for audio content, and recent studies have explored using visual information to enhance captioning quality. However, current methods often fail to effectively fuse audio and visual data, missing important semantic cues from each modality. To address this, we introduce LAVCap, a large language model (LLM)-based audio-visual captioning framework that effectively integrates visual information with audio to improve audio captioning performance. LAVCap employs an optimal transport-based alignment loss to bridge the modality gap between audio and visual features, enabling more effective semantic extraction. Additionally, we propose an optimal transport attention module that enhances audio-visual fusion using an optimal transport assignment map. Combined with the optimal training strategy, experimental results demonstrate that each component of our framework is effective. LAVCap outperforms existing state-of-the-art methods on the AudioCaps dataset, without relying on large datasets or post-processing. Code is available at https://github.com/NAVER-INTEL-Co-Lab/gaudi-lavcap.

Problem

Research questions and friction points this paper is trying to address.

Audio-Visual Fusion

Subtitle Generation

Information Retention

Innovation

Methods, ideas, or system contributions that make the work stand out.

LAVCap

Optimized Audio-Visual Captioning

Large Language Models

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs