AC/DC: LLM-based Audio Comprehension via Dialogue Continuation

📅 2025-06-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Audio captioning suffers from weak model generalization due to caption diversity, introducing annotation noise and overfitting risks in conventional end-to-end text generation. Method: This paper proposes an instruction-following paradigm leveraging large language models’ (LLMs) dialogue continuation capability—reformulating audio captioning as a conversational response task rather than direct sequence generation. By employing LLM-driven audio–text alignment and dialogue-style supervised learning, the approach implicitly models semantic meaning instead of surface caption forms. Contribution/Results: The method introduces the first zero-shot training paradigm that replaces direct caption generation with dialogue continuation, eliminating the need for multi-task instruction fine-tuning while enabling generalization to unseen instruction types. Extensive evaluation on AudioCaps, WavCaps, Clotho, and AudioBench demonstrates significant improvements in zero-shot instruction-following accuracy, validating strong cross-task and cross-instruction generalization capabilities.

Technology Category

Application Category

📝 Abstract
We propose an instruction-following audio comprehension model that leverages the dialogue continuation ability of large language models (LLMs). Instead of directly generating target captions in training data, the proposed method trains a model to produce responses as if the input caption triggered a dialogue. This dialogue continuation training mitigates the caption variation problem. Learning to continue a dialogue effectively captures the caption's meaning beyond its surface-level words. As a result, our model enables zero-shot instruction-following capability without multitask instruction tuning, even trained solely on audio captioning datasets. Experiments on AudioCaps, WavCaps, and Clotho datasets with AudioBench audio-scene question-answering tests demonstrate our model's ability to follow various unseen instructions.
Problem

Research questions and friction points this paper is trying to address.

Leverages LLMs for audio comprehension via dialogue
Mitigates caption variation through dialogue continuation training
Enables zero-shot instruction-following without multitask tuning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages LLM dialogue continuation for audio comprehension
Trains model to respond as if in dialogue
Enables zero-shot instruction-following without multitask tuning
🔎 Similar Papers
No similar papers found.