Aligned Better, Listen Better for Audio-Visual Large Language Models

📅 2025-04-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing audio-visual large language models (AV-LLMs) suffer from weak audio understanding, leading to modality hallucination and cross-modal inconsistency. To address this, we propose Dolphin, a fine-grained audio-visual co-alignment architecture, and AVU—the first open-domain, question-answering–style audio-visual instruction dataset comprising 5.2 million samples. Methodologically, Dolphin introduces a multi-scale audio-visual adapter for spatial alignment, an interleaved audio-visual fusion mechanism for temporal alignment, and a unified video-audio-question triplet encoding framework. Extensive experiments demonstrate that Dolphin achieves state-of-the-art performance across multiple audio-visual understanding benchmarks. It significantly improves factual accuracy and cross-modal consistency while effectively mitigating modality hallucination.

Technology Category

Application Category

📝 Abstract
Audio is essential for multimodal video understanding. On the one hand, video inherently contains audio, which supplies complementary information to vision. Besides, video large language models (Video-LLMs) can encounter many audio-centric settings. However, existing Video-LLMs and Audio-Visual Large Language Models (AV-LLMs) exhibit deficiencies in exploiting audio information, leading to weak understanding and hallucinations. To solve the issues, we delve into the model architecture and dataset. (1) From the architectural perspective, we propose a fine-grained AV-LLM, namely Dolphin. The concurrent alignment of audio and visual modalities in both temporal and spatial dimensions ensures a comprehensive and accurate understanding of videos. Specifically, we devise an audio-visual multi-scale adapter for multi-scale information aggregation, which achieves spatial alignment. For temporal alignment, we propose audio-visual interleaved merging. (2) From the dataset perspective, we curate an audio-visual caption and instruction-tuning dataset, called AVU. It comprises 5.2 million diverse, open-ended data tuples (video, audio, question, answer) and introduces a novel data partitioning strategy. Extensive experiments show our model not only achieves remarkable performance in audio-visual understanding, but also mitigates potential hallucinations.
Problem

Research questions and friction points this paper is trying to address.

Enhance audio-visual alignment in multimodal video understanding
Address weak audio exploitation in Video-LLMs and AV-LLMs
Mitigate hallucinations in audio-visual large language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-grained AV-LLM with multi-scale adapter
Audio-visual interleaved merging for alignment
Curated AVU dataset with diverse data
🔎 Similar Papers
No similar papers found.
Y
Yuxin Guo
School of Artificial Intelligence, University of Chinese Academy of Sciences; MAIS, Institute of Automation, Chinese Academy of Sciences (CASIA)
Shuailei Ma
Shuailei Ma
Northeast University China
Open-World Object DetectionHuman Object Interaction Detection
Shijie Ma
Shijie Ma
Institute of Automation, Chinese Academy of Sciences
Computer VisionMachine Learning
X
Xiaoyi Bao
School of Artificial Intelligence, University of Chinese Academy of Sciences; MAIS, Institute of Automation, Chinese Academy of Sciences (CASIA)
Chen-Wei Xie
Chen-Wei Xie
Alibaba Group
Computer VisionMachine Learning
K
Kecheng Zheng
Ant Group
T
Tingyu Weng
Tongyi Lab, Alibaba Group
Siyang Sun
Siyang Sun
Alibaba Group
deep learningmulti-modal large language model
Yun Zheng
Yun Zheng
Alibaba
Computer VisionMultimodal Modeling
Wei Zou
Wei Zou
PKU、Samsung、Baidu、Didi、Ke
SpeechNLPLLMMultimodal