Empowering LLMs with Pseudo-Untrimmed Videos for Audio-Visual Temporal Understanding

๐Ÿ“… 2024-03-24
๐Ÿ›๏ธ AAAI Conference on Artificial Intelligence
๐Ÿ“ˆ Citations: 16
โœจ Influential: 1
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing untrimmed audio-video datasets lack precise temporal annotations, hindering multimodal large models from learning joint alignment across audio, visual, textual, and temporal modalities. Method: We introduce PU-VALORโ€”a large-scale pseudo-untrimmed dataset comprising over 114K videosโ€”along with a novel generation paradigm integrating event clustering, temporal scaling, and reordering to overcome the scarcity of real untrimmed data. This enables end-to-end joint alignment of audio-visual events, millisecond-level timestamps, and text tokens for the first time. Built upon Qwen-VL, we propose AVicuna, a model incorporating event-driven clustering and cross-modal alignment fine-tuning. Results: Experiments demonstrate state-of-the-art performance on open-ended video QA, audio-visual QA, and event-dense temporal localization, significantly improving temporal localization accuracy and time-aware dialogue capabilities.

Technology Category

Application Category

๐Ÿ“ Abstract
Large language models (LLMs) have demonstrated remarkable capabilities in natural language and multimodal domains. By fine-tuning multimodal LLMs with temporal annotations from well-annotated datasets, e.g., dense video captioning datasets, their temporal understanding capacity in video-language tasks can be obtained. However, there is a notable lack of untrimmed audio-visual video datasets with precise temporal annotations for events. This deficiency hinders LLMs from learning the alignment between time, audio-visual events, and text tokens, thus impairing their ability to localize audio-visual events in videos temporally. To address this gap, we introduce PU-VALOR, a comprehensive audio-visual dataset comprising over 114,081 pseudo-untrimmed videos with detailed temporal annotations. PU-VALOR is derived from the large-scale but coarse-annotated audio-visual dataset VALOR, through a subtle method involving event-based video clustering, random temporal scaling, and permutation. By fine-tuning a multimodal LLM on PU-VALOR, we developed AVicuna, a model capable of aligning audio-visual events with temporal intervals and corresponding text tokens. AVicuna excels in temporal localization and time-aware dialogue capabilities. Our experiments demonstrate that AVicuna effectively handles temporal understanding in audio-visual videos and achieves state-of-the-art performance on open-ended video QA, audio-visual QA, and audio-visual event dense localization tasks.
Problem

Research questions and friction points this paper is trying to address.

Lack of untrimmed audio-visual datasets with precise temporal annotations
LLMs struggle to align time, audio-visual events, and text tokens
Impaired temporal localization of audio-visual events in videos
Innovation

Methods, ideas, or system contributions that make the work stand out.

PU-VALOR dataset with pseudo-untrimmed videos
Fine-tuning multimodal LLMs for temporal alignment
AVicuna model localizes audio-visual events temporally
๐Ÿ”Ž Similar Papers
No similar papers found.