StoryTeller: Improving Long Video Description through Global Audio-Visual Character Identification

πŸ“… 2024-11-11
πŸ›οΈ arXiv.org
πŸ“ˆ Citations: 1
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Current vision-language models struggle to generate coherent, minute-long video descriptions due to inconsistent cross-temporal character identification and insufficient audio-visual, plot-level semantic modeling. To address this, we introduce MovieStory101β€”the first long-video description datasetβ€”and StoryQA, a dedicated evaluation benchmark, both grounded in global audio-visual character binding as the core mechanism. We propose a multimodal large model architecture that jointly performs audio-visual character alignment and LVLM-enhanced decoding in staged inference. Crucially, we identify and systematically resolve audio-visual character consistency modeling as the key bottleneck in long-video description. On StoryQA, our method achieves a 9.5% absolute accuracy gain over Gemini-1.5-pro and a 15.56% improvement in human evaluation. When integrated into Gemini-1.5-pro and GPT-4o, it yields respective gains of 5.5% and 13.0%, demonstrating strong transferability and efficacy.

Technology Category

Application Category

πŸ“ Abstract
Existing large vision-language models (LVLMs) are largely limited to processing short, seconds-long videos and struggle with generating coherent descriptions for extended video spanning minutes or more. Long video description introduces new challenges, such as consistent character identification and plot-level descriptions incorporating both visual and audio information. To address these, we figure out audio-visual character identification, matching character names to each dialogue, as a key factor. We propose StoryTeller, a system for generating dense descriptions of long videos, incorporating both low-level visual concepts and high-level plot information. StoryTeller uses a multimodal large language model that integrates visual, audio, and text modalities to perform audio-visual character identification on minute-long video clips. The results are then fed into a LVLM to enhance consistency of video description. We validate our approach on movie description tasks and introduce MovieStory101, a dataset with dense descriptions for three-minute movie clips. To evaluate long video descriptions, we create StoryQA, a large set of multiple-choice questions for MovieStory101 test set. We assess descriptions by inputting them into GPT-4 to answer these questions, using accuracy as an automatic evaluation metric. Experiments show that StoryTeller outperforms all open and closed-source baselines on StoryQA, achieving 9.5% higher accuracy than the strongest baseline, Gemini-1.5-pro, and demonstrating a +15.56% advantage in human side-by-side evaluations. Additionally, incorporating audio-visual character identification from StoryTeller improves the performance of all video description models, with Gemini-1.5-pro and GPT-4o showing relative improvement of 5.5% and 13.0%, respectively, in accuracy on StoryQA.
Problem

Research questions and friction points this paper is trying to address.

Generating coherent descriptions for long videos spanning minutes.
Consistent character identification using audio-visual information.
Enhancing video description models with multimodal integration.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal model integrates visual, audio, text.
Audio-visual character identification enhances consistency.
StoryTeller outperforms baselines in accuracy.
πŸ”Ž Similar Papers
No similar papers found.
Yichen He
Yichen He
Bytedance Research
Yuan Lin
Yuan Lin
Ocean College, Zhejiang University
RheologyPolymer physcisMulti-phase flow
Jianchao Wu
Jianchao Wu
ByteDance Research
H
Hanchong Zhang
Shanghai Jiao Tong University
Y
Yuchen Zhang
ByteDance Research
R
Ruicheng Le
Peking University