StoryTeller: Improving Long Video Description through Global Audio-Visual Character Identification

📅 2024-11-11

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

228K/year

🤖 AI Summary

Current vision-language models struggle to generate coherent, minute-long video descriptions due to inconsistent cross-temporal character identification and insufficient audio-visual, plot-level semantic modeling. To address this, we introduce MovieStory101—the first long-video description dataset—and StoryQA, a dedicated evaluation benchmark, both grounded in global audio-visual character binding as the core mechanism. We propose a multimodal large model architecture that jointly performs audio-visual character alignment and LVLM-enhanced decoding in staged inference. Crucially, we identify and systematically resolve audio-visual character consistency modeling as the key bottleneck in long-video description. On StoryQA, our method achieves a 9.5% absolute accuracy gain over Gemini-1.5-pro and a 15.56% improvement in human evaluation. When integrated into Gemini-1.5-pro and GPT-4o, it yields respective gains of 5.5% and 13.0%, demonstrating strong transferability and efficacy.

Technology Category

Application Category

📝 Abstract

Existing large vision-language models (LVLMs) are largely limited to processing short, seconds-long videos and struggle with generating coherent descriptions for extended video spanning minutes or more. Long video description introduces new challenges, such as consistent character identification and plot-level descriptions incorporating both visual and audio information. To address these, we figure out audio-visual character identification, matching character names to each dialogue, as a key factor. We propose StoryTeller, a system for generating dense descriptions of long videos, incorporating both low-level visual concepts and high-level plot information. StoryTeller uses a multimodal large language model that integrates visual, audio, and text modalities to perform audio-visual character identification on minute-long video clips. The results are then fed into a LVLM to enhance consistency of video description. We validate our approach on movie description tasks and introduce MovieStory101, a dataset with dense descriptions for three-minute movie clips. To evaluate long video descriptions, we create StoryQA, a large set of multiple-choice questions for MovieStory101 test set. We assess descriptions by inputting them into GPT-4 to answer these questions, using accuracy as an automatic evaluation metric. Experiments show that StoryTeller outperforms all open and closed-source baselines on StoryQA, achieving 9.5% higher accuracy than the strongest baseline, Gemini-1.5-pro, and demonstrating a +15.56% advantage in human side-by-side evaluations. Additionally, incorporating audio-visual character identification from StoryTeller improves the performance of all video description models, with Gemini-1.5-pro and GPT-4o showing relative improvement of 5.5% and 13.0%, respectively, in accuracy on StoryQA.

Problem

Research questions and friction points this paper is trying to address.

Generating coherent descriptions for long videos spanning minutes.

Consistent character identification using audio-visual information.

Enhancing video description models with multimodal integration.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal model integrates visual, audio, text.

Audio-visual character identification enhances consistency.

StoryTeller outperforms baselines in accuracy.

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs