V2Xum-LLM: Cross-Modal Video Summarization with Temporal Prompt Instruction Tuning

📅 2024-04-18

🏛️ AAAI Conference on Artificial Intelligence

📈 Citations: 37

✨ Influential: 3

career value

194K/year

🤖 AI Summary

Existing video summarization datasets suffer from limited scale, modality scarcity, and misalignment between textual summaries and video frames, hindering effective training of large vision-language models (VLMs) for cross-modal tasks. Method: We propose V2Xum, a unified cross-modal video summarization framework featuring the first joint V2V (video-to-video), V2T (video-to-text), and V2VT (video-to-video-and-text) modeling using an LLaMA-based text decoder; a temporal prompt instruction-tuning mechanism for controllable generation; and Instruct-V2Xum—a novel large-scale YouTube dataset (30K videos) with frame-level aligned textual summaries. We further introduce new V2V and V2VT evaluation metrics. Results: Experiments show that V2Xum-LLaMA significantly outperforms strong baselines; Instruct-V2Xum advances high-quality, fine-grained, multimodal video summarization; and the proposed metrics enhance result interpretability and consistency.

Technology Category

Application Category

📝 Abstract

Video summarization aims to create short, accurate, and cohesive summaries of longer videos. Despite the existence of various video summarization datasets, a notable limitation is their limited amount of source videos, which hampers the effective training of advanced large vision-language models (VLMs). Additionally, most existing datasets are created for video-to-video summarization, overlooking the contemporary need for multimodal video content summarization. Recent efforts have been made to expand from unimodal to multimodal video summarization, categorizing the task into three sub-tasks based on the summary's modality: video-to-video (V2V), video-to-text (V2T), and a combination of video and text summarization (V2VT). However, the textual summaries in previous multimodal datasets are inadequate. To address these issues, we introduce Instruct-V2Xum, a cross-modal video summarization dataset featuring 30,000 diverse videos sourced from YouTube, with lengths ranging from 40 to 940 seconds and an average summarization ratio of 16.39%. Each video summary in Instruct-V2Xum is paired with a textual summary that references specific frame indexes, facilitating the generation of aligned video and textual summaries. In addition, we propose a new video summarization framework named V2Xum-LLM. V2Xum-LLM, specifically V2Xum-LLaMA in this study, is the first framework that unifies different video summarization tasks into one large language model's (LLM) text decoder and achieves task-controllable video summarization with temporal prompts and task instructions. Experiments show that V2Xum-LLaMA outperforms strong baseline models on multiple video summarization tasks. Furthermore, we propose an enhanced evaluation metric for V2V and V2VT summarization tasks.

Problem

Research questions and friction points this paper is trying to address.

Existing video summarization datasets lack sufficient source videos for training large vision-language models

Current datasets focus on video-to-video summarization, ignoring multimodal content summarization needs

Textual summaries in previous multimodal datasets are inadequate for cross-modal video summarization tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces Instruct-V2Xum dataset with 30000 diverse videos

Unifies video summarization tasks into one LLM decoder

Achieves task control via temporal prompts and instructions

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs