Valley: Video Assistant with Large Language model Enhanced abilitY

📅 2023-06-12
🏛️ arXiv.org
📈 Citations: 160
Influential: 16
📄 PDF
🤖 AI Summary
This work addresses the underexplored challenge of video-language joint understanding in large language models (LLMs). We propose Valley, the first unified vision-language foundation model supporting video, image, and text modalities. Methodologically, we introduce a novel two-stage video instruction tuning paradigm, incorporating a lightweight learnable projection layer and a dedicated temporal modeling module to overcome limitations of conventional image-text alignment approaches in video understanding. Valley is pretrained and instruction-tuned on a high-quality, ChatGPT-assisted video instruction dataset, constructed using a frozen visual encoder and an LLM. Experiments demonstrate that Valley significantly outperforms existing baselines on complex video instruction tasks—including long-horizon understanding, multi-hop description, action recognition, and causal reasoning—while exhibiting strong generalization and interactive analytical capabilities.
📝 Abstract
Large language models (LLMs), with their remarkable conversational capabilities, have demonstrated impressive performance across various applications and have emerged as formidable AI assistants. In view of this, it raises an intuitive question: Can we harness the power of LLMs to build multimodal AI assistants for visual applications? Recently, several multi-modal models have been developed for this purpose. They typically pre-train an adaptation module to align the semantics of the vision encoder and language model, followed by fine-tuning on instruction-following data. However, despite the success of this pipeline in image and language understanding, its effectiveness in joint video and language understanding has not been widely explored. In this paper, we aim to develop a novel multi-modal foundation model capable of comprehending video, image, and language within a general framework. To achieve this goal, we introduce Valley, a Video Assistant with Large Language model Enhanced abilitY. The Valley consists of a LLM, a temporal modeling module, a visual encoder, and a simple projection module designed to bridge visual and textual modes. To empower Valley with video comprehension and instruction-following capabilities, we construct a video instruction dataset and adopt a two-stage tuning procedure to train it. Specifically, we employ ChatGPT to facilitate the construction of task-oriented conversation data encompassing various tasks, including multi-shot captions, long video descriptions, action recognition, causal relationship inference, etc. Subsequently, we adopt a pre-training-then-instructions-tuned pipeline to align visual and textual modalities and improve the instruction-following capability of Valley. Qualitative experiments demonstrate that Valley has the potential to function as a highly effective video assistant that can make complex video understanding scenarios easy.
Problem

Research questions and friction points this paper is trying to address.

Enhance video comprehension using multi-modal foundation models.
Develop datasets for diverse video-text alignment tasks.
Improve instruction-following capabilities in video understanding.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-modal foundation model for video comprehension
Two-phase training approach enhances instruction-following
ViT-L/14 encoder with temporal modeling modules
🔎 Similar Papers
No similar papers found.