Video Understanding with Large Language Models: A Survey

📅 2023-12-29

🏛️ IEEE transactions on circuits and systems for video technology (Print)

📈 Citations: 95

✨ Influential: 4

career value

214K/year

🤖 AI Summary

The explosive growth of video content poses significant challenges for multi-granularity (i.e., generic, temporal, spatiotemporal) understanding and integration of commonsense knowledge. Method: This paper presents a systematic survey of large language model (LLM)-enhanced video understanding, proposing the first unified “Video Analyzer/Embedder × LLM” taxonomy. It formally defines five novel functional roles of LLMs in video understanding and reveals their open-ended, multi-granularity reasoning capabilities. The survey integrates advances in multimodal LLMs, video representation learning, cross-modal alignment, and prompt engineering to establish a comprehensive evaluation framework covering over 100 works, major benchmarks, and task paradigms. Contribution/Results: We release an authoritative, open-source resource repository. Key future directions are identified: scalable modeling, fine-grained temporal reasoning, and causal inference.

📝 Abstract

With the burgeoning growth of online video platforms and the escalating volume of video content, the demand for proficient video understanding tools has intensified markedly. Given the remarkable capabilities of large language models (LLMs) in language and multimodal tasks, this survey provides a detailed overview of recent advancements in video understanding that harness the power of LLMs (Vid-LLMs). The emergent capabilities of Vid-LLMs are surprisingly advanced, particularly their ability for open-ended multi-granularity (general, temporal, and spatiotemporal) reasoning combined with commonsense knowledge, suggesting a promising path for future video understanding. We examine the unique characteristics and capabilities of Vid-LLMs, categorizing the approaches into three main types: Video Analyzer x LLM, Video Embedder x LLM, and (Analyzer + Embedder) x LLM. Furthermore, we identify five sub-types based on the functions of LLMs in Vid-LLMs: LLM as Summarizer, LLM as Manager, LLM as Text Decoder, LLM as Regressor, and LLM as Hidden Layer. Furthermore, this survey presents a comprehensive study of the tasks, datasets, benchmarks, and evaluation methodologies for Vid-LLMs. Additionally, it explores the expansive applications of Vid-LLMs across various domains, highlighting their remarkable scalability and versatility in real-world video understanding challenges. Finally, it summarizes the limitations of existing Vid-LLMs and outlines directions for future research. For more information, readers are recommended to visit the repository at https://github.com/yunlong10/Awesome-LLMs-for-Video-Understanding.

Problem

Research questions and friction points this paper is trying to address.

Surveying advancements in video understanding using large language models

Exploring Vid-LLMs' multi-granularity reasoning and commonsense knowledge integration

Analyzing Vid-LLMs' applications, limitations, and future research directions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leveraging LLMs for multi-granularity video reasoning

Categorizing Vid-LLMs into three main approach types

Identifying five functional sub-types of LLMs

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs