Video Understanding with Large Language Models: A Survey

πŸ“… 2023-12-29
πŸ›οΈ IEEE transactions on circuits and systems for video technology (Print)
πŸ“ˆ Citations: 95
✨ Influential: 4
πŸ“„ PDF
πŸ€– AI Summary
The explosive growth of video content poses significant challenges for multi-granularity (i.e., generic, temporal, spatiotemporal) understanding and integration of commonsense knowledge. Method: This paper presents a systematic survey of large language model (LLM)-enhanced video understanding, proposing the first unified β€œVideo Analyzer/Embedder Γ— LLM” taxonomy. It formally defines five novel functional roles of LLMs in video understanding and reveals their open-ended, multi-granularity reasoning capabilities. The survey integrates advances in multimodal LLMs, video representation learning, cross-modal alignment, and prompt engineering to establish a comprehensive evaluation framework covering over 100 works, major benchmarks, and task paradigms. Contribution/Results: We release an authoritative, open-source resource repository. Key future directions are identified: scalable modeling, fine-grained temporal reasoning, and causal inference.
πŸ“ Abstract
With the burgeoning growth of online video platforms and the escalating volume of video content, the demand for proficient video understanding tools has intensified markedly. Given the remarkable capabilities of large language models (LLMs) in language and multimodal tasks, this survey provides a detailed overview of recent advancements in video understanding that harness the power of LLMs (Vid-LLMs). The emergent capabilities of Vid-LLMs are surprisingly advanced, particularly their ability for open-ended multi-granularity (general, temporal, and spatiotemporal) reasoning combined with commonsense knowledge, suggesting a promising path for future video understanding. We examine the unique characteristics and capabilities of Vid-LLMs, categorizing the approaches into three main types: Video Analyzer x LLM, Video Embedder x LLM, and (Analyzer + Embedder) x LLM. Furthermore, we identify five sub-types based on the functions of LLMs in Vid-LLMs: LLM as Summarizer, LLM as Manager, LLM as Text Decoder, LLM as Regressor, and LLM as Hidden Layer. Furthermore, this survey presents a comprehensive study of the tasks, datasets, benchmarks, and evaluation methodologies for Vid-LLMs. Additionally, it explores the expansive applications of Vid-LLMs across various domains, highlighting their remarkable scalability and versatility in real-world video understanding challenges. Finally, it summarizes the limitations of existing Vid-LLMs and outlines directions for future research. For more information, readers are recommended to visit the repository at https://github.com/yunlong10/Awesome-LLMs-for-Video-Understanding.
Problem

Research questions and friction points this paper is trying to address.

Surveying advancements in video understanding using large language models
Exploring Vid-LLMs' multi-granularity reasoning and commonsense knowledge integration
Analyzing Vid-LLMs' applications, limitations, and future research directions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leveraging LLMs for multi-granularity video reasoning
Categorizing Vid-LLMs into three main approach types
Identifying five functional sub-types of LLMs
πŸ”Ž Similar Papers
No similar papers found.
Y
Yunlong Tang
University of Rochester
J
Jing Bi
University of Rochester
S
Siting Xu
Southern University of Science and Technology
Luchuan Song
Luchuan Song
University of Rochester
Computer VisionComputer GraphicsAnimation
Susan Liang
Susan Liang
University of Rochester
Computer Vision
T
Teng Wang
The University of Hong Kong
Daoan Zhang
Daoan Zhang
PhD Student, University of Rochester
Computer VisionMultimodal LearningLLM
J
Jie An
University of Rochester
J
Jingyang Lin
University of Rochester
Rongyi Zhu
Rongyi Zhu
University of Rochester
A
A. Vosoughi
University of Rochester
C
Chao Huang
University of Rochester
Zeliang Zhang
Zeliang Zhang
PhD Candidate @ University of Rochester; BEng @ HUST
trustworthy and efficient AI
F
Feng Zheng
Southern University of Science and Technology
J
Jianguo Zhang
Southern University of Science and Technology
Ping Luo
Ping Luo
National University of Defense Technology
distributed_computing
J
Jiebo Luo
University of Rochester
Chenliang Xu
Chenliang Xu
Associate Professor of Computer Science, University of Rochester
Computer VisionMultimodal LearningVideo UnderstandingVision and Language