π€ AI Summary
The explosive growth of video content poses significant challenges for multi-granularity (i.e., generic, temporal, spatiotemporal) understanding and integration of commonsense knowledge. Method: This paper presents a systematic survey of large language model (LLM)-enhanced video understanding, proposing the first unified βVideo Analyzer/Embedder Γ LLMβ taxonomy. It formally defines five novel functional roles of LLMs in video understanding and reveals their open-ended, multi-granularity reasoning capabilities. The survey integrates advances in multimodal LLMs, video representation learning, cross-modal alignment, and prompt engineering to establish a comprehensive evaluation framework covering over 100 works, major benchmarks, and task paradigms. Contribution/Results: We release an authoritative, open-source resource repository. Key future directions are identified: scalable modeling, fine-grained temporal reasoning, and causal inference.
π Abstract
With the burgeoning growth of online video platforms and the escalating volume of video content, the demand for proficient video understanding tools has intensified markedly. Given the remarkable capabilities of large language models (LLMs) in language and multimodal tasks, this survey provides a detailed overview of recent advancements in video understanding that harness the power of LLMs (Vid-LLMs). The emergent capabilities of Vid-LLMs are surprisingly advanced, particularly their ability for open-ended multi-granularity (general, temporal, and spatiotemporal) reasoning combined with commonsense knowledge, suggesting a promising path for future video understanding. We examine the unique characteristics and capabilities of Vid-LLMs, categorizing the approaches into three main types: Video Analyzer x LLM, Video Embedder x LLM, and (Analyzer + Embedder) x LLM. Furthermore, we identify five sub-types based on the functions of LLMs in Vid-LLMs: LLM as Summarizer, LLM as Manager, LLM as Text Decoder, LLM as Regressor, and LLM as Hidden Layer. Furthermore, this survey presents a comprehensive study of the tasks, datasets, benchmarks, and evaluation methodologies for Vid-LLMs. Additionally, it explores the expansive applications of Vid-LLMs across various domains, highlighting their remarkable scalability and versatility in real-world video understanding challenges. Finally, it summarizes the limitations of existing Vid-LLMs and outlines directions for future research. For more information, readers are recommended to visit the repository at https://github.com/yunlong10/Awesome-LLMs-for-Video-Understanding.