Video Understanding: From Geometry and Semantics to Unified Models

📅 2026-03-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Video understanding demands effective modeling of temporal dynamics and evolving visual contexts, placing heightened requirements on a model’s spatiotemporal reasoning capabilities. This work systematically surveys existing approaches and constructs a structured analytical framework organized around three perspectives: video geometric modeling, high-level semantic understanding, and unified foundation models. By doing so, it advances the field from task-specific pipelines toward a general-purpose paradigm adaptable to diverse downstream tasks. The study not only synthesizes key methodologies spanning geometric representation, semantic interpretation, and unified modeling but also constructs a domain knowledge graph that illuminates core trends and open challenges on the path toward robust, scalable video foundation models.

Technology Category

Application Category

📝 Abstract
Video understanding aims to enable models to perceive, reason about, and interact with the dynamic visual world. In contrast to image understanding, video understanding inherently requires modeling temporal dynamics and evolving visual context, placing stronger demands on spatiotemporal reasoning and making it a foundational problem in computer vision. In this survey, we present a structured overview of video understanding by organizing the literature into three complementary perspectives: low-level video geometry understanding, high-level semantic understanding, and unified video understanding models. We further highlight a broader shift from isolated, task-specific pipelines toward unified modeling paradigms that can be adapted to diverse downstream objectives, enabling a more systematic view of recent progress. By consolidating these perspectives, this survey provides a coherent map of the evolving video understanding landscape, summarizes key modeling trends and design principles, and outlines open challenges toward building robust, scalable, and unified video foundation models.
Problem

Research questions and friction points this paper is trying to address.

video understanding
temporal dynamics
spatiotemporal reasoning
visual context
computer vision
Innovation

Methods, ideas, or system contributions that make the work stand out.

unified video understanding
spatiotemporal reasoning
video foundation models
geometric understanding
semantic understanding
🔎 Similar Papers
No similar papers found.
Z
Zhaochong An
Department of Computer Science, University of Copenhagen, Copenhagen 2100, Denmark.
Z
Zirui Li
College of Computer Science, Nankai University, Tianjin 300350, China.
Mingqiao Ye
Mingqiao Ye
EPFL
Computer VisionMultimodality
Feng Qiao
Feng Qiao
Washington University in St. Louis
Computer VisionArtificial IntelligenceAutonomous Driving
Jiaang Li
Jiaang Li
University of Copenhagen
Computer VisionMultimodalityNatural Language Processing
Zongwei Wu
Zongwei Wu
University of Würzburg | CNRS - Université de Bourgogne | ETH Zurich
Sensor FusionPerception
V
Vishal Thengane
Computer Science Research Centre, University of Surrey, Guildford GU2 7XH, UK. School of Electrical, Computer and Telecommunications Engineering, University of Wollongong, Wollongong 2500, Australia.
Chengzu Li
Chengzu Li
University of Cambridge
Natural Language Processing
L
Lei Li
School of Artificial Intelligence, Beijing Institute of Technology, Beijing 100081, China.
Luc Van Gool
Luc Van Gool
professor computer vision INSAIT Sofia University, em. KU Leuven, em. ETHZ, Toyota Lab TRACE
computer visionmachine learningAIautonomous carscultural heritage
Guolei Sun
Guolei Sun
ETH Zurich
Visual AttentionVideoWeak SupervisionCamouflageLow-level Vision
Serge Belongie
Serge Belongie
University of Copenhagen
Computer VisionMachine Learning