🤖 AI Summary
This survey systematically categorizes video action understanding into three temporal regimes: full-action recognition, partial-observation prediction, and unobserved forecasting—first unifying their distinct modeling essences. Addressing limitations in long-horizon causal reasoning and cross-modal synergy, it comprehensively reviews deep temporal modeling, contrastive learning, multimodal alignment, generative video modeling, and zero-shot transfer, integrating Transformer, CNN-LSTM, and diffusion-based paradigms. Synthesizing over 100 seminal works, it identifies critical performance bottlenecks and evaluation biases, constructing a structured knowledge graph. The core contribution is the proposal of the “dynamic reasoning” paradigm—a conceptual and methodological shift that advances action understanding from static classification toward systematic, causal, anticipatory, and generalizable modeling.
📝 Abstract
We have witnessed impressive advances in video action understanding. Increased dataset sizes, variability, and computation availability have enabled leaps in performance and task diversification. Current systems can provide coarse- and fine-grained descriptions of video scenes, extract segments corresponding to queries, synthesize unobserved parts of videos, and predict context across multiple modalities. This survey comprehensively reviews advances in uni- and multi-modal action understanding across a range of tasks. We focus on prevalent challenges, overview widely adopted datasets, and survey seminal works with an emphasis on recent advances. We broadly distinguish between three temporal scopes: (1) recognition tasks of actions observed in full, (2) prediction tasks for ongoing partially observed actions, and (3) forecasting tasks for subsequent unobserved action(s). This division allows us to identify specific action modeling and video representation challenges. Finally, we outline future directions to address current shortcomings.