🤖 AI Summary
Action prediction aims to infer future actions from partial video observations, yet existing approaches predominantly rely on a single visual modality and suffer from coarse-grained label noise. This paper proposes a multimodal hierarchical action prediction framework: it is the first to jointly leverage visual and textual cues, employing a multi-level semantic encoder that explicitly models the temporal–semantic hierarchy of actions; it introduces a fine-grained label generator to mitigate label coarseness and incorporates a temporal consistency loss to enhance prediction stability. Evaluated on Breakfast, 50 Salads, and DARai datasets, the method achieves an average improvement of 3.08% in early prediction accuracy, establishing new state-of-the-art performance. The core contributions lie in the unified design of multimodal collaborative modeling, hierarchical semantic representation, and robust learning against label noise.
📝 Abstract
Action anticipation, the task of predicting future actions from partially observed videos, is crucial for advancing intelligent systems. Unlike action recognition, which operates on fully observed videos, action anticipation must handle incomplete information. Hence, it requires temporal reasoning, and inherent uncertainty handling. While recent advances have been made, traditional methods often focus solely on visual modalities, neglecting the potential of integrating multiple sources of information. Drawing inspiration from human behavior, we introduce extit{Multi-level and Multi-modal Action Anticipation (m&m-Ant)}, a novel multi-modal action anticipation approach that combines both visual and textual cues, while explicitly modeling hierarchical semantic information for more accurate predictions. To address the challenge of inaccurate coarse action labels, we propose a fine-grained label generator paired with a specialized temporal consistency loss function to optimize performance. Extensive experiments on widely used datasets, including Breakfast, 50 Salads, and DARai, demonstrate the effectiveness of our approach, achieving state-of-the-art results with an average anticipation accuracy improvement of 3.08% over existing methods. This work underscores the potential of multi-modal and hierarchical modeling in advancing action anticipation and establishes a new benchmark for future research in the field. Our code is available at: https://github.com/olivesgatech/mM-ant.