🤖 AI Summary
This work addresses long-horizon video action forecasting, jointly modeling action classes and their durations to ensure temporal coherence and physical plausibility. We propose an encoder-decoder framework with three key contributions: (1) a novel bidirectional action context regularization module that explicitly captures long-range temporal dependencies; (2) a globally optimized action transition matrix learned from annotated segments to enforce physically realistic state transitions; and (3) an action-segment-specific encoder that enhances representation quality of observed segments. The method supports parallel decoding and probabilistic duration prediction, augmented by explicit temporal consistency constraints. Evaluated on four major benchmarks—EpicKitchen-55, EGTEA+, 50Salads, and Breakfast—our approach achieves state-of-the-art or competitive performance, significantly outperforming both LLM-based and traditional probabilistic methods that operate on trimmed inputs.
📝 Abstract
This paper proposes a method for long-term action anticipation (LTA), the task of predicting action labels and their duration in a video given the observation of an initial untrimmed video interval. We build on an encoder-decoder architecture with parallel decoding and make two key contributions. First, we introduce a bi-directional action context regularizer module on the top of the decoder that ensures temporal context coherence in temporally adjacent segments. Second, we learn from classified segments a transition matrix that models the probability of transitioning from one action to another and the sequence is optimized globally over the full prediction interval. In addition, we use a specialized encoder for the task of action segmentation to increase the quality of the predictions in the observation interval at inference time, leading to a better understanding of the past. We validate our methods on four benchmark datasets for LTA, the EpicKitchen-55, EGTEA+, 50Salads and Breakfast demonstrating superior or comparable performance to state-of-the-art methods, including probabilistic models and also those based on Large Language Models, that assume trimmed video as input. The code will be released upon acceptance.