🤖 AI Summary
Existing Transformer-based approaches for video action prediction rely on pixel-level attention, which lacks high-level semantic modeling, struggles to capture latent intentions, and is prone to overfitting to historical visual cues, thereby limiting generalization. To address these limitations, this work proposes an Action-Guided Attention (AGA) mechanism that explicitly incorporates the predicted action sequence into the attention computation as a guiding signal for both queries and keys. This enables the model to focus on past video segments most relevant to future actions and integrates current frame information through a gating function. The proposed method not only enhances semantic understanding and generalization in action prediction—demonstrating stable performance from the validation set to unseen test sets on the EPIC-Kitchens-100 benchmark—but also facilitates interpretable analysis of the learned action dependencies and counterfactual evidence.
📝 Abstract
Anticipating future actions in videos is challenging, as the observed frames provide only evidence of past activities, requiring the inference of latent intentions to predict upcoming actions. Existing transformer-based approaches, which rely on dot-product attention over pixel representations, often lack the high-level semantics necessary to model video sequences for effective action anticipation. As a result, these methods tend to overfit to explicit visual cues present in the past frames, limiting their ability to capture underlying intentions and degrading generalization to unseen samples. To address this, we propose Action-Guided Attention (AGA), an attention mechanism that explicitly leverages predicted action sequences as queries and keys to guide sequence modeling. Our approach fosters the attention module to emphasize relevant moments from the past based on the upcoming activity and combine this information with the current frame embedding via a dedicated gating function. The design of AGA enables post-training analysis of the knowledge discovered from the training set. Experiments on the widely adopted EPIC-Kitchens-100 benchmark demonstrate that AGA generalizes well from validation to unseen test sets. Post-training analysis can further examine the action dependencies captured by the model and the counterfactual evidence it has internalized, offering transparent and interpretable insights into its anticipative predictions.