🤖 AI Summary
This work addresses the task of first-person video action anticipation on the EPIC-KITCHENS-100 dataset by proposing an efficient approach based on the V-JEPA 2.1 architecture. The method leverages a frozen encoder–predictor framework to extract contextual representations from observed video segments and latent features of the near future, followed by lightweight task-query attention probes to separately predict verbs, nouns, and complete actions. A novel field-aware ensemble strategy is introduced, which selectively fuses results from multiple training runs according to output fields, substantially enhancing robustness and accuracy across all prediction dimensions. This approach achieved first place in the EPIC-KITCHENS-100 Action Anticipation Challenge at the EgoVis 2026 official evaluation.
📝 Abstract
We propose JFAA, a JEPA-based Future Action Anticipation method for the EPIC-KITCHENS-100 (EK-100) Action Anticipation task. Inspired by the representation learning and future prediction ability of V-JEPA 2.1, JFAA uses a frozen encoder and predictor to extract observed context features and near-future latent tokens. A lightweight attentive probe is then trained to predict verb, noun, and action logits with separate task queries. To improve robustness, we further build a field-aware ensemble over selected epoch-level predictions, allowing each output field to benefit from its most reliable candidates. Experimental results on the official challenge server show that JFAA achieves first place in the EgoVis 2026 EK-100 Action Anticipation Challenge. Our code will be released at https://github.com/CorrineQiu/JFAA.