🤖 AI Summary
Early action prediction is highly challenging due to limited visual evidence. This work proposes the EAST framework, which enables generalization to arbitrary observation ratios by stochastically sampling the boundary time step between observed and unobserved frames and jointly learning representations of both observed and future segments. The method innovatively integrates a token masking mechanism with a single-encoder architecture, achieving high prediction accuracy while substantially reducing memory consumption by half and accelerating training by 2× compared to prior approaches. On the NTU60, Something-Something V2, and UCF101 benchmarks, EAST outperforms previous state-of-the-art methods by 10.1, 7.7, and 3.9 percentage points, respectively, marking the first demonstration of superior performance in early action prediction using a single encoder.
📝 Abstract
Early action prediction seeks to anticipate an action before it fully unfolds, but limited visual evidence makes this task especially challenging. We introduce EAST, a simple and efficient framework that enables a model to reason about incomplete observations. In our empirical study, we identify key components when training early action prediction models. Our key contribution is a randomized training strategy that samples a time step separating observed and unobserved video frames, enabling a single model to generalize seamlessly across all test-time observation ratios. We further show that joint learning on both observed and future (oracle) representations significantly boosts performance, even allowing an encoder-only model to excel. To improve scalability, we propose a token masking procedure that cuts memory usage in half and accelerates training by 2x with negligible accuracy loss. Combined with a forecasting decoder, EAST sets a new state of the art on NTU60, SSv2, and UCF101, surpassing previous best work by 10.1, 7.7, and 3.9 percentage points, respectively.