🤖 AI Summary
This work addresses the challenging task of long-horizon (multi-minute), video-driven dense action forecasting, focusing on modeling multi-modal future uncertainty and capturing long-range temporal dependencies. We propose MANTA, the first architecture that deeply integrates state-space models (specifically Mamba) with diffusion-based probabilistic mechanisms into a unified past-future joint action prediction framework. MANTA enables end-to-end, fine-grained estimation of both action categories and durations over dense temporal sequences. Its design achieves strong representational capacity while maintaining linear computational complexity in sequence length—effectively mitigating error accumulation and GPU memory bottlenecks prevalent in transformer- or RNN-based approaches for long videos. Extensive experiments demonstrate state-of-the-art performance on three major benchmarks—Breakfast, 50Salads, and Assembly101—along with a 2.1× speedup in inference latency and a 38% reduction in GPU memory consumption.
📝 Abstract
Our work addresses the problem of stochastic long-term dense anticipation. The goal of this task is to predict actions and their durations several minutes into the future based on provided video observations. Anticipation over extended horizons introduces high uncertainty, as a single observation can lead to multiple plausible future outcomes. To address this uncertainty, stochastic models are designed to predict several potential future action sequences. Recent work has further proposed to incorporate uncertainty modelling for observed frames by simultaneously predicting per-frame past and future actions in a unified manner. While such joint modelling of actions is beneficial, it requires long-range temporal capabilities to connect events across distant past and future time points. However, the previous work struggles to achieve such a long-range understanding due to its limited and/or sparse receptive field. To alleviate this issue, we propose a novel MANTA (MAmba for ANTicipation) network. Our model enables effective long-term temporal modelling even for very long sequences while maintaining linear complexity in sequence length. We demonstrate that our approach achieves state-of-the-art results on three datasets - Breakfast, 50Salads, and Assembly101 - while also significantly improving computational and memory efficiency.