๐ค AI Summary
Unidirectional modeling in long-horizon video action forecasting struggles to capture semantically heterogeneous sub-actions. Method: We propose BiAnt, the first framework that deeply integrates bidirectional action sequence learning with large language models (LLMs). BiAnt employs an encoder-decoder architecture that jointly performs forward future prediction and backward contextual reconstruction, leveraging LLMs to explicitly model semantic dependencies and temporal symmetry among actionsโthereby overcoming representational limitations of conventional unidirectional models. Contribution/Results: On the Ego4D benchmark, BiAnt achieves significant improvements in edit distance over state-of-the-art baselines, empirically validating the efficacy of bidirectional collaborative reasoning for long-term action anticipation. This work establishes a novel, interpretable, and robust action forecasting paradigm, particularly beneficial for safety-critical applications such as autonomous driving and service robotics requiring early risk identification.
๐ Abstract
Video-based long-term action anticipation is crucial for early risk detection in areas such as automated driving and robotics. Conventional approaches extract features from past actions using encoders and predict future events with decoders, which limits performance due to their unidirectional nature. These methods struggle to capture semantically distinct sub-actions within a scene. The proposed method, BiAnt, addresses this limitation by combining forward prediction with backward prediction using a large language model. Experimental results on Ego4D demonstrate that BiAnt improves performance in terms of edit distance compared to baseline methods.