Technical Report for Ego4D Long-Term Action Anticipation Challenge 2025

📅 2025-06-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenges of modeling future action sequences and weak cross-modal temporal reasoning in the Ego4D Long-Term Action (LTA) prediction task, this paper proposes a three-stage framework: (1) a high-performance vision encoder for frame-level feature extraction; (2) a Transformer module incorporating verb–noun co-occurrence matrix embeddings to enable fine-grained joint recognition; and (3) a fine-tuned large language model (LLM) augmented with semantic prompt engineering to map verb–noun pairs into natural-language action sequences. Our work pioneers the explicit integration of co-occurrence statistics—derived from prior linguistic knowledge—into the visual recognition module and establishes an end-to-end cross-modal long-horizon generation paradigm. Evaluated on the CVPR 2025 Ego4D LTA Challenge, our method achieves first place and sets a new state-of-the-art. The code is publicly available.

Technology Category

Application Category

📝 Abstract
In this report, we present a novel three-stage framework developed for the Ego4D Long-Term Action Anticipation (LTA) task. Inspired by recent advances in foundation models, our method consists of three stages: feature extraction, action recognition, and long-term action anticipation. First, visual features are extracted using a high-performance visual encoder. The features are then fed into a Transformer to predict verbs and nouns, with a verb-noun co-occurrence matrix incorporated to enhance recognition accuracy. Finally, the predicted verb-noun pairs are formatted as textual prompts and input into a fine-tuned large language model (LLM) to anticipate future action sequences. Our framework achieves first place in this challenge at CVPR 2025, establishing a new state-of-the-art in long-term action prediction. Our code will be released at https://github.com/CorrineQiu/Ego4D-LTA-Challenge-2025.
Problem

Research questions and friction points this paper is trying to address.

Develops a three-stage framework for long-term action anticipation
Enhances action recognition using verb-noun co-occurrence matrix
Uses fine-tuned LLM to predict future action sequences
Innovation

Methods, ideas, or system contributions that make the work stand out.

Three-stage framework for action anticipation
Transformer with verb-noun co-occurrence matrix
Fine-tuned LLM for future action sequences
🔎 Similar Papers
No similar papers found.
Qiaohui Chu
Qiaohui Chu
Harbin Institute of Technology (Shenzhen)
Multimodal AnalysisEgocentric Vision
H
Haoyu Zhang
Harbin Institute of Technology (Shenzhen), Pengcheng Laboratory
Yisen Feng
Yisen Feng
Harbin Institute of Technology (Shenzhen)
Multimodal Analysis
M
Meng Liu
Shandong Jianzhu University
W
Weili Guan
Harbin Institute of Technology (Shenzhen)
Yaowei Wang
Yaowei Wang
The Hong Kong Polytechnic University
L
Liqiang Nie
Harbin Institute of Technology (Shenzhen)