TEMPURA: Temporal Event Masked Prediction and Understanding for Reasoning in Action

📅 2025-05-02

📈 Citations: 0

✨ Influential: 0

career value

156K/year

🤖 AI Summary

Video causal event understanding and fine-grained temporal localization face two key challenges: existing methods struggle to model causal dependencies due to temporal resolution compression or neglect of event boundaries. This paper proposes a two-stage training framework: (1) an event-level masked prediction stage for causal reasoning and missing-event reconstruction; and (2) a joint optimization stage for non-overlapping event segmentation and temporally aligned dense caption generation. We innovatively embed causal reasoning into event-granular masked modeling and introduce VER—the first million-scale, temporally aligned event dataset. To our knowledge, this is the first work to jointly optimize causal explanation generation and end-to-end event segmentation. Our method achieves significant improvements over state-of-the-art approaches on temporal localization and highlight detection benchmarks, demonstrating the effectiveness of integrating causal modeling with fine-grained temporal segmentation.

Technology Category

Application Category

📝 Abstract

Understanding causal event relationships and achieving fine-grained temporal grounding in videos remain challenging for vision-language models. Existing methods either compress video tokens to reduce temporal resolution, or treat videos as unsegmented streams, which obscures fine-grained event boundaries and limits the modeling of causal dependencies. We propose TEMPURA (Temporal Event Masked Prediction and Understanding for Reasoning in Action), a two-stage training framework that enhances video temporal understanding. TEMPURA first applies masked event prediction reasoning to reconstruct missing events and generate step-by-step causal explanations from dense event annotations, drawing inspiration from effective infilling techniques. TEMPURA then learns to perform video segmentation and dense captioning to decompose videos into non-overlapping events with detailed, timestamp-aligned descriptions. We train TEMPURA on VER, a large-scale dataset curated by us that comprises 1M training instances and 500K videos with temporally aligned event descriptions and structured reasoning steps. Experiments on temporal grounding and highlight detection benchmarks demonstrate that TEMPURA outperforms strong baseline models, confirming that integrating causal reasoning with fine-grained temporal segmentation leads to improved video understanding.

Problem

Research questions and friction points this paper is trying to address.

Enhancing video temporal understanding with causal reasoning

Improving fine-grained event boundary detection in videos

Integrating dense captioning with temporal event segmentation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Masked event prediction for causal reasoning

Video segmentation with dense captioning

Large-scale dataset with aligned event descriptions

🔎 Similar Papers

No similar papers found.