GEB+: A Benchmark for Generic Event Boundary Captioning, Grounding and Retrieval

📅 2022-04-01

🏛️ European Conference on Computer Vision

📈 Citations: 19

✨ Influential: 1

🤖 AI Summary

Existing video understanding models struggle to accurately localize fine-grained event boundaries—particularly in identifying and describing character state transitions. To address this, we introduce GEB+, the first multi-task benchmark for generic event boundary understanding, comprising over 15K real-world long-video segments with human-verified, fine-grained boundary annotations. GEB+ unifies three core tasks for the first time: event boundary description generation, precise boundary localization, and cross-modal retrieval. We propose a temporal-aware annotation protocol and a weakly supervised alignment strategy to support joint task learning. Methodologically, we design a multimodal Transformer architecture enhanced with temporal contrastive learning, boundary-aware attention, and weakly supervised cross-modal alignment. Our approach achieves substantial improvements across all three tasks, yielding an average R@1 gain of 12.7% over prior methods. GEB+ establishes a new baseline for video event understanding and advances standardized evaluation in this domain.

Problem

Research questions and friction points this paper is trying to address.

Video Understanding

Fine-grained Changes

State Transition Recognition

Innovation

Methods, ideas, or system contributions that make the work stand out.

Kinetic-GEB+

TPD

Video Understanding

🔎 Similar Papers

No similar papers found.