GEB+: A Benchmark for Generic Event Boundary Captioning, Grounding and Retrieval

📅 2022-04-01
🏛️ European Conference on Computer Vision
📈 Citations: 19
Influential: 1
📄 PDF
🤖 AI Summary
Existing video understanding models struggle to accurately localize fine-grained event boundaries—particularly in identifying and describing character state transitions. To address this, we introduce GEB+, the first multi-task benchmark for generic event boundary understanding, comprising over 15K real-world long-video segments with human-verified, fine-grained boundary annotations. GEB+ unifies three core tasks for the first time: event boundary description generation, precise boundary localization, and cross-modal retrieval. We propose a temporal-aware annotation protocol and a weakly supervised alignment strategy to support joint task learning. Methodologically, we design a multimodal Transformer architecture enhanced with temporal contrastive learning, boundary-aware attention, and weakly supervised cross-modal alignment. Our approach achieves substantial improvements across all three tasks, yielding an average R@1 gain of 12.7% over prior methods. GEB+ establishes a new baseline for video event understanding and advances standardized evaluation in this domain.
Problem

Research questions and friction points this paper is trying to address.

Video Understanding
Fine-grained Changes
State Transition Recognition
Innovation

Methods, ideas, or system contributions that make the work stand out.

Kinetic-GEB+
TPD
Video Understanding
🔎 Similar Papers
No similar papers found.
Y
Yuxuan Wang
Show Lab, National University of Singapore
Difei Gao
Difei Gao
National U. of Singapore; Institute of Computing Technology, Chinese Academy of Sciences
Artificial IntelligenceAI AgentVision and Language
L
Licheng Yu
Meta AI
S
Stan Weixian Lei
Show Lab, National University of Singapore
Matt Feiszli
Matt Feiszli
Facebook AI Research
Machine LearningComputer VisionHarmonic AnalysisGeometry
M
Mike Zheng Shou
Show Lab, National University of Singapore