🤖 AI Summary
Existing video understanding models struggle to accurately localize fine-grained event boundaries—particularly in identifying and describing character state transitions. To address this, we introduce GEB+, the first multi-task benchmark for generic event boundary understanding, comprising over 15K real-world long-video segments with human-verified, fine-grained boundary annotations. GEB+ unifies three core tasks for the first time: event boundary description generation, precise boundary localization, and cross-modal retrieval. We propose a temporal-aware annotation protocol and a weakly supervised alignment strategy to support joint task learning. Methodologically, we design a multimodal Transformer architecture enhanced with temporal contrastive learning, boundary-aware attention, and weakly supervised cross-modal alignment. Our approach achieves substantial improvements across all three tasks, yielding an average R@1 gain of 12.7% over prior methods. GEB+ establishes a new baseline for video event understanding and advances standardized evaluation in this domain.