🤖 AI Summary
Video action detection relies on costly, dense spatiotemporal annotations, with substantial variation in sample difficulty. To optimize annotation efficiency, this paper proposes an active learning framework that first analyzes the actual granularity requirements—ranging from video-level labels and temporal points to scribbles, bounding boxes, and pixel-level masks—across diverse video samples. It then introduces a dynamic annotation-type selection strategy, leveraging spatiotemporal 3D superpixel segmentation to generate high-quality pseudo-labels, enabling unified modeling and joint training across multiple annotation granularities. The method supports progressive integration—from weak to strong supervision—within a single pipeline. Evaluated on UCF101-24 and JHMDB-21, it reduces annotation cost by up to 72% while maintaining near–fully supervised performance (mAP degradation of only 1.2–2.5%). This establishes a scalable, annotation-efficient paradigm for low-resource video understanding.
📝 Abstract
Video action detection requires dense spatio-temporal annotations, which are both challenging and expensive to obtain. However, real-world videos often vary in difficulty and may not require the same level of annotation. This paper analyzes the appropriate annotation types for each sample and their impact on spatio-temporal video action detection. It focuses on two key aspects: 1) how to obtain varying levels of annotation for videos, and 2) how to learn action detection from different annotation types. The study explores video-level tags, points, scribbles, bounding boxes, and pixel-level masks. First, a simple active learning strategy is proposed to estimate the necessary annotation type for each video. Then, a novel spatio-temporal 3D-superpixel approach is introduced to generate pseudo-labels from these annotations, enabling effective training. The approach is validated on UCF101-24 and JHMDB-21 datasets, significantly cutting annotation costs with minimal performance loss.