OmViD: Omni-supervised active learning for video action detection

📅 2025-08-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Video action detection relies on costly, dense spatiotemporal annotations, with substantial variation in sample difficulty. To optimize annotation efficiency, this paper proposes an active learning framework that first analyzes the actual granularity requirements—ranging from video-level labels and temporal points to scribbles, bounding boxes, and pixel-level masks—across diverse video samples. It then introduces a dynamic annotation-type selection strategy, leveraging spatiotemporal 3D superpixel segmentation to generate high-quality pseudo-labels, enabling unified modeling and joint training across multiple annotation granularities. The method supports progressive integration—from weak to strong supervision—within a single pipeline. Evaluated on UCF101-24 and JHMDB-21, it reduces annotation cost by up to 72% while maintaining near–fully supervised performance (mAP degradation of only 1.2–2.5%). This establishes a scalable, annotation-efficient paradigm for low-resource video understanding.

Technology Category

Application Category

📝 Abstract
Video action detection requires dense spatio-temporal annotations, which are both challenging and expensive to obtain. However, real-world videos often vary in difficulty and may not require the same level of annotation. This paper analyzes the appropriate annotation types for each sample and their impact on spatio-temporal video action detection. It focuses on two key aspects: 1) how to obtain varying levels of annotation for videos, and 2) how to learn action detection from different annotation types. The study explores video-level tags, points, scribbles, bounding boxes, and pixel-level masks. First, a simple active learning strategy is proposed to estimate the necessary annotation type for each video. Then, a novel spatio-temporal 3D-superpixel approach is introduced to generate pseudo-labels from these annotations, enabling effective training. The approach is validated on UCF101-24 and JHMDB-21 datasets, significantly cutting annotation costs with minimal performance loss.
Problem

Research questions and friction points this paper is trying to address.

Determining appropriate annotation types for video action detection
Learning action detection from varying annotation levels
Reducing annotation costs while maintaining detection performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Active learning strategy for annotation estimation
Spatio-temporal 3D-superpixel pseudo-label generation
Multi-level annotation integration for video action
🔎 Similar Papers
No similar papers found.