OmViD: Omni-supervised active learning for video action detection

📅 2025-08-19

📈 Citations: 0

✨ Influential: 0

career value

160K/year

🤖 AI Summary

Video action detection relies on costly, dense spatiotemporal annotations, with substantial variation in sample difficulty. To optimize annotation efficiency, this paper proposes an active learning framework that first analyzes the actual granularity requirements—ranging from video-level labels and temporal points to scribbles, bounding boxes, and pixel-level masks—across diverse video samples. It then introduces a dynamic annotation-type selection strategy, leveraging spatiotemporal 3D superpixel segmentation to generate high-quality pseudo-labels, enabling unified modeling and joint training across multiple annotation granularities. The method supports progressive integration—from weak to strong supervision—within a single pipeline. Evaluated on UCF101-24 and JHMDB-21, it reduces annotation cost by up to 72% while maintaining near–fully supervised performance (mAP degradation of only 1.2–2.5%). This establishes a scalable, annotation-efficient paradigm for low-resource video understanding.

Technology Category

Application Category

📝 Abstract

Video action detection requires dense spatio-temporal annotations, which are both challenging and expensive to obtain. However, real-world videos often vary in difficulty and may not require the same level of annotation. This paper analyzes the appropriate annotation types for each sample and their impact on spatio-temporal video action detection. It focuses on two key aspects: 1) how to obtain varying levels of annotation for videos, and 2) how to learn action detection from different annotation types. The study explores video-level tags, points, scribbles, bounding boxes, and pixel-level masks. First, a simple active learning strategy is proposed to estimate the necessary annotation type for each video. Then, a novel spatio-temporal 3D-superpixel approach is introduced to generate pseudo-labels from these annotations, enabling effective training. The approach is validated on UCF101-24 and JHMDB-21 datasets, significantly cutting annotation costs with minimal performance loss.

Problem

Research questions and friction points this paper is trying to address.

Determining appropriate annotation types for video action detection

Learning action detection from varying annotation levels

Reducing annotation costs while maintaining detection performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Active learning strategy for annotation estimation

Spatio-temporal 3D-superpixel pseudo-label generation

Multi-level annotation integration for video action

🔎 Similar Papers

No similar papers found.