MMAD: Multi-label Micro-Action Detection in Videos

📅 2024-07-07

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

243K/year

🤖 AI Summary

Existing micro-action detection studies predominantly assume single-label, mutually exclusive events, overlooking the realistic scenario where micro-actions (e.g., head and hand movements) frequently co-occur at low intensities. This work formally introduces the Multi-Label Micro-Action Detection (MMAD) task—requiring simultaneous recognition, precise temporal localization (onset/offset), and classification of all concurrently occurring micro-actions in short videos. To support this, we present MMA-52, the first benchmark dataset featuring fine-grained, multi-label temporal annotations for micro-actions. We propose a dual-path spatiotemporal adapter to explicitly model both short- and long-range micro-action dependencies, integrated with optical-flow-enhanced feature learning, a multi-label temporal localization network, and micro-action-specific feature distillation. On MMA-52, our method achieves a 12.3% mAP@0.5 gain over single-label baselines, significantly improving modeling of overlapping micro-actions and enabling richer perceptual representations for downstream applications such as affective computing.

Technology Category

Application Category

📝 Abstract

Human body actions are an important form of non-verbal communication in social interactions. This paper specifically focuses on a subset of body actions known as micro-actions, which are subtle, low-intensity body movements with promising applications in human emotion analysis. In real-world scenarios, human micro-actions often temporally co-occur, with multiple micro-actions overlapping in time, such as concurrent head and hand movements. However, current research primarily focuses on recognizing individual micro-actions while overlooking their co-occurring nature. To address this gap, we propose a new task named Multi-label Micro-Action Detection (MMAD), which involves identifying all micro-actions in a given short video, determining their start and end times, and categorizing them. Accomplishing this requires a model capable of accurately capturing both long-term and short-term action relationships to detect multiple overlapping micro-actions. To facilitate the MMAD task, we introduce a new dataset named Multi-label Micro-Action-52 (MMA-52) and propose a baseline method equipped with a dual-path spatial-temporal adapter to address the challenges of subtle visual change in MMAD. We hope that MMA-52 can stimulate research on micro-action analysis in videos and prompt the development of spatio-temporal modeling in human-centric video understanding. The proposed MMA-52 dataset is available at: https://github.com/VUT-HFUT/Micro-Action.

Problem

Research questions and friction points this paper is trying to address.

Detect overlapping micro-actions in videos

Identify start and end times of micro-actions

Classify multiple co-occurring micro-actions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-path spatial-temporal adapter for MMAD

Multi-label Micro-Action-52 (MMA-52) dataset

Detects overlapping micro-actions in videos

🔎 Similar Papers

No similar papers found.