Deconstruct Complexity (DeComplex): A Novel Perspective on Tackling Dense Action Detection

📅 2025-01-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of detecting dense, co-occurring, and semantically ambiguous actions in untrimmed videos, this paper proposes a novel action concept disentanglement paradigm: decomposing each action class into two complementary conceptual components—static (objects/scenes) and dynamic (verbs/motions)—and modeling them via dedicated dual-stream subnetworks. We introduce a language-embedding-guided contrastive learning loss to explicitly capture inter-concept co-occurrence patterns, overcoming the limitations of conventional independent classification optimization. Furthermore, we adopt a multi-label joint optimization strategy. Our method achieves state-of-the-art performance, improving mAP by 23.4% on Charades and 2.5% on MultiTHUMOS over prior approaches. The core contributions are twofold: (1) the first principled decomposition of actions into interpretable, collaboratively modeled conceptual units; and (2) a novel language–vision joint contrastive learning framework that enhances semantic alignment and contextual reasoning.

Technology Category

Application Category

📝 Abstract
Dense action detection involves detecting multiple co-occurring actions in an untrimmed video while action classes are often ambiguous and represent overlapping concepts. To address this challenge task, we introduce a novel perspective inspired by how humans tackle complex tasks by breaking them into manageable sub-tasks. Instead of relying on a single network to address the entire problem, as in current approaches, we propose decomposing the problem into detecting key concepts present in action classes, specifically, detecting dense static concepts and detecting dense dynamic concepts, and assigning them to distinct, specialized networks. Furthermore, simultaneous actions in a video often exhibit interrelationships, and exploiting these relationships can improve performance. However, we argue that current networks fail to effectively learn these relationships due to their reliance on binary cross-entropy optimization, which treats each class independently. To address this limitation, we propose providing explicit supervision on co-occurring concepts during network optimization through a novel language-guided contrastive learning loss. Our extensive experiments demonstrate the superiority of our approach over state-of-the-art methods, achieving substantial relative improvements of 23.4% and 2.5% mAP on the challenging benchmark datasets, Charades and MultiTHUMOS.
Problem

Research questions and friction points this paper is trying to address.

Dense Action Recognition
Untrimmed Videos
Behavior Categorization
Innovation

Methods, ideas, or system contributions that make the work stand out.

DeComplex
Contrastive Learning
Decomposition of Dense Action Detection
🔎 Similar Papers
No similar papers found.