Disentangled Concepts Speak Louder Than Words:Explainable Video Action Recognition

📅 2025-11-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video action recognition explanation methods struggle to disentangle motion dynamics from static spatial background features, while language-based explanations are inherently limited by the tacit nature of motion. To address this, we propose DANCE, the first framework that explicitly models motion as human pose sequences and decouples motion, object, and scene semantics. DANCE employs an ante-hoc concept bottleneck architecture: it jointly leverages pose estimation and large language models to extract interpretable, semantically grounded concepts, enforcing predictions to flow exclusively through this concept layer. This design substantially enhances explanation clarity and model transparency. Evaluated on four benchmark datasets, DANCE achieves both state-of-the-art interpretability and competitive accuracy. A user study confirms that its explanations are more intuitive and effectively support model debugging, editing, and failure attribution.

Technology Category

Application Category

📝 Abstract
Effective explanations of video action recognition models should disentangle how movements unfold over time from the surrounding spatial context. However, existing methods based on saliency produce entangled explanations, making it unclear whether predictions rely on motion or spatial context. Language-based approaches offer structure but often fail to explain motions due to their tacit nature -- intuitively understood but difficult to verbalize. To address these challenges, we propose Disentangled Action aNd Context concept-based Explainable (DANCE) video action recognition, a framework that predicts actions through disentangled concept types: motion dynamics, objects, and scenes. We define motion dynamics concepts as human pose sequences. We employ a large language model to automatically extract object and scene concepts. Built on an ante-hoc concept bottleneck design, DANCE enforces prediction through these concepts. Experiments on four datasets -- KTH, Penn Action, HAA500, and UCF-101 -- demonstrate that DANCE significantly improves explanation clarity with competitive performance. We validate the superior interpretability of DANCE through a user study. Experimental results also show that DANCE is beneficial for model debugging, editing, and failure analysis.
Problem

Research questions and friction points this paper is trying to address.

Disentangling motion dynamics from spatial context in video action recognition
Addressing limitations of saliency and language-based explanation methods
Providing interpretable explanations through motion, object, and scene concepts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Disentangles motion dynamics from spatial context
Uses pose sequences and LLM-extracted concepts
Implements ante-hoc concept bottleneck design
🔎 Similar Papers
No similar papers found.