🤖 AI Summary
Existing video action recognition explanation methods struggle to disentangle motion dynamics from static spatial background features, while language-based explanations are inherently limited by the tacit nature of motion. To address this, we propose DANCE, the first framework that explicitly models motion as human pose sequences and decouples motion, object, and scene semantics. DANCE employs an ante-hoc concept bottleneck architecture: it jointly leverages pose estimation and large language models to extract interpretable, semantically grounded concepts, enforcing predictions to flow exclusively through this concept layer. This design substantially enhances explanation clarity and model transparency. Evaluated on four benchmark datasets, DANCE achieves both state-of-the-art interpretability and competitive accuracy. A user study confirms that its explanations are more intuitive and effectively support model debugging, editing, and failure attribution.
📝 Abstract
Effective explanations of video action recognition models should disentangle how movements unfold over time from the surrounding spatial context. However, existing methods based on saliency produce entangled explanations, making it unclear whether predictions rely on motion or spatial context. Language-based approaches offer structure but often fail to explain motions due to their tacit nature -- intuitively understood but difficult to verbalize. To address these challenges, we propose Disentangled Action aNd Context concept-based Explainable (DANCE) video action recognition, a framework that predicts actions through disentangled concept types: motion dynamics, objects, and scenes. We define motion dynamics concepts as human pose sequences. We employ a large language model to automatically extract object and scene concepts. Built on an ante-hoc concept bottleneck design, DANCE enforces prediction through these concepts. Experiments on four datasets -- KTH, Penn Action, HAA500, and UCF-101 -- demonstrate that DANCE significantly improves explanation clarity with competitive performance. We validate the superior interpretability of DANCE through a user study. Experimental results also show that DANCE is beneficial for model debugging, editing, and failure analysis.