🤖 AI Summary
This work addresses the lack of interpretability in existing skeleton-based action recognition models, which typically operate as black boxes. The authors propose a concept-driven interpretable framework that reformulates action recognition as first-order logical reasoning grounded in motion primitives. Their approach employs a spatiotemporal skeleton encoder and a concept decoder to learn differentiable spatiotemporal motion concepts, which are instantiated as logical predicates. By integrating a large language model to align atomic action semantics, the method constructs a shared conceptual space bridging perception and reasoning. This is the first effort to incorporate differentiable first-order logic into skeleton-based action recognition, achieving competitive accuracy on the NTU RGB+D 60/120 and NW-UCLA benchmarks while generating human-readable logical rules that enable explicit, interpretable action understanding.
📝 Abstract
Skeleton-based human activity recognition has achieved strong empirical performance, yet most existing models remain black boxes and difficult to interpret. In this work, we introduce a neurosymbolic formulation of skeleton-based HAR that reframes action recognition as concept-driven first-order logical reasoning over motion primitives. Our framework bridges representation learning and symbolic inference by grounding first-order logic predicates in learnable spatial and temporal motion concepts. Specifically, we employ a standard spatio-temporal skeleton encoder to extract latent motion representations, which are then mapped to interpretable concept predicates via a spatio-temporal concept decoder that explicitly separates pose-centric and dynamics-centric abstractions. These concept predicates are composed through differentiable first-order logic layers, enabling the model to learn human-readable logical rules that govern action semantics. To impose semantic structure on the learned concepts, we align skeleton representations with LLM-derived descriptions of atomic motion primitives, establishing a shared conceptual space for perception and reasoning. Extensive experiments on NTU RGB+D 60/120 and NW-UCLA demonstrate that our approach achieves competitive recognition performance while providing explicit, interpretable explanations grounded in logical structure. Our results highlight neurosymbolic reasoning as an effective paradigm for interpretable spatio-temporal action understanding. Code: https://github.com/Mr-TalhaIlyas/REASON