🤖 AI Summary
This work addresses the challenge of insufficient discriminability in fine-grained human action recognition from RGB-only videos, where subtle spatiotemporal differences between actions hinder performance. To this end, we propose TAG-Head, a lightweight, plug-and-play spatiotemporal graph head that explicitly couples high-resolution spatial interactions, low-variance temporal continuity, and global context. TAG-Head leverages a Transformer architecture augmented with learnable 3D positional encoding to model global spatiotemporal dependencies, incorporating fully connected intra-frame edges and temporally aligned inter-frame edges. It seamlessly integrates with mainstream 3D backbones and supports end-to-end training. Despite its minimal parameter count and computational overhead, TAG-Head achieves state-of-the-art results among RGB-only methods on FineGym and HAA500, even outperforming multimodal approaches that rely on privileged information such as pose or textual annotations.
📝 Abstract
Fine-grained human action recognition (FHAR) is challenging because visually similar actions differ by subtle spatio-temporal cues. Many recent systems enhance discriminability with extra modalities (e.g., pose, text, optical flow), but this increases annotation burden and computational cost. We introduce TAG-Head, a lightweight spatio-temporal graph head that upgrades standard 3D backbones (SlowFast, R(2+1)D-34, I3D, etc.) for FHAR using RGB only. Our pipeline first applies a Transformer encoder with learnable 3D positional encodings to the backbone tokens, capturing long-range dependencies across space and time. The resulting features are then refined by a graph in which (i) fully-connected intra-frame edges to resolve subtle appearance differences within frames, and (ii) time-aligned temporal edges that connect features at the same spatial location across frames to stabilise motion cues without over-smoothing. The head is compact (little parameter/FLOP overhead), plug-and-play across backbones, and trained end-to-end with the backbone. Extensive evaluations on FineGym (Gym99 and Gym288) and HAA500 show that TAG-Head sets a new state-of-the-art among RGB-only models and surpasses many recent multimodal approaches (video + pose + text) that rely on privileged information. Ablations disentangle the contributions of the Transformer and the graph topology, and complexity analyses confirm low latency. TAG-Head advances FHAR by explicitly coupling global context with high-resolution spatial interactions and low-variance temporal continuity inside a slim, composable graph head. The simplicity of the design enables straightforward adoption in practical systems that favour RGB-only sensors, while delivering performance gains typically associated with heavier or multimodal models. Code will be released on GitHub.