TAG-Head: Time-Aligned Graph Head for Plug-and-Play Fine-grained Action Recognition

📅 2026-04-13

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

This work addresses the challenge of insufficient discriminability in fine-grained human action recognition from RGB-only videos, where subtle spatiotemporal differences between actions hinder performance. To this end, we propose TAG-Head, a lightweight, plug-and-play spatiotemporal graph head that explicitly couples high-resolution spatial interactions, low-variance temporal continuity, and global context. TAG-Head leverages a Transformer architecture augmented with learnable 3D positional encoding to model global spatiotemporal dependencies, incorporating fully connected intra-frame edges and temporally aligned inter-frame edges. It seamlessly integrates with mainstream 3D backbones and supports end-to-end training. Despite its minimal parameter count and computational overhead, TAG-Head achieves state-of-the-art results among RGB-only methods on FineGym and HAA500, even outperforming multimodal approaches that rely on privileged information such as pose or textual annotations.

Technology Category

Application Category

📝 Abstract

Fine-grained human action recognition (FHAR) is challenging because visually similar actions differ by subtle spatio-temporal cues. Many recent systems enhance discriminability with extra modalities (e.g., pose, text, optical flow), but this increases annotation burden and computational cost. We introduce TAG-Head, a lightweight spatio-temporal graph head that upgrades standard 3D backbones (SlowFast, R(2+1)D-34, I3D, etc.) for FHAR using RGB only. Our pipeline first applies a Transformer encoder with learnable 3D positional encodings to the backbone tokens, capturing long-range dependencies across space and time. The resulting features are then refined by a graph in which (i) fully-connected intra-frame edges to resolve subtle appearance differences within frames, and (ii) time-aligned temporal edges that connect features at the same spatial location across frames to stabilise motion cues without over-smoothing. The head is compact (little parameter/FLOP overhead), plug-and-play across backbones, and trained end-to-end with the backbone. Extensive evaluations on FineGym (Gym99 and Gym288) and HAA500 show that TAG-Head sets a new state-of-the-art among RGB-only models and surpasses many recent multimodal approaches (video + pose + text) that rely on privileged information. Ablations disentangle the contributions of the Transformer and the graph topology, and complexity analyses confirm low latency. TAG-Head advances FHAR by explicitly coupling global context with high-resolution spatial interactions and low-variance temporal continuity inside a slim, composable graph head. The simplicity of the design enables straightforward adoption in practical systems that favour RGB-only sensors, while delivering performance gains typically associated with heavier or multimodal models. Code will be released on GitHub.

Problem

Research questions and friction points this paper is trying to address.

fine-grained action recognition

spatio-temporal cues

RGB-only

multimodal dependency

annotation burden

Innovation

Methods, ideas, or system contributions that make the work stand out.

fine-grained action recognition

spatio-temporal graph

time-aligned edges