🤖 AI Summary
Existing methods for fine-grained action localization in sports videos suffer from limited temporal receptive fields and weak spatial adaptability, resulting in low frame-level accuracy. To address this, we propose the Multi-scale Attention-Gated Shift (MAGS) module, which jointly integrates multi-scale dilated temporal modeling, multi-head spatial attention, and a lightweight gated shift mechanism to synergistically enhance both long- and short-term dependency capture and key region focus. We further introduce and publicly release TTA, the first fine-grained table tennis event benchmark—comprising 12 sub-second event classes and 32,000 precisely annotated frames. Evaluated on five mainstream benchmarks, MAGS consistently outperforms state-of-the-art methods, achieving average mAP gains of 2.1–4.7 percentage points while reducing computational overhead by 38%–61%. The module is architecture-agnostic and compatible with various 2D CNN backbones.
📝 Abstract
Precise Event Spotting (PES) in sports videos requires frame-level recognition of fine-grained actions from single-camera footage. Existing PES models typically incorporate lightweight temporal modules such as Gate Shift Module (GSM) or Gate Shift Fuse (GSF) to enrich 2D CNN feature extractors with temporal context. However, these modules are limited in both temporal receptive field and spatial adaptability. We propose a Multi-Scale Attention Gate Shift Module (MSAGSM) that enhances GSM with multi-scale temporal dilations and multi-head spatial attention, enabling efficient modeling of both short- and long-term dependencies while focusing on salient regions. MSAGSM is a lightweight plug-and-play module that can be easily integrated with various 2D backbones. To further advance the field, we introduce the Table Tennis Australia (TTA) dataset-the first PES benchmark for table tennis-containing over 4800 precisely annotated events. Extensive experiments across five PES benchmarks demonstrate that MSAGSM consistently improves performance with minimal overhead, setting new state-of-the-art results.