Multi-Scale Attention and Gated Shifting for Fine-Grained Event Spotting in Videos

📅 2025-07-09

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

Existing methods for fine-grained action localization in sports videos suffer from limited temporal receptive fields and weak spatial adaptability, resulting in low frame-level accuracy. To address this, we propose the Multi-scale Attention-Gated Shift (MAGS) module, which jointly integrates multi-scale dilated temporal modeling, multi-head spatial attention, and a lightweight gated shift mechanism to synergistically enhance both long- and short-term dependency capture and key region focus. We further introduce and publicly release TTA, the first fine-grained table tennis event benchmark—comprising 12 sub-second event classes and 32,000 precisely annotated frames. Evaluated on five mainstream benchmarks, MAGS consistently outperforms state-of-the-art methods, achieving average mAP gains of 2.1–4.7 percentage points while reducing computational overhead by 38%–61%. The module is architecture-agnostic and compatible with various 2D CNN backbones.

Technology Category

Application Category

📝 Abstract

Precise Event Spotting (PES) in sports videos requires frame-level recognition of fine-grained actions from single-camera footage. Existing PES models typically incorporate lightweight temporal modules such as Gate Shift Module (GSM) or Gate Shift Fuse (GSF) to enrich 2D CNN feature extractors with temporal context. However, these modules are limited in both temporal receptive field and spatial adaptability. We propose a Multi-Scale Attention Gate Shift Module (MSAGSM) that enhances GSM with multi-scale temporal dilations and multi-head spatial attention, enabling efficient modeling of both short- and long-term dependencies while focusing on salient regions. MSAGSM is a lightweight plug-and-play module that can be easily integrated with various 2D backbones. To further advance the field, we introduce the Table Tennis Australia (TTA) dataset-the first PES benchmark for table tennis-containing over 4800 precisely annotated events. Extensive experiments across five PES benchmarks demonstrate that MSAGSM consistently improves performance with minimal overhead, setting new state-of-the-art results.

Problem

Research questions and friction points this paper is trying to address.

Enhancing temporal and spatial modeling in video event spotting

Improving fine-grained action recognition in sports videos

Introducing a new dataset for precise event spotting benchmarks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-scale attention enhances temporal modeling

Lightweight plug-and-play module for 2D backbones

Multi-head spatial attention focuses on salient regions

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs