🤖 AI Summary
This work addresses the challenge of fine-grained action localization in professional badminton videos, where distinguishing visually similar actions is hindered by complex spatiotemporal dynamics. To this end, we propose the Decoupled Spatio-Temporal Adapter (DSTA), which introduces a decoupled modeling mechanism—adopted for the first time in this task—to separately capture temporal dynamics, vertical spatial variations, and horizontal spatial changes through three parallel branches. Operating within a parameter-efficient framework, DSTA precisely models subtle motion differences with minimal computational overhead. Evaluated on both our newly curated Fine-Badminton dataset and the established ShuttleSet benchmark, the method achieves state-of-the-art performance while introducing negligible additional parameters and computation, significantly enhancing discriminability among highly similar actions.
📝 Abstract
Temporal Action Localization (TAL) has been extensively studied in generic video understanding, while fine-grained sports scenarios, such as professional badminton, remain underexplored due to their complex and subtle spatio-temporal dynamics. In this paper, we focus on fine-grained TAL in professional badminton videos and introduce a new benchmark dataset, Fine-Badminton, which consists of 31 matches with 29 fine-grained stroke categories, covering 2104 rallies and 27597 annotated actions. To effectively capture the intricate motion patterns in such scenarios, we propose a Decoupling Spatio-Temporal Adapter (DSTA), which enables efficient modeling of spatio-temporal features within a parameter-efficient framework. Specifically, DSTA decomposes motion representation into three parallel branches, capturing temporal dynamics as well as vertical and horizontal spatial variations. The design allows the model to better distinguish subtle differences among fine-grained actions. Extensive experiments on both the Fine-Badminton dataset and the ShuttleSet benchmark demonstrate that the proposed method achieves state-of-the-art performance while introducing only a marginal increase in computational and parameter cost. These results validate the effectiveness and efficiency of the proposed approach for fine-grained temporal action localization.