Shot2Tactic-Caption: Multi-Scale Captioning of Badminton Videos for Tactical Understanding

📅 2025-10-16

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

This paper addresses multi-scale tactical understanding in badminton match videos, proposing the first dual-branch, multi-scale captioning framework capable of generating both stroke-level action descriptions and tactic-level process models. Methodologically, it introduces a vision–spatiotemporal dual-encoder architecture that fuses ResNet50 and a spatiotemporal Transformer to extract multi-granularity features; it further incorporates a prompt-guided cross-attention decoder to explicitly model the state recognition and temporal coherence of tactical units (e.g., rally interruptions/resumptions). Contributions include: (1) the first multi-scale tactical dataset, comprising 5,494 stroke-level and 544 tactic-level annotations; and (2) the first end-to-end model jointly detecting tactical units and generating coherent, multi-scale semantic captions. Experiments demonstrate significant improvements over existing baselines across BLEU, METEOR, and human evaluation metrics.

Technology Category

Application Category

📝 Abstract

Tactical understanding in badminton involves interpreting not only individual actions but also how tactics are dynamically executed over time. In this paper, we propose extbf{Shot2Tactic-Caption}, a novel framework for semantic and temporal multi-scale video captioning in badminton, capable of generating shot-level captions that describe individual actions and tactic-level captions that capture how these actions unfold over time within a tactical execution. We also introduce the Shot2Tactic-Caption Dataset, the first badminton captioning dataset containing 5,494 shot captions and 544 tactic captions. Shot2Tactic-Caption adopts a dual-branch design, with both branches including a visual encoder, a spatio-temporal Transformer encoder, and a Transformer-based decoder to generate shot and tactic captions. To support tactic captioning, we additionally introduce a Tactic Unit Detector that identifies valid tactic units, tactic types, and tactic states (e.g., Interrupt, Resume). For tactic captioning, we further incorporate a shot-wise prompt-guided mechanism, where the predicted tactic type and state are embedded as prompts and injected into the decoder via cross-attention. The shot-wise prompt-guided mechanism enables our system not only to describe successfully executed tactics but also to capture tactical executions that are temporarily interrupted and later resumed. Experimental results demonstrate the effectiveness of our framework in generating both shot and tactic captions. Ablation studies show that the ResNet50-based spatio-temporal encoder outperforms other variants, and that shot-wise prompt structuring leads to more coherent and accurate tactic captioning.

Problem

Research questions and friction points this paper is trying to address.

Generating multi-scale tactical captions for badminton videos

Describing individual actions and dynamic tactic executions over time

Detecting tactic units, types, and states for coherent tactical analysis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-branch design for multi-scale video captioning

Tactic Unit Detector identifies tactical elements

Shot-wise prompt mechanism enhances tactic description

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs