Action Tube Generation by Person Query Matching for Spatio-Temporal Action Detection

📅 2025-03-17
🏛️ Proceedings of the 20th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the reliance on post-processing steps—such as IoU-based matching and temporal segmentation—in spatio-temporal action detection. We propose an end-to-end framework for generating action tubes, eliminating the need for hand-crafted heuristics. Our core innovation is the Query Matching Module (QMM), built upon the DETR architecture: it performs frame-level detection and then achieves person-level cross-frame query alignment via metric learning, supporting variable-length video inputs and jointly optimizing action localization and classification. By removing conventional post-processing, our method enables truly end-to-end training and inference. Experiments on JHMDB, UCF101-24, and AVA demonstrate significant improvements in detecting actions with large displacements, while simultaneously reducing computational overhead and GPU memory consumption. These results validate the method’s efficiency and generalization capability across diverse benchmarks.

Technology Category

Application Category

📝 Abstract
This paper proposes a method for spatio-temporal action detection (STAD) that directly generates action tubes from the original video without relying on post-processing steps such as IoU-based linking and clip splitting. Our approach applies query-based detection (DETR) to each frame and matches DETR queries to link the same person across frames. We introduce the Query Matching Module (QMM), which uses metric learning to bring queries for the same person closer together across frames compared to queries for different people. Action classes are predicted using the sequence of queries obtained from QMM matching, allowing for variable-length inputs from videos longer than a single clip. Experimental results on JHMDB, UCF101-24, and AVA datasets demonstrate that our method performs well for large position changes of people while offering superior computational efficiency and lower resource requirements.
Problem

Research questions and friction points this paper is trying to address.

Generates action tubes directly from video without post-processing.
Links person queries across frames using metric learning.
Predicts action classes for variable-length video inputs.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates action tubes directly from video
Uses Query Matching Module for person linking
Predicts action classes with variable-length inputs
🔎 Similar Papers
No similar papers found.
K
Kazuki Omi
Nagoya Institute of Technology, Japan
J
Jion Oshima
Nagoya Institute of Technology, Japan
Toru Tamaki
Toru Tamaki
Nagoya Institute of Technology
Computer visionPattern recognitionDeep learning