🤖 AI Summary
To address the quadratic time complexity bottleneck in video-based repetitive action counting and insufficient robustness under open-set and multi-speed scenarios, this paper proposes a linear-complexity method based on dynamic action queries. Our approach eliminates explicit similarity matrix construction via a lightweight Transformer architecture. Key contributions include: (1) a novel learnable, temporally adaptive action query mechanism that enables flexible modeling of action instances; (2) cross-query contrastive learning to enhance discriminability and suppress frame-level noise; and (3) end-to-end optimization without handcrafted features or pre-defined templates. On the RepCountA benchmark, our method achieves a 26.5% improvement in overlap-based overlap (OBO) accuracy over TransRAC, reduces mean counting error by 22.7%, and cuts computational overhead by 94.1%. These results demonstrate substantial gains in both efficiency and generalization across diverse real-world settings.
📝 Abstract
Temporal repetition counting aims to quantify the repeated action cycles within a video. The majority of existing methods rely on the similarity correlation matrix to characterize the repetitiveness of actions, but their scalability is hindered due to the quadratic computational complexity. In this work, we introduce a novel approach that employs an action query representation to localize repeated action cycles with linear computational complexity. Based on this representation, we further develop two key components to tackle the essential challenges of temporal repetition counting. Firstly, to facilitate open-set action counting, we propose the dynamic update scheme on action queries. Unlike static action queries, this approach dynamically embeds video features into action queries, offering a more flexible and generalizable representation. Secondly, to distinguish between actions of interest and background noise actions, we incorporate inter-query contrastive learning to regularize the video representations corresponding to different action queries. As a result, our method significantly outperforms previous works, particularly in terms of long video sequences, unseen actions, and actions at various speeds. On the challenging RepCountA benchmark, we outperform the state-of-the-art method TransRAC by 26.5% in OBO accuracy, with a 22.7% mean error decrease and 94.1% computational burden reduction. Code is available at https://github.com/lizishi/DeTRC.