DiGIT: Multi-Dilated Gated Encoder and Central-Adjacent Region Integrated Decoder for Temporal Action Detection Transformer

📅 2025-05-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing query-based temporal action detection (TAD) models, which directly adopt object detection architectures, suffer from multi-scale feature redundancy and insufficient long-range temporal modeling. To address these issues, this paper proposes a novel temporal-aware Transformer architecture. Its core contributions are: (1) a multi-dilation gated encoder that jointly leverages multi-dilation convolutions and gating mechanisms to suppress cross-scale feature redundancy while preserving fine-grained localization accuracy and enabling effective long-range contextual modeling; and (2) a center-neighborhood region fusion decoder that enhances the completeness of temporal context sampling in deformable cross-attention. The proposed method achieves state-of-the-art performance on THUMOS14, ActivityNet v1.3, and HACS-Segment, with significant improvements in temporal boundary localization precision and long-range temporal relationship modeling capability.

Technology Category

Application Category

📝 Abstract
In this paper, we examine a key limitation in query-based detectors for temporal action detection (TAD), which arises from their direct adaptation of originally designed architectures for object detection. Despite the effectiveness of the existing models, they struggle to fully address the unique challenges of TAD, such as the redundancy in multi-scale features and the limited ability to capture sufficient temporal context. To address these issues, we propose a multi-dilated gated encoder and central-adjacent region integrated decoder for temporal action detection transformer (DiGIT). Our approach replaces the existing encoder that consists of multi-scale deformable attention and feedforward network with our multi-dilated gated encoder. Our proposed encoder reduces the redundant information caused by multi-level features while maintaining the ability to capture fine-grained and long-range temporal information. Furthermore, we introduce a central-adjacent region integrated decoder that leverages a more comprehensive sampling strategy for deformable cross-attention to capture the essential information. Extensive experiments demonstrate that DiGIT achieves state-of-the-art performance on THUMOS14, ActivityNet v1.3, and HACS-Segment. Code is available at: https://github.com/Dotori-HJ/DiGIT
Problem

Research questions and friction points this paper is trying to address.

Addresses redundancy in multi-scale features for temporal action detection
Improves limited temporal context capture in query-based detectors
Enhances deformable cross-attention sampling for essential information extraction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-dilated gated encoder reduces feature redundancy
Central-adjacent decoder enhances temporal context capture
Transformer-based architecture improves action detection accuracy
🔎 Similar Papers
No similar papers found.
Ho-Joong Kim
Ho-Joong Kim
Korea University
computer vision
Y
Yearang Lee
Dept. of Artificial Intelligence, Korea University, Seoul, Korea
Jung-Ho Hong
Jung-Ho Hong
Korea University
Artificial IntelligenceDeep LearningExplainable AI
S
Seong-Whan Lee
Dept. of Artificial Intelligence, Korea University, Seoul, Korea