🤖 AI Summary
This work addresses the challenge of aligning action semantics with video representations in open-vocabulary temporal action detection, where semantic imbalance hinders effective cross-modal correspondence. To bridge this gap, the paper introduces— for the first time—a diffusion model to generate foreground knowledge that serves as a cross-modal semantic anchor to guide alignment. The proposed framework features three key innovations: semantic-unified conditioning, background-suppressed denoising, and foreground-prompted alignment, which collectively mitigate the semantic discrepancy between video and text modalities. This approach substantially improves localization and recognition performance for unseen action categories, achieving state-of-the-art results on two established open-vocabulary temporal action detection benchmarks.
📝 Abstract
Open-Vocabulary Temporal Action Detection (OV-TAD) aims to localize and classify action segments of unseen categories in untrimmed videos, where effective alignment between action semantics and video representations is critical for accurate detection. However, existing methods struggle to mitigate the semantic imbalance between concise, abstract action labels and rich, complex video contents, inevitably introducing semantic noise and misleading cross-modal alignment. To address this challenge, we propose DFAlign, the first framework that leverages diffusion-based denoising to generate foreground knowledge for the guidance of action-video alignment. Following the 'conditioning, denoising and aligning' manner, we first introduce the Semantic-Unify Conditioning (SUC) module, which unifies action-shared and action-specific semantics as conditions for diffusion denoising. Then, the Background-Suppress Denoising (BSD) module generates foreground knowledge by progressively removing background redundancy from videos through denoising process. This foreground knowledge serves as effective intermediate semantic anchor between video and text representations, mitigating the semantic gap and enhancing the discriminability of action-relevant segments. Furthermore, we introduce the Foreground-Prompt Alignment (FPA) module to inject extracted foreground knowledge as prompt tokens into text representations, guiding model's attention towards action-relevant segments and enabling precise cross-modal alignment. Extensive experiments demonstrate that our method achieves state-of-the-art performance on two OV-TAD benchmarks. The code repository is provided as follows: https://anonymous.4open.science/r/Code-2114/.