Denoise and Align: Diffusion-Driven Foreground Knowledge Prompting for Open-Vocabulary Temporal Action Detection

📅 2026-04-20
📈 Citations: 0
Influential: 0
📄 PDF

career value

208K/year
🤖 AI Summary
This work addresses the challenge of aligning action semantics with video representations in open-vocabulary temporal action detection, where semantic imbalance hinders effective cross-modal correspondence. To bridge this gap, the paper introduces— for the first time—a diffusion model to generate foreground knowledge that serves as a cross-modal semantic anchor to guide alignment. The proposed framework features three key innovations: semantic-unified conditioning, background-suppressed denoising, and foreground-prompted alignment, which collectively mitigate the semantic discrepancy between video and text modalities. This approach substantially improves localization and recognition performance for unseen action categories, achieving state-of-the-art results on two established open-vocabulary temporal action detection benchmarks.

Technology Category

Application Category

📝 Abstract
Open-Vocabulary Temporal Action Detection (OV-TAD) aims to localize and classify action segments of unseen categories in untrimmed videos, where effective alignment between action semantics and video representations is critical for accurate detection. However, existing methods struggle to mitigate the semantic imbalance between concise, abstract action labels and rich, complex video contents, inevitably introducing semantic noise and misleading cross-modal alignment. To address this challenge, we propose DFAlign, the first framework that leverages diffusion-based denoising to generate foreground knowledge for the guidance of action-video alignment. Following the 'conditioning, denoising and aligning' manner, we first introduce the Semantic-Unify Conditioning (SUC) module, which unifies action-shared and action-specific semantics as conditions for diffusion denoising. Then, the Background-Suppress Denoising (BSD) module generates foreground knowledge by progressively removing background redundancy from videos through denoising process. This foreground knowledge serves as effective intermediate semantic anchor between video and text representations, mitigating the semantic gap and enhancing the discriminability of action-relevant segments. Furthermore, we introduce the Foreground-Prompt Alignment (FPA) module to inject extracted foreground knowledge as prompt tokens into text representations, guiding model's attention towards action-relevant segments and enabling precise cross-modal alignment. Extensive experiments demonstrate that our method achieves state-of-the-art performance on two OV-TAD benchmarks. The code repository is provided as follows: https://anonymous.4open.science/r/Code-2114/.
Problem

Research questions and friction points this paper is trying to address.

Open-Vocabulary Temporal Action Detection
Semantic Alignment
Semantic Imbalance
Cross-Modal Alignment
Action Localization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffusion Denoising
Foreground Knowledge
Cross-Modal Alignment
Open-Vocabulary Temporal Action Detection
Prompting
🔎 Similar Papers
No similar papers found.
S
Sa Zhu
Institute of Information Engineering, Chinese Academy of Sciences; School of Cyber Security, University of Chinese Academy of Sciences; State Key Laboratory of Cyberspace Security Defense
W
Wanqian Zhang
Institute of Information Engineering, Chinese Academy of Sciences
L
Lin Wang
Hangzhou Dianzi University
Jinchao Zhang
Jinchao Zhang
WeChat AI - Pattern Recognition Center
Deep LearningNatural Language ProcessingMachine TranslationDialogue System
Cong Wang
Cong Wang
Zhejiang University
LLM Safety/Efficiency
B
Bo Li
Institute of Information Engineering, Chinese Academy of Sciences; State Key Laboratory of Cyberspace Security Defense