🤖 AI Summary
This work addresses the challenge of semantic recognition and precise temporal localization of target actions in untrimmed videos by proposing a novel masked diffusion vision-language model. The approach overcomes the limitation of autoregressive decoders, which cannot leverage future semantic context to refine early timestamp predictions, through a bidirectional attention mechanism within an iterative denoising process that jointly optimizes action semantics and temporal boundary estimation. Key innovations include a boundary-aware masking strategy, a step-level IoU reward mechanism enabling overlap-aware denoising supervision, and a scheduled training objective that progressively restores temporal tokens. Extensive experiments on ActivityNet-RTL, ActivityNet-1.3, and THUMOS-14 demonstrate that the proposed method significantly outperforms existing autoregressive vision-language baselines, with particularly notable gains at high IoU thresholds.
📝 Abstract
Temporal action localization (TAL) requires recognizing the target event and localizing its start and end times precisely in untrimmed videos. Recent vision-language formulations improve semantic reasoning and support language-conditioned outputs, but their autoregressive decoders still generate tokens from left to right, preventing later semantic evidence from revising earlier timestamp predictions.
We adapt masked diffusion vision-language models (MDVLMs) to TAL so that semantic tokens and boundary tokens remain editable throughout iterative denoising with bidirectional attention, allowing temporal boundaries and semantic content to be refined jointly.
Direct adaptation, however, creates two TAL-specific mismatches: standard masked diffusion training corrupts all positions uniformly at random, but the time tokens are more reliable when enough semantic context is available; and token-level cross-entropy does not reflect temporal IoU. To address these mismatches, we introduce a Planned Training Objective that uses boundary-aware masking and step-weighted reconstruction to rehearse the late recovery of time tokens, together with a Step-Level IoU Reward that provides overlap-aware supervision during denoising. A standard sequence-level cross-entropy term provides the base reconstruction signal.
Experiments on ActivityNet-RTL, ActivityNet-1.3, and THUMOS-14 show that MDVLM-TAL improves both temporal reasoning and boundary localization over autoregressive vision-language baselines, with especially strong gains under stricter temporal IoU criteria.