🤖 AI Summary
This work addresses key challenges in video and audio manipulation localization—namely, ambiguous boundaries, sparse tampering patterns, and insufficient long-range modeling—by proposing DeformTrace, a novel framework that synergistically combines the global modeling capacity of Transformers with the computational efficiency of State Space Models (SSMs). DeformTrace introduces deformable SSMs (DS-SSM/DC-SSM) to dynamically adapt receptive fields, incorporates relay tokens to mitigate long-range dependency decay, and designs a query-aware subspace mechanism to enhance sensitivity to sparse manipulations. Despite using fewer parameters and achieving faster inference, DeformTrace attains state-of-the-art accuracy and robustness on temporal forgery localization benchmarks.
📝 Abstract
Temporal Forgery Localization (TFL) aims to precisely identify manipulated segments in video and audio, offering strong interpretability for security and forensics. While recent State Space Models (SSMs) show promise in precise temporal reasoning, their use in TFL is hindered by ambiguous boundaries, sparse forgeries, and limited long-range modeling. We propose DeformTrace, which enhances SSMs with deformable dynamics and relay mechanisms to address these challenges. Specifically, Deformable Self-SSM (DS-SSM) introduces dynamic receptive fields into SSMs for precise temporal localization. To further enhance its capacity for temporal reasoning and mitigate long-range decay, a Relay Token Mechanism is integrated into DS-SSM. Besides, Deformable Cross-SSM (DC-SSM) partitions the global state space into query-specific subspaces, reducing non-forgery information accumulation and boosting sensitivity to sparse forgeries. These components are integrated into a hybrid architecture that combines the global modeling of Transformers with the efficiency of SSMs. Extensive experiments show that DeformTrace achieves state-of-the-art performance with fewer parameters, faster inference, and stronger robustness.