Keeping the Evidence Chain: Semantic Evidence Allocation for Training-Free Token Pruning in Video Temporal Grounding

📅 2026-03-05

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

Existing training-free visual token pruning methods suffer significant performance degradation in video temporal localization due to their neglect of boundary-sensitive evidence and cross-frame reasoning chains. To address this, this work proposes SemVID, the first framework enabling efficient and high-fidelity training-free pruning. SemVID constructs a compact yet semantically complementary subset of visual tokens by introducing two principles—Evidence Retention (ER) and Connection Strength (CS)—and allocating tokens into three semantic roles: object, motion, and context. Without any training, the method achieves remarkable efficiency and accuracy: using only 12.5% of the original visual tokens, it retains 95.4% mIoU and attains up to 5.8× pre-filling acceleration, substantially outperforming existing approaches.

Technology Category

Application Category

📝 Abstract

Video Temporal Grounding (VTG) localizes the temporal boundaries of a query-relevant moment in long, untrimmed videos, making video-language-model (VLM) pipelines prohibitively expensive. While recent training-free visual token pruning has shown success in video question answering, naively applying these objectives to VTG often causes drastic degradation, as VTG crucially depends on boundary-sensitive evidence and cross-frame reasoning chains. We therefore identify two VTG-specific pruning principles: Evidence Retention (ER), which keeps query-critical patches especially around event boundaries, and Connectivity Strength (CS), which preserves token-level cross-frame connectivity for long-range evidence aggregation. Building on these insights, we propose SemVID, a training-free pruning framework that constructs a compact yet coherent token subset with complementary semantic roles. SemVID first allocates per-frame token budgets by balancing query relevance and inter-frame variation to avoid over-pruned segments, and then selects three types of tokens: object tokens for diverse query-critical evidence, motion tokens to capture meaningful transitions and serve as cross-frame relays, and a small set of context tokens for scene continuity. Extensive experiments on VTG benchmarks show that SemVID achieves a strong accuracy-efficiency trade-off, retaining up to 95.4% mIoU with only 12.5% visual tokens and delivering up to a 5.8x prefill speedup, consistently outperforming prior methods under the same budgets.

Problem

Research questions and friction points this paper is trying to address.

Video Temporal Grounding

Token Pruning

Evidence Chain

Boundary Sensitivity

Cross-frame Reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

training-free token pruning

video temporal grounding

evidence retention