Keeping the Evidence Chain: Semantic Evidence Allocation for Training-Free Token Pruning in Video Temporal Grounding

📅 2026-03-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing training-free visual token pruning methods suffer significant performance degradation in video temporal localization due to their neglect of boundary-sensitive evidence and cross-frame reasoning chains. To address this, this work proposes SemVID, the first framework enabling efficient and high-fidelity training-free pruning. SemVID constructs a compact yet semantically complementary subset of visual tokens by introducing two principles—Evidence Retention (ER) and Connection Strength (CS)—and allocating tokens into three semantic roles: object, motion, and context. Without any training, the method achieves remarkable efficiency and accuracy: using only 12.5% of the original visual tokens, it retains 95.4% mIoU and attains up to 5.8× pre-filling acceleration, substantially outperforming existing approaches.

Technology Category

Application Category

📝 Abstract
Video Temporal Grounding (VTG) localizes the temporal boundaries of a query-relevant moment in long, untrimmed videos, making video-language-model (VLM) pipelines prohibitively expensive. While recent training-free visual token pruning has shown success in video question answering, naively applying these objectives to VTG often causes drastic degradation, as VTG crucially depends on boundary-sensitive evidence and cross-frame reasoning chains. We therefore identify two VTG-specific pruning principles: Evidence Retention (ER), which keeps query-critical patches especially around event boundaries, and Connectivity Strength (CS), which preserves token-level cross-frame connectivity for long-range evidence aggregation. Building on these insights, we propose SemVID, a training-free pruning framework that constructs a compact yet coherent token subset with complementary semantic roles. SemVID first allocates per-frame token budgets by balancing query relevance and inter-frame variation to avoid over-pruned segments, and then selects three types of tokens: object tokens for diverse query-critical evidence, motion tokens to capture meaningful transitions and serve as cross-frame relays, and a small set of context tokens for scene continuity. Extensive experiments on VTG benchmarks show that SemVID achieves a strong accuracy-efficiency trade-off, retaining up to 95.4% mIoU with only 12.5% visual tokens and delivering up to a 5.8x prefill speedup, consistently outperforming prior methods under the same budgets.
Problem

Research questions and friction points this paper is trying to address.

Video Temporal Grounding
Token Pruning
Evidence Chain
Boundary Sensitivity
Cross-frame Reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

training-free token pruning
video temporal grounding
evidence retention
connectivity strength
semantic token allocation
J
Jiaqi Li
University of Warwick, Coventry CV4 7AL, United Kingdom
S
Shuntian Zheng
University of Warwick, Coventry CV4 7AL, United Kingdom
Yixian Shen
Yixian Shen
University of Amsterdam
Efficient DNNComputer ArchitectureSystem Optimization
J
Jia-Hong Huang
University of Amsterdam, Amsterdam 1012 WX, the Netherlands
X
Xiaoman Lu
University of Warwick, Coventry CV4 7AL, United Kingdom
M
Minzhe Ni
University of Warwick, Coventry CV4 7AL, United Kingdom
Yu Guan
Yu Guan
Associate Professor, University of Warwick, UK
Activity RecognitionAI for HealthcareUbiquitous ComputingVisual ComputingMachine Learning