KTV: Keyframes and Key Tokens Selection for Efficient Training-Free Video LLMs

📅 2026-02-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing training-free video understanding methods suffer from high visual redundancy, substantial computational overhead, and bias in keyframe selection, leading to suboptimal performance. To address these limitations, this work proposes KTV, a two-stage framework that first performs task-agnostic keyframe selection via CLIP feature clustering and then prunes visual tokens within each selected frame based on their importance and redundancy. The approach requires no training, significantly reduces input scale, and enhances comprehension accuracy. On the MLVU-Test benchmark, KTV achieves 44.8% accuracy on 60-minute videos using only 504 visual tokens, outperforming multiple training-free methods and even some training-based approaches, thereby demonstrating its efficiency and effectiveness.

Technology Category

Application Category

📝 Abstract
Training-free video understanding leverages the strong image comprehension capabilities of pre-trained vision language models (VLMs) by treating a video as a sequence of static frames, thus obviating the need for costly video-specific training. However, this paradigm often suffers from severe visual redundancy and high computational overhead, especially when processing long videos. Crucially, existing keyframe selection strategies, especially those based on CLIP similarity, are prone to biases and may inadvertently overlook critical frames, resulting in suboptimal video comprehension. To address these significant challenges, we propose \textbf{KTV}, a novel two-stage framework for efficient and effective training-free video understanding. In the first stage, KTV performs question-agnostic keyframe selection by clustering frame-level visual features, yielding a compact, diverse, and representative subset of frames that mitigates temporal redundancy. In the second stage, KTV applies key visual token selection, pruning redundant or less informative tokens from each selected keyframe based on token importance and redundancy, which significantly reduces the number of tokens fed into the LLM. Extensive experiments on the Multiple-Choice VideoQA task demonstrate that KTV outperforms state-of-the-art training-free baselines while using significantly fewer visual tokens, \emph{e.g.}, only 504 visual tokens for a 60-min video with 10800 frames, achieving $44.8\%$ accuracy on the MLVU-Test benchmark. In particular, KTV also exceeds several training-based approaches on certain benchmarks.
Problem

Research questions and friction points this paper is trying to address.

video understanding
visual redundancy
keyframe selection
training-free
computational overhead
Innovation

Methods, ideas, or system contributions that make the work stand out.

training-free video understanding
keyframe selection
visual token pruning
vision-language models
video question answering
🔎 Similar Papers
No similar papers found.