π€ AI Summary
This work addresses the high computational cost of video multimodal large language models when processing high-frame-rate videos, where uniform frame sampling often discards critical semantic content. To overcome this limitation, the authors propose TiFRe, a framework that dynamically selects keyframes guided by user-provided text prompts while preserving semantic information from non-keyframes through task-aware frame compression. TiFRe leverages a large language model to generate CLIP-style textual prompts and employs a pretrained CLIP encoder to compute frameβtext similarity scores. A novel Frame Matching and Merging (FMM) mechanism then integrates relevant frames, substantially reducing input length without compromising semantic fidelity. Experimental results demonstrate that TiFRe outperforms fixed-frame-rate baselines across multiple video-language tasks, achieving a favorable balance between efficiency and performance.
π Abstract
With the rapid development of Large Language Models (LLMs), Video Multi-Modal Large Language Models (Video MLLMs) have achieved remarkable performance in video-language tasks such as video understanding and question answering. However, Video MLLMs face high computational costs, particularly in processing numerous video frames as input, which leads to significant attention computation overhead. A straightforward approach to reduce computational costs is to decrease the number of input video frames. However, simply selecting key frames at a fixed frame rate (FPS) often overlooks valuable information in non-key frames, resulting in notable performance degradation. To address this, we propose Text-guided Video Frame Reduction (TiFRe), a framework that reduces input frames while preserving essential video information. TiFRe uses a Text-guided Frame Sampling (TFS) strategy to select key frames based on user input, which is processed by an LLM to generate a CLIP-style prompt. Pre-trained CLIP encoders calculate the semantic similarity between the prompt and each frame, selecting the most relevant frames as key frames. To preserve video semantics, TiFRe employs a Frame Matching and Merging (FMM) mechanism, which integrates non-key frame information into the selected key frames, minimizing information loss. Experiments show that TiFRe effectively reduces computational costs while improving performance on video-language tasks.