Improving LLM Video Understanding with 16 Frames Per Second

📅 2025-03-18

📈 Citations: 0

✨ Influential: 0

career value

226K/year

🤖 AI Summary

Existing video understanding methods are constrained by low-frame-rate sampling (≤2 FPS), limiting their capacity to model dynamic spatiotemporal semantics. This work introduces F-16, the first multimodal large language model explicitly designed for high-frame-rate (16 FPS) video input. F-16 employs three core techniques: temporal-aware visual token compression, multi-granularity spatiotemporal feature alignment, and lightweight dynamic decoding scheduling—enabling efficient modeling of continuous spatiotemporal semantics. Crucially, it incorporates a novel, training-free adaptive low-frame-rate inference mechanism, demonstrating that increasing input frame rate constitutes a more effective optimization axis than merely scaling model size or dataset volume. On benchmarks including Video-MME and TemporalBench, the 7B-parameter variant achieves state-of-the-art performance; it significantly outperforms GPT-4o and Gemini-1.5-pro on high-speed motion analysis; and supports zero-shot, frame-rate-flexible inference without fine-tuning.

Technology Category

Application Category

📝 Abstract

Human vision is dynamic and continuous. However, in video understanding with multimodal large language models (LLMs), existing methods primarily rely on static features extracted from images sampled at a fixed low frame rate of frame-per-second (FPS) $leqslant$2, leading to critical visual information loss. In this paper, we introduce F-16, the first multimodal LLM designed for high-frame-rate video understanding. By increasing the frame rate to 16 FPS and compressing visual tokens within each 1-second clip, F-16 efficiently captures dynamic visual features while preserving key semantic information. Experimental results demonstrate that higher frame rates considerably enhance video understanding across multiple benchmarks, providing a new approach to improving video LLMs beyond scaling model size or training data. F-16 achieves state-of-the-art performance among 7-billion-parameter video LLMs on both general and fine-grained video understanding benchmarks, such as Video-MME and TemporalBench. Furthermore, F-16 excels in complex spatiotemporal tasks, including high-speed sports analysis ( extit{e.g.}, basketball, football, gymnastics, and diving), outperforming SOTA proprietary visual models like GPT-4o and Gemini-1.5-pro. Additionally, we introduce a novel decoding method for F-16 that enables highly efficient low-frame-rate inference without requiring model retraining. Upon acceptance, we will release the source code, model checkpoints, and data.

Problem

Research questions and friction points this paper is trying to address.

Enhances video understanding using 16 FPS in LLMs

Captures dynamic visual features efficiently with token compression

Improves performance in spatiotemporal tasks and benchmarks

Innovation

Methods, ideas, or system contributions that make the work stand out.

High-frame-rate video understanding at 16 FPS

Compressed visual tokens for efficient feature capture

Novel decoding method for low-frame-rate inference

🔎 Similar Papers

From Image to Video, what do we need in multimodal LLMs?