🤖 AI Summary
Existing video understanding methods are constrained by low-frame-rate sampling (≤2 FPS), limiting their capacity to model dynamic spatiotemporal semantics. This work introduces F-16, the first multimodal large language model explicitly designed for high-frame-rate (16 FPS) video input. F-16 employs three core techniques: temporal-aware visual token compression, multi-granularity spatiotemporal feature alignment, and lightweight dynamic decoding scheduling—enabling efficient modeling of continuous spatiotemporal semantics. Crucially, it incorporates a novel, training-free adaptive low-frame-rate inference mechanism, demonstrating that increasing input frame rate constitutes a more effective optimization axis than merely scaling model size or dataset volume. On benchmarks including Video-MME and TemporalBench, the 7B-parameter variant achieves state-of-the-art performance; it significantly outperforms GPT-4o and Gemini-1.5-pro on high-speed motion analysis; and supports zero-shot, frame-rate-flexible inference without fine-tuning.
📝 Abstract
Human vision is dynamic and continuous. However, in video understanding with multimodal large language models (LLMs), existing methods primarily rely on static features extracted from images sampled at a fixed low frame rate of frame-per-second (FPS) $leqslant$2, leading to critical visual information loss. In this paper, we introduce F-16, the first multimodal LLM designed for high-frame-rate video understanding. By increasing the frame rate to 16 FPS and compressing visual tokens within each 1-second clip, F-16 efficiently captures dynamic visual features while preserving key semantic information. Experimental results demonstrate that higher frame rates considerably enhance video understanding across multiple benchmarks, providing a new approach to improving video LLMs beyond scaling model size or training data. F-16 achieves state-of-the-art performance among 7-billion-parameter video LLMs on both general and fine-grained video understanding benchmarks, such as Video-MME and TemporalBench. Furthermore, F-16 excels in complex spatiotemporal tasks, including high-speed sports analysis ( extit{e.g.}, basketball, football, gymnastics, and diving), outperforming SOTA proprietary visual models like GPT-4o and Gemini-1.5-pro. Additionally, we introduce a novel decoding method for F-16 that enables highly efficient low-frame-rate inference without requiring model retraining. Upon acceptance, we will release the source code, model checkpoints, and data.