🤖 AI Summary
Current vision-language models (VLMs) exhibit significant limitations in fine-grained video action understanding, hindered by the absence of dedicated evaluation benchmarks and efficient modeling frameworks. To address this gap, we introduce MotionBench—the first benchmark explicitly designed for fine-grained motion understanding—comprising six motion-oriented tasks that systematically expose pervasive performance bottlenecks across state-of-the-art VLMs. To enhance modeling efficiency and accuracy for high-frame-rate videos, we propose the Through-Encoder (TE) fusion architecture, which bypasses LLM sequence-length constraints via cross-frame encoder cascading and optimized video feature compression. Leveraging multi-source heterogeneous data and fine-grained motion semantic annotations, our experiments demonstrate that TE combined with high-frame-rate input yields substantial performance gains; however, considerable room for improvement remains. This work establishes a new benchmark, introduces a novel architectural paradigm, and provides foundational insights for advancing video motion understanding.
📝 Abstract
In recent years, vision language models (VLMs) have made significant advancements in video understanding. However, a crucial capability - fine-grained motion comprehension - remains under-explored in current benchmarks. To address this gap, we propose MotionBench, a comprehensive evaluation benchmark designed to assess the fine-grained motion comprehension of video understanding models. MotionBench evaluates models' motion-level perception through six primary categories of motion-oriented question types and includes data collected from diverse sources, ensuring a broad representation of real-world video content. Experimental results reveal that existing VLMs perform poorly in understanding fine-grained motions. To enhance VLM's ability to perceive fine-grained motion within a limited sequence length of LLM, we conduct extensive experiments reviewing VLM architectures optimized for video feature compression and propose a novel and efficient Through-Encoder (TE) Fusion method. Experiments show that higher frame rate inputs and TE Fusion yield improvements in motion understanding, yet there is still substantial room for enhancement. Our benchmark aims to guide and motivate the development of more capable video understanding models, emphasizing the importance of fine-grained motion comprehension. Project page: https://motion-bench.github.io .