FAVOR-Bench: A Comprehensive Benchmark for Fine-Grained Video Motion Understanding

📅 2025-03-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current multimodal large language models (MLLMs) exhibit significant limitations in fine-grained video motion understanding—particularly in temporal dynamic modeling. To address this gap, we introduce FAVOR-Bench, the first comprehensive benchmark explicitly designed for evaluating such capability. It comprises 1,776 videos with multi-stage human-annotated, structured motion descriptions, supporting both closed- and open-ended tasks. We further propose a novel, LLM-free automatic evaluation framework that is fully reproducible and interpretable. Complementing the benchmark, we release FAVOR-Train—a large-scale supervised training dataset containing over 17,000 videos with fine-grained motion annotations—filling a critical data void. Extensive experiments across TVBench, MotionBench, and FAVOR-Bench demonstrate consistently weak performance of 21 state-of-the-art MLLMs, empirically validating the benchmark’s effectiveness and necessity for advancing video motion understanding.

Technology Category

Application Category

📝 Abstract
Multimodal Large Language Models (MLLMs) have shown remarkable capabilities in video content understanding but still struggle with fine-grained motion comprehension. To comprehensively assess the motion understanding ability of existing MLLMs, we introduce FAVOR-Bench, comprising 1,776 videos with structured manual annotations of various motions. Our benchmark includes both close-ended and open-ended tasks. For close-ended evaluation, we carefully design 8,184 multiple-choice question-answer pairs spanning six distinct sub-tasks. For open-ended evaluation, we develop both a novel cost-efficient LLM-free and a GPT-assisted caption assessment method, where the former can enhance benchmarking interpretability and reproducibility. Comprehensive experiments with 21 state-of-the-art MLLMs reveal significant limitations in their ability to comprehend and describe detailed temporal dynamics in video motions. To alleviate this limitation, we further build FAVOR-Train, a dataset consisting of 17,152 videos with fine-grained motion annotations. The results of finetuning Qwen2.5-VL on FAVOR-Train yield consistent improvements on motion-related tasks of TVBench, MotionBench and our FAVOR-Bench. Comprehensive assessment results demonstrate that the proposed FAVOR-Bench and FAVOR-Train provide valuable tools to the community for developing more powerful video understanding models. Project page: href{https://favor-bench.github.io/}{https://favor-bench.github.io/}.
Problem

Research questions and friction points this paper is trying to address.

Assessing fine-grained motion understanding in Multimodal Large Language Models (MLLMs).
Developing FAVOR-Bench with structured annotations for comprehensive motion evaluation.
Creating FAVOR-Train dataset to improve MLLMs' temporal dynamics comprehension.
Innovation

Methods, ideas, or system contributions that make the work stand out.

FAVOR-Bench: 1,776 videos with structured motion annotations
Cost-efficient LLM-free and GPT-assisted caption assessment methods
FAVOR-Train: 17,152 videos for fine-grained motion understanding
🔎 Similar Papers
No similar papers found.
Chongjun Tu
Chongjun Tu
fudan university
neural architecture searchdataset pruningMLLM inference acceleration
L
Lin Zhang
Fudan University, StepFun
Pengtao Chen
Pengtao Chen
Ph.D. Student, Fudan University
Computer VisionDiffusion ModelEfficient Deep Learning
P
Peng Ye
The Chinese University of Hong Kong
X
Xianfang Zeng
StepFun
W
Wei Cheng
StepFun
G
Gang Yu
StepFun
T
Tao Chen
Fudan University