MotionSight: Boosting Fine-Grained Motion Understanding in Multimodal LLMs

📅 2025-06-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Multimodal large language models (MLLMs) exhibit significant limitations in fine-grained video motion understanding: they struggle to model inter-frame dynamic variations, conflate object motion with camera motion, and lack sensitivity to subtle visual cues. To address this, we propose MotionPrompt—a zero-shot visual prompting framework introducing a novel dual-prompt mechanism comprising “object-centric light spots” and “motion blur,” enabling enhanced inter-frame dynamic perception and motion disentanglement without fine-tuning. We further introduce MotionVid-QA, the first large-scale video question-answering benchmark tailored for fine-grained motion understanding, containing 40K videos and 87K QA pairs with multi-granularity motion annotations and preference data. Experiments demonstrate that MotionPrompt achieves state-of-the-art performance on open-source MLLMs and matches the capability of leading commercial closed-source models. Code, models, and full annotations will be publicly released.

Technology Category

Application Category

📝 Abstract
Despite advancements in Multimodal Large Language Models (MLLMs), their proficiency in fine-grained video motion understanding remains critically limited. They often lack inter-frame differencing and tend to average or ignore subtle visual cues. Furthermore, while visual prompting has shown potential in static images, its application to video's temporal complexities, particularly for fine-grained motion understanding, remains largely unexplored. We investigate whether inherent capability can be unlocked and boost MLLMs' motion perception and enable distinct visual signatures tailored to decouple object and camera motion cues. In this study, we introduce MotionSight, a novel zero-shot method pioneering object-centric visual spotlight and motion blur as visual prompts to effectively improve fine-grained motion understanding without training. To convert this into valuable data assets, we curated MotionVid-QA, the first large-scale dataset for fine-grained video motion understanding, with hierarchical annotations including SFT and preference data, {Theta}(40K) video clips and {Theta}(87K) QAs. Experiments show MotionSight achieves state-of-the-art open-source performance and competitiveness with commercial models. In particular, for fine-grained motion understanding we present a novel zero-shot technique and a large-scale, high-quality dataset. All the code and annotations will be publicly available.
Problem

Research questions and friction points this paper is trying to address.

Enhancing fine-grained motion understanding in MLLMs
Exploring visual prompts for video temporal complexities
Addressing lack of inter-frame differencing in MLLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Zero-shot object-centric visual spotlight technique
Motion blur as visual prompts
Large-scale MotionVid-QA dataset creation
🔎 Similar Papers
No similar papers found.
Y
Yipeng Du
Nanjing University
Tiehan Fan
Tiehan Fan
Nanjing University
AIGCMultiModal Learning
Kepan Nan
Kepan Nan
Nanjing University
Computer VisionVideo Generation
R
Rui Xie
Nanjing University, ByteDance
P
Penghao Zhou
ByteDance
X
Xiang Li
Nankai University
J
Jian Yang
Nanjing University
Zhenheng Yang
Zhenheng Yang
TikTok
Computer VisionMachine LearningDeep Learning
Y
Ying Tai
Nanjing University