MotionSight: Boosting Fine-Grained Motion Understanding in Multimodal LLMs

📅 2025-06-02

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Multimodal large language models (MLLMs) exhibit significant limitations in fine-grained video motion understanding: they struggle to model inter-frame dynamic variations, conflate object motion with camera motion, and lack sensitivity to subtle visual cues. To address this, we propose MotionPrompt—a zero-shot visual prompting framework introducing a novel dual-prompt mechanism comprising “object-centric light spots” and “motion blur,” enabling enhanced inter-frame dynamic perception and motion disentanglement without fine-tuning. We further introduce MotionVid-QA, the first large-scale video question-answering benchmark tailored for fine-grained motion understanding, containing 40K videos and 87K QA pairs with multi-granularity motion annotations and preference data. Experiments demonstrate that MotionPrompt achieves state-of-the-art performance on open-source MLLMs and matches the capability of leading commercial closed-source models. Code, models, and full annotations will be publicly released.

Technology Category

Application Category

📝 Abstract

Despite advancements in Multimodal Large Language Models (MLLMs), their proficiency in fine-grained video motion understanding remains critically limited. They often lack inter-frame differencing and tend to average or ignore subtle visual cues. Furthermore, while visual prompting has shown potential in static images, its application to video's temporal complexities, particularly for fine-grained motion understanding, remains largely unexplored. We investigate whether inherent capability can be unlocked and boost MLLMs' motion perception and enable distinct visual signatures tailored to decouple object and camera motion cues. In this study, we introduce MotionSight, a novel zero-shot method pioneering object-centric visual spotlight and motion blur as visual prompts to effectively improve fine-grained motion understanding without training. To convert this into valuable data assets, we curated MotionVid-QA, the first large-scale dataset for fine-grained video motion understanding, with hierarchical annotations including SFT and preference data, {Theta}(40K) video clips and {Theta}(87K) QAs. Experiments show MotionSight achieves state-of-the-art open-source performance and competitiveness with commercial models. In particular, for fine-grained motion understanding we present a novel zero-shot technique and a large-scale, high-quality dataset. All the code and annotations will be publicly available.

Problem

Research questions and friction points this paper is trying to address.

Enhancing fine-grained motion understanding in MLLMs

Exploring visual prompts for video temporal complexities

Addressing lack of inter-frame differencing in MLLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Zero-shot object-centric visual spotlight technique

Motion blur as visual prompts

Large-scale MotionVid-QA dataset creation

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs