FoundationMotion: Auto-Labeling and Reasoning about Spatial Movement in Videos

📅 2025-12-11

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

Video motion understanding is hindered by the scarcity of large-scale, fine-grained annotated datasets, as manual annotation is costly and poorly scalable. To address this, we propose the first end-to-end fully automated framework for motion dataset construction: it integrates YOLO-based object detection, multi-object tracking (MOT), and spatiotemporal video feature extraction to guide large language models (Qwen2.5 and NVILA-Video) in generating fine-grained motion descriptions and spatial-reasoning question-answer pairs. Crucially, our method introduces trajectory-guided LLM-based spatiotemporal reasoning—a novel capability that circumvents reliance on human annotation. Evaluated on multiple motion understanding benchmarks, models fine-tuned on our generated data significantly outperform Gemini-2.5 Flash and Qwen2.5-VL-72B, while retaining full performance on general vision-language tasks.

Technology Category

Application Category

📝 Abstract

Motion understanding is fundamental to physical reasoning, enabling models to infer dynamics and predict future states. However, state-of-the-art models still struggle on recent motion benchmarks, primarily due to the scarcity of large-scale, fine-grained motion datasets. Existing motion datasets are often constructed from costly manual annotation, severely limiting scalability. To address this challenge, we introduce FoundationMotion, a fully automated data curation pipeline that constructs large-scale motion datasets. Our approach first detects and tracks objects in videos to extract their trajectories, then leverages these trajectories and video frames with Large Language Models (LLMs) to generate fine-grained captions and diverse question-answer pairs about motion and spatial reasoning. Using datasets produced by this pipeline, we fine-tune open-source models including NVILA-Video-15B and Qwen2.5-7B, achieving substantial improvements in motion understanding without compromising performance on other tasks. Notably, our models outperform strong closed-source baselines like Gemini-2.5 Flash and large open-source models such as Qwen2.5-VL-72B across diverse motion understanding datasets and benchmarks. FoundationMotion thus provides a scalable solution for curating fine-grained motion datasets that enable effective fine-tuning of diverse models to enhance motion understanding and spatial reasoning capabilities.

Problem

Research questions and friction points this paper is trying to address.

Automatically generates large-scale fine-grained motion datasets

Enhances video motion understanding and spatial reasoning capabilities

Improves model performance on motion benchmarks without manual annotation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated pipeline for large-scale motion dataset curation

Leverages object trajectories and LLMs for fine-grained labeling

Enables effective fine-tuning of diverse models for motion understanding

🔎 Similar Papers

No similar papers found.