FoundationMotion: Auto-Labeling and Reasoning about Spatial Movement in Videos

πŸ“… 2025-12-11
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Video motion understanding is hindered by the scarcity of large-scale, fine-grained annotated datasets, as manual annotation is costly and poorly scalable. To address this, we propose the first end-to-end fully automated framework for motion dataset construction: it integrates YOLO-based object detection, multi-object tracking (MOT), and spatiotemporal video feature extraction to guide large language models (Qwen2.5 and NVILA-Video) in generating fine-grained motion descriptions and spatial-reasoning question-answer pairs. Crucially, our method introduces trajectory-guided LLM-based spatiotemporal reasoningβ€”a novel capability that circumvents reliance on human annotation. Evaluated on multiple motion understanding benchmarks, models fine-tuned on our generated data significantly outperform Gemini-2.5 Flash and Qwen2.5-VL-72B, while retaining full performance on general vision-language tasks.

Technology Category

Application Category

πŸ“ Abstract
Motion understanding is fundamental to physical reasoning, enabling models to infer dynamics and predict future states. However, state-of-the-art models still struggle on recent motion benchmarks, primarily due to the scarcity of large-scale, fine-grained motion datasets. Existing motion datasets are often constructed from costly manual annotation, severely limiting scalability. To address this challenge, we introduce FoundationMotion, a fully automated data curation pipeline that constructs large-scale motion datasets. Our approach first detects and tracks objects in videos to extract their trajectories, then leverages these trajectories and video frames with Large Language Models (LLMs) to generate fine-grained captions and diverse question-answer pairs about motion and spatial reasoning. Using datasets produced by this pipeline, we fine-tune open-source models including NVILA-Video-15B and Qwen2.5-7B, achieving substantial improvements in motion understanding without compromising performance on other tasks. Notably, our models outperform strong closed-source baselines like Gemini-2.5 Flash and large open-source models such as Qwen2.5-VL-72B across diverse motion understanding datasets and benchmarks. FoundationMotion thus provides a scalable solution for curating fine-grained motion datasets that enable effective fine-tuning of diverse models to enhance motion understanding and spatial reasoning capabilities.
Problem

Research questions and friction points this paper is trying to address.

Automatically generates large-scale fine-grained motion datasets
Enhances video motion understanding and spatial reasoning capabilities
Improves model performance on motion benchmarks without manual annotation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated pipeline for large-scale motion dataset curation
Leverages object trajectories and LLMs for fine-grained labeling
Enables effective fine-tuning of diverse models for motion understanding
πŸ”Ž Similar Papers
No similar papers found.